Skip to content

Commit 336bf30

Browse files
mjkravetztorvalds
authored andcommitted
hugetlbfs: fix anon huge page migration race
Qian Cai reported the following BUG in [1] LTP: starting move_pages12 BUG: unable to handle page fault for address: ffffffffffffffe0 ... RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63 Call Trace: rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864 try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763 migrate_pages+0x1005/0x1fb0 move_pages_and_store_status.isra.47+0xd7/0x1a0 __x64_sys_move_pages+0xa5c/0x1100 do_syscall_64+0x5f/0x310 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Hugh Dickins diagnosed this as a migration bug caused by code introduced to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for anon pages as the anon_vma_lock should be held in this case. Further analysis suggested that i_mmap_rwsem was not required to he held at all when calling try_to_unmap for anon pages as an anon page could never be part of a shared pmd mapping. Discussion also revealed that the hack in hugetlb_page_mapping_lock_write to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to keep mapping valid while dropping page lock. This patch does the following: - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when calling try_to_unmap. - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine will now simply do a 'trylock' while still holding the page lock. If the trylock fails, it will return NULL. This could impact the callers: - migration calling code will receive -EAGAIN and retry up to the hard coded limit (10). - memory error code will treat the page as BUSY. This will force killing (SIGKILL) instead of SIGBUS any mapping tasks. Do note that this change in behavior only happens when there is a race. None of the standard kernel testing suites actually hit this race, but it is possible. [1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/ [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/ Fixes: c0d0381 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") Reported-by: Qian Cai <cai@lca.pw> Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 8b21ca0 commit 336bf30

4 files changed

Lines changed: 47 additions & 128 deletions

File tree

mm/hugetlb.c

Lines changed: 5 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -1567,104 +1567,24 @@ int PageHeadHuge(struct page *page_head)
15671567
return page_head[1].compound_dtor == HUGETLB_PAGE_DTOR;
15681568
}
15691569

1570-
/*
1571-
* Find address_space associated with hugetlbfs page.
1572-
* Upon entry page is locked and page 'was' mapped although mapped state
1573-
* could change. If necessary, use anon_vma to find vma and associated
1574-
* address space. The returned mapping may be stale, but it can not be
1575-
* invalid as page lock (which is held) is required to destroy mapping.
1576-
*/
1577-
static struct address_space *_get_hugetlb_page_mapping(struct page *hpage)
1578-
{
1579-
struct anon_vma *anon_vma;
1580-
pgoff_t pgoff_start, pgoff_end;
1581-
struct anon_vma_chain *avc;
1582-
struct address_space *mapping = page_mapping(hpage);
1583-
1584-
/* Simple file based mapping */
1585-
if (mapping)
1586-
return mapping;
1587-
1588-
/*
1589-
* Even anonymous hugetlbfs mappings are associated with an
1590-
* underlying hugetlbfs file (see hugetlb_file_setup in mmap
1591-
* code). Find a vma associated with the anonymous vma, and
1592-
* use the file pointer to get address_space.
1593-
*/
1594-
anon_vma = page_lock_anon_vma_read(hpage);
1595-
if (!anon_vma)
1596-
return mapping; /* NULL */
1597-
1598-
/* Use first found vma */
1599-
pgoff_start = page_to_pgoff(hpage);
1600-
pgoff_end = pgoff_start + pages_per_huge_page(page_hstate(hpage)) - 1;
1601-
anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
1602-
pgoff_start, pgoff_end) {
1603-
struct vm_area_struct *vma = avc->vma;
1604-
1605-
mapping = vma->vm_file->f_mapping;
1606-
break;
1607-
}
1608-
1609-
anon_vma_unlock_read(anon_vma);
1610-
return mapping;
1611-
}
1612-
16131570
/*
16141571
* Find and lock address space (mapping) in write mode.
16151572
*
1616-
* Upon entry, the page is locked which allows us to find the mapping
1617-
* even in the case of an anon page. However, locking order dictates
1618-
* the i_mmap_rwsem be acquired BEFORE the page lock. This is hugetlbfs
1619-
* specific. So, we first try to lock the sema while still holding the
1620-
* page lock. If this works, great! If not, then we need to drop the
1621-
* page lock and then acquire i_mmap_rwsem and reacquire page lock. Of
1622-
* course, need to revalidate state along the way.
1573+
* Upon entry, the page is locked which means that page_mapping() is
1574+
* stable. Due to locking order, we can only trylock_write. If we can
1575+
* not get the lock, simply return NULL to caller.
16231576
*/
16241577
struct address_space *hugetlb_page_mapping_lock_write(struct page *hpage)
16251578
{
1626-
struct address_space *mapping, *mapping2;
1579+
struct address_space *mapping = page_mapping(hpage);
16271580

1628-
mapping = _get_hugetlb_page_mapping(hpage);
1629-
retry:
16301581
if (!mapping)
16311582
return mapping;
16321583

1633-
/*
1634-
* If no contention, take lock and return
1635-
*/
16361584
if (i_mmap_trylock_write(mapping))
16371585
return mapping;
16381586

1639-
/*
1640-
* Must drop page lock and wait on mapping sema.
1641-
* Note: Once page lock is dropped, mapping could become invalid.
1642-
* As a hack, increase map count until we lock page again.
1643-
*/
1644-
atomic_inc(&hpage->_mapcount);
1645-
unlock_page(hpage);
1646-
i_mmap_lock_write(mapping);
1647-
lock_page(hpage);
1648-
atomic_add_negative(-1, &hpage->_mapcount);
1649-
1650-
/* verify page is still mapped */
1651-
if (!page_mapped(hpage)) {
1652-
i_mmap_unlock_write(mapping);
1653-
return NULL;
1654-
}
1655-
1656-
/*
1657-
* Get address space again and verify it is the same one
1658-
* we locked. If not, drop lock and retry.
1659-
*/
1660-
mapping2 = _get_hugetlb_page_mapping(hpage);
1661-
if (mapping2 != mapping) {
1662-
i_mmap_unlock_write(mapping);
1663-
mapping = mapping2;
1664-
goto retry;
1665-
}
1666-
1667-
return mapping;
1587+
return NULL;
16681588
}
16691589

16701590
pgoff_t __basepage_index(struct page *page)

mm/memory-failure.c

Lines changed: 17 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1057,27 +1057,25 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
10571057
if (!PageHuge(hpage)) {
10581058
unmap_success = try_to_unmap(hpage, ttu);
10591059
} else {
1060-
/*
1061-
* For hugetlb pages, try_to_unmap could potentially call
1062-
* huge_pmd_unshare. Because of this, take semaphore in
1063-
* write mode here and set TTU_RMAP_LOCKED to indicate we
1064-
* have taken the lock at this higer level.
1065-
*
1066-
* Note that the call to hugetlb_page_mapping_lock_write
1067-
* is necessary even if mapping is already set. It handles
1068-
* ugliness of potentially having to drop page lock to obtain
1069-
* i_mmap_rwsem.
1070-
*/
1071-
mapping = hugetlb_page_mapping_lock_write(hpage);
1072-
1073-
if (mapping) {
1074-
unmap_success = try_to_unmap(hpage,
1060+
if (!PageAnon(hpage)) {
1061+
/*
1062+
* For hugetlb pages in shared mappings, try_to_unmap
1063+
* could potentially call huge_pmd_unshare. Because of
1064+
* this, take semaphore in write mode here and set
1065+
* TTU_RMAP_LOCKED to indicate we have taken the lock
1066+
* at this higer level.
1067+
*/
1068+
mapping = hugetlb_page_mapping_lock_write(hpage);
1069+
if (mapping) {
1070+
unmap_success = try_to_unmap(hpage,
10751071
ttu|TTU_RMAP_LOCKED);
1076-
i_mmap_unlock_write(mapping);
1072+
i_mmap_unlock_write(mapping);
1073+
} else {
1074+
pr_info("Memory failure: %#lx: could not lock mapping for mapped huge page\n", pfn);
1075+
unmap_success = false;
1076+
}
10771077
} else {
1078-
pr_info("Memory failure: %#lx: could not find mapping for mapped huge page\n",
1079-
pfn);
1080-
unmap_success = false;
1078+
unmap_success = try_to_unmap(hpage, ttu);
10811079
}
10821080
}
10831081
if (!unmap_success)

mm/migrate.c

Lines changed: 24 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1328,34 +1328,38 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
13281328
goto put_anon;
13291329

13301330
if (page_mapped(hpage)) {
1331-
/*
1332-
* try_to_unmap could potentially call huge_pmd_unshare.
1333-
* Because of this, take semaphore in write mode here and
1334-
* set TTU_RMAP_LOCKED to let lower levels know we have
1335-
* taken the lock.
1336-
*/
1337-
mapping = hugetlb_page_mapping_lock_write(hpage);
1338-
if (unlikely(!mapping))
1339-
goto unlock_put_anon;
1331+
bool mapping_locked = false;
1332+
enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK|
1333+
TTU_IGNORE_ACCESS;
1334+
1335+
if (!PageAnon(hpage)) {
1336+
/*
1337+
* In shared mappings, try_to_unmap could potentially
1338+
* call huge_pmd_unshare. Because of this, take
1339+
* semaphore in write mode here and set TTU_RMAP_LOCKED
1340+
* to let lower levels know we have taken the lock.
1341+
*/
1342+
mapping = hugetlb_page_mapping_lock_write(hpage);
1343+
if (unlikely(!mapping))
1344+
goto unlock_put_anon;
1345+
1346+
mapping_locked = true;
1347+
ttu |= TTU_RMAP_LOCKED;
1348+
}
13401349

1341-
try_to_unmap(hpage,
1342-
TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
1343-
TTU_RMAP_LOCKED);
1350+
try_to_unmap(hpage, ttu);
13441351
page_was_mapped = 1;
1345-
/*
1346-
* Leave mapping locked until after subsequent call to
1347-
* remove_migration_ptes()
1348-
*/
1352+
1353+
if (mapping_locked)
1354+
i_mmap_unlock_write(mapping);
13491355
}
13501356

13511357
if (!page_mapped(hpage))
13521358
rc = move_to_new_page(new_hpage, hpage, mode);
13531359

1354-
if (page_was_mapped) {
1360+
if (page_was_mapped)
13551361
remove_migration_ptes(hpage,
1356-
rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, true);
1357-
i_mmap_unlock_write(mapping);
1358-
}
1362+
rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, false);
13591363

13601364
unlock_put_anon:
13611365
unlock_page(new_hpage);

mm/rmap.c

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1413,9 +1413,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
14131413
/*
14141414
* If sharing is possible, start and end will be adjusted
14151415
* accordingly.
1416-
*
1417-
* If called for a huge page, caller must hold i_mmap_rwsem
1418-
* in write mode as it is possible to call huge_pmd_unshare.
14191416
*/
14201417
adjust_range_if_pmd_sharing_possible(vma, &range.start,
14211418
&range.end);
@@ -1462,7 +1459,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
14621459
subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
14631460
address = pvmw.address;
14641461

1465-
if (PageHuge(page)) {
1462+
if (PageHuge(page) && !PageAnon(page)) {
14661463
/*
14671464
* To call huge_pmd_unshare, i_mmap_rwsem must be
14681465
* held in write mode. Caller needs to explicitly

0 commit comments

Comments
 (0)