aboutsummaryrefslogtreecommitdiffstats
path: root/mm/swap.c
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2024-03-14 17:43:30 -0700
committerLinus Torvalds <torvalds@linux-foundation.org>2024-03-14 17:43:30 -0700
commit902861e34c401696ed9ad17a54c8790e7e8e3069 (patch)
tree126324c3ec4101b1e17f002ef029d3ffb296ada7 /mm/swap.c
parent1bbeaf83dd7b5e3628b98bec66ff8fe2646e14aa (diff)
parent270700dd06ca41a4779c19eb46608f076bb7d40e (diff)
downloadnet-902861e34c401696ed9ad17a54c8790e7e8e3069.tar.gz
Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
Diffstat (limited to 'mm/swap.c')
-rw-r--r--mm/swap.c197
1 files changed, 114 insertions, 83 deletions
diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3aa8..500a09a48dfd3a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -74,22 +74,21 @@ static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) = {
.lock = INIT_LOCAL_LOCK(lock),
};
-/*
- * This path almost never happens for VM activity - pages are normally freed
- * in batches. But it gets used by networking - and for compound pages.
- */
-static void __page_cache_release(struct folio *folio)
+static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
+ unsigned long *flagsp)
{
if (folio_test_lru(folio)) {
- struct lruvec *lruvec;
- unsigned long flags;
-
- lruvec = folio_lruvec_lock_irqsave(folio, &flags);
- lruvec_del_folio(lruvec, folio);
+ folio_lruvec_relock_irqsave(folio, lruvecp, flagsp);
+ lruvec_del_folio(*lruvecp, folio);
__folio_clear_lru_flags(folio);
- unlock_page_lruvec_irqrestore(lruvec, flags);
}
- /* See comment on folio_test_mlocked in release_pages() */
+
+ /*
+ * In rare cases, when truncation or holepunching raced with
+ * munlock after VM_LOCKED was cleared, Mlocked may still be
+ * found set here. This does not indicate a problem, unless
+ * "unevictable_pgs_cleared" appears worryingly large.
+ */
if (unlikely(folio_test_mlocked(folio))) {
long nr_pages = folio_nr_pages(folio);
@@ -99,9 +98,23 @@ static void __page_cache_release(struct folio *folio)
}
}
+/*
+ * This path almost never happens for VM activity - pages are normally freed
+ * in batches. But it gets used by networking - and for compound pages.
+ */
+static void page_cache_release(struct folio *folio)
+{
+ struct lruvec *lruvec = NULL;
+ unsigned long flags;
+
+ __page_cache_release(folio, &lruvec, &flags);
+ if (lruvec)
+ unlock_page_lruvec_irqrestore(lruvec, flags);
+}
+
static void __folio_put_small(struct folio *folio)
{
- __page_cache_release(folio);
+ page_cache_release(folio);
mem_cgroup_uncharge(folio);
free_unref_page(&folio->page, 0);
}
@@ -115,7 +128,7 @@ static void __folio_put_large(struct folio *folio)
* be called for hugetlb (it has a separate hugetlb_cgroup.)
*/
if (!folio_test_hugetlb(folio))
- __page_cache_release(folio);
+ page_cache_release(folio);
destroy_large_folio(folio);
}
@@ -138,22 +151,25 @@ EXPORT_SYMBOL(__folio_put);
*/
void put_pages_list(struct list_head *pages)
{
+ struct folio_batch fbatch;
struct folio *folio, *next;
+ folio_batch_init(&fbatch);
list_for_each_entry_safe(folio, next, pages, lru) {
- if (!folio_put_testzero(folio)) {
- list_del(&folio->lru);
+ if (!folio_put_testzero(folio))
continue;
- }
if (folio_test_large(folio)) {
- list_del(&folio->lru);
__folio_put_large(folio);
continue;
}
/* LRU flag must be clear because it's passed using the lru */
+ if (folio_batch_add(&fbatch, folio) > 0)
+ continue;
+ free_unref_folios(&fbatch);
}
- free_unref_page_list(pages);
+ if (fbatch.nr)
+ free_unref_folios(&fbatch);
INIT_LIST_HEAD(pages);
}
EXPORT_SYMBOL(put_pages_list);
@@ -175,7 +191,7 @@ static void lru_add_fn(struct lruvec *lruvec, struct folio *folio)
* while the LRU lock is held.
*
* (That is not true of __page_cache_release(), and not necessarily
- * true of release_pages(): but those only clear the mlocked flag after
+ * true of folios_put(): but those only clear the mlocked flag after
* folio_put_testzero() has excluded any other users of the folio.)
*/
if (folio_evictable(folio)) {
@@ -213,7 +229,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
continue;
- lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags);
+ folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
move_fn(lruvec, folio);
folio_set_lru(folio);
@@ -221,8 +237,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
if (lruvec)
unlock_page_lruvec_irqrestore(lruvec, flags);
- folios_put(fbatch->folios, folio_batch_count(fbatch));
- folio_batch_reinit(fbatch);
+ folios_put(fbatch);
}
static void folio_batch_add_and_move(struct folio_batch *fbatch,
@@ -946,41 +961,29 @@ void lru_cache_disable(void)
}
/**
- * release_pages - batched put_page()
- * @arg: array of pages to release
- * @nr: number of pages
+ * folios_put_refs - Reduce the reference count on a batch of folios.
+ * @folios: The folios.
+ * @refs: The number of refs to subtract from each folio.
*
- * Decrement the reference count on all the pages in @arg. If it
- * fell to zero, remove the page from the LRU and free it.
+ * Like folio_put(), but for a batch of folios. This is more efficient
+ * than writing the loop yourself as it will optimise the locks which need
+ * to be taken if the folios are freed. The folios batch is returned
+ * empty and ready to be reused for another batch; there is no need
+ * to reinitialise it. If @refs is NULL, we subtract one from each
+ * folio refcount.
*
- * Note that the argument can be an array of pages, encoded pages,
- * or folio pointers. We ignore any encoded bits, and turn any of
- * them into just a folio that gets free'd.
+ * Context: May be called in process or interrupt context, but not in NMI
+ * context. May be called while holding a spinlock.
*/
-void release_pages(release_pages_arg arg, int nr)
+void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
{
- int i;
- struct encoded_page **encoded = arg.encoded_pages;
- LIST_HEAD(pages_to_free);
+ int i, j;
struct lruvec *lruvec = NULL;
unsigned long flags = 0;
- unsigned int lock_batch;
- for (i = 0; i < nr; i++) {
- struct folio *folio;
-
- /* Turn any of the argument types into a folio */
- folio = page_folio(encoded_page_ptr(encoded[i]));
-
- /*
- * Make sure the IRQ-safe lock-holding time does not get
- * excessive with a continuous string of pages from the
- * same lruvec. The lock is held only if lruvec != NULL.
- */
- if (lruvec && ++lock_batch == SWAP_CLUSTER_MAX) {
- unlock_page_lruvec_irqrestore(lruvec, flags);
- lruvec = NULL;
- }
+ for (i = 0, j = 0; i < folios->nr; i++) {
+ struct folio *folio = folios->folios[i];
+ unsigned int nr_refs = refs ? refs[i] : 1;
if (is_huge_zero_page(&folio->page))
continue;
@@ -990,56 +993,85 @@ void release_pages(release_pages_arg arg, int nr)
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- if (put_devmap_managed_page(&folio->page))
+ if (put_devmap_managed_page_refs(&folio->page, nr_refs))
continue;
- if (folio_put_testzero(folio))
+ if (folio_ref_sub_and_test(folio, nr_refs))
free_zone_device_page(&folio->page);
continue;
}
- if (!folio_put_testzero(folio))
+ if (!folio_ref_sub_and_test(folio, nr_refs))
continue;
- if (folio_test_large(folio)) {
+ /* hugetlb has its own memcg */
+ if (folio_test_hugetlb(folio)) {
if (lruvec) {
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- __folio_put_large(folio);
+ free_huge_folio(folio);
continue;
}
+ if (folio_test_large(folio) &&
+ folio_test_large_rmappable(folio))
+ folio_undo_large_rmappable(folio);
- if (folio_test_lru(folio)) {
- struct lruvec *prev_lruvec = lruvec;
+ __page_cache_release(folio, &lruvec, &flags);
- lruvec = folio_lruvec_relock_irqsave(folio, lruvec,
- &flags);
- if (prev_lruvec != lruvec)
- lock_batch = 0;
+ if (j != i)
+ folios->folios[j] = folio;
+ j++;
+ }
+ if (lruvec)
+ unlock_page_lruvec_irqrestore(lruvec, flags);
+ if (!j) {
+ folio_batch_reinit(folios);
+ return;
+ }
- lruvec_del_folio(lruvec, folio);
- __folio_clear_lru_flags(folio);
- }
+ folios->nr = j;
+ mem_cgroup_uncharge_folios(folios);
+ free_unref_folios(folios);
+}
+EXPORT_SYMBOL(folios_put_refs);
- /*
- * In rare cases, when truncation or holepunching raced with
- * munlock after VM_LOCKED was cleared, Mlocked may still be
- * found set here. This does not indicate a problem, unless
- * "unevictable_pgs_cleared" appears worryingly large.
- */
- if (unlikely(folio_test_mlocked(folio))) {
- __folio_clear_mlocked(folio);
- zone_stat_sub_folio(folio, NR_MLOCK);
- count_vm_event(UNEVICTABLE_PGCLEARED);
- }
+/**
+ * release_pages - batched put_page()
+ * @arg: array of pages to release
+ * @nr: number of pages
+ *
+ * Decrement the reference count on all the pages in @arg. If it
+ * fell to zero, remove the page from the LRU and free it.
+ *
+ * Note that the argument can be an array of pages, encoded pages,
+ * or folio pointers. We ignore any encoded bits, and turn any of
+ * them into just a folio that gets free'd.
+ */
+void release_pages(release_pages_arg arg, int nr)
+{
+ struct folio_batch fbatch;
+ int refs[PAGEVEC_SIZE];
+ struct encoded_page **encoded = arg.encoded_pages;
+ int i;
+
+ folio_batch_init(&fbatch);
+ for (i = 0; i < nr; i++) {
+ /* Turn any of the argument types into a folio */
+ struct folio *folio = page_folio(encoded_page_ptr(encoded[i]));
+
+ /* Is our next entry actually "nr_pages" -> "nr_refs" ? */
+ refs[fbatch.nr] = 1;
+ if (unlikely(encoded_page_flags(encoded[i]) &
+ ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+ refs[fbatch.nr] = encoded_nr_pages(encoded[++i]);
- list_add(&folio->lru, &pages_to_free);
+ if (folio_batch_add(&fbatch, folio) > 0)
+ continue;
+ folios_put_refs(&fbatch, refs);
}
- if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
- mem_cgroup_uncharge_list(&pages_to_free);
- free_unref_page_list(&pages_to_free);
+ if (fbatch.nr)
+ folios_put_refs(&fbatch, refs);
}
EXPORT_SYMBOL(release_pages);
@@ -1059,8 +1091,7 @@ void __folio_batch_release(struct folio_batch *fbatch)
lru_add_drain();
fbatch->percpu_pvec_drained = true;
}
- release_pages(fbatch->folios, folio_batch_count(fbatch));
- folio_batch_reinit(fbatch);
+ folios_put(fbatch);
}
EXPORT_SYMBOL(__folio_batch_release);