kernel_xiaomi_cepheus

Evolution-X-Devices/kernel_xiaomi_cepheus

Author	SHA1	Message	Date
Yu Zhao	c0367061af	FROMLIST: mm: multi-gen LRU: support page table walks To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[5.5, 7.5]% Ops/sec KB/sec patch1-7: 1014393.57 39455.42 patch1-8: 1078507.59 41949.15 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 45.54% lzo1x_1_do_compress (real work) 9.56% page_vma_mapped_walk 6.70% _raw_spin_unlock_irq 2.78% ptep_clear_flush 2.47% do_raw_spin_lock 2.22% __zram_bvec_write 1.87% lru_gen_look_around 1.78% memmove 1.77% obj_malloc 1.44% free_unref_page_list patch1-8 47.02% lzo1x_1_do_compress (real work) 6.73% page_vma_mapped_walk 6.14% _raw_spin_unlock_irq 3.39% walk_pte_range 2.63% ptep_clear_flush 2.29% __zram_bvec_write 2.10% do_raw_spin_lock 1.81% memmove 1.73% obj_malloc 1.53% free_unref_page_list Configurations: no change [1] https://lwn.net/Articles/23732/ [2] https://source.android.com/devices/tech/debug/scudo Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 228114874 Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e	2022-06-27 14:32:38 +00:00
Yu Zhao	09a2a70211	FROMLIST: mm: multi-gen LRU: exploit locality in rmap Searching the rmap for PTEs mapping each page on an LRU list (to test and clear the accessed bit) can be expensive because pages from different VMAs (PA space) are not cache friendly to the rmap (VA space). For workloads mostly using mapped pages, the rmap has a high CPU cost in the reclaim path. This patch exploits spatial locality to reduce the trips into the rmap. When shrink_page_list() walks the rmap and finds a young PTE, a new function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent PTEs. On finding another young PTE, it clears the accessed bit and updates the gen counter of the page mapped by this PTE to (max_seq%MAX_NR_GENS)+1. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[4, 6]% Ops/sec KB/sec patch1-6: 964656.80 37520.88 patch1-7: 1014393.57 39455.42 Configurations: no change Client benchmark results: kswapd profiles: patch1-6 36.13% lzo1x_1_do_compress (real work) 19.16% page_vma_mapped_walk 6.55% _raw_spin_unlock_irq 4.02% do_raw_spin_lock 2.32% anon_vma_interval_tree_iter_first 2.11% ptep_clear_flush 1.76% __zram_bvec_write 1.64% folio_referenced_one 1.40% memmove 1.35% obj_malloc patch1-7 45.54% lzo1x_1_do_compress (real work) 9.56% page_vma_mapped_walk 6.70% _raw_spin_unlock_irq 2.78% ptep_clear_flush 2.47% do_raw_spin_lock 2.22% __zram_bvec_write 1.87% lru_gen_look_around 1.78% memmove 1.77% obj_malloc 1.44% free_unref_page_list Configurations: no change Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 228114874 Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c	2022-06-27 14:32:38 +00:00
Yu Zhao	42440bf2f0	FROMLIST: mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. Promotion in the aging path does not require any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The eviction consumes old generations. Given an lruvec, it increments min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. Each generation is divided into multiple tiers. Tiers represent different ranges of numbers of accesses through file descriptors. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in page->flags. In contrast to moving across generations, which requires the LRU lock, moving across tiers only involves operations on page->flags. The feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. The eviction moves a page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[38, 40]% IOPS BW 5.18-ed4643521e6a: 2547k 9989MiB/s patch1-6: 3540k 13.5GiB/s Single workload: memcached (anon): +[103, 107]% Ops/sec KB/sec 5.18-ed4643521e6a: 469048.66 18243.91 patch1-6: 964656.80 37520.88 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO \| __GFP_ZERO \| __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO \| __GFP_ZERO \| __GFP_HIGHMEM \| __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.18-ed4643521e6a 39.56% page_vma_mapped_walk 19.32% lzo1x_1_do_compress (real work) 7.18% do_raw_spin_lock 4.23% _raw_spin_unlock_irq 2.26% vma_interval_tree_subtree_search 2.12% vma_interval_tree_iter_next 2.11% folio_referenced_one 1.90% anon_vma_interval_tree_iter_first 1.47% ptep_clear_flush 0.97% __anon_vma_interval_tree_subtree_search patch1-6 36.13% lzo1x_1_do_compress (real work) 19.16% page_vma_mapped_walk 6.55% _raw_spin_unlock_irq 4.02% do_raw_spin_lock 2.32% anon_vma_interval_tree_iter_first 2.11% ptep_clear_flush 1.76% __zram_bvec_write 1.64% folio_referenced_one 1.40% memmove 1.35% obj_malloc Configurations: CPU: single Snapdragon 7c Mem: total 4G Chrome OS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 228114874 Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86	2022-06-27 14:32:38 +00:00
kondors1995	0cca5fb93e	Revert "Merge branch 'dev/mglru' into dev/12-2" This reverts commit `e3d1ce4d09`, reversing changes made to `d3cf1f4d72`.	2022-05-17 08:16:01 +00:00
Yu Zhao	1e84d38423	FROMLIST: mm: multi-gen LRU: support page table walks To further exploit spatial locality, the aging prefers to walk page tables to search for young PTEs and promote hot pages. A kill switch will be added in the next patch to disable this behavior. When disabled, the aging relies on the rmap only. NB: this behavior has nothing similar with the page table scanning in the 2.4 kernel [1], which searches page tables for old PTEs, adds cold pages to swapcache and unmaps them. To avoid confusion, the term "iteration" specifically means the traversal of an entire mm_struct list; the term "walk" will be applied to page tables and the rmap, as usual. An mm_struct list is maintained for each memcg, and an mm_struct follows its owner task to the new memcg when this task is migrated. Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls walk_page_range() with each mm_struct on this list to promote hot pages before it increments max_seq. When multiple page table walkers iterate the same list, each of them gets a unique mm_struct; therefore they can run concurrently. Page table walkers ignore any misplaced pages, e.g., if an mm_struct was migrated, pages it left in the previous memcg will not be promoted when its current memcg is under reclaim. Similarly, page table walkers will not promote pages from nodes other than the one under reclaim. This patch uses the following optimizations when walking page tables: 1. It tracks the usage of mm_struct's between context switches so that page table walkers can skip processes that have been sleeping since the last iteration. 2. It uses generational Bloom filters to record populated branches so that page table walkers can reduce their search space based on the query results, e.g., to skip page tables containing mostly holes or misplaced pages. 3. It takes advantage of the accessed bit in non-leaf PMD entries when CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y. 4. It does not zigzag between a PGD table and the same PMD table spanning multiple VMAs. IOW, it finishes all the VMAs within the range of the same PMD table before it returns to a PGD table. This improves the cache performance for workloads that have large numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[5.5, 7.5]% Ops/sec KB/sec patch1-7: 1014393.57 39455.42 patch1-8: 1078507.59 41949.15 Configurations: no change Client benchmark results: kswapd profiles: patch1-7 45.54% lzo1x_1_do_compress (real work) 9.56% page_vma_mapped_walk 6.70% _raw_spin_unlock_irq 2.78% ptep_clear_flush 2.47% do_raw_spin_lock 2.22% __zram_bvec_write 1.87% lru_gen_look_around 1.78% memmove 1.77% obj_malloc 1.44% free_unref_page_list patch1-8 47.02% lzo1x_1_do_compress (real work) 6.73% page_vma_mapped_walk 6.14% _raw_spin_unlock_irq 3.39% walk_pte_range 2.63% ptep_clear_flush 2.29% __zram_bvec_write 2.10% do_raw_spin_lock 1.81% memmove 1.73% obj_malloc 1.53% free_unref_page_list Configurations: no change [1] https://lwn.net/Articles/23732/ [2] https://source.android.com/devices/tech/debug/scudo Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 228114874 Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e	2022-05-03 11:10:21 +00:00
Yu Zhao	1bb75d617f	FROMLIST: mm: multi-gen LRU: exploit locality in rmap Searching the rmap for PTEs mapping each page on an LRU list (to test and clear the accessed bit) can be expensive because pages from different VMAs (PA space) are not cache friendly to the rmap (VA space). For workloads mostly using mapped pages, the rmap has a high CPU cost in the reclaim path. This patch exploits spatial locality to reduce the trips into the rmap. When shrink_page_list() walks the rmap and finds a young PTE, a new function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent PTEs. On finding another young PTE, it clears the accessed bit and updates the gen counter of the page mapped by this PTE to (max_seq%MAX_NR_GENS)+1. Server benchmark results: Single workload: fio (buffered I/O): no change Single workload: memcached (anon): +[4, 6]% Ops/sec KB/sec patch1-6: 964656.80 37520.88 patch1-7: 1014393.57 39455.42 Configurations: no change Client benchmark results: kswapd profiles: patch1-6 36.13% lzo1x_1_do_compress (real work) 19.16% page_vma_mapped_walk 6.55% _raw_spin_unlock_irq 4.02% do_raw_spin_lock 2.32% anon_vma_interval_tree_iter_first 2.11% ptep_clear_flush 1.76% __zram_bvec_write 1.64% folio_referenced_one 1.40% memmove 1.35% obj_malloc patch1-7 45.54% lzo1x_1_do_compress (real work) 9.56% page_vma_mapped_walk 6.70% _raw_spin_unlock_irq 2.78% ptep_clear_flush 2.47% do_raw_spin_lock 2.22% __zram_bvec_write 1.87% lru_gen_look_around 1.78% memmove 1.77% obj_malloc 1.44% free_unref_page_list Configurations: no change Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 228114874 Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c	2022-05-03 11:10:21 +00:00
Yu Zhao	ed95143ac2	FROMLIST: mm: multi-gen LRU: minimal implementation To avoid confusion, the terms "promotion" and "demotion" will be applied to the multi-gen LRU, as a new convention; the terms "activation" and "deactivation" will be applied to the active/inactive LRU, as usual. The aging produces young generations. Given an lruvec, it increments max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging promotes hot pages to the youngest generation when it finds them accessed through page tables; the demotion of cold pages happens consequently when it increments max_seq. The aging has the complexity O(nr_hot_pages), since it is only interested in hot pages. Promotion in the aging path does not require any LRU list operations, only the updates of the gen counter and lrugen->nr_pages[]; demotion, unless as the result of the increment of max_seq, requires LRU list operations, e.g., lru_deactivate_fn(). The eviction consumes old generations. Given an lruvec, it increments min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A feedback loop modeled after the PID controller monitors refaults over anon and file types and decides which type to evict when both types are available from the same generation. Each generation is divided into multiple tiers. Tiers represent different ranges of numbers of accesses through file descriptors. A page accessed N times through file descriptors is in tier order_base_2(N). Tiers do not have dedicated lrugen->lists[], only bits in page->flags. In contrast to moving across generations, which requires the LRU lock, moving across tiers only involves operations on page->flags. The feedback loop also monitors refaults over all tiers and decides when to protect pages in which tiers (N>1), using the first tier (N=0,1) as a baseline. The first tier contains single-use unmapped clean pages, which are most likely the best choices. The eviction moves a page to the next generation, i.e., min_seq+1, if the feedback loop decides so. This approach has the following advantages: 1. It removes the cost of activation in the buffered access path by inferring whether pages accessed multiple times through file descriptors are statistically hot and thus worth protecting in the eviction path. 2. It takes pages accessed through page tables into account and avoids overprotecting pages accessed multiple times through file descriptors. (Pages accessed through page tables are in the first tier, since N=0.) 3. More tiers provide better protection for pages accessed more than twice through file descriptors, when under heavy buffered I/O workloads. Server benchmark results: Single workload: fio (buffered I/O): +[38, 40]% IOPS BW 5.18-ed4643521e6a: 2547k 9989MiB/s patch1-6: 3540k 13.5GiB/s Single workload: memcached (anon): +[103, 107]% Ops/sec KB/sec 5.18-ed4643521e6a: 469048.66 18243.91 patch1-6: 964656.80 37520.88 Configurations: CPU: two Xeon 6154 Mem: total 256G Node 1 was only used as a ram disk to reduce the variance in the results. patch drivers/block/brd.c <<EOF 99,100c99,100 < gfp_flags = GFP_NOIO \| __GFP_ZERO \| __GFP_HIGHMEM; < page = alloc_page(gfp_flags); --- > gfp_flags = GFP_NOIO \| __GFP_ZERO \| __GFP_HIGHMEM \| __GFP_THISNODE; > page = alloc_pages_node(1, gfp_flags, 0); EOF cat >>/etc/systemd/system.conf <<EOF CPUAffinity=numa NUMAPolicy=bind NUMAMask=0 EOF cat >>/etc/memcached.conf <<EOF -m 184320 -s /var/run/memcached/memcached.sock -a 0766 -t 36 -B binary EOF cat fio.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkfs.ext4 /dev/ram0 mount -t ext4 /dev/ram0 /mnt mkdir /sys/fs/cgroup/user.slice/test echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting cat memcached.sh modprobe brd rd_nr=1 rd_size=113246208 swapoff -a mkswap /dev/ram0 swapon /dev/ram0 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \ --ratio 1:0 --pipeline 8 -d 2000 memtier_benchmark -S /var/run/memcached/memcached.sock \ -P memcache_binary -n allkeys --key-minimum=1 \ --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \ --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed Client benchmark results: kswapd profiles: 5.18-ed4643521e6a 39.56% page_vma_mapped_walk 19.32% lzo1x_1_do_compress (real work) 7.18% do_raw_spin_lock 4.23% _raw_spin_unlock_irq 2.26% vma_interval_tree_subtree_search 2.12% vma_interval_tree_iter_next 2.11% folio_referenced_one 1.90% anon_vma_interval_tree_iter_first 1.47% ptep_clear_flush 0.97% __anon_vma_interval_tree_subtree_search patch1-6 36.13% lzo1x_1_do_compress (real work) 19.16% page_vma_mapped_walk 6.55% _raw_spin_unlock_irq 4.02% do_raw_spin_lock 2.32% anon_vma_interval_tree_iter_first 2.11% ptep_clear_flush 1.76% __zram_bvec_write 1.64% folio_referenced_one 1.40% memmove 1.35% obj_malloc Configurations: CPU: single Snapdragon 7c Mem: total 4G Chrome OS MemoryPressure [1] [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/ Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/ Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Bug: 228114874 Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86	2022-05-03 11:10:21 +00:00
UtsavBalar1231	f5daca7dd9	Merge remote-tracking branch 'aosp/android-4.14-stable' into android11-base * aosp/android-4.14-stable: Linux 4.14.230 can: flexcan: flexcan_chip_freeze(): fix chip freeze for missing bitrate init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM init/Kconfig: make COMPILE_TEST depend on !S390 bpf, x86: Validate computation of branch displacements for x86-64 cifs: Silently ignore unknown oplock break handle cifs: revalidate mapping when we open files for SMB1 POSIX ia64: mca: allocate early mca with GFP_ATOMIC scsi: target: pscsi: Clean up after failure in pscsi_map_sg() x86/build: Turn off -fcf-protection for realmode targets platform/x86: thinkpad_acpi: Allow the FnLock LED to change state drm/msm: Ratelimit invalid-fence message mac80211: choose first enabled channel for monitor mISDN: fix crash in fritzpci net: pxa168_eth: Fix a potential data race in pxa168_eth_remove ARM: dts: am33xx: add aliases for mmc interfaces Linux 4.14.229 drivers: video: fbcon: fix NULL dereference in fbcon_cursor() staging: rtl8192e: Change state information from u16 to u8 staging: rtl8192e: Fix incorrect source in memcpy() usb: gadget: udc: amd5536udc_pci fix null-ptr-dereference USB: cdc-acm: fix use-after-free after probe failure USB: cdc-acm: downgrade message to debug USB: cdc-acm: untangle a circular dependency between callback and softint cdc-acm: fix BREAK rx code path adding necessary calls usb: xhci-mtk: fix broken streams issue on 0.96 xHCI usb: musb: Fix suspend with devices connected for a64 USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem usbip: vhci_hcd fix shift out-of-bounds in vhci_hub_control() firewire: nosy: Fix a use-after-free bug in nosy_ioctl() extcon: Fix error handling in extcon_dev_register extcon: Add stubs for extcon_register_notifier_all() functions pinctrl: rockchip: fix restore error in resume mm: writeback: use exact memcg dirty counts mm: fix oom_kill event handling mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the same cacheline mm: memcg: make sure memory.events is uptodate when waking pollers mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats reiserfs: update reiserfs_xattrs_initialized() condition drm/amdgpu: check alignment on CPU page for bo map drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings() mm: fix race by making init_zero_pfn() early_initcall tracing: Fix stack trace event size ALSA: hda/realtek: call alc_update_headset_mode() in hp_automute_hook ALSA: hda/realtek: fix a determine_headset_type issue for a Dell AIO ALSA: usb-audio: Apply sample rate quirk to Logitech Connect bpf: Remove MTU check in __bpf_skb_max_len net: wan/lmc: unregister device when no matching device is found appletalk: Fix skb allocation size in loopback case net: ethernet: aquantia: Handle error cleanup of start on open brcmfmac: clear EAP/association status bits on linkdown events ext4: do not iput inode under running transaction in ext4_rename() ASoC: rt5659: Update MCLK rate in set_sysclk() staging: comedi: cb_pcidas64: fix request_irq() warn staging: comedi: cb_pcidas: fix request_irq() warn scsi: qla2xxx: Fix broken #endif placement scsi: st: Fix a use after free in st_open() vhost: Fix vhost_vq_reset() powerpc: Force inlining of cpu_has_feature() to avoid build failure ASoC: cs42l42: Always wait at least 3ms after reset ASoC: cs42l42: Fix mixer volume control ASoC: es8316: Simplify adc_pga_gain_tlv table ASoC: sgtl5000: set DAP_AVC_CTRL register to correct default value on probe ASoC: rt5651: Fix dac- and adc- vol-tlv values being off by a factor of 10 ASoC: rt5640: Fix dac- and adc- vol-tlv values being off by a factor of 10 rpc: fix NULL dereference on kmalloc failure ext4: fix bh ref count on error paths ipv6: weaken the v4mapped source check selinux: vsock: Set SID for socket returned by accept() BACKPORT: drm/virtio: Use vmalloc for command buffer allocations. UPSTREAM: drm/virtio: Rewrite virtio_gpu_queue_ctrl_buffer using fenced version. Conflicts: drivers/usb/core/quirks.c Change-Id: I23466726e73acff25a6a397f23d3752c48379a51 Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>	2021-04-11 05:23:34 +00:00
Greg Thelen	f34769a0b8	mm: writeback: use exact memcg dirty counts commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream. Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") memcg dirty and writeback counters are managed as: 1) per-memcg per-cpu values in range of [-32..32] 2) per-memcg atomic counter When a per-cpu counter cannot fit in [-32..32] it's flushed to the atomic. Stat readers only check the atomic. Thus readers such as balance_dirty_pages() may see a nontrivial error margin: 32 pages per cpu. Assuming 100 cpus: 4k x86 page_size: 13 MiB error per memcg 64k ppc page_size: 200 MiB error per memcg Considering that dirty+writeback are used together for some decisions the errors double. This inaccuracy can lead to undeserved oom kills. One nasty case is when all per-cpu counters hold positive values offsetting an atomic negative value (i.e. per_cpu[]=32, atomic=n_cpu-32). balance_dirty_pages() only consults the atomic and does not consider throttling the next n_cpu32 dirty pages. If the file_lru is in the 13..200 MiB range then there's absolutely no dirty throttling, which burdens vmscan with only dirty+writeback pages thus resorting to oom kill. It could be argued that tiny containers are not supported, but it's more subtle. It's the amount the space available for file lru that matters. If a container has memory.max-200MiB of non reclaimable memory, then it will also suffer such oom kills on a 100 cpu machine. The following test reliably ooms without this patch. This patch avoids oom kills. $ cat test mount -t cgroup2 none /dev/cgroup cd /dev/cgroup echo +io +memory > cgroup.subtree_control mkdir test cd test echo 10M > memory.max (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo) (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100) $ cat memcg-writeback-stress.c / * Dirty pages from all but one cpu. * Clean pages from the non dirtying cpu. * This is to stress per cpu counter imbalance. * On a 100 cpu machine: * - per memcg per cpu dirty count is 32 pages for each of 99 cpus * - per memcg atomic is -9932 pages - thus the complete dirty limit: sum of all counters 0 * - balance_dirty_pages() only sees atomic count -9932 pages, which it max()s to 0. * - So a workload can dirty -9932 pages before balance_dirty_pages() cares. / #define _GNU_SOURCE #include <err.h> #include <fcntl.h> #include <sched.h> #include <stdlib.h> #include <stdio.h> #include <sys/stat.h> #include <sys/sysinfo.h> #include <sys/types.h> #include <unistd.h> static char buf; static int bufSize; static void set_affinity(int cpu) { cpu_set_t affinity; CPU_ZERO(&affinity); CPU_SET(cpu, &affinity); if (sched_setaffinity(0, sizeof(affinity), &affinity)) err(1, "sched_setaffinity"); } static void dirty_on(int output_fd, int cpu) { int i, wrote; set_affinity(cpu); for (i = 0; i < 32; i++) { for (wrote = 0; wrote < bufSize; ) { int ret = write(output_fd, buf+wrote, bufSize-wrote); if (ret == -1) err(1, "write"); wrote += ret; } } } int main(int argc, char *argv) { int cpu, flush_cpu = 1, output_fd; const char output; if (argc != 2) errx(1, "usage: output_file"); output = argv[1]; bufSize = getpagesize(); buf = malloc(getpagesize()); if (buf == NULL) errx(1, "malloc failed"); output_fd = open(output, O_CREAT\|O_RDWR); if (output_fd == -1) err(1, "open(%s)", output); for (cpu = 0; cpu < get_nprocs(); cpu++) { if (cpu != flush_cpu) dirty_on(output_fd, cpu); } set_affinity(flush_cpu); if (fsync(output_fd)) err(1, "fsync(%s)", output); if (close(output_fd)) err(1, "close(%s)", output); free(buf); } Make balance_dirty_pages() and wb_over_bg_thresh() work harder to collect exact per memcg counters. This avoids the aforementioned oom kills. This does not affect the overhead of memory.stat, which still reads the single atomic counter. Why not use percpu_counter? memcg already handles cpus going offline, so no need for that overhead from percpu_counter. And the percpu_counter spinlocks are more heavyweight than is required. It probably also makes sense to use exact dirty and writeback counters in memcg oom reports. But that is saved for later. Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com Signed-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> [4.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-04-07 12:47:03 +02:00
Roman Gushchin	6f599379fd	mm: fix oom_kill event handling commit fe6bdfc8e1e131720abbe77a2eb990c94c9024cb upstream. Commit e27be240df53 ("mm: memcg: make sure memory.events is uptodate when waking pollers") converted most of memcg event counters to per-memcg atomics, which made them less confusing for a user. The "oom_kill" counter remained untouched, so now it behaves differently than other counters (including "oom"). This adds nothing but confusion. Let's fix this by adding the MEMCG_OOM_KILL event, and follow the MEMCG_OOM approach. This also removes a hack from count_memcg_event_mm(), introduced earlier specially for the OOM_KILL counter. [akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch] Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [fllinden@amazon.com: backport to 4.14, minor contextual changes] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-04-07 12:47:03 +02:00
Aaron Lu	ad6e2898e6	mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the same cacheline commit e81bf9793b1861d74953ef041b4f6c7faecc2dbd upstream. The LKP robot found a 27% will-it-scale/page_fault3 performance regression regarding commit e27be240df53("mm: memcg: make sure memory.events is uptodate when waking pollers"). What the test does is: 1 mkstemp() a 128M file on a tmpfs; 2 start $nr_cpu processes, each to loop the following: 2.1 mmap() this file in shared write mode; 2.2 write 0 to this file in a PAGE_SIZE step till the end of the file; 2.3 unmap() this file and repeat this process. 3 After 5 minutes, check how many loops they managed to complete, the higher the better. The commit itself looks innocent enough as it merely changed some event counting mechanism and this test didn't trigger those events at all. Perf shows increased cycles spent on accessing root_mem_cgroup->stat_cpu in count_memcg_event_mm()(called by handle_mm_fault()) and in __mod_memcg_state() called by page_add_file_rmap(). So it's likely due to the changed layout of 'struct mem_cgroup' that either make stat_cpu falling into a constantly modifying cacheline or some hot fields stop being in the same cacheline. I verified this by moving memory_events[] back to where it was: : --- a/include/linux/memcontrol.h : +++ b/include/linux/memcontrol.h : @@ -205,7 +205,6 @@ struct mem_cgroup { : int oom_kill_disable; : : /* memory.events / : - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; : struct cgroup_file events_file; : : / protect arrays of thresholds / : @@ -238,6 +237,7 @@ struct mem_cgroup { : struct mem_cgroup_stat_cpu __percpu stat_cpu; : atomic_long_t stat[MEMCG_NR_STAT]; : atomic_long_t events[NR_VM_EVENT_ITEMS]; : + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; : : unsigned long socket_pressure; And performance restored. Later investigation found that as long as the following 3 fields moving_account, move_lock_task and stat_cpu are in the same cacheline, performance will be good. To avoid future performance surprise by other commits changing the layout of 'struct mem_cgroup', this patch makes sure the 3 fields stay in the same cacheline. One concern of this approach is, moving_account and move_lock_task could be modified when a process changes memory cgroup while stat_cpu is a always read field, it might hurt to place them in the same cacheline. I assume it is rare for a process to change memory cgroup so this should be OK. Link: https://lkml.kernel.org/r/20180528114019.GF9904@yexl-desktop Link: http://lkml.kernel.org/r/20180601071115.GA27302@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Reported-by: kernel test robot <xiaolong.ye@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-04-07 12:47:03 +02:00
Johannes Weiner	09031a8b68	mm: memcg: make sure memory.events is uptodate when waking pollers commit e27be240df53f1a20c659168e722b5d9f16cc7f4 upstream. Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") added per-cpu drift to all memory cgroup stats and events shown in memory.stat and memory.events. For memory.stat this is acceptable. But memory.events issues file notifications, and somebody polling the file for changes will be confused when the counters in it are unchanged after a wakeup. Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX, MEMCG_OOM - are sufficiently rare and high-level that we don't need per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most frequent, but they're counting invocations of reclaim, which is a complex operation that touches many shared cachelines. This splits memory.events from the generic VM events and tracks them in their own, unbuffered atomic counters. That's also cleaner, as it eliminates the ugly enum nesting of VM and cgroup events. [hannes@cmpxchg.org: "array subscript is above array bounds"] Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-04-07 12:47:03 +02:00
Johannes Weiner	cfc5a8f43e	mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats commit c3cc39118c3610eb6ab4711bc624af7fc48a35fe upstream. After commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting"), we observed slowly upward creeping NR_WRITEBACK counts over the course of several days, both the per-memcg stats as well as the system counter in e.g. /proc/meminfo. The conversion from full per-cpu stat counts to per-cpu cached atomic stat counts introduced an irq-unsafe RMW operation into the updates. Most stat updates come from process context, but one notable exception is the NR_WRITEBACK counter. While writebacks are issued from process context, they are retired from (soft)irq context. When writeback completions interrupt the RMW counter updates of new writebacks being issued, the decs from the completions are lost. Since the global updates are routed through the joint lruvec API, both the memcg counters as well as the system counters are affected. This patch makes the joint stat and event API irq safe. Link: http://lkml.kernel.org/r/20180203082353.17284-1-hannes@cmpxchg.org Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Debugged-by: Tejun Heo <tj@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2021-04-07 12:47:03 +02:00
UtsavBalar1231	52dc535677	Merge remote-tracking branch 'aosp/android-4.14-stable' into android11-base * aosp/android-4.14-stable: Linux 4.14.216 net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet block: fix use-after-free in disk_part_iter_next KVM: arm64: Don't access PMCR_EL0 when no PMU is available wan: ds26522: select CONFIG_BITREVERSE net/mlx5e: Fix two double free cases net/mlx5e: Fix memleak in mlx5e_create_l2_table_groups iommu/intel: Fix memleak in intel_irq_remapping_alloc block: rsxx: select CONFIG_CRC32 wil6210: select CONFIG_CRC32 dmaengine: xilinx_dma: fix mixed_enum_type coverity warning dmaengine: xilinx_dma: check dma_async_device_register return value spi: stm32: FIFO threshold level - fix align packet size cpufreq: powernow-k8: pass policy rather than use cpufreq_cpu_get() i2c: sprd: use a specific timeout to avoid system hang up issue ARM: OMAP2+: omap_device: fix idling of devices during probe iio: imu: st_lsm6dsx: fix edge-trigger interrupts iio: imu: st_lsm6dsx: flip irq return logic spi: pxa2xx: Fix use-after-free on unbind ubifs: wbuf: Don't leak kernel memory to flash drm/i915: Fix mismatch between misplaced vma check and vma insert vmlinux.lds.h: Add PGO and AutoFDO input sections x86/resctrl: Don't move a task to the same resource group x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR net: fix pmtu check in nopmtudisc mode net: ip: always refragment ip defragmented packets net: vlan: avoid leaks on register_vlan_dev() failures net: cdc_ncm: correct overhead in delayed_ndp_size powerpc: Fix incorrect stw{, ux, u, x} instructions in __set_pte_at Linux 4.14.215 scsi: target: Fix XCOPY NAA identifier lookup KVM: x86: fix shift out of bounds reported by UBSAN x86/mtrr: Correct the range check before performing MTRR type lookups netfilter: xt_RATEEST: reject non-null terminated string from userspace netfilter: ipset: fix shift-out-of-bounds in htable_bits() Revert "device property: Keep secondary firmware node secondary by type" ALSA: hda/realtek - Fix speaker volume control on Lenovo C940 ALSA: hda/conexant: add a new hda codec CX11970 x86/mm: Fix leak of pmd ptlock USB: serial: keyspan_pda: remove unused variable usb: gadget: configfs: Fix use-after-free issue with udc_name usb: gadget: configfs: Preserve function ordering after bind failure usb: gadget: Fix spinlock lockup on usb_function_deactivate USB: gadget: legacy: fix return error code in acm_ms_bind() usb: gadget: function: printer: Fix a memory leak for interface descriptor usb: gadget: f_uac2: reset wMaxPacketSize usb: gadget: select CONFIG_CRC32 ALSA: usb-audio: Fix UBSAN warnings for MIDI jacks USB: usblp: fix DMA to stack USB: yurex: fix control-URB timeout handling USB: serial: option: add Quectel EM160R-GL USB: serial: option: add LongSung M5710 module support USB: serial: iuu_phoenix: fix DMA from stack usb: uas: Add PNY USB Portable SSD to unusual_uas usb: usbip: vhci_hcd: protect shift size USB: xhci: fix U1/U2 handling for hardware with XHCI_INTEL_HOST quirk set usb: chipidea: ci_hdrc_imx: add missing put_device() call in usbmisc_get_init_data() usb: dwc3: ulpi: Use VStsDone to detect PHY regs access completion USB: cdc-acm: blacklist another IR Droid device usb: gadget: enable super speed plus crypto: ecdh - avoid buffer overflow in ecdh_set_secret() video: hyperv_fb: Fix the mmap() regression for v5.4.y and older net: systemport: set dev->max_mtu to UMAC_MAX_MTU_SIZE net: mvpp2: Fix GoP port 3 Networking Complex Control configurations net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc net: sched: prevent invalid Scell_log shift count vhost_net: fix ubuf refcount incorrectly when sendmsg fails net: usb: qmi_wwan: add Quectel EM160R-GL CDC-NCM: remove "connected" log message net: hdlc_ppp: Fix issues when mod_timer is called while timer is running net: hns: fix return value check in __lb_other_process() ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() net: ethernet: ti: cpts: fix ethtool output when no ptp_clock registered net-sysfs: take the rtnl lock when storing xps_cpus net: ethernet: Fix memleak in ethoc_probe net/ncsi: Use real net-device for response handler virtio_net: Fix recursive call to cpus_read_lock() qede: fix offload for IPIP tunnel packets atm: idt77252: call pci_disable_device() on error path ethernet: ucc_geth: set dev->max_mtu to 1518 ethernet: ucc_geth: fix use-after-free in ucc_geth_remove() depmod: handle the case of /sbin/depmod without /sbin in PATH lib/genalloc: fix the overflow when size is too big scsi: ide: Do not set the RQF_PREEMPT flag for sense requests scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff() workqueue: Kick a worker based on the actual activation of delayed works kbuild: don't hardcode depmod path Linux 4.14.214 mwifiex: Fix possible buffer overflows in mwifiex_cmd_802_11_ad_hoc_start iio:magnetometer:mag3110: Fix alignment and data leak issues. iio:imu:bmi160: Fix alignment and data leak issues kdev_t: always inline major/minor helper functions dm verity: skip verity work if I/O error when system is shutting down ALSA: pcm: Clear the full allocated memory at hw_params module: delay kobject uevent until after module init call powerpc: sysdev: add missing iounmap() on error in mpic_msgr_probe() quota: Don't overflow quota file offsets module: set MODULE_STATE_GOING state when a module fails to load rtc: sun6i: Fix memleak in sun6i_rtc_clk_init ALSA: seq: Use bool for snd_seq_queue internal flags media: gp8psk: initialize stats at power control logic misc: vmw_vmci: fix kernel info-leak by initializing dbells in vmci_ctx_get_chkpt_doorbells() reiserfs: add check for an invalid ih_entry_count of: fix linker-section match-table corruption uapi: move constants from <linux/kernel.h> to <linux/const.h> powerpc/bitops: Fix possible undefined behaviour with fls() and fls64() USB: serial: digi_acceleport: fix write-wakeup deadlocks s390/dasd: fix hanging device offline processing vfio/pci: Move dummy_resources_list init in vfio_pci_probe() mm: memcontrol: fix excessive complexity in memory.stat reporting mm: memcontrol: implement lruvec stat functions on top of each other mm: memcontrol: eliminate raw access to stat and event counters ALSA: usb-audio: fix sync-ep altsetting sanity check ALSA: usb-audio: simplify set_sync_ep_implicit_fb_quirk ALSA: hda/ca0132 - Fix work handling in delayed HP detection md/raid10: initialize r10_bio->read_slot before use. x86/entry/64: Add instruction suffix ANDROID: usb: f_accessory: Don't drop NULL reference in acc_disconnect() ANDROID: usb: f_accessory: Avoid bitfields for shared variables ANDROID: usb: f_accessory: Cancel any pending work before teardown ANDROID: usb: f_accessory: Don't corrupt global state on double registration ANDROID: usb: f_accessory: Fix teardown ordering in acc_release() ANDROID: usb: f_accessory: Add refcounting to global 'acc_dev' ANDROID: usb: f_accessory: Wrap '_acc_dev' in get()/put() accessors ANDROID: usb: f_accessory: Remove useless assignment ANDROID: usb: f_accessory: Remove useless non-debug prints ANDROID: usb: f_accessory: Remove stale comments ANDROID: USB: f_accessory: Check dev pointer before decoding ctrl request ANDROID: usb: gadget: f_accessory: fix CTS test stuck Linux 4.14.213 PCI: Fix pci_slot_release() NULL pointer dereference libnvdimm/namespace: Fix reaping of invalidated block-window-namespace labels xenbus/xenbus_backend: Disallow pending watch messages xen/xenbus: Count pending messages for each watch xen/xenbus/xen_bus_type: Support will_handle watch callback xen/xenbus: Add 'will_handle' callback support in xenbus_watch_path() xen/xenbus: Allow watches discard events before queueing xen-blkback: set ring->xenblkd to NULL after kthread_stop() clk: mvebu: a3700: fix the XTAL MODE pin to MPP1_9 md/cluster: fix deadlock when node is doing resync job iio:imu:bmi160: Fix too large a buffer. iio:pressure:mpl3115: Force alignment of buffer iio:light:rpr0521: Fix timestamp alignment and prevent data leak. iio: adc: rockchip_saradc: fix missing clk_disable_unprepare() on error in rockchip_saradc_resume iio: buffer: Fix demux update mtd: parser: cmdline: Fix parsing of part-names with colons soc: qcom: smp2p: Safely acquire spinlock without IRQs spi: st-ssc4: Fix unbalanced pm_runtime_disable() in probe error path spi: sc18is602: Don't leak SPI master in probe error path spi: rb4xx: Don't leak SPI master in probe error path spi: pic32: Don't leak DMA channels in probe error path spi: davinci: Fix use-after-free on unbind spi: spi-sh: Fix use-after-free on unbind drm/dp_aux_dev: check aux_dev before use in drm_dp_aux_dev_get_by_minor() jfs: Fix array index bounds check in dbAdjTree jffs2: Fix GC exit abnormally ceph: fix race in concurrent __ceph_remove_cap invocations ima: Don't modify file descriptor mode on the fly powerpc/powernv/memtrace: Don't leak kernel memory to user space powerpc/xmon: Change printk() to pr_cont() powerpc/rtas: Fix typo of ibm,open-errinjct in RTAS filter ARM: dts: at91: sama5d2: fix CAN message ram offset and size KVM: arm64: Introduce handling of AArch32 TTBCR2 traps ext4: fix deadlock with fs freezing and EA inodes ext4: fix a memory leak of ext4_free_data btrfs: fix return value mixup in btrfs_get_extent Btrfs: fix selftests failure due to uninitialized i_mode in test inodes USB: serial: keyspan_pda: fix write unthrottling USB: serial: keyspan_pda: fix tx-unthrottle use-after-free USB: serial: keyspan_pda: fix write-wakeup use-after-free USB: serial: keyspan_pda: fix stalled writes USB: serial: keyspan_pda: fix write deadlock USB: serial: keyspan_pda: fix dropped unthrottle interrupts USB: serial: mos7720: fix parallel-port state restore EDAC/amd64: Fix PCI component registration crypto: ecdh - avoid unaligned accesses in ecdh_set_secret() powerpc/perf: Exclude kernel samples while counting events in user space. staging: comedi: mf6x4: Fix AI end-of-conversion detection s390/dasd: fix list corruption of lcu list s390/dasd: fix list corruption of pavgroup group list s390/dasd: prevent inconsistent LCU device data s390/smp: perform initial CPU reset also for SMT siblings ALSA: usb-audio: Disable sample read check if firmware doesn't give back ALSA: pcm: oss: Fix a few more UBSAN fixes ALSA: hda/realtek - Enable headset mic of ASUS Q524UQK with ALC255 ACPI: PNP: compare the string length in the matching_id() Revert "ACPI / resources: Use AE_CTRL_TERMINATE to terminate resources walks" PM: ACPI: PCI: Drop acpi_pm_set_bridge_wakeup() Input: cyapa_gen6 - fix out-of-bounds stack access media: netup_unidvb: Don't leak SPI master in probe error path media: sunxi-cir: ensure IR is handled when it is continuous media: gspca: Fix memory leak in probe Input: goodix - add upside-down quirk for Teclast X98 Pro tablet Input: cros_ec_keyb - send 'scancodes' in addition to key events fix namespaced fscaps when !CONFIG_SECURITY cfg80211: initialize rekey_data clk: sunxi-ng: Make sure divider tables have sentinel clk: s2mps11: Fix a resource leak in error handling paths in the probe function qlcnic: Fix error code in probe perf record: Fix memory leak when using '--user-regs=?' to list registers pwm: lp3943: Dynamically allocate PWM chip base pwm: zx: Add missing cleanup in error path clk: ti: Fix memleak in ti_fapll_synth_setup watchdog: coh901327: add COMMON_CLK dependency watchdog: qcom: Avoid context switch in restart handler net: korina: fix return value net: allwinner: Fix some resources leak in the error handling path of the probe and in the remove function net: bcmgenet: Fix a resource leak in an error handling path in the probe functin checkpatch: fix unescaped left brace powerpc/ps3: use dma_mapping_error() nfc: s3fwrn5: Release the nfc firmware um: chan_xterm: Fix fd leak watchdog: sirfsoc: Add missing dependency on HAS_IOMEM irqchip/alpine-msi: Fix freeing of interrupts on allocation error path ASoC: wm_adsp: remove "ctl" from list on error in wm_adsp_create_control() extcon: max77693: Fix modalias string clk: tegra: Fix duplicated SE clock entry x86/kprobes: Restore BTF if the single-stepping is cancelled nfs_common: need lock during iterate through the list nfsd: Fix message level for normal termination speakup: fix uninitialized flush_lock usb: oxu210hp-hcd: Fix memory leak in oxu_create usb: ehci-omap: Fix PM disable depth umbalance in ehci_hcd_omap_probe powerpc/pseries/hibernation: remove redundant cacheinfo update powerpc/pseries/hibernation: drop pseries_suspend_begin() from suspend ops scsi: fnic: Fix error return code in fnic_probe() seq_buf: Avoid type mismatch for seq_buf_init scsi: pm80xx: Fix error return in pm8001_pci_probe() scsi: qedi: Fix missing destroy_workqueue() on error in __qedi_probe cpufreq: scpi: Add missing MODULE_ALIAS cpufreq: loongson1: Add missing MODULE_ALIAS cpufreq: st: Add missing MODULE_DEVICE_TABLE cpufreq: mediatek: Add missing MODULE_DEVICE_TABLE cpufreq: highbank: Add missing MODULE_DEVICE_TABLE clocksource/drivers/arm_arch_timer: Correct fault programming of CNTKCTL_EL1.EVNTI dm ioctl: fix error return code in target_message ASoC: jz4740-i2s: add missed checks for clk_get() net/mlx5: Properly convey driver version to firmware memstick: r592: Fix error return in r592_probe() arm64: dts: rockchip: Fix UART pull-ups on rk3328 pinctrl: falcon: add missing put_device() call in pinctrl_falcon_probe() ARM: dts: at91: sama5d2: map securam as device clocksource/drivers/cadence_ttc: Fix memory leak in ttc_setup_clockevent() media: saa7146: fix array overflow in vidioc_s_audio() vfio-pci: Use io_remap_pfn_range() for PCI IO memory NFS: switch nfsiod to be an UNBOUND workqueue. lockd: don't use interval-based rebinding over TCP SUNRPC: xprt_load_transport() needs to support the netid "rdma6" NFSv4.2: condition READDIR's mask for security label based on LSM state ath10k: Release some resources in an error handling path ath10k: Fix an error handling path ARM: dts: at91: at91sam9rl: fix ADC triggers PCI: iproc: Fix out-of-bound array accesses genirq/irqdomain: Don't try to free an interrupt that has no mapping power: supply: bq24190_charger: fix reference leak ARM: dts: Remove non-existent i2c1 from 98dx3236 HSI: omap_ssi: Don't jump to free ID in ssi_add_controller() media: max2175: fix max2175_set_csm_mode() error code mips: cdmm: fix use-after-free in mips_cdmm_bus_discover samples: bpf: Fix lwt_len_hist reusing previous BPF map media: siano: fix memory leak of debugfs members in smsdvb_hotplug cw1200: fix missing destroy_workqueue() on error in cw1200_init_common orinoco: Move context allocation after processing the skb ARM: dts: at91: sama5d3_xplained: add pincontrol for USB Host ARM: dts: at91: sama5d4_xplained: add pincontrol for USB Host memstick: fix a double-free bug in memstick_check RDMA/cxgb4: Validate the number of CQEs Input: omap4-keypad - fix runtime PM error handling drivers: soc: ti: knav_qmss_queue: Fix error return code in knav_queue_probe soc: ti: Fix reference imbalance in knav_dma_probe soc: ti: knav_qmss: fix reference leak in knav_queue_probe crypto: omap-aes - Fix PM disable depth imbalance in omap_aes_probe powerpc/feature: Fix CPU_FTRS_ALWAYS by removing CPU_FTRS_GENERIC_32 Input: ads7846 - fix unaligned access on 7845 Input: ads7846 - fix integer overflow on Rt calculation Input: ads7846 - fix race that causes missing releases drm/omap: dmm_tiler: fix return error code in omap_dmm_probe() media: solo6x10: fix missing snd_card_free in error handling case scsi: core: Fix VPD LUN ID designator priorities media: mtk-vcodec: add missing put_device() call in mtk_vcodec_release_dec_pm() staging: greybus: codecs: Fix reference counter leak in error handling MIPS: BCM47XX: fix kconfig dependency bug for BCM47XX_BCMA RDMa/mthca: Work around -Wenum-conversion warning ASoC: arizona: Fix a wrong free in wm8997_probe ASoC: wm8998: Fix PM disable depth imbalance on error mwifiex: fix mwifiex_shutdown_sw() causing sw reset failure spi: tegra114: fix reference leak in tegra spi ops spi: tegra20-sflash: fix reference leak in tegra_sflash_resume spi: tegra20-slink: fix reference leak in slink ops of tegra20 spi: spi-ti-qspi: fix reference leak in ti_qspi_setup Bluetooth: Fix null pointer dereference in hci_event_packet() arm64: dts: exynos: Correct psci compatible used on Exynos7 selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling ASoC: pcm: DRAIN support reactivation spi: img-spfi: fix reference leak in img_spfi_resume crypto: talitos - Fix return type of current_desc_hdr() sched: Reenable interrupts in do_sched_yield() sched/deadline: Fix sched_dl_global_validate() ARM: p2v: fix handling of LPAE translation in BE mode x86/mm/ident_map: Check for errors from ident_pud_init() RDMA/rxe: Compute PSN windows correctly selinux: fix error initialization in inode_doinit_with_dentry() RDMA/bnxt_re: Set queue pair state when being queried soc: mediatek: Check if power domains can be powered on at boot time soc: renesas: rmobile-sysc: Fix some leaks in rmobile_init_pm_domains() drm/gma500: fix double free of gma_connector Bluetooth: Fix slab-out-of-bounds read in hci_le_direct_adv_report_evt() md: fix a warning caused by a race between concurrent md_ioctl()s crypto: af_alg - avoid undefined behavior accessing salg_name media: msi2500: assign SPI bus number dynamically quota: Sanity-check quota file headers on load serial_core: Check for port state when tty is in error state HID: i2c-hid: add Vero K147 to descriptor override ARM: dts: exynos: fix USB 3.0 pins supply being turned off on Odroid XU ARM: dts: exynos: fix USB 3.0 VBUS control and over-current pins on Exynos5410 ARM: dts: exynos: fix roles of USB 3.0 ports on Odroid XU usb: chipidea: ci_hdrc_imx: Pass DISABLE_DEVICE_STREAMING flag to imx6ul USB: gadget: f_rndis: fix bitrate for SuperSpeed and above usb: gadget: f_fs: Re-use SS descriptors for SuperSpeedPlus USB: gadget: f_midi: setup SuperSpeed Plus descriptors USB: gadget: f_acm: add support for SuperSpeed Plus USB: serial: option: add interface-number sanity check to flag handling soc/tegra: fuse: Fix index bug in get_process_id dm table: Remove BUG_ON(in_interrupt()) scsi: mpt3sas: Increase IOCInit request timeout to 30s vxlan: Copy needed_tailroom from lowerdev vxlan: Add needed_headroom for lower device drm/tegra: sor: Disable clocks on error in tegra_sor_init() kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling RDMA/cm: Fix an attempt to use non-valid pointer when cleaning timewait can: softing: softing_netdev_open(): fix error handling scsi: bnx2i: Requires MMU gpio: mvebu: fix potential user-after-free on probe ARM: dts: sun8i: v3s: fix GIC node memory range pinctrl: baytrail: Avoid clearing debounce value when turning it off pinctrl: merrifield: Set default bias in case no particular value given drm: fix drm_dp_mst_port refcount leaks in drm_dp_mst_allocate_vcpi serial: 8250_omap: Avoid FIFO corruption caused by MDR1 access ALSA: pcm: oss: Fix potential out-of-bounds shift USB: sisusbvga: Make console support depend on BROKEN USB: UAS: introduce a quirk to set no_write_same xhci: Give USB2 ports time to enter U3 in bus suspend ALSA: usb-audio: Fix control 'access overflow' errors from chmap ALSA: usb-audio: Fix potential out-of-bounds shift USB: add RESET_RESUME quirk for Snapscan 1212 USB: dummy-hcd: Fix uninitialized array use in init() mac80211: mesh: fix mesh_pathtbl_init() error path net: bridge: vlan: fix error return code in __vlan_add() net: stmmac: dwmac-meson8b: fix mask definition of the m250_sel mux net: stmmac: delete the eee_ctrl_timer after napi disabled net/mlx4_en: Handle TX error CQE net/mlx4_en: Avoid scheduling restart task if it is already running tcp: fix cwnd-limited bug for TSO deferral where we send nothing net: stmmac: free tx skb buffer in stmmac_resume() PCI: qcom: Add missing reset for ipq806x x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP scsi: be2iscsi: Revert "Fix a theoretical leak in beiscsi_create_eqs()" kbuild: avoid static_assert for genksyms pinctrl: amd: remove debounce filter setting in IRQ type setting Input: i8042 - add Acer laptops to the i8042 reset list Input: cm109 - do not stomp on control URB platform/x86: acer-wmi: add automatic keyboard background light toggle key as KEY_LIGHTS_TOGGLE soc: fsl: dpio: Get the cpumask through cpumask_of(cpu) scsi: ufs: Make sure clk scaling happens only when HBA is runtime ACTIVE ARC: stack unwinding: don't assume non-current task is sleeping iwlwifi: mvm: fix kernel panic in case of assert during CSA arm64: dts: rockchip: Assign a fixed index to mmc devices on rk3399 boards. iwlwifi: pcie: limit memory read spin time spi: bcm2835aux: Restore err assignment in bcm2835aux_spi_probe spi: bcm2835aux: Fix use-after-free on unbind Linux 4.14.212 x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes Input: i8042 - fix error return code in i8042_setup_aux() i2c: qup: Fix error return code in qup_i2c_bam_schedule_desc() gfs2: check for empty rgrp tree in gfs2_ri_update tracing: Fix userstacktrace option for instances spi: bcm2835: Release the DMA channel if probe fails after dma_init spi: bcm2835: Fix use-after-free on unbind spi: bcm-qspi: Fix use-after-free on unbind spi: Introduce device-managed SPI controller allocation iommu/amd: Set DTE[IntTabLen] to represent 512 IRTEs speakup: Reject setting the speakup line discipline outside of speakup i2c: imx: Check for I2SR_IAL after every byte i2c: imx: Fix reset of I2SR_IAL flag mm/swapfile: do not sleep with a spin lock held cifs: fix potential use-after-free in cifs_echo_request() ftrace: Fix updating FTRACE_FL_TRAMP ALSA: hda/generic: Add option to enforce preferred_dacs pairs ALSA: hda/realtek - Add new codec supported for ALC897 tty: Fix ->session locking tty: Fix ->pgrp locking in tiocspgrp() USB: serial: option: fix Quectel BG96 matching USB: serial: option: add support for Thales Cinterion EXS82 USB: serial: option: add Fibocom NL668 variants USB: serial: ch341: sort device-id entries USB: serial: ch341: add new Product ID for CH341A USB: serial: kl5kusb105: fix memleak on open usb: gadget: f_fs: Use local copy of descriptors for userspace copy vlan: consolidate VLAN parsing code and limit max parsing depth pinctrl: baytrail: Fix pin being driven low for a while on gpiod_get(..., GPIOD_OUT_HIGH) pinctrl: baytrail: Replace WARN with dev_info_once when setting direct-irq pin to output ANDROID: Incremental fs: Set credentials before reading/writing ANDROID: Incremental fs: Fix incfs_test use of atol, open ANDROID: Incremental fs: Change per UID timeouts to microseconds ANDROID: Incremental fs: Add v2 feature flag ANDROID: Incremental fs: Add zstd feature flag Linux 4.14.211 RDMA/i40iw: Address an mmap handler exploit in i40iw Input: i8042 - add ByteSpeed touchpad to noloop table Input: xpad - support Ardwiino Controllers ALSA: usb-audio: US16x08: fix value count for level meters dt-bindings: net: correct interrupt flags in examples net/mlx5: Fix wrong address reclaim when command interface is down net: pasemi: fix error return code in pasemi_mac_open() cxgb3: fix error return code in t3_sge_alloc_qset() net/x25: prevent a couple of overflows ibmvnic: Fix TX completion error handling ibmvnic: Ensure that SCRQ entry reads are correctly ordered ipv4: Fix tos mask in inet_rtm_getroute() netfilter: bridge: reset skb->pkt_type after NF_INET_POST_ROUTING traversal bonding: wait for sysfs kobject destruction before freeing struct slave usbnet: ipheth: fix connectivity with iOS 14 tun: honor IOCB_NOWAIT flag tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control sock: set sk_err to ee_errno on dequeue from errq rose: Fix Null pointer dereference in rose_send_frame() net/af_iucv: set correct sk_protocol for child sockets ANDROID: Incremental fs: Fix 32-bit build UPSTREAM: arm64: sysreg: Clean up instructions for modifying PSTATE fields [CP] CHROMIUM: drm/virtio: rebase zero-copy patches to virgl/drm-misc-next Linux 4.14.210 USB: core: Fix regression in Hercules audio card USB: core: add endpoint-blacklist quirk x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb usb: gadget: Fix memleak in gadgetfs_fill_super usb: gadget: f_midi: Fix memleak in f_midi_alloc USB: core: Change %pK for __user pointers to %px perf probe: Fix to die_entrypc() returns error correctly can: m_can: fix nominal bitiming tseg2 min for version >= 3.1 platform/x86: toshiba_acpi: Fix the wrong variable assignment can: gs_usb: fix endianess problem with candleLight firmware efivarfs: revert "fix memory leak in efivarfs_create()" ibmvnic: fix NULL pointer dereference in ibmvic_reset_crq ibmvnic: fix NULL pointer dereference in reset_sub_crq_queues net: ena: set initial DMA width to avoid intel iommu issue nfc: s3fwrn5: use signed integer for parsing GPIO numbers IB/mthca: fix return value of error branch in mthca_init_cq() bnxt_en: Release PCI regions when DMA mask setup fails during probe. video: hyperv_fb: Fix the cache type when mapping the VRAM bnxt_en: fix error return code in bnxt_init_board() bnxt_en: fix error return code in bnxt_init_one() scsi: ufs: Fix race between shutdown and runtime resume flow batman-adv: set .owner to THIS_MODULE phy: tegra: xusb: Fix dangling pointer on probe failure perf/x86: fix sysfs type mismatches scsi: target: iscsi: Fix cmd abort fabric stop race scsi: libiscsi: Fix NOP race condition dmaengine: pl330: _prep_dma_memcpy: Fix wrong burst size nvme: free sq/cq dbbuf pointers when dbbuf set fails proc: don't allow async path resolution of /proc/self components HID: Add Logitech Dinovo Edge battery quirk x86/xen: don't unbind uninitialized lock_kicker_irq dmaengine: xilinx_dma: use readl_poll_timeout_atomic variant HID: hid-sensor-hub: Fix issue with devices with no report ID Input: i8042 - allow insmod to succeed on devices without an i8042 controller HID: cypress: Support Varmilo Keyboards' media hotkeys ALSA: hda/hdmi: fix incorrect locking in hdmi_pcm_close ALSA: hda/hdmi: Use single mutex unlock in error paths arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect() arm64: pgtable: Fix pte_accessible() btrfs: inode: Verify inode mode to avoid NULL pointer dereference btrfs: adjust return values of btrfs_inode_by_name btrfs: tree-checker: Enhance chunk checker to validate chunk profile PCI: Add device even if driver attach failed wireless: Use linux/stddef.h instead of stddef.h btrfs: fix lockdep splat when reading qgroup config on mount mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault() perf event: Check ref_reloc_sym before using it Linux 4.14.209 x86/microcode/intel: Check patch signature before saving microcode for early loading s390/dasd: fix null pointer dereference for ERP requests s390/cpum_sf.c: fix file permission for cpum_sfb_size mac80211: free sta in sta_info_insert_finish() on errors mac80211: minstrel: fix tx status processing corner case mac80211: minstrel: remove deferred sampling code xtensa: disable preemption around cache alias management calls regulator: workaround self-referent regulators regulator: avoid resolve_supply() infinite recursion regulator: fix memory leak with repeated set_machine_constraints() iio: accel: kxcjk1013: Add support for KIOX010A ACPI DSM for setting tablet-mode iio: accel: kxcjk1013: Replace is_smo8500_device with an acpi_type enum ext4: fix bogus warning in ext4_update_dx_flag() staging: rtl8723bs: Add 024c:0627 to the list of SDIO device-ids efivarfs: fix memory leak in efivarfs_create() tty: serial: imx: keep console clocks always on ALSA: mixart: Fix mutex deadlock ALSA: ctl: fix error path at adding user-defined element set speakup: Do not let the line discipline be used several times powerpc/uaccess-flush: fix missing includes in kup-radix.h libfs: fix error cast of negative value in simple_attr_write() xfs: revert "xfs: fix rmap key and record comparison functions" regulator: ti-abb: Fix array out of bound read access on the first transition MIPS: Alchemy: Fix memleak in alchemy_clk_setup_cpu ASoC: qcom: lpass-platform: Fix memory leak can: m_can: m_can_handle_state_change(): fix state change can: peak_usb: fix potential integer overflow on shift of a int can: mcba_usb: mcba_usb_start_xmit(): first fill skb, then pass to can_put_echo_skb() can: ti_hecc: Fix memleak in ti_hecc_probe can: dev: can_restart(): post buffer from the right context can: af_can: prevent potential access of uninitialized member in canfd_rcv() can: af_can: prevent potential access of uninitialized member in can_rcv() perf lock: Don't free "lock_seq_stat" if read_count isn't zero ARM: dts: imx50-evk: Fix the chip select 1 IOMUX arm: dts: imx6qdl-udoo: fix rgmii phy-mode for ksz9031 phy MIPS: export has_transparent_hugepage() for modules Input: adxl34x - clean up a data type in adxl34x_probe() vfs: remove lockdep bogosity in __sb_start_write arm64: psci: Avoid printing in cpu_psci_cpu_die() pinctrl: rockchip: enable gpio pclk for rockchip_gpio_to_irq net: ftgmac100: Fix crash when removing driver tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimate net: usb: qmi_wwan: Set DTR quirk for MR400 net/mlx5: Disable QoS when min_rates on all VFs are zero sctp: change to hold/put transport for proto_unreach_timer qlcnic: fix error return code in qlcnic_83xx_restart_hw() net: x25: Increase refcnt of "struct x25_neigh" in x25_rx_call_request net/mlx4_core: Fix init_hca fields offset netlabel: fix an uninitialized warning in netlbl_unlabel_staticlist() netlabel: fix our progress tracking in netlbl_unlabel_staticlist() net: Have netpoll bring-up DSA management interface net: dsa: mv88e6xxx: Avoid VTU corruption on 6097 net: bridge: add missing counters to ndo_get_stats64 callback net: b44: fix error return code in b44_init_one() mlxsw: core: Use variable timeout for EMAD retries inet_diag: Fix error path to cancel the meseage in inet_req_diag_fill() devlink: Add missing genlmsg_cancel() in devlink_nl_sb_port_pool_fill() bnxt_en: read EEPROM A2h address using page 0 atm: nicstar: Unmap DMA on send error ah6: fix error return code in ah6_input() ANDROID: Fix cuttlefish defconfigs now incfs uses zstd ANDROID: Incremental fs: Add zstd compression support ANDROID: Incremental fs: Small improvements ANDROID: Incremental fs: Initialize mount options correctly ANDROID: Incremental fs: Fix read_log_test which failed sporadically ANDROID: Incremental fs: Fix misuse of cpu_to_leXX and poll return ANDROID: Incremental fs: Add per UID read timeouts ANDROID: Incremental fs: Add .incomplete folder ANDROID: Incremental fs: Fix dangling else ANDROID: Incremental fs: Fix uninitialized variable ANDROID: Incremental fs: Fix filled block count from get filled blocks ANDROID: Incremental fs: Add hash block counts to IOC_IOCTL_GET_BLOCK_COUNT ANDROID: Incremental fs: Add INCFS_IOC_GET_BLOCK_COUNT ANDROID: Incremental fs: Make compatible with existing files ANDROID: Incremental fs: Remove block HASH flag ANDROID: Incremental fs: Remove back links and crcs ANDROID: Incremental fs: Remove attributes from file ANDROID: Incremental fs: Add .blocks_written file ANDROID: Incremental fs: Separate pseudo-file code ANDROID: Incremental fs: Add UID to pending_read ANDROID: Incremental fs: Create mapped file ANDROID: Incremental fs: Don't allow renaming .index directory. ANDROID: Incremental fs: Fix incfs to work on virtio-9p ANDROID: Incremental fs: Allow running a single test ANDROID: Incremental fs: Adding perf test ANDROID: Incremental fs: Stress tool ANDROID: Incremental fs: Use R/W locks to read/write segment blockmap. ANDROID: Incremental fs: Remove unnecessary dependencies ANDROID: Incremental fs: Remove annoying pr_debugs ANDROID: Incremental fs: dentry_revalidate should not return -EBADF. ANDROID: Incremental fs: Fix minor bugs ANDROID: Incremental fs: RCU locks instead of mutex for pending_reads. ANDROID: Incremental fs: fix up attempt to copy structures with READ/WRITE_ONCE Linux 4.14.208 ACPI: GED: fix -Wformat KVM: x86: clflushopt should be treated as a no-op by emulation can: proc: can_remove_proc(): silence remove_proc_entry warning mac80211: always wind down STA state Input: sunkbd - avoid use-after-free in teardown paths powerpc/8xx: Always fault when _PAGE_ACCESSED is not set gpio: mockup: fix resource leak in error path i2c: imx: Fix external abort on interrupt in exit paths i2c: imx: use clk notifier for rate changes powerpc/64s: flush L1D after user accesses powerpc/uaccess: Evaluate macro arguments once, before user access is allowed powerpc: Fix __clear_user() with KUAP enabled powerpc: Implement user_access_begin and friends powerpc: Add a framework for user access tracking powerpc/64s: flush L1D on kernel entry powerpc/64s: move some exception handlers out of line powerpc/64s: Define MASKABLE_RELON_EXCEPTION_PSERIES_OOL Linux 4.14.207 mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race Convert trailing spaces and periods in path components reboot: fix overflow parsing reboot cpu number Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint" perf/core: Fix race in the perf_mmap_close() function xen/events: block rogue events for some time xen/events: defer eoi in case of excessive number of events xen/events: use a common cpu hotplug hook for event channels xen/events: switch user event channels to lateeoi model xen/pciback: use lateeoi irq binding xen/pvcallsback: use lateeoi irq binding xen/scsiback: use lateeoi irq binding xen/netback: use lateeoi irq binding xen/blkback: use lateeoi irq binding xen/events: add a new "late EOI" evtchn framework xen/events: fix race in evtchn_fifo_unmask() xen/events: add a proper barrier to 2-level uevent unmasking xen/events: avoid removing an event channel while handling it perf/core: Fix a memory leak in perf_event_parse_addr_filter() perf/core: Fix crash when using HW tracing kernel filters perf/core: Fix bad use of igrab() x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP random32: make prandom_u32() output unpredictable net: Update window_clamp if SOCK_RCVBUF is set r8169: fix potential skb double free in an error path vrf: Fix fast path output packet handling with async Netfilter rules net/x25: Fix null-ptr-deref in x25_connect net/af_iucv: fix null pointer dereference on shutdown IPv6: Set SIT tunnel hard_header_len to zero swiotlb: fix "x86: Don't panic if can not alloc buffer for swiotlb" pinctrl: amd: fix incorrect way to disable debounce filter pinctrl: amd: use higher precision for 512 RtcClk drm/gma500: Fix out-of-bounds access to struct drm_device.vblank[] don't dump the threads that had been already exiting when zapped. selinux: Fix error return code in sel_ib_pkey_sid_slow() ocfs2: initialize ip_next_orphan futex: Don't enable IRQs unconditionally in put_pi_state() mei: protect mei_cl_mtu from null dereference usb: cdc-acm: Add DISABLE_ECHO for Renesas USB Download mode uio: Fix use-after-free in uio_unregister_device() thunderbolt: Add the missed ida_simple_remove() in ring_request_msix() ext4: unlock xattr_sem properly in ext4_inline_data_truncate() ext4: correctly report "not supported" for {usr,grp}jquota when !CONFIG_QUOTA perf: Fix get_recursion_context() cosa: Add missing kfree in error path of cosa_write of/address: Fix of_node memory leak in of_dma_is_coherent xfs: fix a missing unlock on error in xfs_fs_map_blocks xfs: fix rmap key and record comparison functions xfs: fix flags argument to rmap lookup when converting shared file rmaps nbd: fix a block_device refcount leak in nbd_release pinctrl: aspeed: Fix GPI only function problem. ARM: 9019/1: kprobes: Avoid fortify_panic() when copying optprobe template pinctrl: intel: Set default bias in case no particular value given iommu/amd: Increase interrupt remapping table limit to 512 entries scsi: scsi_dh_alua: Avoid crash during alua_bus_detach() cfg80211: regulatory: Fix inconsistent format argument mac80211: fix use of skb payload instead of header drm/amdgpu: perform srbm soft reset always on SDMA resume scsi: hpsa: Fix memory leak in hpsa_init_one() gfs2: check for live vs. read-only file system in gfs2_fitrim gfs2: Add missing truncate_inode_pages_final for sd_aspace gfs2: Free rd_bits later in gfs2_clear_rgrpd to fix use-after-free usb: gadget: goku_udc: fix potential crashes in probe ath9k_htc: Use appropriate rs_datalen type Btrfs: fix missing error return if writeback for extent buffer never started xfs: flush new eof page on truncate to avoid post-eof corruption can: peak_canfd: pucan_handle_can_rx(): fix echo management when loopback is on can: peak_usb: peak_usb_get_ts_time(): fix timestamp wrapping can: peak_usb: add range checking in decode operations can: can_create_echo_skb(): fix echo skb generation: always use skb_clone() can: dev: __can_get_echo_skb(): fix real payload length return value for RTR frames can: dev: can_get_echo_skb(): prevent call to kfree_skb() in hard IRQ context can: rx-offload: don't call kfree_skb() from IRQ context ALSA: hda: prevent undefined shift in snd_hdac_ext_bus_get_link() perf tools: Add missing swap for ino_generation net: xfrm: fix a race condition during allocing spi hv_balloon: disable warning when floor reached genirq: Let GENERIC_IRQ_IPI select IRQ_DOMAIN_HIERARCHY btrfs: reschedule when cloning lots of extents btrfs: sysfs: init devices outside of the chunk_mutex nbd: don't update block size after device is started time: Prevent undefined behaviour in timespec64_to_ns() mm: mempolicy: fix potential pte_unmap_unlock pte error ring-buffer: Fix recursion protection transitions between interrupt context regulator: defer probe when trying to get voltage from unresolved supply UPSTREAM: sched: idle: Avoid retaining the tick when it has been stopped UPSTREAM: cpuidle: menu: Handle stopped tick more aggressively UPSTREAM: staging: android: vsoc: fix copy_from_user overrun UPSTREAM: ipv6: ndisc: RFC-ietf-6man-ra-pref64-09 is now published as RFC8781 BACKPORT: drm/virtio: fix missing dma_fence_put() in virtio_gpu_execbuffer_ioctl() Conflicts: drivers/scsi/ufs/ufshcd.c drivers/soc/qcom/smp2p.c drivers/usb/gadget/function/f_accessory.c drivers/usb/gadget/function/f_fs.c drivers/usb/gadget/function/f_uac2.c Change-Id: I7ec83fef94dcc943abd858eb81e4a78983faae8c	2021-01-22 21:56:05 +05:30
Johannes Weiner	aa93e9471a	mm: memcontrol: fix excessive complexity in memory.stat reporting commit a983b5ebee57209c99f68c8327072f25e0e6e3da upstream We've seen memory.stat reads in top-level cgroups take up to fourteen seconds during a userspace bug that created tens of thousands of ghost cgroups pinned by lingering page cache. Even with a more reasonable number of cgroups, aggregating memory.stat is unnecessarily heavy. The complexity is this: nr_cgroups * nr_stat_items * nr_possible_cpus where the stat items are ~70 at this point. With 128 cgroups and 128 CPUs - decent, not enormous setups - reading the top-level memory.stat has to aggregate over a million per-cpu counters. This doesn't scale. Instead of spreading the source of truth across all CPUs, use the per-cpu counters merely to batch updates to shared atomic counters. This is the same as the per-cpu stocks we use for charging memory to the shared atomic page_counters, and also the way the global vmstat counters are implemented. Vmstat has elaborate spilling thresholds that depend on the number of CPUs, amount of memory, and memory pressure - carefully balancing the cost of counter updates with the amount of per-cpu error. That's because the vmstat counters are system-wide, but also used for decisions inside the kernel (e.g. NR_FREE_PAGES in the allocator). Neither is true for the memory controller. Use the same static batch size we already use for page_counter updates during charging. The per-cpu error in the stats will be 128k, which is an acceptable ratio of cores to memory accounting granularity. [hannes@cmpxchg.org: fix warning in __this_cpu_xchg() calls] Link: http://lkml.kernel.org/r/20171201135750.GB8097@cmpxchg.org Link: http://lkml.kernel.org/r/20171103153336.24044-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org [shaoyi@amazon.com: resolved the conflict brought by commit `17ffa29c35` in mm/memcontrol.c by contextual fix] Signed-off-by: Shaoying Xu <shaoyi@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-01-09 13:37:37 +01:00
Johannes Weiner	955f8bc9eb	mm: memcontrol: implement lruvec stat functions on top of each other commit 284542656e22c43fdada8c8cc0ca9ede8453eed7 upstream The implementation of the lruvec stat functions and their variants for accounting through a page, or accounting from a preemptible context, are mostly identical and needlessly repetitive. Implement the lruvec_page functions by looking up the page's lruvec and then using the lruvec function. Implement the functions for preemptible contexts by disabling preemption before calling the atomic context functions. Link: http://lkml.kernel.org/r/20171103153336.24044-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Signed-off-by: Shaoying Xu <shaoyi@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-01-09 13:37:37 +01:00
Johannes Weiner	fdcda71d87	mm: memcontrol: eliminate raw access to stat and event counters commit c9019e9bf42e66d028d70d2da6206cad4dd9250d upstream Replace all raw 'this_cpu_' modifications of the stat and event per-cpu counters with API functions such as mod_memcg_state(). This makes the code easier to read, but is also in preparation for the next patch, which changes the per-cpu implementation of those counters. Link: http://lkml.kernel.org/r/20171103153336.24044-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Shaoying Xu <shaoyi@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-01-09 13:37:37 +01:00
Michal Hocko	e52a7c1cb9	memcg, oom: move out_of_memory back to the charge path Commit `3812c8c8f3` ("mm: memcg: do not trap chargers with full callstack on OOM") has changed the ENOMEM semantic of memcg charges. Rather than invoking the oom killer from the charging context it delays the oom killer to the page fault path (pagefault_out_of_memory). This in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg hits the hard limit and the memcg is is OOM. This is behavior is inconsistent with !memcg case where the oom killer is invoked from the allocation context and the allocator keeps retrying until it succeeds. The difference in the behavior is user visible. mmap(MAP_POPULATE) might result in not fully populated ranges while the mmap return code doesn't tell that to the userspace. Random syscalls might fail with ENOMEM etc. The primary motivation of the different memcg oom semantic was the deadlock avoidance. Things have changed since then, though. We have an async oom teardown by the oom reaper now and so we do not have to rely on the victim to tear down its memory anymore. Therefore we can return to the original semantic as long as the memcg oom killer is not handed over to the users space. There is still one thing to be careful about here though. If the oom killer is not able to make any forward progress - e.g. because there is no eligible task to kill - then we have to bail out of the charge path to prevent from same class of deadlocks. We have basically two options here. Either we fail the charge with ENOMEM or force the charge and allow overcharge. The first option has been considered more harmful than useful because rare inconsistencies in the ENOMEM behavior is hard to test for and error prone. Basically the same reason why the page allocator doesn't fail allocations under such conditions. The later might allow runaways but those should be really unlikely unless somebody misconfigures the system. E.g. allowing to migrate tasks away from the memcg to a different unlimited memcg with move_charge_at_immigrate disabled. Change-Id: Ib861df994357b907df37705bf626d405320e2b06 Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 29ef680ae7c21110af8e6416d84d8a72fc147b14 Git-repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Vijayanand Jitta <vjitta@codeaurora.org>	2019-05-22 22:05:36 -07:00
Matthias Kaehlcke	04fecbf51b	mm: memcontrol: use int for event/state parameter in several functions Several functions use an enum type as parameter for an event/state, but are called in some locations with an argument of a different enum type. Adjust the interface of these functions to reality by changing the parameter to int. This fixes a ton of enum-conversion warnings that are generated when building the kernel with clang. [mka@chromium.org: also change parameter type of inc/dec/mod_memcg_page_state()] Link: http://lkml.kernel.org/r/20170728213442.93823-1-mka@chromium.org Link: http://lkml.kernel.org/r/20170727211004.34435-1-mka@chromium.org Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Doug Anderson <dianders@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-09-06 17:27:27 -07:00
Johannes Weiner	739f79fc9d	mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() Jaegeuk and Brad report a NULL pointer crash when writeback ending tries to update the memcg stats: BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0 IP: test_clear_page_writeback+0x12e/0x2c0 [...] RIP: 0010:test_clear_page_writeback+0x12e/0x2c0 Call Trace: <IRQ> end_page_writeback+0x47/0x70 f2fs_write_end_io+0x76/0x180 [f2fs] bio_endio+0x9f/0x120 blk_update_request+0xa8/0x2f0 scsi_end_request+0x39/0x1d0 scsi_io_completion+0x211/0x690 scsi_finish_command+0xd9/0x120 scsi_softirq_done+0x127/0x150 __blk_mq_complete_request_remote+0x13/0x20 flush_smp_call_function_queue+0x56/0x110 generic_smp_call_function_single_interrupt+0x13/0x30 smp_call_function_single_interrupt+0x27/0x40 call_function_single_interrupt+0x89/0x90 RIP: 0010:native_safe_halt+0x6/0x10 (gdb) l (test_clear_page_writeback+0x12e) 0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619). 614 mod_node_page_state(page_pgdat(page), idx, val); 615 if (mem_cgroup_disabled() \|\| !page->mem_cgroup) 616 return; 617 mod_memcg_state(page->mem_cgroup, idx, val); 618 pn = page->mem_cgroup->nodeinfo[page_to_nid(page)]; 619 this_cpu_add(pn->lruvec_stat->count[idx], val); 620 } 621 622 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t pgdat, int order, 623 gfp_t gfp_mask, The issue is that writeback doesn't hold a page reference and the page might get freed after PG_writeback is cleared (and the mapping is unlocked) in test_clear_page_writeback(). The stat functions looking up the page's node or zone are safe, as those attributes are static across allocation and free cycles. But page->mem_cgroup is not, and it will get cleared if we race with truncation or migration. It appears this race window has been around for a while, but less likely to trigger when the memcg stats were updated first thing after PG_writeback is cleared. Recent changes reshuffled this code to update the global node stats before the memcg ones, though, stretching the race window out to an extent where people can reproduce the problem. Update test_clear_page_writeback() to look up and pin page->mem_cgroup before clearing PG_writeback, then not use that pointer afterward. It is a partial revert of `62cccb8c8e` ("mm: simplify lock_page_memcg()") but leaves the pageref-holding callsites that aren't affected alone. Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org Fixes: `62cccb8c8e` ("mm: simplify lock_page_memcg()") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Jaegeuk Kim <jaegeuk@kernel.org> Tested-by: Jaegeuk Kim <jaegeuk@kernel.org> Reported-by: Bradley Bolen <bradleybolen@gmail.com> Tested-by: Brad Bolen <bradleybolen@gmail.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> [4.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-08-18 15:32:01 -07:00
Johannes Weiner	00f3ca2c2d	mm: memcontrol: per-lruvec stats infrastructure lruvecs are at the intersection of the NUMA node and memcg, which is the scope for most paging activity. Introduce a convenient accounting infrastructure that maintains statistics per node, per memcg, and the lruvec itself. Then convert over accounting sites for statistics that are already tracked in both nodes and memcgs and can be easily switched. [hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code] Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org [hannes@cmpxchg.org: don't track uncharged pages at all Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org [hannes@cmpxchg.org: add missing free_percpu()] Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org [linux@roeck-us.net: hexagon: fix build error caused by include file order] Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:35 -07:00
Johannes Weiner	ed52be7bfd	mm: memcontrol: use generic mod_memcg_page_state for kmem pages The kmem-specific functions do the same thing. Switch and drop. Link: http://lkml.kernel.org/r/20170530181724.27197-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:35 -07:00
Johannes Weiner	320492961c	mm: memcontrol: use the node-native slab memory counters Now that the slab counters are moved from the zone to the node level we can drop the private memcg node stats and use the official ones. Link: http://lkml.kernel.org/r/20170530181724.27197-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:35 -07:00
Konstantin Khlebnikov	8e675f7af5	mm/oom_kill: count global and memory cgroup oom kills Show count of oom killer invocations in /proc/vmstat and count of processes killed in memory cgroup in knob "memory.events" (in memory.oom_control for v1 cgroup). Also describe difference between "oom" and "oom_kill" in memory cgroup documentation. Currently oom in memory cgroup kills tasks iff shortage has happened inside page fault. These counters helps in monitoring oom kills - for now the only way is grepping for magic words in kernel log. [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename] [akpm@linux-foundation.org: fix comment, per Konstantin] Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Roman Guschin <guroan@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:35 -07:00
Roman Gushchin	2262185c5b	mm: per-cgroup memory reclaim stats Track the following reclaim counters for every memory cgroup: PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED. These values are exposed using the memory.stats interface of cgroup v2. The meaning of each value is the same as for global counters, available using /proc/vmstat. Also, for consistency, rename mem_cgroup_count_vm_event() to count_memcg_event_mm(). Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:35 -07:00
Johannes Weiner	ccda7f4360	mm: memcontrol: use node page state naming scheme for memcg The memory controllers stat function names are awkwardly long and arbitrarily different from the zone and node stat functions. The current interface is named: mem_cgroup_read_stat() mem_cgroup_update_stat() mem_cgroup_inc_stat() mem_cgroup_dec_stat() mem_cgroup_update_page_stat() mem_cgroup_inc_page_stat() mem_cgroup_dec_page_stat() This patch renames it to match the corresponding node stat functions: memcg_page_state() [node_page_state()] mod_memcg_state() [mod_node_state()] inc_memcg_state() [inc_node_state()] dec_memcg_state() [dec_node_state()] mod_memcg_page_state() [mod_node_page_state()] inc_memcg_page_state() [inc_node_page_state()] dec_memcg_page_state() [dec_node_page_state()] Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:11 -07:00
Johannes Weiner	71cd31135d	mm: memcontrol: re-use node VM page state enum The current duplication is a high-maintenance mess, and it's painful to add new items or query memcg state from the rest of the VM. This increases the size of the stat array marginally, but we should aim to track all these stats on a per-cgroup level anyway. Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:11 -07:00
Johannes Weiner	df0e53d061	mm: memcontrol: re-use global VM event enum The current duplication is a high-maintenance mess, and it's painful to add new items. This increases the size of the event array, but we'll eventually want most of the VM events tracked on a per-cgroup basis anyway. Link: http://lkml.kernel.org/r/20170404220148.28338-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:11 -07:00
Johannes Weiner	31176c7815	mm: memcontrol: clean up memory.events counting function We only ever count single events, drop the @nr parameter. Rename the function accordingly. Remove low-information kerneldoc. Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:11 -07:00
Johannes Weiner	2a2e48854d	mm: vmscan: fix IO/refault regression in cache workingset transition Since commit `59dc76b0d4` ("mm: vmscan: reduce size of inactive file list") we noticed bigger IO spikes during changes in cache access patterns. The patch in question shrunk the inactive list size to leave more room for the current workingset in the presence of streaming IO. However, workingset transitions that previously happened on the inactive list are now pushed out of memory and incur more refaults to complete. This patch disables active list protection when refaults are being observed. This accelerates workingset transitions, and allows more of the new set to establish itself from memory, without eating into the ability to protect the established workingset during stable periods. The workloads that were measurably affected for us were hit pretty bad by it, with refault/majfault rates doubling and tripling during cache transitions, and the machines sustaining half-hour periods of 100% IO utilization, where they'd previously have sub-minute peaks at 60-90%. Stateful services that handle user data tend to be more conservative with kernel upgrades. As a result we hit most page cache issues with some delay, as was the case here. The severity seemed to warrant a stable tag. Fixes: `59dc76b0d4` ("mm: vmscan: reduce size of inactive file list") Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:11 -07:00
Johannes Weiner	9a4caf1e9f	mm: memcontrol: provide shmem statistics Cgroups currently don't report how much shmem they use, which can be useful data to have, in particular since shmem is included in the cache/file item while being reclaimed like anonymous memory. Add a counter to track shmem pages during charging and uncharging. Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Chris Down <cdown@fb.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:08 -07:00
Johannes Weiner	553af430e7	mm: rmap: fix huge file mmap accounting in the memcg stats Huge pages are accounted as single units in the memcg's "file_mapped" counter. Account the correct number of base pages, like we do in the corresponding node counter. Link: http://lkml.kernel.org/r/20170322005111.3156-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-03-31 17:13:30 -07:00
Tejun Heo	17cc4dfeda	slab: use memcg_kmem_cache_wq for slab destruction operations If there's contention on slab_mutex, queueing the per-cache destruction work item on the system_wq can unnecessarily create and tie up a lot of kworkers. Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it global and use that workqueue for the destruction work items too. While at it, convert the workqueue from an unbound workqueue to a per-cpu one with concurrency limited to 1. It's generally preferable to use per-cpu workqueues and concurrency limit of 1 is safe enough. This is suggested by Joonsoo Kim. Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:27 -08:00
Tejun Heo	bc2791f857	slab: link memcg kmem_caches on their associated memory cgroup With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. While a memcg kmem_cache is listed on its root cache's ->children list, there is no direct way to iterate all kmem_caches which are assocaited with a memory cgroup. The only way to iterate them is walking all caches while filtering out caches which don't match, which would be most of them. This makes memcg destruction operations O(N^2) where N is the total number of slab caches which can be huge. This combined with the synchronous RCU operations can tie up a CPU and affect the whole machine for many hours when memory reclaim triggers offlining and destruction of the stale memcgs. This patch adds mem_cgroup->kmem_caches list which goes through memcg_cache_params->kmem_caches_node of all kmem_caches which are associated with the memcg. All memcg specific iterations, including stat file access, are updated to use the new list instead. Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:27 -08:00
Michal Hocko	b4536f0c82	mm, memcg: fix the active list aging for lowmem requests when memcg is enabled Nils Holland and Klaus Ethgen have reported unexpected OOM killer invocations with 32b kernel starting with 4.8 kernels kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS\|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0 kworker/u4:5 cpuset=/ mems_allowed=0 CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2 [...] Mem-Info: active_anon:58685 inactive_anon:90 isolated_anon:0 active_file:274324 inactive_file:281962 isolated_file:0 unevictable:0 dirty:649 writeback:0 unstable:0 slab_reclaimable:40662 slab_unreclaimable:17754 mapped:7382 shmem:202 pagetables:351 bounce:0 free:206736 free_pcp:332 free_cma:0 Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 813 3474 3474 Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB lowmem_reserve[]: 0 0 21292 21292 HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB the oom killer is clearly pre-mature because there there is still a lot of page cache in the zone Normal which should satisfy this lowmem request. Further debugging has shown that the reclaim cannot make any forward progress because the page cache is hidden in the active list which doesn't get rotated because inactive_list_is_low is not memcg aware. The code simply subtracts per-zone highmem counters from the respective memcg's lru sizes which doesn't make any sense. We can simply end up always seeing the resulting active and inactive counts 0 and return false. This issue is not limited to 32b kernels but in practice the effect on systems without CONFIG_HIGHMEM would be much harder to notice because we do not invoke the OOM killer for allocations requests targeting < ZONE_NORMAL. Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node and subtract per-memcg highmem counts when memcg is enabled. Introduce helper lruvec_zone_lru_size which redirects to either zone counters or mem_cgroup_get_zone_lru_size when appropriate. We are losing empty LRU but non-zero lru size detection introduced by `ca707239e8` ("mm: update_lru_size warn and reset bad lru_size") because of the inherent zone vs. node discrepancy. Fixes: `f8d1a31163` ("mm: consider whether to decivate based on eligible zones inactive ratio") Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Nils Holland <nholland@tisys.org> Tested-by: Nils Holland <nholland@tisys.org> Reported-by: Klaus Ethgen <Klaus@Ethgen.de> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-01-10 18:31:55 -08:00
Johannes Weiner	2d75807383	mm: memcontrol: consolidate cgroup socket tracking The cgroup core and the memory controller need to track socket ownership for different purposes, but the tracking sites being entirely different is kind of ugly. Be a better citizen and rename the memory controller callbacks to match the cgroup core callbacks, then move them to the same place. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:29 -07:00
Vladimir Davydov	7c5f64f844	mm: oom: deduplicate victim selection code for memcg and global oom When selecting an oom victim, we use the same heuristic for both memory cgroup and global oom. The only difference is the scope of tasks to select the victim from. So we could just export an iterator over all memcg tasks and keep all oom related logic in oom_kill.c, but instead we duplicate pieces of it in memcontrol.c reusing some initially private functions of oom_kill.c in order to not duplicate all of it. That looks ugly and error prone, because any modification of select_bad_process should also be propagated to mem_cgroup_out_of_memory. Let's rework this as follows: keep all oom heuristic related code private to oom_kill.c and make oom_kill.c use exported memcg functions when it's really necessary (like in case of iterating over memcg tasks). Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:26 -07:00
Andy Lutomirski	efdc949079	mm: fix memcg stack accounting for sub-page stacks We should account for stacks regardless of stack size, and we need to account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units to kilobytes and Move it into account_kernel_stack(). Fixes: `12580e4b54` ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat") Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Mel Gorman	7ee36a14f0	mm, vmscan: Update all zone LRU sizes before updating memcg Minchan Kim reported setting the following warning on a 32-bit system although it can affect 64-bit systems. WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110 mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty Modules linked in: CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x76/0xaf __warn+0xea/0x110 ? mem_cgroup_update_lru_size+0x103/0x110 warn_slowpath_fmt+0x3b/0x40 mem_cgroup_update_lru_size+0x103/0x110 isolate_lru_pages.isra.61+0x2e2/0x360 shrink_active_list+0xac/0x2a0 ? __delay+0xe/0x10 shrink_node_memcg+0x53c/0x7a0 shrink_node+0xab/0x2a0 do_try_to_free_pages+0xc6/0x390 try_to_free_pages+0x245/0x590 LRU list contents and counts are updated separately. Counts are updated before pages are added to the LRU and updated after pages are removed. The warning above is from a check in mem_cgroup_update_lru_size that ensures that list sizes of zero are empty. The problem is that node-lru needs to account for highmem pages if CONFIG_HIGHMEM is set. One impact of the implementation is that the sizes are updated in multiple passes when pages from multiple zones were isolated. This happens whether HIGHMEM is set or not. When multiple zones are isolated, it's possible for a debugging check in memcg to be tripped. This patch forces all the zone counts to be updated before the memcg function is called. Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Minchan Kim <minchan@kernel.org> Reported-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Mel Gorman	ef8f232799	mm, memcg: move memcg limit enforcement from zones to nodes Memcg needs adjustment after moving LRUs to the node. Limits are tracked per memcg but the soft-limit excess is tracked per zone. As global page reclaim is based on the node, it is easy to imagine a situation where a zone soft limit is exceeded even though the memcg limit is fine. This patch moves the soft limit tree the node. Technically, all the variable names should also change but people are already familiar by the meaning of "mz" even if "mn" would be a more appropriate name now. Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Mel Gorman	a9dd0a8310	mm, vmscan: make shrink_node decisions more node-centric Earlier patches focused on having direct reclaim and kswapd use data that is node-centric for reclaiming but shrink_node() itself still uses too much zone information. This patch removes unnecessary zone-based information with the most important decision being whether to continue reclaim or not. Some memcg APIs are adjusted as a result even though memcg itself still uses some zone information. [mgorman@techsingularity.net: optimization] Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Mel Gorman	599d0c954f	mm, vmscan: move LRU lists to node This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Johannes Weiner	55779ec759	mm: fix vm-scalability regression in cgroup-aware workingset code Commit `23047a96d7` ("mm: workingset: per-cgroup cache thrash detection") added a page->mem_cgroup lookup to the cache eviction, refault, and activation paths, as well as locking to the activation path, and the vm-scalability tests showed a regression of -23%. While the test in question is an artificial worst-case scenario that doesn't occur in real workloads - reading two sparse files in parallel at full CPU speed just to hammer the LRU paths - there is still some optimizations that can be done in those paths. Inline the lookup functions to eliminate calls. Also, page->mem_cgroup doesn't need to be stabilized when counting an activation; we merely need to hold the RCU lock to prevent the memcg from being freed. This cuts down on overhead quite a bit: `23047a96d7` 063f6715e77a7be5770d6081fe ---------------- -------------------------- %stddev %change %stddev \ \| \ 21621405 +- 0% +11.3% 24069657 +- 2% vm-scalability.throughput [linux@roeck-us.net: drop unnecessary include file] [hannes@cmpxchg.org: add WARN_ON_ONCE()s] Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org Reported-by: Ye Xiaolong <xiaolong.ye@intel.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Vladimir Davydov	452647784b	mm: memcontrol: cleanup kmem charge functions - Handle memcg_kmem_enabled check out to the caller. This reduces the number of function definitions making the code easier to follow. At the same time it doesn't result in code bloat, because all of these functions are used only in one or two places. - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't have to dive deep into memcg implementation to see which allocations are charged and which are not. - Refresh comments. Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-26 16:19:19 -07:00
Johannes Weiner	73f576c04b	mm: memcontrol: fix cgroup creation failure after many small jobs The memory controller has quite a bit of state that usually outlives the cgroup and pins its CSS until said state disappears. At the same time it imposes a 16-bit limit on the CSS ID space to economically store IDs in the wild. Consequently, when we use cgroups to contain frequent but small and short-lived jobs that leave behind some page cache, we quickly run into the 64k limitations of outstanding CSSs. Creating a new cgroup fails with -ENOSPC while there are only a few, or even no user-visible cgroups in existence. Although pinning CSSs past cgroup removal is common, there are only two instances that actually need an ID after a cgroup is deleted: cache shadow entries and swapout records. Cache shadow entries reference the ID weakly and can deal with the CSS having disappeared when it's looked up later. They pose no hurdle. Swap-out records do need to pin the css to hierarchically attribute swapins after the cgroup has been deleted; though the only pages that remain swapped out after offlining are tmpfs/shmem pages. And those references are under the user's control, so they are manageable. This patch introduces a private 16-bit memcg ID and switches swap and cache shadow entries over to using that. This ID can then be recycled after offlining when the CSS remains pinned only by objects that don't specifically need it. This script demonstrates the problem by faulting one cache page in a new cgroup and deleting it again: set -e mkdir -p pages for x in `seq 128000`; do [ $((x % 1000)) -eq 0 ] && echo $x mkdir /cgroup/foo echo $$ >/cgroup/foo/cgroup.procs echo trex >pages/$x echo $$ >/cgroup/cgroup.procs rmdir /cgroup/foo done When run on an unpatched kernel, we eventually run out of possible IDs even though there are no visible cgroups: [root@ham ~]# ./cssidstress.sh [...] 65000 mkdir: cannot create directory '/cgroup/foo': No space left on device After this patch, the IDs get released upon cgroup destruction and the cache and css objects get released once memory reclaim kicks in. [hannes@cmpxchg.org: init the IDR] Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org Fixes: `b2052564e6` ("mm: memcontrol: continue cache reclaim from offlined groups") Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: John Garcia <john.garcia@mesosphere.io> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Nikolay Borisov <kernel@kyup.com> Cc: <stable@vger.kernel.org> [3.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-23 10:25:54 +09:00
Rik van Riel	59dc76b0d4	mm: vmscan: reduce size of inactive file list The inactive file list should still be large enough to contain readahead windows and freshly written file data, but it no longer is the only source for detecting multiple accesses to file pages. The workingset refault measurement code causes recently evicted file pages that get accessed again after a shorter interval to be promoted directly to the active list. With that mechanism in place, we can afford to (on a larger system) dedicate more memory to the active file list, so we can actually cache more of the frequently used file pages in memory, and not have them pushed out by streaming writes, once-used streaming file reads, etc. This can help things like database workloads, where only half the page cache can currently be used to cache the database working set. This patch automatically increases that fraction on larger systems, using the same ratio that has already been used for anonymous memory. [hannes@cmpxchg.org: cgroup-awareness] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-05-20 17:58:30 -07:00
Hugh Dickins	9d5e6a9f22	mm: update_lru_size do the __mod_zone_page_state Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy reclaim was removed) that lru_size can be updated by -nr_taken once per call to isolate_lru_pages(), instead of page by page. Update it inside isolate_lru_pages(), or at its two callsites? I chose to update it at the callsites, rearranging and grouping the updates by nr_taken and nr_scanned together in both. With one exception, mem_cgroup_update_lru_size(,lru,) is then used where __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding some more calls in a future commit. Make the code a little smaller and simpler by incorporating stat update in lru_size update. The exception was move_active_pages_to_lru(), which aggregated the pgmoved stat update separately from the individual lru_size updates; but I still think this a simplification worth making. However, the __mod_zone_page_state is not peculiar to mem_cgroups: so better use the name update_lru_size, calls mem_cgroup_update_lru_size when CONFIG_MEMCG. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-05-19 19:12:14 -07:00
Vladimir Davydov	0a6b76dd23	mm: workingset: make shadow node shrinker memcg aware Workingset code was recently made memcg aware, but shadow node shrinker is still global. As a result, one small cgroup can consume all memory available for shadow nodes, possibly hurting other cgroups by reclaiming their shadow nodes, even though reclaim distances stored in its shadow nodes have no effect. To avoid this, we need to make shadow node shrinker memcg aware. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-03-17 15:09:34 -07:00
Vladimir Davydov	b6ecd2dea4	mm: memcontrol: zap memcg_kmem_online helper As kmem accounting is now either enabled for all cgroups or disabled system-wide, there's no point in having memcg_kmem_online() helper - instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as shrink_slab() now does. There are only two places left where this helper is used - __memcg_kmem_charge() and memcg_create_kmem_cache(). The former can only be called if memcg_kmem_enabled() returned true. Since the cgroup it operates on is online, mem_cgroup_is_root() check will be enough. memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead of memcg_kmem_online(), because it relies on the fact that in memcg_offline_kmem() memcg->kmem_state is changed before memcg_deactivate_kmem_caches() is called, but there we can just open-code the check. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-03-17 15:09:34 -07:00
Vladimir Davydov	12580e4b54	mm: memcontrol: report kernel stack usage in cgroup2 memory.stat Show how much memory is allocated to kernel stacks. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-03-17 15:09:34 -07:00

1 2 3 4 5 ...

290 Commits