70f98fe87ab5b332fa6308ae9f05da170d65e9f6
290 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c0367061af |
FROMLIST: mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.
When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[5.5, 7.5]%
Ops/sec KB/sec
patch1-7: 1014393.57 39455.42
patch1-8: 1078507.59 41949.15
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
45.54% lzo1x_1_do_compress (real work)
9.56% page_vma_mapped_walk
6.70% _raw_spin_unlock_irq
2.78% ptep_clear_flush
2.47% do_raw_spin_lock
2.22% __zram_bvec_write
1.87% lru_gen_look_around
1.78% memmove
1.77% obj_malloc
1.44% free_unref_page_list
patch1-8
47.02% lzo1x_1_do_compress (real work)
6.73% page_vma_mapped_walk
6.14% _raw_spin_unlock_irq
3.39% walk_pte_range
2.63% ptep_clear_flush
2.29% __zram_bvec_write
2.10% do_raw_spin_lock
1.81% memmove
1.73% obj_malloc
1.53% free_unref_page_list
Configurations:
no change
[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo
Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e
|
||
|
|
09a2a70211 |
FROMLIST: mm: multi-gen LRU: exploit locality in rmap
Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.
This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[4, 6]%
Ops/sec KB/sec
patch1-6: 964656.80 37520.88
patch1-7: 1014393.57 39455.42
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-6
36.13% lzo1x_1_do_compress (real work)
19.16% page_vma_mapped_walk
6.55% _raw_spin_unlock_irq
4.02% do_raw_spin_lock
2.32% anon_vma_interval_tree_iter_first
2.11% ptep_clear_flush
1.76% __zram_bvec_write
1.64% folio_referenced_one
1.40% memmove
1.35% obj_malloc
patch1-7
45.54% lzo1x_1_do_compress (real work)
9.56% page_vma_mapped_walk
6.70% _raw_spin_unlock_irq
2.78% ptep_clear_flush
2.47% do_raw_spin_lock
2.22% __zram_bvec_write
1.87% lru_gen_look_around
1.78% memmove
1.77% obj_malloc
1.44% free_unref_page_list
Configurations:
no change
Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c
|
||
|
|
42440bf2f0 |
FROMLIST: mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().
The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.
Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in page->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
page->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[38, 40]%
IOPS BW
5.18-ed4643521e6a: 2547k 9989MiB/s
patch1-6: 3540k 13.5GiB/s
Single workload:
memcached (anon): +[103, 107]%
Ops/sec KB/sec
5.18-ed4643521e6a: 469048.66 18243.91
patch1-6: 964656.80 37520.88
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.18-ed4643521e6a
39.56% page_vma_mapped_walk
19.32% lzo1x_1_do_compress (real work)
7.18% do_raw_spin_lock
4.23% _raw_spin_unlock_irq
2.26% vma_interval_tree_subtree_search
2.12% vma_interval_tree_iter_next
2.11% folio_referenced_one
1.90% anon_vma_interval_tree_iter_first
1.47% ptep_clear_flush
0.97% __anon_vma_interval_tree_subtree_search
patch1-6
36.13% lzo1x_1_do_compress (real work)
19.16% page_vma_mapped_walk
6.55% _raw_spin_unlock_irq
4.02% do_raw_spin_lock
2.32% anon_vma_interval_tree_iter_first
2.11% ptep_clear_flush
1.76% __zram_bvec_write
1.64% folio_referenced_one
1.40% memmove
1.35% obj_malloc
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
Chrome OS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
|
||
|
|
0cca5fb93e |
Revert "Merge branch 'dev/mglru' into dev/12-2"
This reverts commit |
||
|
|
1e84d38423 |
FROMLIST: mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.
NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.
To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.
An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.
When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.
This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
page table walkers can skip processes that have been sleeping since
the last iteration.
2. It uses generational Bloom filters to record populated branches so
that page table walkers can reduce their search space based on the
query results, e.g., to skip page tables containing mostly holes or
misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
spanning multiple VMAs. IOW, it finishes all the VMAs within the
range of the same PMD table before it returns to a PGD table. This
improves the cache performance for workloads that have large
numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[5.5, 7.5]%
Ops/sec KB/sec
patch1-7: 1014393.57 39455.42
patch1-8: 1078507.59 41949.15
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-7
45.54% lzo1x_1_do_compress (real work)
9.56% page_vma_mapped_walk
6.70% _raw_spin_unlock_irq
2.78% ptep_clear_flush
2.47% do_raw_spin_lock
2.22% __zram_bvec_write
1.87% lru_gen_look_around
1.78% memmove
1.77% obj_malloc
1.44% free_unref_page_list
patch1-8
47.02% lzo1x_1_do_compress (real work)
6.73% page_vma_mapped_walk
6.14% _raw_spin_unlock_irq
3.39% walk_pte_range
2.63% ptep_clear_flush
2.29% __zram_bvec_write
2.10% do_raw_spin_lock
1.81% memmove
1.73% obj_malloc
1.53% free_unref_page_list
Configurations:
no change
[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo
Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e
|
||
|
|
1bb75d617f |
FROMLIST: mm: multi-gen LRU: exploit locality in rmap
Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.
This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.
Server benchmark results:
Single workload:
fio (buffered I/O): no change
Single workload:
memcached (anon): +[4, 6]%
Ops/sec KB/sec
patch1-6: 964656.80 37520.88
patch1-7: 1014393.57 39455.42
Configurations:
no change
Client benchmark results:
kswapd profiles:
patch1-6
36.13% lzo1x_1_do_compress (real work)
19.16% page_vma_mapped_walk
6.55% _raw_spin_unlock_irq
4.02% do_raw_spin_lock
2.32% anon_vma_interval_tree_iter_first
2.11% ptep_clear_flush
1.76% __zram_bvec_write
1.64% folio_referenced_one
1.40% memmove
1.35% obj_malloc
patch1-7
45.54% lzo1x_1_do_compress (real work)
9.56% page_vma_mapped_walk
6.70% _raw_spin_unlock_irq
2.78% ptep_clear_flush
2.47% do_raw_spin_lock
2.22% __zram_bvec_write
1.87% lru_gen_look_around
1.78% memmove
1.77% obj_malloc
1.44% free_unref_page_list
Configurations:
no change
Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c
|
||
|
|
ed95143ac2 |
FROMLIST: mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.
The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().
The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.
Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in page->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
page->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
inferring whether pages accessed multiple times through file
descriptors are statistically hot and thus worth protecting in the
eviction path.
2. It takes pages accessed through page tables into account and avoids
overprotecting pages accessed multiple times through file
descriptors. (Pages accessed through page tables are in the first
tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
twice through file descriptors, when under heavy buffered I/O
workloads.
Server benchmark results:
Single workload:
fio (buffered I/O): +[38, 40]%
IOPS BW
5.18-ed4643521e6a: 2547k 9989MiB/s
patch1-6: 3540k 13.5GiB/s
Single workload:
memcached (anon): +[103, 107]%
Ops/sec KB/sec
5.18-ed4643521e6a: 469048.66 18243.91
patch1-6: 964656.80 37520.88
Configurations:
CPU: two Xeon 6154
Mem: total 256G
Node 1 was only used as a ram disk to reduce the variance in the
results.
patch drivers/block/brd.c <<EOF
99,100c99,100
< gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
< page = alloc_page(gfp_flags);
---
> gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
> page = alloc_pages_node(1, gfp_flags, 0);
EOF
cat >>/etc/systemd/system.conf <<EOF
CPUAffinity=numa
NUMAPolicy=bind
NUMAMask=0
EOF
cat >>/etc/memcached.conf <<EOF
-m 184320
-s /var/run/memcached/memcached.sock
-a 0766
-t 36
-B binary
EOF
cat fio.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkfs.ext4 /dev/ram0
mount -t ext4 /dev/ram0 /mnt
mkdir /sys/fs/cgroup/user.slice/test
echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=10m --runtime=5m --group_reporting
cat memcached.sh
modprobe brd rd_nr=1 rd_size=113246208
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
--ratio 1:0 --pipeline 8 -d 2000
memtier_benchmark -S /var/run/memcached/memcached.sock \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
--ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
Client benchmark results:
kswapd profiles:
5.18-ed4643521e6a
39.56% page_vma_mapped_walk
19.32% lzo1x_1_do_compress (real work)
7.18% do_raw_spin_lock
4.23% _raw_spin_unlock_irq
2.26% vma_interval_tree_subtree_search
2.12% vma_interval_tree_iter_next
2.11% folio_referenced_one
1.90% anon_vma_interval_tree_iter_first
1.47% ptep_clear_flush
0.97% __anon_vma_interval_tree_subtree_search
patch1-6
36.13% lzo1x_1_do_compress (real work)
19.16% page_vma_mapped_walk
6.55% _raw_spin_unlock_irq
4.02% do_raw_spin_lock
2.32% anon_vma_interval_tree_iter_first
2.11% ptep_clear_flush
1.76% __zram_bvec_write
1.64% folio_referenced_one
1.40% memmove
1.35% obj_malloc
Configurations:
CPU: single Snapdragon 7c
Mem: total 4G
Chrome OS MemoryPressure [1]
[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
|
||
|
|
f5daca7dd9 |
Merge remote-tracking branch 'aosp/android-4.14-stable' into android11-base
* aosp/android-4.14-stable: Linux 4.14.230 can: flexcan: flexcan_chip_freeze(): fix chip freeze for missing bitrate init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM init/Kconfig: make COMPILE_TEST depend on !S390 bpf, x86: Validate computation of branch displacements for x86-64 cifs: Silently ignore unknown oplock break handle cifs: revalidate mapping when we open files for SMB1 POSIX ia64: mca: allocate early mca with GFP_ATOMIC scsi: target: pscsi: Clean up after failure in pscsi_map_sg() x86/build: Turn off -fcf-protection for realmode targets platform/x86: thinkpad_acpi: Allow the FnLock LED to change state drm/msm: Ratelimit invalid-fence message mac80211: choose first enabled channel for monitor mISDN: fix crash in fritzpci net: pxa168_eth: Fix a potential data race in pxa168_eth_remove ARM: dts: am33xx: add aliases for mmc interfaces Linux 4.14.229 drivers: video: fbcon: fix NULL dereference in fbcon_cursor() staging: rtl8192e: Change state information from u16 to u8 staging: rtl8192e: Fix incorrect source in memcpy() usb: gadget: udc: amd5536udc_pci fix null-ptr-dereference USB: cdc-acm: fix use-after-free after probe failure USB: cdc-acm: downgrade message to debug USB: cdc-acm: untangle a circular dependency between callback and softint cdc-acm: fix BREAK rx code path adding necessary calls usb: xhci-mtk: fix broken streams issue on 0.96 xHCI usb: musb: Fix suspend with devices connected for a64 USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem usbip: vhci_hcd fix shift out-of-bounds in vhci_hub_control() firewire: nosy: Fix a use-after-free bug in nosy_ioctl() extcon: Fix error handling in extcon_dev_register extcon: Add stubs for extcon_register_notifier_all() functions pinctrl: rockchip: fix restore error in resume mm: writeback: use exact memcg dirty counts mm: fix oom_kill event handling mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the same cacheline mm: memcg: make sure memory.events is uptodate when waking pollers mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats reiserfs: update reiserfs_xattrs_initialized() condition drm/amdgpu: check alignment on CPU page for bo map drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings() mm: fix race by making init_zero_pfn() early_initcall tracing: Fix stack trace event size ALSA: hda/realtek: call alc_update_headset_mode() in hp_automute_hook ALSA: hda/realtek: fix a determine_headset_type issue for a Dell AIO ALSA: usb-audio: Apply sample rate quirk to Logitech Connect bpf: Remove MTU check in __bpf_skb_max_len net: wan/lmc: unregister device when no matching device is found appletalk: Fix skb allocation size in loopback case net: ethernet: aquantia: Handle error cleanup of start on open brcmfmac: clear EAP/association status bits on linkdown events ext4: do not iput inode under running transaction in ext4_rename() ASoC: rt5659: Update MCLK rate in set_sysclk() staging: comedi: cb_pcidas64: fix request_irq() warn staging: comedi: cb_pcidas: fix request_irq() warn scsi: qla2xxx: Fix broken #endif placement scsi: st: Fix a use after free in st_open() vhost: Fix vhost_vq_reset() powerpc: Force inlining of cpu_has_feature() to avoid build failure ASoC: cs42l42: Always wait at least 3ms after reset ASoC: cs42l42: Fix mixer volume control ASoC: es8316: Simplify adc_pga_gain_tlv table ASoC: sgtl5000: set DAP_AVC_CTRL register to correct default value on probe ASoC: rt5651: Fix dac- and adc- vol-tlv values being off by a factor of 10 ASoC: rt5640: Fix dac- and adc- vol-tlv values being off by a factor of 10 rpc: fix NULL dereference on kmalloc failure ext4: fix bh ref count on error paths ipv6: weaken the v4mapped source check selinux: vsock: Set SID for socket returned by accept() BACKPORT: drm/virtio: Use vmalloc for command buffer allocations. UPSTREAM: drm/virtio: Rewrite virtio_gpu_queue_ctrl_buffer using fenced version. Conflicts: drivers/usb/core/quirks.c Change-Id: I23466726e73acff25a6a397f23d3752c48379a51 Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> |
||
|
|
f34769a0b8 |
mm: writeback: use exact memcg dirty counts
commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream.
Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:
1) per-memcg per-cpu values in range of [-32..32]
2) per-memcg atomic counter
When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic. Stat readers only check the atomic. Thus readers such as
balance_dirty_pages() may see a nontrivial error margin: 32 pages per
cpu.
Assuming 100 cpus:
4k x86 page_size: 13 MiB error per memcg
64k ppc page_size: 200 MiB error per memcg
Considering that dirty+writeback are used together for some decisions the
errors double.
This inaccuracy can lead to undeserved oom kills. One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages. If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.
It could be argued that tiny containers are not supported, but it's more
subtle. It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.
The following test reliably ooms without this patch. This patch avoids
oom kills.
$ cat test
mount -t cgroup2 none /dev/cgroup
cd /dev/cgroup
echo +io +memory > cgroup.subtree_control
mkdir test
cd test
echo 10M > memory.max
(echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
(echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
$ cat memcg-writeback-stress.c
/*
* Dirty pages from all but one cpu.
* Clean pages from the non dirtying cpu.
* This is to stress per cpu counter imbalance.
* On a 100 cpu machine:
* - per memcg per cpu dirty count is 32 pages for each of 99 cpus
* - per memcg atomic is -99*32 pages
* - thus the complete dirty limit: sum of all counters 0
* - balance_dirty_pages() only sees atomic count -99*32 pages, which
* it max()s to 0.
* - So a workload can dirty -99*32 pages before balance_dirty_pages()
* cares.
*/
#define _GNU_SOURCE
#include <err.h>
#include <fcntl.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/stat.h>
#include <sys/sysinfo.h>
#include <sys/types.h>
#include <unistd.h>
static char *buf;
static int bufSize;
static void set_affinity(int cpu)
{
cpu_set_t affinity;
CPU_ZERO(&affinity);
CPU_SET(cpu, &affinity);
if (sched_setaffinity(0, sizeof(affinity), &affinity))
err(1, "sched_setaffinity");
}
static void dirty_on(int output_fd, int cpu)
{
int i, wrote;
set_affinity(cpu);
for (i = 0; i < 32; i++) {
for (wrote = 0; wrote < bufSize; ) {
int ret = write(output_fd, buf+wrote, bufSize-wrote);
if (ret == -1)
err(1, "write");
wrote += ret;
}
}
}
int main(int argc, char **argv)
{
int cpu, flush_cpu = 1, output_fd;
const char *output;
if (argc != 2)
errx(1, "usage: output_file");
output = argv[1];
bufSize = getpagesize();
buf = malloc(getpagesize());
if (buf == NULL)
errx(1, "malloc failed");
output_fd = open(output, O_CREAT|O_RDWR);
if (output_fd == -1)
err(1, "open(%s)", output);
for (cpu = 0; cpu < get_nprocs(); cpu++) {
if (cpu != flush_cpu)
dirty_on(output_fd, cpu);
}
set_affinity(flush_cpu);
if (fsync(output_fd))
err(1, "fsync(%s)", output);
if (close(output_fd))
err(1, "close(%s)", output);
free(buf);
}
Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters. This avoids the aforementioned oom
kills.
This does not affect the overhead of memory.stat, which still reads the
single atomic counter.
Why not use percpu_counter? memcg already handles cpus going offline, so
no need for that overhead from percpu_counter. And the percpu_counter
spinlocks are more heavyweight than is required.
It probably also makes sense to use exact dirty and writeback counters
in memcg oom reports. But that is saved for later.
Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org> [4.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
6f599379fd |
mm: fix oom_kill event handling
commit fe6bdfc8e1e131720abbe77a2eb990c94c9024cb upstream.
Commit e27be240df53 ("mm: memcg: make sure memory.events is uptodate
when waking pollers") converted most of memcg event counters to
per-memcg atomics, which made them less confusing for a user. The
"oom_kill" counter remained untouched, so now it behaves differently
than other counters (including "oom"). This adds nothing but confusion.
Let's fix this by adding the MEMCG_OOM_KILL event, and follow the
MEMCG_OOM approach.
This also removes a hack from count_memcg_event_mm(), introduced earlier
specially for the OOM_KILL counter.
[akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch]
Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[fllinden@amazon.com: backport to 4.14, minor contextual changes]
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
ad6e2898e6 |
mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the same cacheline
commit e81bf9793b1861d74953ef041b4f6c7faecc2dbd upstream.
The LKP robot found a 27% will-it-scale/page_fault3 performance
regression regarding commit e27be240df53("mm: memcg: make sure
memory.events is uptodate when waking pollers").
What the test does is:
1 mkstemp() a 128M file on a tmpfs;
2 start $nr_cpu processes, each to loop the following:
2.1 mmap() this file in shared write mode;
2.2 write 0 to this file in a PAGE_SIZE step till the end of the file;
2.3 unmap() this file and repeat this process.
3 After 5 minutes, check how many loops they managed to complete, the
higher the better.
The commit itself looks innocent enough as it merely changed some event
counting mechanism and this test didn't trigger those events at all.
Perf shows increased cycles spent on accessing root_mem_cgroup->stat_cpu
in count_memcg_event_mm()(called by handle_mm_fault()) and in
__mod_memcg_state() called by page_add_file_rmap(). So it's likely due
to the changed layout of 'struct mem_cgroup' that either make stat_cpu
falling into a constantly modifying cacheline or some hot fields stop
being in the same cacheline.
I verified this by moving memory_events[] back to where it was:
: --- a/include/linux/memcontrol.h
: +++ b/include/linux/memcontrol.h
: @@ -205,7 +205,6 @@ struct mem_cgroup {
: int oom_kill_disable;
:
: /* memory.events */
: - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
: struct cgroup_file events_file;
:
: /* protect arrays of thresholds */
: @@ -238,6 +237,7 @@ struct mem_cgroup {
: struct mem_cgroup_stat_cpu __percpu *stat_cpu;
: atomic_long_t stat[MEMCG_NR_STAT];
: atomic_long_t events[NR_VM_EVENT_ITEMS];
: + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
:
: unsigned long socket_pressure;
And performance restored.
Later investigation found that as long as the following 3 fields
moving_account, move_lock_task and stat_cpu are in the same cacheline,
performance will be good. To avoid future performance surprise by other
commits changing the layout of 'struct mem_cgroup', this patch makes
sure the 3 fields stay in the same cacheline.
One concern of this approach is, moving_account and move_lock_task could
be modified when a process changes memory cgroup while stat_cpu is a
always read field, it might hurt to place them in the same cacheline. I
assume it is rare for a process to change memory cgroup so this should
be OK.
Link: https://lkml.kernel.org/r/20180528114019.GF9904@yexl-desktop
Link: http://lkml.kernel.org/r/20180601071115.GA27302@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
09031a8b68 |
mm: memcg: make sure memory.events is uptodate when waking pollers
commit e27be240df53f1a20c659168e722b5d9f16cc7f4 upstream.
Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") added per-cpu drift to all memory cgroup stats
and events shown in memory.stat and memory.events.
For memory.stat this is acceptable. But memory.events issues file
notifications, and somebody polling the file for changes will be
confused when the counters in it are unchanged after a wakeup.
Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
MEMCG_OOM - are sufficiently rare and high-level that we don't need
per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
frequent, but they're counting invocations of reclaim, which is a
complex operation that touches many shared cachelines.
This splits memory.events from the generic VM events and tracks them in
their own, unbuffered atomic counters. That's also cleaner, as it
eliminates the ugly enum nesting of VM and cgroup events.
[hannes@cmpxchg.org: "array subscript is above array bounds"]
Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
cfc5a8f43e |
mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats
commit c3cc39118c3610eb6ab4711bc624af7fc48a35fe upstream.
After commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting"), we observed slowly upward creeping NR_WRITEBACK
counts over the course of several days, both the per-memcg stats as well
as the system counter in e.g. /proc/meminfo.
The conversion from full per-cpu stat counts to per-cpu cached atomic
stat counts introduced an irq-unsafe RMW operation into the updates.
Most stat updates come from process context, but one notable exception
is the NR_WRITEBACK counter. While writebacks are issued from process
context, they are retired from (soft)irq context.
When writeback completions interrupt the RMW counter updates of new
writebacks being issued, the decs from the completions are lost.
Since the global updates are routed through the joint lruvec API, both
the memcg counters as well as the system counters are affected.
This patch makes the joint stat and event API irq safe.
Link: http://lkml.kernel.org/r/20180203082353.17284-1-hannes@cmpxchg.org
Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Debugged-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
52dc535677 |
Merge remote-tracking branch 'aosp/android-4.14-stable' into android11-base
* aosp/android-4.14-stable:
Linux 4.14.216
net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
block: fix use-after-free in disk_part_iter_next
KVM: arm64: Don't access PMCR_EL0 when no PMU is available
wan: ds26522: select CONFIG_BITREVERSE
net/mlx5e: Fix two double free cases
net/mlx5e: Fix memleak in mlx5e_create_l2_table_groups
iommu/intel: Fix memleak in intel_irq_remapping_alloc
block: rsxx: select CONFIG_CRC32
wil6210: select CONFIG_CRC32
dmaengine: xilinx_dma: fix mixed_enum_type coverity warning
dmaengine: xilinx_dma: check dma_async_device_register return value
spi: stm32: FIFO threshold level - fix align packet size
cpufreq: powernow-k8: pass policy rather than use cpufreq_cpu_get()
i2c: sprd: use a specific timeout to avoid system hang up issue
ARM: OMAP2+: omap_device: fix idling of devices during probe
iio: imu: st_lsm6dsx: fix edge-trigger interrupts
iio: imu: st_lsm6dsx: flip irq return logic
spi: pxa2xx: Fix use-after-free on unbind
ubifs: wbuf: Don't leak kernel memory to flash
drm/i915: Fix mismatch between misplaced vma check and vma insert
vmlinux.lds.h: Add PGO and AutoFDO input sections
x86/resctrl: Don't move a task to the same resource group
x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR
net: fix pmtu check in nopmtudisc mode
net: ip: always refragment ip defragmented packets
net: vlan: avoid leaks on register_vlan_dev() failures
net: cdc_ncm: correct overhead in delayed_ndp_size
powerpc: Fix incorrect stw{, ux, u, x} instructions in __set_pte_at
Linux 4.14.215
scsi: target: Fix XCOPY NAA identifier lookup
KVM: x86: fix shift out of bounds reported by UBSAN
x86/mtrr: Correct the range check before performing MTRR type lookups
netfilter: xt_RATEEST: reject non-null terminated string from userspace
netfilter: ipset: fix shift-out-of-bounds in htable_bits()
Revert "device property: Keep secondary firmware node secondary by type"
ALSA: hda/realtek - Fix speaker volume control on Lenovo C940
ALSA: hda/conexant: add a new hda codec CX11970
x86/mm: Fix leak of pmd ptlock
USB: serial: keyspan_pda: remove unused variable
usb: gadget: configfs: Fix use-after-free issue with udc_name
usb: gadget: configfs: Preserve function ordering after bind failure
usb: gadget: Fix spinlock lockup on usb_function_deactivate
USB: gadget: legacy: fix return error code in acm_ms_bind()
usb: gadget: function: printer: Fix a memory leak for interface descriptor
usb: gadget: f_uac2: reset wMaxPacketSize
usb: gadget: select CONFIG_CRC32
ALSA: usb-audio: Fix UBSAN warnings for MIDI jacks
USB: usblp: fix DMA to stack
USB: yurex: fix control-URB timeout handling
USB: serial: option: add Quectel EM160R-GL
USB: serial: option: add LongSung M5710 module support
USB: serial: iuu_phoenix: fix DMA from stack
usb: uas: Add PNY USB Portable SSD to unusual_uas
usb: usbip: vhci_hcd: protect shift size
USB: xhci: fix U1/U2 handling for hardware with XHCI_INTEL_HOST quirk set
usb: chipidea: ci_hdrc_imx: add missing put_device() call in usbmisc_get_init_data()
usb: dwc3: ulpi: Use VStsDone to detect PHY regs access completion
USB: cdc-acm: blacklist another IR Droid device
usb: gadget: enable super speed plus
crypto: ecdh - avoid buffer overflow in ecdh_set_secret()
video: hyperv_fb: Fix the mmap() regression for v5.4.y and older
net: systemport: set dev->max_mtu to UMAC_MAX_MTU_SIZE
net: mvpp2: Fix GoP port 3 Networking Complex Control configurations
net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc
net: sched: prevent invalid Scell_log shift count
vhost_net: fix ubuf refcount incorrectly when sendmsg fails
net: usb: qmi_wwan: add Quectel EM160R-GL
CDC-NCM: remove "connected" log message
net: hdlc_ppp: Fix issues when mod_timer is called while timer is running
net: hns: fix return value check in __lb_other_process()
ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
net: ethernet: ti: cpts: fix ethtool output when no ptp_clock registered
net-sysfs: take the rtnl lock when storing xps_cpus
net: ethernet: Fix memleak in ethoc_probe
net/ncsi: Use real net-device for response handler
virtio_net: Fix recursive call to cpus_read_lock()
qede: fix offload for IPIP tunnel packets
atm: idt77252: call pci_disable_device() on error path
ethernet: ucc_geth: set dev->max_mtu to 1518
ethernet: ucc_geth: fix use-after-free in ucc_geth_remove()
depmod: handle the case of /sbin/depmod without /sbin in PATH
lib/genalloc: fix the overflow when size is too big
scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
workqueue: Kick a worker based on the actual activation of delayed works
kbuild: don't hardcode depmod path
Linux 4.14.214
mwifiex: Fix possible buffer overflows in mwifiex_cmd_802_11_ad_hoc_start
iio:magnetometer:mag3110: Fix alignment and data leak issues.
iio:imu:bmi160: Fix alignment and data leak issues
kdev_t: always inline major/minor helper functions
dm verity: skip verity work if I/O error when system is shutting down
ALSA: pcm: Clear the full allocated memory at hw_params
module: delay kobject uevent until after module init call
powerpc: sysdev: add missing iounmap() on error in mpic_msgr_probe()
quota: Don't overflow quota file offsets
module: set MODULE_STATE_GOING state when a module fails to load
rtc: sun6i: Fix memleak in sun6i_rtc_clk_init
ALSA: seq: Use bool for snd_seq_queue internal flags
media: gp8psk: initialize stats at power control logic
misc: vmw_vmci: fix kernel info-leak by initializing dbells in vmci_ctx_get_chkpt_doorbells()
reiserfs: add check for an invalid ih_entry_count
of: fix linker-section match-table corruption
uapi: move constants from <linux/kernel.h> to <linux/const.h>
powerpc/bitops: Fix possible undefined behaviour with fls() and fls64()
USB: serial: digi_acceleport: fix write-wakeup deadlocks
s390/dasd: fix hanging device offline processing
vfio/pci: Move dummy_resources_list init in vfio_pci_probe()
mm: memcontrol: fix excessive complexity in memory.stat reporting
mm: memcontrol: implement lruvec stat functions on top of each other
mm: memcontrol: eliminate raw access to stat and event counters
ALSA: usb-audio: fix sync-ep altsetting sanity check
ALSA: usb-audio: simplify set_sync_ep_implicit_fb_quirk
ALSA: hda/ca0132 - Fix work handling in delayed HP detection
md/raid10: initialize r10_bio->read_slot before use.
x86/entry/64: Add instruction suffix
ANDROID: usb: f_accessory: Don't drop NULL reference in acc_disconnect()
ANDROID: usb: f_accessory: Avoid bitfields for shared variables
ANDROID: usb: f_accessory: Cancel any pending work before teardown
ANDROID: usb: f_accessory: Don't corrupt global state on double registration
ANDROID: usb: f_accessory: Fix teardown ordering in acc_release()
ANDROID: usb: f_accessory: Add refcounting to global 'acc_dev'
ANDROID: usb: f_accessory: Wrap '_acc_dev' in get()/put() accessors
ANDROID: usb: f_accessory: Remove useless assignment
ANDROID: usb: f_accessory: Remove useless non-debug prints
ANDROID: usb: f_accessory: Remove stale comments
ANDROID: USB: f_accessory: Check dev pointer before decoding ctrl request
ANDROID: usb: gadget: f_accessory: fix CTS test stuck
Linux 4.14.213
PCI: Fix pci_slot_release() NULL pointer dereference
libnvdimm/namespace: Fix reaping of invalidated block-window-namespace labels
xenbus/xenbus_backend: Disallow pending watch messages
xen/xenbus: Count pending messages for each watch
xen/xenbus/xen_bus_type: Support will_handle watch callback
xen/xenbus: Add 'will_handle' callback support in xenbus_watch_path()
xen/xenbus: Allow watches discard events before queueing
xen-blkback: set ring->xenblkd to NULL after kthread_stop()
clk: mvebu: a3700: fix the XTAL MODE pin to MPP1_9
md/cluster: fix deadlock when node is doing resync job
iio:imu:bmi160: Fix too large a buffer.
iio:pressure:mpl3115: Force alignment of buffer
iio:light:rpr0521: Fix timestamp alignment and prevent data leak.
iio: adc: rockchip_saradc: fix missing clk_disable_unprepare() on error in rockchip_saradc_resume
iio: buffer: Fix demux update
mtd: parser: cmdline: Fix parsing of part-names with colons
soc: qcom: smp2p: Safely acquire spinlock without IRQs
spi: st-ssc4: Fix unbalanced pm_runtime_disable() in probe error path
spi: sc18is602: Don't leak SPI master in probe error path
spi: rb4xx: Don't leak SPI master in probe error path
spi: pic32: Don't leak DMA channels in probe error path
spi: davinci: Fix use-after-free on unbind
spi: spi-sh: Fix use-after-free on unbind
drm/dp_aux_dev: check aux_dev before use in drm_dp_aux_dev_get_by_minor()
jfs: Fix array index bounds check in dbAdjTree
jffs2: Fix GC exit abnormally
ceph: fix race in concurrent __ceph_remove_cap invocations
ima: Don't modify file descriptor mode on the fly
powerpc/powernv/memtrace: Don't leak kernel memory to user space
powerpc/xmon: Change printk() to pr_cont()
powerpc/rtas: Fix typo of ibm,open-errinjct in RTAS filter
ARM: dts: at91: sama5d2: fix CAN message ram offset and size
KVM: arm64: Introduce handling of AArch32 TTBCR2 traps
ext4: fix deadlock with fs freezing and EA inodes
ext4: fix a memory leak of ext4_free_data
btrfs: fix return value mixup in btrfs_get_extent
Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
USB: serial: keyspan_pda: fix write unthrottling
USB: serial: keyspan_pda: fix tx-unthrottle use-after-free
USB: serial: keyspan_pda: fix write-wakeup use-after-free
USB: serial: keyspan_pda: fix stalled writes
USB: serial: keyspan_pda: fix write deadlock
USB: serial: keyspan_pda: fix dropped unthrottle interrupts
USB: serial: mos7720: fix parallel-port state restore
EDAC/amd64: Fix PCI component registration
crypto: ecdh - avoid unaligned accesses in ecdh_set_secret()
powerpc/perf: Exclude kernel samples while counting events in user space.
staging: comedi: mf6x4: Fix AI end-of-conversion detection
s390/dasd: fix list corruption of lcu list
s390/dasd: fix list corruption of pavgroup group list
s390/dasd: prevent inconsistent LCU device data
s390/smp: perform initial CPU reset also for SMT siblings
ALSA: usb-audio: Disable sample read check if firmware doesn't give back
ALSA: pcm: oss: Fix a few more UBSAN fixes
ALSA: hda/realtek - Enable headset mic of ASUS Q524UQK with ALC255
ACPI: PNP: compare the string length in the matching_id()
Revert "ACPI / resources: Use AE_CTRL_TERMINATE to terminate resources walks"
PM: ACPI: PCI: Drop acpi_pm_set_bridge_wakeup()
Input: cyapa_gen6 - fix out-of-bounds stack access
media: netup_unidvb: Don't leak SPI master in probe error path
media: sunxi-cir: ensure IR is handled when it is continuous
media: gspca: Fix memory leak in probe
Input: goodix - add upside-down quirk for Teclast X98 Pro tablet
Input: cros_ec_keyb - send 'scancodes' in addition to key events
fix namespaced fscaps when !CONFIG_SECURITY
cfg80211: initialize rekey_data
clk: sunxi-ng: Make sure divider tables have sentinel
clk: s2mps11: Fix a resource leak in error handling paths in the probe function
qlcnic: Fix error code in probe
perf record: Fix memory leak when using '--user-regs=?' to list registers
pwm: lp3943: Dynamically allocate PWM chip base
pwm: zx: Add missing cleanup in error path
clk: ti: Fix memleak in ti_fapll_synth_setup
watchdog: coh901327: add COMMON_CLK dependency
watchdog: qcom: Avoid context switch in restart handler
net: korina: fix return value
net: allwinner: Fix some resources leak in the error handling path of the probe and in the remove function
net: bcmgenet: Fix a resource leak in an error handling path in the probe functin
checkpatch: fix unescaped left brace
powerpc/ps3: use dma_mapping_error()
nfc: s3fwrn5: Release the nfc firmware
um: chan_xterm: Fix fd leak
watchdog: sirfsoc: Add missing dependency on HAS_IOMEM
irqchip/alpine-msi: Fix freeing of interrupts on allocation error path
ASoC: wm_adsp: remove "ctl" from list on error in wm_adsp_create_control()
extcon: max77693: Fix modalias string
clk: tegra: Fix duplicated SE clock entry
x86/kprobes: Restore BTF if the single-stepping is cancelled
nfs_common: need lock during iterate through the list
nfsd: Fix message level for normal termination
speakup: fix uninitialized flush_lock
usb: oxu210hp-hcd: Fix memory leak in oxu_create
usb: ehci-omap: Fix PM disable depth umbalance in ehci_hcd_omap_probe
powerpc/pseries/hibernation: remove redundant cacheinfo update
powerpc/pseries/hibernation: drop pseries_suspend_begin() from suspend ops
scsi: fnic: Fix error return code in fnic_probe()
seq_buf: Avoid type mismatch for seq_buf_init
scsi: pm80xx: Fix error return in pm8001_pci_probe()
scsi: qedi: Fix missing destroy_workqueue() on error in __qedi_probe
cpufreq: scpi: Add missing MODULE_ALIAS
cpufreq: loongson1: Add missing MODULE_ALIAS
cpufreq: st: Add missing MODULE_DEVICE_TABLE
cpufreq: mediatek: Add missing MODULE_DEVICE_TABLE
cpufreq: highbank: Add missing MODULE_DEVICE_TABLE
clocksource/drivers/arm_arch_timer: Correct fault programming of CNTKCTL_EL1.EVNTI
dm ioctl: fix error return code in target_message
ASoC: jz4740-i2s: add missed checks for clk_get()
net/mlx5: Properly convey driver version to firmware
memstick: r592: Fix error return in r592_probe()
arm64: dts: rockchip: Fix UART pull-ups on rk3328
pinctrl: falcon: add missing put_device() call in pinctrl_falcon_probe()
ARM: dts: at91: sama5d2: map securam as device
clocksource/drivers/cadence_ttc: Fix memory leak in ttc_setup_clockevent()
media: saa7146: fix array overflow in vidioc_s_audio()
vfio-pci: Use io_remap_pfn_range() for PCI IO memory
NFS: switch nfsiod to be an UNBOUND workqueue.
lockd: don't use interval-based rebinding over TCP
SUNRPC: xprt_load_transport() needs to support the netid "rdma6"
NFSv4.2: condition READDIR's mask for security label based on LSM state
ath10k: Release some resources in an error handling path
ath10k: Fix an error handling path
ARM: dts: at91: at91sam9rl: fix ADC triggers
PCI: iproc: Fix out-of-bound array accesses
genirq/irqdomain: Don't try to free an interrupt that has no mapping
power: supply: bq24190_charger: fix reference leak
ARM: dts: Remove non-existent i2c1 from 98dx3236
HSI: omap_ssi: Don't jump to free ID in ssi_add_controller()
media: max2175: fix max2175_set_csm_mode() error code
mips: cdmm: fix use-after-free in mips_cdmm_bus_discover
samples: bpf: Fix lwt_len_hist reusing previous BPF map
media: siano: fix memory leak of debugfs members in smsdvb_hotplug
cw1200: fix missing destroy_workqueue() on error in cw1200_init_common
orinoco: Move context allocation after processing the skb
ARM: dts: at91: sama5d3_xplained: add pincontrol for USB Host
ARM: dts: at91: sama5d4_xplained: add pincontrol for USB Host
memstick: fix a double-free bug in memstick_check
RDMA/cxgb4: Validate the number of CQEs
Input: omap4-keypad - fix runtime PM error handling
drivers: soc: ti: knav_qmss_queue: Fix error return code in knav_queue_probe
soc: ti: Fix reference imbalance in knav_dma_probe
soc: ti: knav_qmss: fix reference leak in knav_queue_probe
crypto: omap-aes - Fix PM disable depth imbalance in omap_aes_probe
powerpc/feature: Fix CPU_FTRS_ALWAYS by removing CPU_FTRS_GENERIC_32
Input: ads7846 - fix unaligned access on 7845
Input: ads7846 - fix integer overflow on Rt calculation
Input: ads7846 - fix race that causes missing releases
drm/omap: dmm_tiler: fix return error code in omap_dmm_probe()
media: solo6x10: fix missing snd_card_free in error handling case
scsi: core: Fix VPD LUN ID designator priorities
media: mtk-vcodec: add missing put_device() call in mtk_vcodec_release_dec_pm()
staging: greybus: codecs: Fix reference counter leak in error handling
MIPS: BCM47XX: fix kconfig dependency bug for BCM47XX_BCMA
RDMa/mthca: Work around -Wenum-conversion warning
ASoC: arizona: Fix a wrong free in wm8997_probe
ASoC: wm8998: Fix PM disable depth imbalance on error
mwifiex: fix mwifiex_shutdown_sw() causing sw reset failure
spi: tegra114: fix reference leak in tegra spi ops
spi: tegra20-sflash: fix reference leak in tegra_sflash_resume
spi: tegra20-slink: fix reference leak in slink ops of tegra20
spi: spi-ti-qspi: fix reference leak in ti_qspi_setup
Bluetooth: Fix null pointer dereference in hci_event_packet()
arm64: dts: exynos: Correct psci compatible used on Exynos7
selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling
ASoC: pcm: DRAIN support reactivation
spi: img-spfi: fix reference leak in img_spfi_resume
crypto: talitos - Fix return type of current_desc_hdr()
sched: Reenable interrupts in do_sched_yield()
sched/deadline: Fix sched_dl_global_validate()
ARM: p2v: fix handling of LPAE translation in BE mode
x86/mm/ident_map: Check for errors from ident_pud_init()
RDMA/rxe: Compute PSN windows correctly
selinux: fix error initialization in inode_doinit_with_dentry()
RDMA/bnxt_re: Set queue pair state when being queried
soc: mediatek: Check if power domains can be powered on at boot time
soc: renesas: rmobile-sysc: Fix some leaks in rmobile_init_pm_domains()
drm/gma500: fix double free of gma_connector
Bluetooth: Fix slab-out-of-bounds read in hci_le_direct_adv_report_evt()
md: fix a warning caused by a race between concurrent md_ioctl()s
crypto: af_alg - avoid undefined behavior accessing salg_name
media: msi2500: assign SPI bus number dynamically
quota: Sanity-check quota file headers on load
serial_core: Check for port state when tty is in error state
HID: i2c-hid: add Vero K147 to descriptor override
ARM: dts: exynos: fix USB 3.0 pins supply being turned off on Odroid XU
ARM: dts: exynos: fix USB 3.0 VBUS control and over-current pins on Exynos5410
ARM: dts: exynos: fix roles of USB 3.0 ports on Odroid XU
usb: chipidea: ci_hdrc_imx: Pass DISABLE_DEVICE_STREAMING flag to imx6ul
USB: gadget: f_rndis: fix bitrate for SuperSpeed and above
usb: gadget: f_fs: Re-use SS descriptors for SuperSpeedPlus
USB: gadget: f_midi: setup SuperSpeed Plus descriptors
USB: gadget: f_acm: add support for SuperSpeed Plus
USB: serial: option: add interface-number sanity check to flag handling
soc/tegra: fuse: Fix index bug in get_process_id
dm table: Remove BUG_ON(in_interrupt())
scsi: mpt3sas: Increase IOCInit request timeout to 30s
vxlan: Copy needed_tailroom from lowerdev
vxlan: Add needed_headroom for lower device
drm/tegra: sor: Disable clocks on error in tegra_sor_init()
kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
RDMA/cm: Fix an attempt to use non-valid pointer when cleaning timewait
can: softing: softing_netdev_open(): fix error handling
scsi: bnx2i: Requires MMU
gpio: mvebu: fix potential user-after-free on probe
ARM: dts: sun8i: v3s: fix GIC node memory range
pinctrl: baytrail: Avoid clearing debounce value when turning it off
pinctrl: merrifield: Set default bias in case no particular value given
drm: fix drm_dp_mst_port refcount leaks in drm_dp_mst_allocate_vcpi
serial: 8250_omap: Avoid FIFO corruption caused by MDR1 access
ALSA: pcm: oss: Fix potential out-of-bounds shift
USB: sisusbvga: Make console support depend on BROKEN
USB: UAS: introduce a quirk to set no_write_same
xhci: Give USB2 ports time to enter U3 in bus suspend
ALSA: usb-audio: Fix control 'access overflow' errors from chmap
ALSA: usb-audio: Fix potential out-of-bounds shift
USB: add RESET_RESUME quirk for Snapscan 1212
USB: dummy-hcd: Fix uninitialized array use in init()
mac80211: mesh: fix mesh_pathtbl_init() error path
net: bridge: vlan: fix error return code in __vlan_add()
net: stmmac: dwmac-meson8b: fix mask definition of the m250_sel mux
net: stmmac: delete the eee_ctrl_timer after napi disabled
net/mlx4_en: Handle TX error CQE
net/mlx4_en: Avoid scheduling restart task if it is already running
tcp: fix cwnd-limited bug for TSO deferral where we send nothing
net: stmmac: free tx skb buffer in stmmac_resume()
PCI: qcom: Add missing reset for ipq806x
x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
scsi: be2iscsi: Revert "Fix a theoretical leak in beiscsi_create_eqs()"
kbuild: avoid static_assert for genksyms
pinctrl: amd: remove debounce filter setting in IRQ type setting
Input: i8042 - add Acer laptops to the i8042 reset list
Input: cm109 - do not stomp on control URB
platform/x86: acer-wmi: add automatic keyboard background light toggle key as KEY_LIGHTS_TOGGLE
soc: fsl: dpio: Get the cpumask through cpumask_of(cpu)
scsi: ufs: Make sure clk scaling happens only when HBA is runtime ACTIVE
ARC: stack unwinding: don't assume non-current task is sleeping
iwlwifi: mvm: fix kernel panic in case of assert during CSA
arm64: dts: rockchip: Assign a fixed index to mmc devices on rk3399 boards.
iwlwifi: pcie: limit memory read spin time
spi: bcm2835aux: Restore err assignment in bcm2835aux_spi_probe
spi: bcm2835aux: Fix use-after-free on unbind
Linux 4.14.212
x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
Input: i8042 - fix error return code in i8042_setup_aux()
i2c: qup: Fix error return code in qup_i2c_bam_schedule_desc()
gfs2: check for empty rgrp tree in gfs2_ri_update
tracing: Fix userstacktrace option for instances
spi: bcm2835: Release the DMA channel if probe fails after dma_init
spi: bcm2835: Fix use-after-free on unbind
spi: bcm-qspi: Fix use-after-free on unbind
spi: Introduce device-managed SPI controller allocation
iommu/amd: Set DTE[IntTabLen] to represent 512 IRTEs
speakup: Reject setting the speakup line discipline outside of speakup
i2c: imx: Check for I2SR_IAL after every byte
i2c: imx: Fix reset of I2SR_IAL flag
mm/swapfile: do not sleep with a spin lock held
cifs: fix potential use-after-free in cifs_echo_request()
ftrace: Fix updating FTRACE_FL_TRAMP
ALSA: hda/generic: Add option to enforce preferred_dacs pairs
ALSA: hda/realtek - Add new codec supported for ALC897
tty: Fix ->session locking
tty: Fix ->pgrp locking in tiocspgrp()
USB: serial: option: fix Quectel BG96 matching
USB: serial: option: add support for Thales Cinterion EXS82
USB: serial: option: add Fibocom NL668 variants
USB: serial: ch341: sort device-id entries
USB: serial: ch341: add new Product ID for CH341A
USB: serial: kl5kusb105: fix memleak on open
usb: gadget: f_fs: Use local copy of descriptors for userspace copy
vlan: consolidate VLAN parsing code and limit max parsing depth
pinctrl: baytrail: Fix pin being driven low for a while on gpiod_get(..., GPIOD_OUT_HIGH)
pinctrl: baytrail: Replace WARN with dev_info_once when setting direct-irq pin to output
ANDROID: Incremental fs: Set credentials before reading/writing
ANDROID: Incremental fs: Fix incfs_test use of atol, open
ANDROID: Incremental fs: Change per UID timeouts to microseconds
ANDROID: Incremental fs: Add v2 feature flag
ANDROID: Incremental fs: Add zstd feature flag
Linux 4.14.211
RDMA/i40iw: Address an mmap handler exploit in i40iw
Input: i8042 - add ByteSpeed touchpad to noloop table
Input: xpad - support Ardwiino Controllers
ALSA: usb-audio: US16x08: fix value count for level meters
dt-bindings: net: correct interrupt flags in examples
net/mlx5: Fix wrong address reclaim when command interface is down
net: pasemi: fix error return code in pasemi_mac_open()
cxgb3: fix error return code in t3_sge_alloc_qset()
net/x25: prevent a couple of overflows
ibmvnic: Fix TX completion error handling
ibmvnic: Ensure that SCRQ entry reads are correctly ordered
ipv4: Fix tos mask in inet_rtm_getroute()
netfilter: bridge: reset skb->pkt_type after NF_INET_POST_ROUTING traversal
bonding: wait for sysfs kobject destruction before freeing struct slave
usbnet: ipheth: fix connectivity with iOS 14
tun: honor IOCB_NOWAIT flag
tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control
sock: set sk_err to ee_errno on dequeue from errq
rose: Fix Null pointer dereference in rose_send_frame()
net/af_iucv: set correct sk_protocol for child sockets
ANDROID: Incremental fs: Fix 32-bit build
UPSTREAM: arm64: sysreg: Clean up instructions for modifying PSTATE fields
[CP] CHROMIUM: drm/virtio: rebase zero-copy patches to virgl/drm-misc-next
Linux 4.14.210
USB: core: Fix regression in Hercules audio card
USB: core: add endpoint-blacklist quirk
x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak
x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak
x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb
usb: gadget: Fix memleak in gadgetfs_fill_super
usb: gadget: f_midi: Fix memleak in f_midi_alloc
USB: core: Change %pK for __user pointers to %px
perf probe: Fix to die_entrypc() returns error correctly
can: m_can: fix nominal bitiming tseg2 min for version >= 3.1
platform/x86: toshiba_acpi: Fix the wrong variable assignment
can: gs_usb: fix endianess problem with candleLight firmware
efivarfs: revert "fix memory leak in efivarfs_create()"
ibmvnic: fix NULL pointer dereference in ibmvic_reset_crq
ibmvnic: fix NULL pointer dereference in reset_sub_crq_queues
net: ena: set initial DMA width to avoid intel iommu issue
nfc: s3fwrn5: use signed integer for parsing GPIO numbers
IB/mthca: fix return value of error branch in mthca_init_cq()
bnxt_en: Release PCI regions when DMA mask setup fails during probe.
video: hyperv_fb: Fix the cache type when mapping the VRAM
bnxt_en: fix error return code in bnxt_init_board()
bnxt_en: fix error return code in bnxt_init_one()
scsi: ufs: Fix race between shutdown and runtime resume flow
batman-adv: set .owner to THIS_MODULE
phy: tegra: xusb: Fix dangling pointer on probe failure
perf/x86: fix sysfs type mismatches
scsi: target: iscsi: Fix cmd abort fabric stop race
scsi: libiscsi: Fix NOP race condition
dmaengine: pl330: _prep_dma_memcpy: Fix wrong burst size
nvme: free sq/cq dbbuf pointers when dbbuf set fails
proc: don't allow async path resolution of /proc/self components
HID: Add Logitech Dinovo Edge battery quirk
x86/xen: don't unbind uninitialized lock_kicker_irq
dmaengine: xilinx_dma: use readl_poll_timeout_atomic variant
HID: hid-sensor-hub: Fix issue with devices with no report ID
Input: i8042 - allow insmod to succeed on devices without an i8042 controller
HID: cypress: Support Varmilo Keyboards' media hotkeys
ALSA: hda/hdmi: fix incorrect locking in hdmi_pcm_close
ALSA: hda/hdmi: Use single mutex unlock in error paths
arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect()
arm64: pgtable: Fix pte_accessible()
btrfs: inode: Verify inode mode to avoid NULL pointer dereference
btrfs: adjust return values of btrfs_inode_by_name
btrfs: tree-checker: Enhance chunk checker to validate chunk profile
PCI: Add device even if driver attach failed
wireless: Use linux/stddef.h instead of stddef.h
btrfs: fix lockdep splat when reading qgroup config on mount
mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault()
perf event: Check ref_reloc_sym before using it
Linux 4.14.209
x86/microcode/intel: Check patch signature before saving microcode for early loading
s390/dasd: fix null pointer dereference for ERP requests
s390/cpum_sf.c: fix file permission for cpum_sfb_size
mac80211: free sta in sta_info_insert_finish() on errors
mac80211: minstrel: fix tx status processing corner case
mac80211: minstrel: remove deferred sampling code
xtensa: disable preemption around cache alias management calls
regulator: workaround self-referent regulators
regulator: avoid resolve_supply() infinite recursion
regulator: fix memory leak with repeated set_machine_constraints()
iio: accel: kxcjk1013: Add support for KIOX010A ACPI DSM for setting tablet-mode
iio: accel: kxcjk1013: Replace is_smo8500_device with an acpi_type enum
ext4: fix bogus warning in ext4_update_dx_flag()
staging: rtl8723bs: Add 024c:0627 to the list of SDIO device-ids
efivarfs: fix memory leak in efivarfs_create()
tty: serial: imx: keep console clocks always on
ALSA: mixart: Fix mutex deadlock
ALSA: ctl: fix error path at adding user-defined element set
speakup: Do not let the line discipline be used several times
powerpc/uaccess-flush: fix missing includes in kup-radix.h
libfs: fix error cast of negative value in simple_attr_write()
xfs: revert "xfs: fix rmap key and record comparison functions"
regulator: ti-abb: Fix array out of bound read access on the first transition
MIPS: Alchemy: Fix memleak in alchemy_clk_setup_cpu
ASoC: qcom: lpass-platform: Fix memory leak
can: m_can: m_can_handle_state_change(): fix state change
can: peak_usb: fix potential integer overflow on shift of a int
can: mcba_usb: mcba_usb_start_xmit(): first fill skb, then pass to can_put_echo_skb()
can: ti_hecc: Fix memleak in ti_hecc_probe
can: dev: can_restart(): post buffer from the right context
can: af_can: prevent potential access of uninitialized member in canfd_rcv()
can: af_can: prevent potential access of uninitialized member in can_rcv()
perf lock: Don't free "lock_seq_stat" if read_count isn't zero
ARM: dts: imx50-evk: Fix the chip select 1 IOMUX
arm: dts: imx6qdl-udoo: fix rgmii phy-mode for ksz9031 phy
MIPS: export has_transparent_hugepage() for modules
Input: adxl34x - clean up a data type in adxl34x_probe()
vfs: remove lockdep bogosity in __sb_start_write
arm64: psci: Avoid printing in cpu_psci_cpu_die()
pinctrl: rockchip: enable gpio pclk for rockchip_gpio_to_irq
net: ftgmac100: Fix crash when removing driver
tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimate
net: usb: qmi_wwan: Set DTR quirk for MR400
net/mlx5: Disable QoS when min_rates on all VFs are zero
sctp: change to hold/put transport for proto_unreach_timer
qlcnic: fix error return code in qlcnic_83xx_restart_hw()
net: x25: Increase refcnt of "struct x25_neigh" in x25_rx_call_request
net/mlx4_core: Fix init_hca fields offset
netlabel: fix an uninitialized warning in netlbl_unlabel_staticlist()
netlabel: fix our progress tracking in netlbl_unlabel_staticlist()
net: Have netpoll bring-up DSA management interface
net: dsa: mv88e6xxx: Avoid VTU corruption on 6097
net: bridge: add missing counters to ndo_get_stats64 callback
net: b44: fix error return code in b44_init_one()
mlxsw: core: Use variable timeout for EMAD retries
inet_diag: Fix error path to cancel the meseage in inet_req_diag_fill()
devlink: Add missing genlmsg_cancel() in devlink_nl_sb_port_pool_fill()
bnxt_en: read EEPROM A2h address using page 0
atm: nicstar: Unmap DMA on send error
ah6: fix error return code in ah6_input()
ANDROID: Fix cuttlefish defconfigs now incfs uses zstd
ANDROID: Incremental fs: Add zstd compression support
ANDROID: Incremental fs: Small improvements
ANDROID: Incremental fs: Initialize mount options correctly
ANDROID: Incremental fs: Fix read_log_test which failed sporadically
ANDROID: Incremental fs: Fix misuse of cpu_to_leXX and poll return
ANDROID: Incremental fs: Add per UID read timeouts
ANDROID: Incremental fs: Add .incomplete folder
ANDROID: Incremental fs: Fix dangling else
ANDROID: Incremental fs: Fix uninitialized variable
ANDROID: Incremental fs: Fix filled block count from get filled blocks
ANDROID: Incremental fs: Add hash block counts to IOC_IOCTL_GET_BLOCK_COUNT
ANDROID: Incremental fs: Add INCFS_IOC_GET_BLOCK_COUNT
ANDROID: Incremental fs: Make compatible with existing files
ANDROID: Incremental fs: Remove block HASH flag
ANDROID: Incremental fs: Remove back links and crcs
ANDROID: Incremental fs: Remove attributes from file
ANDROID: Incremental fs: Add .blocks_written file
ANDROID: Incremental fs: Separate pseudo-file code
ANDROID: Incremental fs: Add UID to pending_read
ANDROID: Incremental fs: Create mapped file
ANDROID: Incremental fs: Don't allow renaming .index directory.
ANDROID: Incremental fs: Fix incfs to work on virtio-9p
ANDROID: Incremental fs: Allow running a single test
ANDROID: Incremental fs: Adding perf test
ANDROID: Incremental fs: Stress tool
ANDROID: Incremental fs: Use R/W locks to read/write segment blockmap.
ANDROID: Incremental fs: Remove unnecessary dependencies
ANDROID: Incremental fs: Remove annoying pr_debugs
ANDROID: Incremental fs: dentry_revalidate should not return -EBADF.
ANDROID: Incremental fs: Fix minor bugs
ANDROID: Incremental fs: RCU locks instead of mutex for pending_reads.
ANDROID: Incremental fs: fix up attempt to copy structures with READ/WRITE_ONCE
Linux 4.14.208
ACPI: GED: fix -Wformat
KVM: x86: clflushopt should be treated as a no-op by emulation
can: proc: can_remove_proc(): silence remove_proc_entry warning
mac80211: always wind down STA state
Input: sunkbd - avoid use-after-free in teardown paths
powerpc/8xx: Always fault when _PAGE_ACCESSED is not set
gpio: mockup: fix resource leak in error path
i2c: imx: Fix external abort on interrupt in exit paths
i2c: imx: use clk notifier for rate changes
powerpc/64s: flush L1D after user accesses
powerpc/uaccess: Evaluate macro arguments once, before user access is allowed
powerpc: Fix __clear_user() with KUAP enabled
powerpc: Implement user_access_begin and friends
powerpc: Add a framework for user access tracking
powerpc/64s: flush L1D on kernel entry
powerpc/64s: move some exception handlers out of line
powerpc/64s: Define MASKABLE_RELON_EXCEPTION_PSERIES_OOL
Linux 4.14.207
mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Convert trailing spaces and periods in path components
reboot: fix overflow parsing reboot cpu number
Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint"
perf/core: Fix race in the perf_mmap_close() function
xen/events: block rogue events for some time
xen/events: defer eoi in case of excessive number of events
xen/events: use a common cpu hotplug hook for event channels
xen/events: switch user event channels to lateeoi model
xen/pciback: use lateeoi irq binding
xen/pvcallsback: use lateeoi irq binding
xen/scsiback: use lateeoi irq binding
xen/netback: use lateeoi irq binding
xen/blkback: use lateeoi irq binding
xen/events: add a new "late EOI" evtchn framework
xen/events: fix race in evtchn_fifo_unmask()
xen/events: add a proper barrier to 2-level uevent unmasking
xen/events: avoid removing an event channel while handling it
perf/core: Fix a memory leak in perf_event_parse_addr_filter()
perf/core: Fix crash when using HW tracing kernel filters
perf/core: Fix bad use of igrab()
x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP
random32: make prandom_u32() output unpredictable
net: Update window_clamp if SOCK_RCVBUF is set
r8169: fix potential skb double free in an error path
vrf: Fix fast path output packet handling with async Netfilter rules
net/x25: Fix null-ptr-deref in x25_connect
net/af_iucv: fix null pointer dereference on shutdown
IPv6: Set SIT tunnel hard_header_len to zero
swiotlb: fix "x86: Don't panic if can not alloc buffer for swiotlb"
pinctrl: amd: fix incorrect way to disable debounce filter
pinctrl: amd: use higher precision for 512 RtcClk
drm/gma500: Fix out-of-bounds access to struct drm_device.vblank[]
don't dump the threads that had been already exiting when zapped.
selinux: Fix error return code in sel_ib_pkey_sid_slow()
ocfs2: initialize ip_next_orphan
futex: Don't enable IRQs unconditionally in put_pi_state()
mei: protect mei_cl_mtu from null dereference
usb: cdc-acm: Add DISABLE_ECHO for Renesas USB Download mode
uio: Fix use-after-free in uio_unregister_device()
thunderbolt: Add the missed ida_simple_remove() in ring_request_msix()
ext4: unlock xattr_sem properly in ext4_inline_data_truncate()
ext4: correctly report "not supported" for {usr,grp}jquota when !CONFIG_QUOTA
perf: Fix get_recursion_context()
cosa: Add missing kfree in error path of cosa_write
of/address: Fix of_node memory leak in of_dma_is_coherent
xfs: fix a missing unlock on error in xfs_fs_map_blocks
xfs: fix rmap key and record comparison functions
xfs: fix flags argument to rmap lookup when converting shared file rmaps
nbd: fix a block_device refcount leak in nbd_release
pinctrl: aspeed: Fix GPI only function problem.
ARM: 9019/1: kprobes: Avoid fortify_panic() when copying optprobe template
pinctrl: intel: Set default bias in case no particular value given
iommu/amd: Increase interrupt remapping table limit to 512 entries
scsi: scsi_dh_alua: Avoid crash during alua_bus_detach()
cfg80211: regulatory: Fix inconsistent format argument
mac80211: fix use of skb payload instead of header
drm/amdgpu: perform srbm soft reset always on SDMA resume
scsi: hpsa: Fix memory leak in hpsa_init_one()
gfs2: check for live vs. read-only file system in gfs2_fitrim
gfs2: Add missing truncate_inode_pages_final for sd_aspace
gfs2: Free rd_bits later in gfs2_clear_rgrpd to fix use-after-free
usb: gadget: goku_udc: fix potential crashes in probe
ath9k_htc: Use appropriate rs_datalen type
Btrfs: fix missing error return if writeback for extent buffer never started
xfs: flush new eof page on truncate to avoid post-eof corruption
can: peak_canfd: pucan_handle_can_rx(): fix echo management when loopback is on
can: peak_usb: peak_usb_get_ts_time(): fix timestamp wrapping
can: peak_usb: add range checking in decode operations
can: can_create_echo_skb(): fix echo skb generation: always use skb_clone()
can: dev: __can_get_echo_skb(): fix real payload length return value for RTR frames
can: dev: can_get_echo_skb(): prevent call to kfree_skb() in hard IRQ context
can: rx-offload: don't call kfree_skb() from IRQ context
ALSA: hda: prevent undefined shift in snd_hdac_ext_bus_get_link()
perf tools: Add missing swap for ino_generation
net: xfrm: fix a race condition during allocing spi
hv_balloon: disable warning when floor reached
genirq: Let GENERIC_IRQ_IPI select IRQ_DOMAIN_HIERARCHY
btrfs: reschedule when cloning lots of extents
btrfs: sysfs: init devices outside of the chunk_mutex
nbd: don't update block size after device is started
time: Prevent undefined behaviour in timespec64_to_ns()
mm: mempolicy: fix potential pte_unmap_unlock pte error
ring-buffer: Fix recursion protection transitions between interrupt context
regulator: defer probe when trying to get voltage from unresolved supply
UPSTREAM: sched: idle: Avoid retaining the tick when it has been stopped
UPSTREAM: cpuidle: menu: Handle stopped tick more aggressively
UPSTREAM: staging: android: vsoc: fix copy_from_user overrun
UPSTREAM: ipv6: ndisc: RFC-ietf-6man-ra-pref64-09 is now published as RFC8781
BACKPORT: drm/virtio: fix missing dma_fence_put() in virtio_gpu_execbuffer_ioctl()
Conflicts:
drivers/scsi/ufs/ufshcd.c
drivers/soc/qcom/smp2p.c
drivers/usb/gadget/function/f_accessory.c
drivers/usb/gadget/function/f_fs.c
drivers/usb/gadget/function/f_uac2.c
Change-Id: I7ec83fef94dcc943abd858eb81e4a78983faae8c
|
||
|
|
aa93e9471a |
mm: memcontrol: fix excessive complexity in memory.stat reporting
commit a983b5ebee57209c99f68c8327072f25e0e6e3da upstream
We've seen memory.stat reads in top-level cgroups take up to fourteen
seconds during a userspace bug that created tens of thousands of ghost
cgroups pinned by lingering page cache.
Even with a more reasonable number of cgroups, aggregating memory.stat
is unnecessarily heavy. The complexity is this:
nr_cgroups * nr_stat_items * nr_possible_cpus
where the stat items are ~70 at this point. With 128 cgroups and 128
CPUs - decent, not enormous setups - reading the top-level memory.stat
has to aggregate over a million per-cpu counters. This doesn't scale.
Instead of spreading the source of truth across all CPUs, use the
per-cpu counters merely to batch updates to shared atomic counters.
This is the same as the per-cpu stocks we use for charging memory to the
shared atomic page_counters, and also the way the global vmstat counters
are implemented.
Vmstat has elaborate spilling thresholds that depend on the number of
CPUs, amount of memory, and memory pressure - carefully balancing the
cost of counter updates with the amount of per-cpu error. That's
because the vmstat counters are system-wide, but also used for decisions
inside the kernel (e.g. NR_FREE_PAGES in the allocator). Neither is
true for the memory controller.
Use the same static batch size we already use for page_counter updates
during charging. The per-cpu error in the stats will be 128k, which is
an acceptable ratio of cores to memory accounting granularity.
[hannes@cmpxchg.org: fix warning in __this_cpu_xchg() calls]
Link: http://lkml.kernel.org/r/20171201135750.GB8097@cmpxchg.org
Link: http://lkml.kernel.org/r/20171103153336.24044-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
[shaoyi@amazon.com: resolved the conflict brought by commit
|
||
|
|
955f8bc9eb |
mm: memcontrol: implement lruvec stat functions on top of each other
commit 284542656e22c43fdada8c8cc0ca9ede8453eed7 upstream The implementation of the lruvec stat functions and their variants for accounting through a page, or accounting from a preemptible context, are mostly identical and needlessly repetitive. Implement the lruvec_page functions by looking up the page's lruvec and then using the lruvec function. Implement the functions for preemptible contexts by disabling preemption before calling the atomic context functions. Link: http://lkml.kernel.org/r/20171103153336.24044-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Signed-off-by: Shaoying Xu <shaoyi@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
fdcda71d87 |
mm: memcontrol: eliminate raw access to stat and event counters
commit c9019e9bf42e66d028d70d2da6206cad4dd9250d upstream Replace all raw 'this_cpu_' modifications of the stat and event per-cpu counters with API functions such as mod_memcg_state(). This makes the code easier to read, but is also in preparation for the next patch, which changes the per-cpu implementation of those counters. Link: http://lkml.kernel.org/r/20171103153336.24044-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Shaoying Xu <shaoyi@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
e52a7c1cb9 |
memcg, oom: move out_of_memory back to the charge path
Commit
|
||
|
|
04fecbf51b |
mm: memcontrol: use int for event/state parameter in several functions
Several functions use an enum type as parameter for an event/state, but are called in some locations with an argument of a different enum type. Adjust the interface of these functions to reality by changing the parameter to int. This fixes a ton of enum-conversion warnings that are generated when building the kernel with clang. [mka@chromium.org: also change parameter type of inc/dec/mod_memcg_page_state()] Link: http://lkml.kernel.org/r/20170728213442.93823-1-mka@chromium.org Link: http://lkml.kernel.org/r/20170727211004.34435-1-mka@chromium.org Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Doug Anderson <dianders@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
739f79fc9d |
mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback()
Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
to update the memcg stats:
BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
IP: test_clear_page_writeback+0x12e/0x2c0
[...]
RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
Call Trace:
<IRQ>
end_page_writeback+0x47/0x70
f2fs_write_end_io+0x76/0x180 [f2fs]
bio_endio+0x9f/0x120
blk_update_request+0xa8/0x2f0
scsi_end_request+0x39/0x1d0
scsi_io_completion+0x211/0x690
scsi_finish_command+0xd9/0x120
scsi_softirq_done+0x127/0x150
__blk_mq_complete_request_remote+0x13/0x20
flush_smp_call_function_queue+0x56/0x110
generic_smp_call_function_single_interrupt+0x13/0x30
smp_call_function_single_interrupt+0x27/0x40
call_function_single_interrupt+0x89/0x90
RIP: 0010:native_safe_halt+0x6/0x10
(gdb) l *(test_clear_page_writeback+0x12e)
0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
614 mod_node_page_state(page_pgdat(page), idx, val);
615 if (mem_cgroup_disabled() || !page->mem_cgroup)
616 return;
617 mod_memcg_state(page->mem_cgroup, idx, val);
618 pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
619 this_cpu_add(pn->lruvec_stat->count[idx], val);
620 }
621
622 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
623 gfp_t gfp_mask,
The issue is that writeback doesn't hold a page reference and the page
might get freed after PG_writeback is cleared (and the mapping is
unlocked) in test_clear_page_writeback(). The stat functions looking up
the page's node or zone are safe, as those attributes are static across
allocation and free cycles. But page->mem_cgroup is not, and it will
get cleared if we race with truncation or migration.
It appears this race window has been around for a while, but less likely
to trigger when the memcg stats were updated first thing after
PG_writeback is cleared. Recent changes reshuffled this code to update
the global node stats before the memcg ones, though, stretching the race
window out to an extent where people can reproduce the problem.
Update test_clear_page_writeback() to look up and pin page->mem_cgroup
before clearing PG_writeback, then not use that pointer afterward. It
is a partial revert of
|
||
|
|
00f3ca2c2d |
mm: memcontrol: per-lruvec stats infrastructure
lruvecs are at the intersection of the NUMA node and memcg, which is the scope for most paging activity. Introduce a convenient accounting infrastructure that maintains statistics per node, per memcg, and the lruvec itself. Then convert over accounting sites for statistics that are already tracked in both nodes and memcgs and can be easily switched. [hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code] Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org [hannes@cmpxchg.org: don't track uncharged pages at all Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org [hannes@cmpxchg.org: add missing free_percpu()] Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org [linux@roeck-us.net: hexagon: fix build error caused by include file order] Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ed52be7bfd |
mm: memcontrol: use generic mod_memcg_page_state for kmem pages
The kmem-specific functions do the same thing. Switch and drop. Link: http://lkml.kernel.org/r/20170530181724.27197-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
320492961c |
mm: memcontrol: use the node-native slab memory counters
Now that the slab counters are moved from the zone to the node level we can drop the private memcg node stats and use the official ones. Link: http://lkml.kernel.org/r/20170530181724.27197-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8e675f7af5 |
mm/oom_kill: count global and memory cgroup oom kills
Show count of oom killer invocations in /proc/vmstat and count of processes killed in memory cgroup in knob "memory.events" (in memory.oom_control for v1 cgroup). Also describe difference between "oom" and "oom_kill" in memory cgroup documentation. Currently oom in memory cgroup kills tasks iff shortage has happened inside page fault. These counters helps in monitoring oom kills - for now the only way is grepping for magic words in kernel log. [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename] [akpm@linux-foundation.org: fix comment, per Konstantin] Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Roman Guschin <guroan@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2262185c5b |
mm: per-cgroup memory reclaim stats
Track the following reclaim counters for every memory cgroup: PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED. These values are exposed using the memory.stats interface of cgroup v2. The meaning of each value is the same as for global counters, available using /proc/vmstat. Also, for consistency, rename mem_cgroup_count_vm_event() to count_memcg_event_mm(). Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ccda7f4360 |
mm: memcontrol: use node page state naming scheme for memcg
The memory controllers stat function names are awkwardly long and arbitrarily different from the zone and node stat functions. The current interface is named: mem_cgroup_read_stat() mem_cgroup_update_stat() mem_cgroup_inc_stat() mem_cgroup_dec_stat() mem_cgroup_update_page_stat() mem_cgroup_inc_page_stat() mem_cgroup_dec_page_stat() This patch renames it to match the corresponding node stat functions: memcg_page_state() [node_page_state()] mod_memcg_state() [mod_node_state()] inc_memcg_state() [inc_node_state()] dec_memcg_state() [dec_node_state()] mod_memcg_page_state() [mod_node_page_state()] inc_memcg_page_state() [inc_node_page_state()] dec_memcg_page_state() [dec_node_page_state()] Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
71cd31135d |
mm: memcontrol: re-use node VM page state enum
The current duplication is a high-maintenance mess, and it's painful to add new items or query memcg state from the rest of the VM. This increases the size of the stat array marginally, but we should aim to track all these stats on a per-cgroup level anyway. Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
df0e53d061 |
mm: memcontrol: re-use global VM event enum
The current duplication is a high-maintenance mess, and it's painful to add new items. This increases the size of the event array, but we'll eventually want most of the VM events tracked on a per-cgroup basis anyway. Link: http://lkml.kernel.org/r/20170404220148.28338-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
31176c7815 |
mm: memcontrol: clean up memory.events counting function
We only ever count single events, drop the @nr parameter. Rename the function accordingly. Remove low-information kerneldoc. Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
2a2e48854d |
mm: vmscan: fix IO/refault regression in cache workingset transition
Since commit |
||
|
|
9a4caf1e9f |
mm: memcontrol: provide shmem statistics
Cgroups currently don't report how much shmem they use, which can be useful data to have, in particular since shmem is included in the cache/file item while being reclaimed like anonymous memory. Add a counter to track shmem pages during charging and uncharging. Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Chris Down <cdown@fb.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
553af430e7 |
mm: rmap: fix huge file mmap accounting in the memcg stats
Huge pages are accounted as single units in the memcg's "file_mapped" counter. Account the correct number of base pages, like we do in the corresponding node counter. Link: http://lkml.kernel.org/r/20170322005111.3156-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
17cc4dfeda |
slab: use memcg_kmem_cache_wq for slab destruction operations
If there's contention on slab_mutex, queueing the per-cache destruction work item on the system_wq can unnecessarily create and tie up a lot of kworkers. Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it global and use that workqueue for the destruction work items too. While at it, convert the workqueue from an unbound workqueue to a per-cpu one with concurrency limited to 1. It's generally preferable to use per-cpu workqueues and concurrency limit of 1 is safe enough. This is suggested by Joonsoo Kim. Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
bc2791f857 |
slab: link memcg kmem_caches on their associated memory cgroup
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. While a memcg kmem_cache is listed on its root cache's ->children list, there is no direct way to iterate all kmem_caches which are assocaited with a memory cgroup. The only way to iterate them is walking all caches while filtering out caches which don't match, which would be most of them. This makes memcg destruction operations O(N^2) where N is the total number of slab caches which can be huge. This combined with the synchronous RCU operations can tie up a CPU and affect the whole machine for many hours when memory reclaim triggers offlining and destruction of the stale memcgs. This patch adds mem_cgroup->kmem_caches list which goes through memcg_cache_params->kmem_caches_node of all kmem_caches which are associated with the memcg. All memcg specific iterations, including stat file access, are updated to use the new list instead. Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b4536f0c82 |
mm, memcg: fix the active list aging for lowmem requests when memcg is enabled
Nils Holland and Klaus Ethgen have reported unexpected OOM killer invocations with 32b kernel starting with 4.8 kernels kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0 kworker/u4:5 cpuset=/ mems_allowed=0 CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2 [...] Mem-Info: active_anon:58685 inactive_anon:90 isolated_anon:0 active_file:274324 inactive_file:281962 isolated_file:0 unevictable:0 dirty:649 writeback:0 unstable:0 slab_reclaimable:40662 slab_unreclaimable:17754 mapped:7382 shmem:202 pagetables:351 bounce:0 free:206736 free_pcp:332 free_cma:0 Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 813 3474 3474 Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB lowmem_reserve[]: 0 0 21292 21292 HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB the oom killer is clearly pre-mature because there there is still a lot of page cache in the zone Normal which should satisfy this lowmem request. Further debugging has shown that the reclaim cannot make any forward progress because the page cache is hidden in the active list which doesn't get rotated because inactive_list_is_low is not memcg aware. The code simply subtracts per-zone highmem counters from the respective memcg's lru sizes which doesn't make any sense. We can simply end up always seeing the resulting active and inactive counts 0 and return false. This issue is not limited to 32b kernels but in practice the effect on systems without CONFIG_HIGHMEM would be much harder to notice because we do not invoke the OOM killer for allocations requests targeting < ZONE_NORMAL. Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node and subtract per-memcg highmem counts when memcg is enabled. Introduce helper lruvec_zone_lru_size which redirects to either zone counters or mem_cgroup_get_zone_lru_size when appropriate. We are losing empty LRU but non-zero lru size detection introduced by |
||
|
|
2d75807383 |
mm: memcontrol: consolidate cgroup socket tracking
The cgroup core and the memory controller need to track socket ownership for different purposes, but the tracking sites being entirely different is kind of ugly. Be a better citizen and rename the memory controller callbacks to match the cgroup core callbacks, then move them to the same place. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
7c5f64f844 |
mm: oom: deduplicate victim selection code for memcg and global oom
When selecting an oom victim, we use the same heuristic for both memory cgroup and global oom. The only difference is the scope of tasks to select the victim from. So we could just export an iterator over all memcg tasks and keep all oom related logic in oom_kill.c, but instead we duplicate pieces of it in memcontrol.c reusing some initially private functions of oom_kill.c in order to not duplicate all of it. That looks ugly and error prone, because any modification of select_bad_process should also be propagated to mem_cgroup_out_of_memory. Let's rework this as follows: keep all oom heuristic related code private to oom_kill.c and make oom_kill.c use exported memcg functions when it's really necessary (like in case of iterating over memcg tasks). Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
efdc949079 |
mm: fix memcg stack accounting for sub-page stacks
We should account for stacks regardless of stack size, and we need to
account in sub-page units if THREAD_SIZE < PAGE_SIZE. Change the units
to kilobytes and Move it into account_kernel_stack().
Fixes:
|
||
|
|
7ee36a14f0 |
mm, vmscan: Update all zone LRU sizes before updating memcg
Minchan Kim reported setting the following warning on a 32-bit system although it can affect 64-bit systems. WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110 mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty Modules linked in: CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x76/0xaf __warn+0xea/0x110 ? mem_cgroup_update_lru_size+0x103/0x110 warn_slowpath_fmt+0x3b/0x40 mem_cgroup_update_lru_size+0x103/0x110 isolate_lru_pages.isra.61+0x2e2/0x360 shrink_active_list+0xac/0x2a0 ? __delay+0xe/0x10 shrink_node_memcg+0x53c/0x7a0 shrink_node+0xab/0x2a0 do_try_to_free_pages+0xc6/0x390 try_to_free_pages+0x245/0x590 LRU list contents and counts are updated separately. Counts are updated before pages are added to the LRU and updated after pages are removed. The warning above is from a check in mem_cgroup_update_lru_size that ensures that list sizes of zero are empty. The problem is that node-lru needs to account for highmem pages if CONFIG_HIGHMEM is set. One impact of the implementation is that the sizes are updated in multiple passes when pages from multiple zones were isolated. This happens whether HIGHMEM is set or not. When multiple zones are isolated, it's possible for a debugging check in memcg to be tripped. This patch forces all the zone counts to be updated before the memcg function is called. Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Minchan Kim <minchan@kernel.org> Reported-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
ef8f232799 |
mm, memcg: move memcg limit enforcement from zones to nodes
Memcg needs adjustment after moving LRUs to the node. Limits are tracked per memcg but the soft-limit excess is tracked per zone. As global page reclaim is based on the node, it is easy to imagine a situation where a zone soft limit is exceeded even though the memcg limit is fine. This patch moves the soft limit tree the node. Technically, all the variable names should also change but people are already familiar by the meaning of "mz" even if "mn" would be a more appropriate name now. Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a9dd0a8310 |
mm, vmscan: make shrink_node decisions more node-centric
Earlier patches focused on having direct reclaim and kswapd use data that is node-centric for reclaiming but shrink_node() itself still uses too much zone information. This patch removes unnecessary zone-based information with the most important decision being whether to continue reclaim or not. Some memcg APIs are adjusted as a result even though memcg itself still uses some zone information. [mgorman@techsingularity.net: optimization] Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
599d0c954f |
mm, vmscan: move LRU lists to node
This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
55779ec759 |
mm: fix vm-scalability regression in cgroup-aware workingset code
Commit |
||
|
|
452647784b |
mm: memcontrol: cleanup kmem charge functions
- Handle memcg_kmem_enabled check out to the caller. This reduces the number of function definitions making the code easier to follow. At the same time it doesn't result in code bloat, because all of these functions are used only in one or two places. - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't have to dive deep into memcg implementation to see which allocations are charged and which are not. - Refresh comments. Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
73f576c04b |
mm: memcontrol: fix cgroup creation failure after many small jobs
The memory controller has quite a bit of state that usually outlives the
cgroup and pins its CSS until said state disappears. At the same time
it imposes a 16-bit limit on the CSS ID space to economically store IDs
in the wild. Consequently, when we use cgroups to contain frequent but
small and short-lived jobs that leave behind some page cache, we quickly
run into the 64k limitations of outstanding CSSs. Creating a new cgroup
fails with -ENOSPC while there are only a few, or even no user-visible
cgroups in existence.
Although pinning CSSs past cgroup removal is common, there are only two
instances that actually need an ID after a cgroup is deleted: cache
shadow entries and swapout records.
Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later. They pose no hurdle.
Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after offlining are tmpfs/shmem pages. And those
references are under the user's control, so they are manageable.
This patch introduces a private 16-bit memcg ID and switches swap and
cache shadow entries over to using that. This ID can then be recycled
after offlining when the CSS remains pinned only by objects that don't
specifically need it.
This script demonstrates the problem by faulting one cache page in a new
cgroup and deleting it again:
set -e
mkdir -p pages
for x in `seq 128000`; do
[ $((x % 1000)) -eq 0 ] && echo $x
mkdir /cgroup/foo
echo $$ >/cgroup/foo/cgroup.procs
echo trex >pages/$x
echo $$ >/cgroup/cgroup.procs
rmdir /cgroup/foo
done
When run on an unpatched kernel, we eventually run out of possible IDs
even though there are no visible cgroups:
[root@ham ~]# ./cssidstress.sh
[...]
65000
mkdir: cannot create directory '/cgroup/foo': No space left on device
After this patch, the IDs get released upon cgroup destruction and the
cache and css objects get released once memory reclaim kicks in.
[hannes@cmpxchg.org: init the IDR]
Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
Fixes:
|
||
|
|
59dc76b0d4 |
mm: vmscan: reduce size of inactive file list
The inactive file list should still be large enough to contain readahead windows and freshly written file data, but it no longer is the only source for detecting multiple accesses to file pages. The workingset refault measurement code causes recently evicted file pages that get accessed again after a shorter interval to be promoted directly to the active list. With that mechanism in place, we can afford to (on a larger system) dedicate more memory to the active file list, so we can actually cache more of the frequently used file pages in memory, and not have them pushed out by streaming writes, once-used streaming file reads, etc. This can help things like database workloads, where only half the page cache can currently be used to cache the database working set. This patch automatically increases that fraction on larger systems, using the same ratio that has already been used for anonymous memory. [hannes@cmpxchg.org: cgroup-awareness] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9d5e6a9f22 |
mm: update_lru_size do the __mod_zone_page_state
Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy reclaim was removed) that lru_size can be updated by -nr_taken once per call to isolate_lru_pages(), instead of page by page. Update it inside isolate_lru_pages(), or at its two callsites? I chose to update it at the callsites, rearranging and grouping the updates by nr_taken and nr_scanned together in both. With one exception, mem_cgroup_update_lru_size(,lru,) is then used where __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding some more calls in a future commit. Make the code a little smaller and simpler by incorporating stat update in lru_size update. The exception was move_active_pages_to_lru(), which aggregated the pgmoved stat update separately from the individual lru_size updates; but I still think this a simplification worth making. However, the __mod_zone_page_state is not peculiar to mem_cgroups: so better use the name update_lru_size, calls mem_cgroup_update_lru_size when CONFIG_MEMCG. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
0a6b76dd23 |
mm: workingset: make shadow node shrinker memcg aware
Workingset code was recently made memcg aware, but shadow node shrinker is still global. As a result, one small cgroup can consume all memory available for shadow nodes, possibly hurting other cgroups by reclaiming their shadow nodes, even though reclaim distances stored in its shadow nodes have no effect. To avoid this, we need to make shadow node shrinker memcg aware. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
b6ecd2dea4 |
mm: memcontrol: zap memcg_kmem_online helper
As kmem accounting is now either enabled for all cgroups or disabled system-wide, there's no point in having memcg_kmem_online() helper - instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as shrink_slab() now does. There are only two places left where this helper is used - __memcg_kmem_charge() and memcg_create_kmem_cache(). The former can only be called if memcg_kmem_enabled() returned true. Since the cgroup it operates on is online, mem_cgroup_is_root() check will be enough. memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead of memcg_kmem_online(), because it relies on the fact that in memcg_offline_kmem() memcg->kmem_state is changed before memcg_deactivate_kmem_caches() is called, but there we can just open-code the check. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
12580e4b54 |
mm: memcontrol: report kernel stack usage in cgroup2 memory.stat
Show how much memory is allocated to kernel stacks. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |