Commit Graph

290 Commits

Author SHA1 Message Date
Yu Zhao
c0367061af FROMLIST: mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.

When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[5.5, 7.5]%
                         Ops/sec      KB/sec
      patch1-7:          1014393.57   39455.42
      patch1-8:          1078507.59   41949.15

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-7
      45.54%  lzo1x_1_do_compress (real work)
       9.56%  page_vma_mapped_walk
       6.70%  _raw_spin_unlock_irq
       2.78%  ptep_clear_flush
       2.47%  do_raw_spin_lock
       2.22%  __zram_bvec_write
       1.87%  lru_gen_look_around
       1.78%  memmove
       1.77%  obj_malloc
       1.44%  free_unref_page_list

    patch1-8
      47.02%  lzo1x_1_do_compress (real work)
       6.73%  page_vma_mapped_walk
       6.14%  _raw_spin_unlock_irq
       3.39%  walk_pte_range
       2.63%  ptep_clear_flush
       2.29%  __zram_bvec_write
       2.10%  do_raw_spin_lock
       1.81%  memmove
       1.73%  obj_malloc
       1.53%  free_unref_page_list

  Configurations:
    no change

[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo

Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e
2022-06-27 14:32:38 +00:00
Yu Zhao
09a2a70211 FROMLIST: mm: multi-gen LRU: exploit locality in rmap
Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[4, 6]%
                         Ops/sec      KB/sec
      patch1-6:          964656.80    37520.88
      patch1-7:          1014393.57   39455.42

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-6
      36.13%  lzo1x_1_do_compress (real work)
      19.16%  page_vma_mapped_walk
       6.55%  _raw_spin_unlock_irq
       4.02%  do_raw_spin_lock
       2.32%  anon_vma_interval_tree_iter_first
       2.11%  ptep_clear_flush
       1.76%  __zram_bvec_write
       1.64%  folio_referenced_one
       1.40%  memmove
       1.35%  obj_malloc

    patch1-7
      45.54%  lzo1x_1_do_compress (real work)
       9.56%  page_vma_mapped_walk
       6.70%  _raw_spin_unlock_irq
       2.78%  ptep_clear_flush
       2.47%  do_raw_spin_lock
       2.22%  __zram_bvec_write
       1.87%  lru_gen_look_around
       1.78%  memmove
       1.77%  obj_malloc
       1.44%  free_unref_page_list

  Configurations:
    no change

Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c
2022-06-27 14:32:38 +00:00
Yu Zhao
42440bf2f0 FROMLIST: mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in page->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
page->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[38, 40]%
                         IOPS         BW
      5.18-ed4643521e6a: 2547k        9989MiB/s
      patch1-6:          3540k        13.5GiB/s

  Single workload:
    memcached (anon): +[103, 107]%
                         Ops/sec      KB/sec
      5.18-ed4643521e6a: 469048.66    18243.91
      patch1-6:          964656.80    37520.88

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.18-ed4643521e6a
      39.56%  page_vma_mapped_walk
      19.32%  lzo1x_1_do_compress (real work)
       7.18%  do_raw_spin_lock
       4.23%  _raw_spin_unlock_irq
       2.26%  vma_interval_tree_subtree_search
       2.12%  vma_interval_tree_iter_next
       2.11%  folio_referenced_one
       1.90%  anon_vma_interval_tree_iter_first
       1.47%  ptep_clear_flush
       0.97%  __anon_vma_interval_tree_subtree_search

    patch1-6
      36.13%  lzo1x_1_do_compress (real work)
      19.16%  page_vma_mapped_walk
       6.55%  _raw_spin_unlock_irq
       4.02%  do_raw_spin_lock
       2.32%  anon_vma_interval_tree_iter_first
       2.11%  ptep_clear_flush
       1.76%  __zram_bvec_write
       1.64%  folio_referenced_one
       1.40%  memmove
       1.35%  obj_malloc

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
2022-06-27 14:32:38 +00:00
kondors1995
0cca5fb93e Revert "Merge branch 'dev/mglru' into dev/12-2"
This reverts commit e3d1ce4d09, reversing
changes made to d3cf1f4d72.
2022-05-17 08:16:01 +00:00
Yu Zhao
1e84d38423 FROMLIST: mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page
tables to search for young PTEs and promote hot pages. A kill switch
will be added in the next patch to disable this behavior. When
disabled, the aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in
the 2.4 kernel [1], which searches page tables for old PTEs, adds cold
pages to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the
traversal of an entire mm_struct list; the term "walk" will be applied
to page tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct
follows its owner task to the new memcg when this task is migrated.
Given an lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot
pages before it increments max_seq.

When multiple page table walkers iterate the same list, each of them
gets a unique mm_struct; therefore they can run concurrently. Page
table walkers ignore any misplaced pages, e.g., if an mm_struct was
migrated, pages it left in the previous memcg will not be promoted
when its current memcg is under reclaim. Similarly, page table walkers
will not promote pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[5.5, 7.5]%
                         Ops/sec      KB/sec
      patch1-7:          1014393.57   39455.42
      patch1-8:          1078507.59   41949.15

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-7
      45.54%  lzo1x_1_do_compress (real work)
       9.56%  page_vma_mapped_walk
       6.70%  _raw_spin_unlock_irq
       2.78%  ptep_clear_flush
       2.47%  do_raw_spin_lock
       2.22%  __zram_bvec_write
       1.87%  lru_gen_look_around
       1.78%  memmove
       1.77%  obj_malloc
       1.44%  free_unref_page_list

    patch1-8
      47.02%  lzo1x_1_do_compress (real work)
       6.73%  page_vma_mapped_walk
       6.14%  _raw_spin_unlock_irq
       3.39%  walk_pte_range
       2.63%  ptep_clear_flush
       2.29%  __zram_bvec_write
       2.10%  do_raw_spin_lock
       1.81%  memmove
       1.73%  obj_malloc
       1.53%  free_unref_page_list

  Configurations:
    no change

[1] https://lwn.net/Articles/23732/
[2] https://source.android.com/devices/tech/debug/scudo

Link: https://lore.kernel.org/r/20220309021230.721028-9-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I5a3c97cf8ebf8d65d5f9528cd979a637c190053e
2022-05-03 11:10:21 +00:00
Yu Zhao
1bb75d617f FROMLIST: mm: multi-gen LRU: exploit locality in rmap
Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[4, 6]%
                         Ops/sec      KB/sec
      patch1-6:          964656.80    37520.88
      patch1-7:          1014393.57   39455.42

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-6
      36.13%  lzo1x_1_do_compress (real work)
      19.16%  page_vma_mapped_walk
       6.55%  _raw_spin_unlock_irq
       4.02%  do_raw_spin_lock
       2.32%  anon_vma_interval_tree_iter_first
       2.11%  ptep_clear_flush
       1.76%  __zram_bvec_write
       1.64%  folio_referenced_one
       1.40%  memmove
       1.35%  obj_malloc

    patch1-7
      45.54%  lzo1x_1_do_compress (real work)
       9.56%  page_vma_mapped_walk
       6.70%  _raw_spin_unlock_irq
       2.78%  ptep_clear_flush
       2.47%  do_raw_spin_lock
       2.22%  __zram_bvec_write
       1.87%  lru_gen_look_around
       1.78%  memmove
       1.77%  obj_malloc
       1.44%  free_unref_page_list

  Configurations:
    no change

Link: https://lore.kernel.org/r/20220309021230.721028-8-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I9a290343840f3cf925c891c8e360c7cdc24ffb9c
2022-05-03 11:10:21 +00:00
Yu Zhao
ed95143ac2 FROMLIST: mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be
applied to the multi-gen LRU, as a new convention; the terms
"activation" and "deactivation" will be applied to the active/inactive
LRU, as usual.

The aging produces young generations. Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
promotes hot pages to the youngest generation when it finds them
accessed through page tables; the demotion of cold pages happens
consequently when it increments max_seq. The aging has the complexity
O(nr_hot_pages), since it is only interested in hot pages. Promotion
in the aging path does not require any LRU list operations, only the
updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
the result of the increment of max_seq, requires LRU list operations,
e.g., lru_deactivate_fn().

The eviction consumes old generations. Given an lruvec, it increments
min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types
are available from the same generation.

Each generation is divided into multiple tiers. Tiers represent
different ranges of numbers of accesses through file descriptors. A
page accessed N times through file descriptors is in tier
order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
bits in page->flags. In contrast to moving across generations, which
requires the LRU lock, moving across tiers only involves operations on
page->flags. The feedback loop also monitors refaults over all tiers
and decides when to protect pages in which tiers (N>1), using the
first tier (N=0,1) as a baseline. The first tier contains single-use
unmapped clean pages, which are most likely the best choices. The
eviction moves a page to the next generation, i.e., min_seq+1, if the
feedback loop decides so. This approach has the following advantages:
1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[38, 40]%
                         IOPS         BW
      5.18-ed4643521e6a: 2547k        9989MiB/s
      patch1-6:          3540k        13.5GiB/s

  Single workload:
    memcached (anon): +[103, 107]%
                         Ops/sec      KB/sec
      5.18-ed4643521e6a: 469048.66    18243.91
      patch1-6:          964656.80    37520.88

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.18-ed4643521e6a
      39.56%  page_vma_mapped_walk
      19.32%  lzo1x_1_do_compress (real work)
       7.18%  do_raw_spin_lock
       4.23%  _raw_spin_unlock_irq
       2.26%  vma_interval_tree_subtree_search
       2.12%  vma_interval_tree_iter_next
       2.11%  folio_referenced_one
       1.90%  anon_vma_interval_tree_iter_first
       1.47%  ptep_clear_flush
       0.97%  __anon_vma_interval_tree_subtree_search

    patch1-6
      36.13%  lzo1x_1_do_compress (real work)
      19.16%  page_vma_mapped_walk
       6.55%  _raw_spin_unlock_irq
       4.02%  do_raw_spin_lock
       2.32%  anon_vma_interval_tree_iter_first
       2.11%  ptep_clear_flush
       1.76%  __zram_bvec_write
       1.64%  folio_referenced_one
       1.40%  memmove
       1.35%  obj_malloc

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    Chrome OS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Bug: 228114874
Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
2022-05-03 11:10:21 +00:00
UtsavBalar1231
f5daca7dd9 Merge remote-tracking branch 'aosp/android-4.14-stable' into android11-base
* aosp/android-4.14-stable:
  Linux 4.14.230
  can: flexcan: flexcan_chip_freeze(): fix chip freeze for missing bitrate
  init/Kconfig: make COMPILE_TEST depend on HAS_IOMEM
  init/Kconfig: make COMPILE_TEST depend on !S390
  bpf, x86: Validate computation of branch displacements for x86-64
  cifs: Silently ignore unknown oplock break handle
  cifs: revalidate mapping when we open files for SMB1 POSIX
  ia64: mca: allocate early mca with GFP_ATOMIC
  scsi: target: pscsi: Clean up after failure in pscsi_map_sg()
  x86/build: Turn off -fcf-protection for realmode targets
  platform/x86: thinkpad_acpi: Allow the FnLock LED to change state
  drm/msm: Ratelimit invalid-fence message
  mac80211: choose first enabled channel for monitor
  mISDN: fix crash in fritzpci
  net: pxa168_eth: Fix a potential data race in pxa168_eth_remove
  ARM: dts: am33xx: add aliases for mmc interfaces
  Linux 4.14.229
  drivers: video: fbcon: fix NULL dereference in fbcon_cursor()
  staging: rtl8192e: Change state information from u16 to u8
  staging: rtl8192e: Fix incorrect source in memcpy()
  usb: gadget: udc: amd5536udc_pci fix null-ptr-dereference
  USB: cdc-acm: fix use-after-free after probe failure
  USB: cdc-acm: downgrade message to debug
  USB: cdc-acm: untangle a circular dependency between callback and softint
  cdc-acm: fix BREAK rx code path adding necessary calls
  usb: xhci-mtk: fix broken streams issue on 0.96 xHCI
  usb: musb: Fix suspend with devices connected for a64
  USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem
  usbip: vhci_hcd fix shift out-of-bounds in vhci_hub_control()
  firewire: nosy: Fix a use-after-free bug in nosy_ioctl()
  extcon: Fix error handling in extcon_dev_register
  extcon: Add stubs for extcon_register_notifier_all() functions
  pinctrl: rockchip: fix restore error in resume
  mm: writeback: use exact memcg dirty counts
  mm: fix oom_kill event handling
  mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the same cacheline
  mm: memcg: make sure memory.events is uptodate when waking pollers
  mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats
  reiserfs: update reiserfs_xattrs_initialized() condition
  drm/amdgpu: check alignment on CPU page for bo map
  drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings()
  mm: fix race by making init_zero_pfn() early_initcall
  tracing: Fix stack trace event size
  ALSA: hda/realtek: call alc_update_headset_mode() in hp_automute_hook
  ALSA: hda/realtek: fix a determine_headset_type issue for a Dell AIO
  ALSA: usb-audio: Apply sample rate quirk to Logitech Connect
  bpf: Remove MTU check in __bpf_skb_max_len
  net: wan/lmc: unregister device when no matching device is found
  appletalk: Fix skb allocation size in loopback case
  net: ethernet: aquantia: Handle error cleanup of start on open
  brcmfmac: clear EAP/association status bits on linkdown events
  ext4: do not iput inode under running transaction in ext4_rename()
  ASoC: rt5659: Update MCLK rate in set_sysclk()
  staging: comedi: cb_pcidas64: fix request_irq() warn
  staging: comedi: cb_pcidas: fix request_irq() warn
  scsi: qla2xxx: Fix broken #endif placement
  scsi: st: Fix a use after free in st_open()
  vhost: Fix vhost_vq_reset()
  powerpc: Force inlining of cpu_has_feature() to avoid build failure
  ASoC: cs42l42: Always wait at least 3ms after reset
  ASoC: cs42l42: Fix mixer volume control
  ASoC: es8316: Simplify adc_pga_gain_tlv table
  ASoC: sgtl5000: set DAP_AVC_CTRL register to correct default value on probe
  ASoC: rt5651: Fix dac- and adc- vol-tlv values being off by a factor of 10
  ASoC: rt5640: Fix dac- and adc- vol-tlv values being off by a factor of 10
  rpc: fix NULL dereference on kmalloc failure
  ext4: fix bh ref count on error paths
  ipv6: weaken the v4mapped source check
  selinux: vsock: Set SID for socket returned by accept()
  BACKPORT: drm/virtio: Use vmalloc for command buffer allocations.
  UPSTREAM: drm/virtio: Rewrite virtio_gpu_queue_ctrl_buffer using fenced version.

Conflicts:
	drivers/usb/core/quirks.c

Change-Id: I23466726e73acff25a6a397f23d3752c48379a51
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
2021-04-11 05:23:34 +00:00
Greg Thelen
f34769a0b8 mm: writeback: use exact memcg dirty counts
commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream.

Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:

 1) per-memcg per-cpu values in range of [-32..32]

 2) per-memcg atomic counter

When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic.  Stat readers only check the atomic.  Thus readers such as
balance_dirty_pages() may see a nontrivial error margin: 32 pages per
cpu.

Assuming 100 cpus:
   4k x86 page_size:  13 MiB error per memcg
  64k ppc page_size: 200 MiB error per memcg

Considering that dirty+writeback are used together for some decisions the
errors double.

This inaccuracy can lead to undeserved oom kills.  One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.

It could be argued that tiny containers are not supported, but it's more
subtle.  It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.

The following test reliably ooms without this patch.  This patch avoids
oom kills.

  $ cat test
  mount -t cgroup2 none /dev/cgroup
  cd /dev/cgroup
  echo +io +memory > cgroup.subtree_control
  mkdir test
  cd test
  echo 10M > memory.max
  (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
  (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

  $ cat memcg-writeback-stress.c
  /*
   * Dirty pages from all but one cpu.
   * Clean pages from the non dirtying cpu.
   * This is to stress per cpu counter imbalance.
   * On a 100 cpu machine:
   * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
   * - per memcg atomic is -99*32 pages
   * - thus the complete dirty limit: sum of all counters 0
   * - balance_dirty_pages() only sees atomic count -99*32 pages, which
   *   it max()s to 0.
   * - So a workload can dirty -99*32 pages before balance_dirty_pages()
   *   cares.
   */
  #define _GNU_SOURCE
  #include <err.h>
  #include <fcntl.h>
  #include <sched.h>
  #include <stdlib.h>
  #include <stdio.h>
  #include <sys/stat.h>
  #include <sys/sysinfo.h>
  #include <sys/types.h>
  #include <unistd.h>

  static char *buf;
  static int bufSize;

  static void set_affinity(int cpu)
  {
  	cpu_set_t affinity;

  	CPU_ZERO(&affinity);
  	CPU_SET(cpu, &affinity);
  	if (sched_setaffinity(0, sizeof(affinity), &affinity))
  		err(1, "sched_setaffinity");
  }

  static void dirty_on(int output_fd, int cpu)
  {
  	int i, wrote;

  	set_affinity(cpu);
  	for (i = 0; i < 32; i++) {
  		for (wrote = 0; wrote < bufSize; ) {
  			int ret = write(output_fd, buf+wrote, bufSize-wrote);
  			if (ret == -1)
  				err(1, "write");
  			wrote += ret;
  		}
  	}
  }

  int main(int argc, char **argv)
  {
  	int cpu, flush_cpu = 1, output_fd;
  	const char *output;

  	if (argc != 2)
  		errx(1, "usage: output_file");

  	output = argv[1];
  	bufSize = getpagesize();
  	buf = malloc(getpagesize());
  	if (buf == NULL)
  		errx(1, "malloc failed");

  	output_fd = open(output, O_CREAT|O_RDWR);
  	if (output_fd == -1)
  		err(1, "open(%s)", output);

  	for (cpu = 0; cpu < get_nprocs(); cpu++) {
  		if (cpu != flush_cpu)
  			dirty_on(output_fd, cpu);
  	}

  	set_affinity(flush_cpu);
  	if (fsync(output_fd))
  		err(1, "fsync(%s)", output);
  	if (close(output_fd))
  		err(1, "close(%s)", output);
  	free(buf);
  }

Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters.  This avoids the aforementioned oom
kills.

This does not affect the overhead of memory.stat, which still reads the
single atomic counter.

Why not use percpu_counter? memcg already handles cpus going offline, so
no need for that overhead from percpu_counter.  And the percpu_counter
spinlocks are more heavyweight than is required.

It probably also makes sense to use exact dirty and writeback counters
in memcg oom reports.  But that is saved for later.

Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>	[4.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-07 12:47:03 +02:00
Roman Gushchin
6f599379fd mm: fix oom_kill event handling
commit fe6bdfc8e1e131720abbe77a2eb990c94c9024cb upstream.

Commit e27be240df53 ("mm: memcg: make sure memory.events is uptodate
when waking pollers") converted most of memcg event counters to
per-memcg atomics, which made them less confusing for a user.  The
"oom_kill" counter remained untouched, so now it behaves differently
than other counters (including "oom").  This adds nothing but confusion.

Let's fix this by adding the MEMCG_OOM_KILL event, and follow the
MEMCG_OOM approach.

This also removes a hack from count_memcg_event_mm(), introduced earlier
specially for the OOM_KILL counter.

[akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch]
Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[fllinden@amazon.com: backport to 4.14, minor contextual changes]
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-07 12:47:03 +02:00
Aaron Lu
ad6e2898e6 mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the same cacheline
commit e81bf9793b1861d74953ef041b4f6c7faecc2dbd upstream.

The LKP robot found a 27% will-it-scale/page_fault3 performance
regression regarding commit e27be240df53("mm: memcg: make sure
memory.events is uptodate when waking pollers").

What the test does is:
 1 mkstemp() a 128M file on a tmpfs;
 2 start $nr_cpu processes, each to loop the following:
   2.1 mmap() this file in shared write mode;
   2.2 write 0 to this file in a PAGE_SIZE step till the end of the file;
   2.3 unmap() this file and repeat this process.
 3 After 5 minutes, check how many loops they managed to complete, the
   higher the better.

The commit itself looks innocent enough as it merely changed some event
counting mechanism and this test didn't trigger those events at all.
Perf shows increased cycles spent on accessing root_mem_cgroup->stat_cpu
in count_memcg_event_mm()(called by handle_mm_fault()) and in
__mod_memcg_state() called by page_add_file_rmap().  So it's likely due
to the changed layout of 'struct mem_cgroup' that either make stat_cpu
falling into a constantly modifying cacheline or some hot fields stop
being in the same cacheline.

I verified this by moving memory_events[] back to where it was:

: --- a/include/linux/memcontrol.h
: +++ b/include/linux/memcontrol.h
: @@ -205,7 +205,6 @@ struct mem_cgroup {
:  	int		oom_kill_disable;
:
:  	/* memory.events */
: -	atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
:  	struct cgroup_file events_file;
:
:  	/* protect arrays of thresholds */
: @@ -238,6 +237,7 @@ struct mem_cgroup {
:  	struct mem_cgroup_stat_cpu __percpu *stat_cpu;
:  	atomic_long_t		stat[MEMCG_NR_STAT];
:  	atomic_long_t		events[NR_VM_EVENT_ITEMS];
: +	atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS];
:
:  	unsigned long		socket_pressure;

And performance restored.

Later investigation found that as long as the following 3 fields
moving_account, move_lock_task and stat_cpu are in the same cacheline,
performance will be good.  To avoid future performance surprise by other
commits changing the layout of 'struct mem_cgroup', this patch makes
sure the 3 fields stay in the same cacheline.

One concern of this approach is, moving_account and move_lock_task could
be modified when a process changes memory cgroup while stat_cpu is a
always read field, it might hurt to place them in the same cacheline.  I
assume it is rare for a process to change memory cgroup so this should
be OK.

Link: https://lkml.kernel.org/r/20180528114019.GF9904@yexl-desktop
Link: http://lkml.kernel.org/r/20180601071115.GA27302@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-07 12:47:03 +02:00
Johannes Weiner
09031a8b68 mm: memcg: make sure memory.events is uptodate when waking pollers
commit e27be240df53f1a20c659168e722b5d9f16cc7f4 upstream.

Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") added per-cpu drift to all memory cgroup stats
and events shown in memory.stat and memory.events.

For memory.stat this is acceptable.  But memory.events issues file
notifications, and somebody polling the file for changes will be
confused when the counters in it are unchanged after a wakeup.

Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
MEMCG_OOM - are sufficiently rare and high-level that we don't need
per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
frequent, but they're counting invocations of reclaim, which is a
complex operation that touches many shared cachelines.

This splits memory.events from the generic VM events and tracks them in
their own, unbuffered atomic counters.  That's also cleaner, as it
eliminates the ugly enum nesting of VM and cgroup events.

[hannes@cmpxchg.org: "array subscript is above array bounds"]
  Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-07 12:47:03 +02:00
Johannes Weiner
cfc5a8f43e mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats
commit c3cc39118c3610eb6ab4711bc624af7fc48a35fe upstream.

After commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting"), we observed slowly upward creeping NR_WRITEBACK
counts over the course of several days, both the per-memcg stats as well
as the system counter in e.g.  /proc/meminfo.

The conversion from full per-cpu stat counts to per-cpu cached atomic
stat counts introduced an irq-unsafe RMW operation into the updates.

Most stat updates come from process context, but one notable exception
is the NR_WRITEBACK counter.  While writebacks are issued from process
context, they are retired from (soft)irq context.

When writeback completions interrupt the RMW counter updates of new
writebacks being issued, the decs from the completions are lost.

Since the global updates are routed through the joint lruvec API, both
the memcg counters as well as the system counters are affected.

This patch makes the joint stat and event API irq safe.

Link: http://lkml.kernel.org/r/20180203082353.17284-1-hannes@cmpxchg.org
Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Debugged-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-04-07 12:47:03 +02:00
UtsavBalar1231
52dc535677 Merge remote-tracking branch 'aosp/android-4.14-stable' into android11-base
* aosp/android-4.14-stable:
  Linux 4.14.216
  net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
  block: fix use-after-free in disk_part_iter_next
  KVM: arm64: Don't access PMCR_EL0 when no PMU is available
  wan: ds26522: select CONFIG_BITREVERSE
  net/mlx5e: Fix two double free cases
  net/mlx5e: Fix memleak in mlx5e_create_l2_table_groups
  iommu/intel: Fix memleak in intel_irq_remapping_alloc
  block: rsxx: select CONFIG_CRC32
  wil6210: select CONFIG_CRC32
  dmaengine: xilinx_dma: fix mixed_enum_type coverity warning
  dmaengine: xilinx_dma: check dma_async_device_register return value
  spi: stm32: FIFO threshold level - fix align packet size
  cpufreq: powernow-k8: pass policy rather than use cpufreq_cpu_get()
  i2c: sprd: use a specific timeout to avoid system hang up issue
  ARM: OMAP2+: omap_device: fix idling of devices during probe
  iio: imu: st_lsm6dsx: fix edge-trigger interrupts
  iio: imu: st_lsm6dsx: flip irq return logic
  spi: pxa2xx: Fix use-after-free on unbind
  ubifs: wbuf: Don't leak kernel memory to flash
  drm/i915: Fix mismatch between misplaced vma check and vma insert
  vmlinux.lds.h: Add PGO and AutoFDO input sections
  x86/resctrl: Don't move a task to the same resource group
  x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR
  net: fix pmtu check in nopmtudisc mode
  net: ip: always refragment ip defragmented packets
  net: vlan: avoid leaks on register_vlan_dev() failures
  net: cdc_ncm: correct overhead in delayed_ndp_size
  powerpc: Fix incorrect stw{, ux, u, x} instructions in __set_pte_at
  Linux 4.14.215
  scsi: target: Fix XCOPY NAA identifier lookup
  KVM: x86: fix shift out of bounds reported by UBSAN
  x86/mtrr: Correct the range check before performing MTRR type lookups
  netfilter: xt_RATEEST: reject non-null terminated string from userspace
  netfilter: ipset: fix shift-out-of-bounds in htable_bits()
  Revert "device property: Keep secondary firmware node secondary by type"
  ALSA: hda/realtek - Fix speaker volume control on Lenovo C940
  ALSA: hda/conexant: add a new hda codec CX11970
  x86/mm: Fix leak of pmd ptlock
  USB: serial: keyspan_pda: remove unused variable
  usb: gadget: configfs: Fix use-after-free issue with udc_name
  usb: gadget: configfs: Preserve function ordering after bind failure
  usb: gadget: Fix spinlock lockup on usb_function_deactivate
  USB: gadget: legacy: fix return error code in acm_ms_bind()
  usb: gadget: function: printer: Fix a memory leak for interface descriptor
  usb: gadget: f_uac2: reset wMaxPacketSize
  usb: gadget: select CONFIG_CRC32
  ALSA: usb-audio: Fix UBSAN warnings for MIDI jacks
  USB: usblp: fix DMA to stack
  USB: yurex: fix control-URB timeout handling
  USB: serial: option: add Quectel EM160R-GL
  USB: serial: option: add LongSung M5710 module support
  USB: serial: iuu_phoenix: fix DMA from stack
  usb: uas: Add PNY USB Portable SSD to unusual_uas
  usb: usbip: vhci_hcd: protect shift size
  USB: xhci: fix U1/U2 handling for hardware with XHCI_INTEL_HOST quirk set
  usb: chipidea: ci_hdrc_imx: add missing put_device() call in usbmisc_get_init_data()
  usb: dwc3: ulpi: Use VStsDone to detect PHY regs access completion
  USB: cdc-acm: blacklist another IR Droid device
  usb: gadget: enable super speed plus
  crypto: ecdh - avoid buffer overflow in ecdh_set_secret()
  video: hyperv_fb: Fix the mmap() regression for v5.4.y and older
  net: systemport: set dev->max_mtu to UMAC_MAX_MTU_SIZE
  net: mvpp2: Fix GoP port 3 Networking Complex Control configurations
  net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc
  net: sched: prevent invalid Scell_log shift count
  vhost_net: fix ubuf refcount incorrectly when sendmsg fails
  net: usb: qmi_wwan: add Quectel EM160R-GL
  CDC-NCM: remove "connected" log message
  net: hdlc_ppp: Fix issues when mod_timer is called while timer is running
  net: hns: fix return value check in __lb_other_process()
  ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
  net: ethernet: ti: cpts: fix ethtool output when no ptp_clock registered
  net-sysfs: take the rtnl lock when storing xps_cpus
  net: ethernet: Fix memleak in ethoc_probe
  net/ncsi: Use real net-device for response handler
  virtio_net: Fix recursive call to cpus_read_lock()
  qede: fix offload for IPIP tunnel packets
  atm: idt77252: call pci_disable_device() on error path
  ethernet: ucc_geth: set dev->max_mtu to 1518
  ethernet: ucc_geth: fix use-after-free in ucc_geth_remove()
  depmod: handle the case of /sbin/depmod without /sbin in PATH
  lib/genalloc: fix the overflow when size is too big
  scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
  scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
  workqueue: Kick a worker based on the actual activation of delayed works
  kbuild: don't hardcode depmod path
  Linux 4.14.214
  mwifiex: Fix possible buffer overflows in mwifiex_cmd_802_11_ad_hoc_start
  iio:magnetometer:mag3110: Fix alignment and data leak issues.
  iio:imu:bmi160: Fix alignment and data leak issues
  kdev_t: always inline major/minor helper functions
  dm verity: skip verity work if I/O error when system is shutting down
  ALSA: pcm: Clear the full allocated memory at hw_params
  module: delay kobject uevent until after module init call
  powerpc: sysdev: add missing iounmap() on error in mpic_msgr_probe()
  quota: Don't overflow quota file offsets
  module: set MODULE_STATE_GOING state when a module fails to load
  rtc: sun6i: Fix memleak in sun6i_rtc_clk_init
  ALSA: seq: Use bool for snd_seq_queue internal flags
  media: gp8psk: initialize stats at power control logic
  misc: vmw_vmci: fix kernel info-leak by initializing dbells in vmci_ctx_get_chkpt_doorbells()
  reiserfs: add check for an invalid ih_entry_count
  of: fix linker-section match-table corruption
  uapi: move constants from <linux/kernel.h> to <linux/const.h>
  powerpc/bitops: Fix possible undefined behaviour with fls() and fls64()
  USB: serial: digi_acceleport: fix write-wakeup deadlocks
  s390/dasd: fix hanging device offline processing
  vfio/pci: Move dummy_resources_list init in vfio_pci_probe()
  mm: memcontrol: fix excessive complexity in memory.stat reporting
  mm: memcontrol: implement lruvec stat functions on top of each other
  mm: memcontrol: eliminate raw access to stat and event counters
  ALSA: usb-audio: fix sync-ep altsetting sanity check
  ALSA: usb-audio: simplify set_sync_ep_implicit_fb_quirk
  ALSA: hda/ca0132 - Fix work handling in delayed HP detection
  md/raid10: initialize r10_bio->read_slot before use.
  x86/entry/64: Add instruction suffix
  ANDROID: usb: f_accessory: Don't drop NULL reference in acc_disconnect()
  ANDROID: usb: f_accessory: Avoid bitfields for shared variables
  ANDROID: usb: f_accessory: Cancel any pending work before teardown
  ANDROID: usb: f_accessory: Don't corrupt global state on double registration
  ANDROID: usb: f_accessory: Fix teardown ordering in acc_release()
  ANDROID: usb: f_accessory: Add refcounting to global 'acc_dev'
  ANDROID: usb: f_accessory: Wrap '_acc_dev' in get()/put() accessors
  ANDROID: usb: f_accessory: Remove useless assignment
  ANDROID: usb: f_accessory: Remove useless non-debug prints
  ANDROID: usb: f_accessory: Remove stale comments
  ANDROID: USB: f_accessory: Check dev pointer before decoding ctrl request
  ANDROID: usb: gadget: f_accessory: fix CTS test stuck
  Linux 4.14.213
  PCI: Fix pci_slot_release() NULL pointer dereference
  libnvdimm/namespace: Fix reaping of invalidated block-window-namespace labels
  xenbus/xenbus_backend: Disallow pending watch messages
  xen/xenbus: Count pending messages for each watch
  xen/xenbus/xen_bus_type: Support will_handle watch callback
  xen/xenbus: Add 'will_handle' callback support in xenbus_watch_path()
  xen/xenbus: Allow watches discard events before queueing
  xen-blkback: set ring->xenblkd to NULL after kthread_stop()
  clk: mvebu: a3700: fix the XTAL MODE pin to MPP1_9
  md/cluster: fix deadlock when node is doing resync job
  iio:imu:bmi160: Fix too large a buffer.
  iio:pressure:mpl3115: Force alignment of buffer
  iio:light:rpr0521: Fix timestamp alignment and prevent data leak.
  iio: adc: rockchip_saradc: fix missing clk_disable_unprepare() on error in rockchip_saradc_resume
  iio: buffer: Fix demux update
  mtd: parser: cmdline: Fix parsing of part-names with colons
  soc: qcom: smp2p: Safely acquire spinlock without IRQs
  spi: st-ssc4: Fix unbalanced pm_runtime_disable() in probe error path
  spi: sc18is602: Don't leak SPI master in probe error path
  spi: rb4xx: Don't leak SPI master in probe error path
  spi: pic32: Don't leak DMA channels in probe error path
  spi: davinci: Fix use-after-free on unbind
  spi: spi-sh: Fix use-after-free on unbind
  drm/dp_aux_dev: check aux_dev before use in drm_dp_aux_dev_get_by_minor()
  jfs: Fix array index bounds check in dbAdjTree
  jffs2: Fix GC exit abnormally
  ceph: fix race in concurrent __ceph_remove_cap invocations
  ima: Don't modify file descriptor mode on the fly
  powerpc/powernv/memtrace: Don't leak kernel memory to user space
  powerpc/xmon: Change printk() to pr_cont()
  powerpc/rtas: Fix typo of ibm,open-errinjct in RTAS filter
  ARM: dts: at91: sama5d2: fix CAN message ram offset and size
  KVM: arm64: Introduce handling of AArch32 TTBCR2 traps
  ext4: fix deadlock with fs freezing and EA inodes
  ext4: fix a memory leak of ext4_free_data
  btrfs: fix return value mixup in btrfs_get_extent
  Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
  USB: serial: keyspan_pda: fix write unthrottling
  USB: serial: keyspan_pda: fix tx-unthrottle use-after-free
  USB: serial: keyspan_pda: fix write-wakeup use-after-free
  USB: serial: keyspan_pda: fix stalled writes
  USB: serial: keyspan_pda: fix write deadlock
  USB: serial: keyspan_pda: fix dropped unthrottle interrupts
  USB: serial: mos7720: fix parallel-port state restore
  EDAC/amd64: Fix PCI component registration
  crypto: ecdh - avoid unaligned accesses in ecdh_set_secret()
  powerpc/perf: Exclude kernel samples while counting events in user space.
  staging: comedi: mf6x4: Fix AI end-of-conversion detection
  s390/dasd: fix list corruption of lcu list
  s390/dasd: fix list corruption of pavgroup group list
  s390/dasd: prevent inconsistent LCU device data
  s390/smp: perform initial CPU reset also for SMT siblings
  ALSA: usb-audio: Disable sample read check if firmware doesn't give back
  ALSA: pcm: oss: Fix a few more UBSAN fixes
  ALSA: hda/realtek - Enable headset mic of ASUS Q524UQK with ALC255
  ACPI: PNP: compare the string length in the matching_id()
  Revert "ACPI / resources: Use AE_CTRL_TERMINATE to terminate resources walks"
  PM: ACPI: PCI: Drop acpi_pm_set_bridge_wakeup()
  Input: cyapa_gen6 - fix out-of-bounds stack access
  media: netup_unidvb: Don't leak SPI master in probe error path
  media: sunxi-cir: ensure IR is handled when it is continuous
  media: gspca: Fix memory leak in probe
  Input: goodix - add upside-down quirk for Teclast X98 Pro tablet
  Input: cros_ec_keyb - send 'scancodes' in addition to key events
  fix namespaced fscaps when !CONFIG_SECURITY
  cfg80211: initialize rekey_data
  clk: sunxi-ng: Make sure divider tables have sentinel
  clk: s2mps11: Fix a resource leak in error handling paths in the probe function
  qlcnic: Fix error code in probe
  perf record: Fix memory leak when using '--user-regs=?' to list registers
  pwm: lp3943: Dynamically allocate PWM chip base
  pwm: zx: Add missing cleanup in error path
  clk: ti: Fix memleak in ti_fapll_synth_setup
  watchdog: coh901327: add COMMON_CLK dependency
  watchdog: qcom: Avoid context switch in restart handler
  net: korina: fix return value
  net: allwinner: Fix some resources leak in the error handling path of the probe and in the remove function
  net: bcmgenet: Fix a resource leak in an error handling path in the probe functin
  checkpatch: fix unescaped left brace
  powerpc/ps3: use dma_mapping_error()
  nfc: s3fwrn5: Release the nfc firmware
  um: chan_xterm: Fix fd leak
  watchdog: sirfsoc: Add missing dependency on HAS_IOMEM
  irqchip/alpine-msi: Fix freeing of interrupts on allocation error path
  ASoC: wm_adsp: remove "ctl" from list on error in wm_adsp_create_control()
  extcon: max77693: Fix modalias string
  clk: tegra: Fix duplicated SE clock entry
  x86/kprobes: Restore BTF if the single-stepping is cancelled
  nfs_common: need lock during iterate through the list
  nfsd: Fix message level for normal termination
  speakup: fix uninitialized flush_lock
  usb: oxu210hp-hcd: Fix memory leak in oxu_create
  usb: ehci-omap: Fix PM disable depth umbalance in ehci_hcd_omap_probe
  powerpc/pseries/hibernation: remove redundant cacheinfo update
  powerpc/pseries/hibernation: drop pseries_suspend_begin() from suspend ops
  scsi: fnic: Fix error return code in fnic_probe()
  seq_buf: Avoid type mismatch for seq_buf_init
  scsi: pm80xx: Fix error return in pm8001_pci_probe()
  scsi: qedi: Fix missing destroy_workqueue() on error in __qedi_probe
  cpufreq: scpi: Add missing MODULE_ALIAS
  cpufreq: loongson1: Add missing MODULE_ALIAS
  cpufreq: st: Add missing MODULE_DEVICE_TABLE
  cpufreq: mediatek: Add missing MODULE_DEVICE_TABLE
  cpufreq: highbank: Add missing MODULE_DEVICE_TABLE
  clocksource/drivers/arm_arch_timer: Correct fault programming of CNTKCTL_EL1.EVNTI
  dm ioctl: fix error return code in target_message
  ASoC: jz4740-i2s: add missed checks for clk_get()
  net/mlx5: Properly convey driver version to firmware
  memstick: r592: Fix error return in r592_probe()
  arm64: dts: rockchip: Fix UART pull-ups on rk3328
  pinctrl: falcon: add missing put_device() call in pinctrl_falcon_probe()
  ARM: dts: at91: sama5d2: map securam as device
  clocksource/drivers/cadence_ttc: Fix memory leak in ttc_setup_clockevent()
  media: saa7146: fix array overflow in vidioc_s_audio()
  vfio-pci: Use io_remap_pfn_range() for PCI IO memory
  NFS: switch nfsiod to be an UNBOUND workqueue.
  lockd: don't use interval-based rebinding over TCP
  SUNRPC: xprt_load_transport() needs to support the netid "rdma6"
  NFSv4.2: condition READDIR's mask for security label based on LSM state
  ath10k: Release some resources in an error handling path
  ath10k: Fix an error handling path
  ARM: dts: at91: at91sam9rl: fix ADC triggers
  PCI: iproc: Fix out-of-bound array accesses
  genirq/irqdomain: Don't try to free an interrupt that has no mapping
  power: supply: bq24190_charger: fix reference leak
  ARM: dts: Remove non-existent i2c1 from 98dx3236
  HSI: omap_ssi: Don't jump to free ID in ssi_add_controller()
  media: max2175: fix max2175_set_csm_mode() error code
  mips: cdmm: fix use-after-free in mips_cdmm_bus_discover
  samples: bpf: Fix lwt_len_hist reusing previous BPF map
  media: siano: fix memory leak of debugfs members in smsdvb_hotplug
  cw1200: fix missing destroy_workqueue() on error in cw1200_init_common
  orinoco: Move context allocation after processing the skb
  ARM: dts: at91: sama5d3_xplained: add pincontrol for USB Host
  ARM: dts: at91: sama5d4_xplained: add pincontrol for USB Host
  memstick: fix a double-free bug in memstick_check
  RDMA/cxgb4: Validate the number of CQEs
  Input: omap4-keypad - fix runtime PM error handling
  drivers: soc: ti: knav_qmss_queue: Fix error return code in knav_queue_probe
  soc: ti: Fix reference imbalance in knav_dma_probe
  soc: ti: knav_qmss: fix reference leak in knav_queue_probe
  crypto: omap-aes - Fix PM disable depth imbalance in omap_aes_probe
  powerpc/feature: Fix CPU_FTRS_ALWAYS by removing CPU_FTRS_GENERIC_32
  Input: ads7846 - fix unaligned access on 7845
  Input: ads7846 - fix integer overflow on Rt calculation
  Input: ads7846 - fix race that causes missing releases
  drm/omap: dmm_tiler: fix return error code in omap_dmm_probe()
  media: solo6x10: fix missing snd_card_free in error handling case
  scsi: core: Fix VPD LUN ID designator priorities
  media: mtk-vcodec: add missing put_device() call in mtk_vcodec_release_dec_pm()
  staging: greybus: codecs: Fix reference counter leak in error handling
  MIPS: BCM47XX: fix kconfig dependency bug for BCM47XX_BCMA
  RDMa/mthca: Work around -Wenum-conversion warning
  ASoC: arizona: Fix a wrong free in wm8997_probe
  ASoC: wm8998: Fix PM disable depth imbalance on error
  mwifiex: fix mwifiex_shutdown_sw() causing sw reset failure
  spi: tegra114: fix reference leak in tegra spi ops
  spi: tegra20-sflash: fix reference leak in tegra_sflash_resume
  spi: tegra20-slink: fix reference leak in slink ops of tegra20
  spi: spi-ti-qspi: fix reference leak in ti_qspi_setup
  Bluetooth: Fix null pointer dereference in hci_event_packet()
  arm64: dts: exynos: Correct psci compatible used on Exynos7
  selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling
  ASoC: pcm: DRAIN support reactivation
  spi: img-spfi: fix reference leak in img_spfi_resume
  crypto: talitos - Fix return type of current_desc_hdr()
  sched: Reenable interrupts in do_sched_yield()
  sched/deadline: Fix sched_dl_global_validate()
  ARM: p2v: fix handling of LPAE translation in BE mode
  x86/mm/ident_map: Check for errors from ident_pud_init()
  RDMA/rxe: Compute PSN windows correctly
  selinux: fix error initialization in inode_doinit_with_dentry()
  RDMA/bnxt_re: Set queue pair state when being queried
  soc: mediatek: Check if power domains can be powered on at boot time
  soc: renesas: rmobile-sysc: Fix some leaks in rmobile_init_pm_domains()
  drm/gma500: fix double free of gma_connector
  Bluetooth: Fix slab-out-of-bounds read in hci_le_direct_adv_report_evt()
  md: fix a warning caused by a race between concurrent md_ioctl()s
  crypto: af_alg - avoid undefined behavior accessing salg_name
  media: msi2500: assign SPI bus number dynamically
  quota: Sanity-check quota file headers on load
  serial_core: Check for port state when tty is in error state
  HID: i2c-hid: add Vero K147 to descriptor override
  ARM: dts: exynos: fix USB 3.0 pins supply being turned off on Odroid XU
  ARM: dts: exynos: fix USB 3.0 VBUS control and over-current pins on Exynos5410
  ARM: dts: exynos: fix roles of USB 3.0 ports on Odroid XU
  usb: chipidea: ci_hdrc_imx: Pass DISABLE_DEVICE_STREAMING flag to imx6ul
  USB: gadget: f_rndis: fix bitrate for SuperSpeed and above
  usb: gadget: f_fs: Re-use SS descriptors for SuperSpeedPlus
  USB: gadget: f_midi: setup SuperSpeed Plus descriptors
  USB: gadget: f_acm: add support for SuperSpeed Plus
  USB: serial: option: add interface-number sanity check to flag handling
  soc/tegra: fuse: Fix index bug in get_process_id
  dm table: Remove BUG_ON(in_interrupt())
  scsi: mpt3sas: Increase IOCInit request timeout to 30s
  vxlan: Copy needed_tailroom from lowerdev
  vxlan: Add needed_headroom for lower device
  drm/tegra: sor: Disable clocks on error in tegra_sor_init()
  kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
  RDMA/cm: Fix an attempt to use non-valid pointer when cleaning timewait
  can: softing: softing_netdev_open(): fix error handling
  scsi: bnx2i: Requires MMU
  gpio: mvebu: fix potential user-after-free on probe
  ARM: dts: sun8i: v3s: fix GIC node memory range
  pinctrl: baytrail: Avoid clearing debounce value when turning it off
  pinctrl: merrifield: Set default bias in case no particular value given
  drm: fix drm_dp_mst_port refcount leaks in drm_dp_mst_allocate_vcpi
  serial: 8250_omap: Avoid FIFO corruption caused by MDR1 access
  ALSA: pcm: oss: Fix potential out-of-bounds shift
  USB: sisusbvga: Make console support depend on BROKEN
  USB: UAS: introduce a quirk to set no_write_same
  xhci: Give USB2 ports time to enter U3 in bus suspend
  ALSA: usb-audio: Fix control 'access overflow' errors from chmap
  ALSA: usb-audio: Fix potential out-of-bounds shift
  USB: add RESET_RESUME quirk for Snapscan 1212
  USB: dummy-hcd: Fix uninitialized array use in init()
  mac80211: mesh: fix mesh_pathtbl_init() error path
  net: bridge: vlan: fix error return code in __vlan_add()
  net: stmmac: dwmac-meson8b: fix mask definition of the m250_sel mux
  net: stmmac: delete the eee_ctrl_timer after napi disabled
  net/mlx4_en: Handle TX error CQE
  net/mlx4_en: Avoid scheduling restart task if it is already running
  tcp: fix cwnd-limited bug for TSO deferral where we send nothing
  net: stmmac: free tx skb buffer in stmmac_resume()
  PCI: qcom: Add missing reset for ipq806x
  x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
  scsi: be2iscsi: Revert "Fix a theoretical leak in beiscsi_create_eqs()"
  kbuild: avoid static_assert for genksyms
  pinctrl: amd: remove debounce filter setting in IRQ type setting
  Input: i8042 - add Acer laptops to the i8042 reset list
  Input: cm109 - do not stomp on control URB
  platform/x86: acer-wmi: add automatic keyboard background light toggle key as KEY_LIGHTS_TOGGLE
  soc: fsl: dpio: Get the cpumask through cpumask_of(cpu)
  scsi: ufs: Make sure clk scaling happens only when HBA is runtime ACTIVE
  ARC: stack unwinding: don't assume non-current task is sleeping
  iwlwifi: mvm: fix kernel panic in case of assert during CSA
  arm64: dts: rockchip: Assign a fixed index to mmc devices on rk3399 boards.
  iwlwifi: pcie: limit memory read spin time
  spi: bcm2835aux: Restore err assignment in bcm2835aux_spi_probe
  spi: bcm2835aux: Fix use-after-free on unbind
  Linux 4.14.212
  x86/uprobes: Do not use prefixes.nbytes when looping over prefixes.bytes
  Input: i8042 - fix error return code in i8042_setup_aux()
  i2c: qup: Fix error return code in qup_i2c_bam_schedule_desc()
  gfs2: check for empty rgrp tree in gfs2_ri_update
  tracing: Fix userstacktrace option for instances
  spi: bcm2835: Release the DMA channel if probe fails after dma_init
  spi: bcm2835: Fix use-after-free on unbind
  spi: bcm-qspi: Fix use-after-free on unbind
  spi: Introduce device-managed SPI controller allocation
  iommu/amd: Set DTE[IntTabLen] to represent 512 IRTEs
  speakup: Reject setting the speakup line discipline outside of speakup
  i2c: imx: Check for I2SR_IAL after every byte
  i2c: imx: Fix reset of I2SR_IAL flag
  mm/swapfile: do not sleep with a spin lock held
  cifs: fix potential use-after-free in cifs_echo_request()
  ftrace: Fix updating FTRACE_FL_TRAMP
  ALSA: hda/generic: Add option to enforce preferred_dacs pairs
  ALSA: hda/realtek - Add new codec supported for ALC897
  tty: Fix ->session locking
  tty: Fix ->pgrp locking in tiocspgrp()
  USB: serial: option: fix Quectel BG96 matching
  USB: serial: option: add support for Thales Cinterion EXS82
  USB: serial: option: add Fibocom NL668 variants
  USB: serial: ch341: sort device-id entries
  USB: serial: ch341: add new Product ID for CH341A
  USB: serial: kl5kusb105: fix memleak on open
  usb: gadget: f_fs: Use local copy of descriptors for userspace copy
  vlan: consolidate VLAN parsing code and limit max parsing depth
  pinctrl: baytrail: Fix pin being driven low for a while on gpiod_get(..., GPIOD_OUT_HIGH)
  pinctrl: baytrail: Replace WARN with dev_info_once when setting direct-irq pin to output
  ANDROID: Incremental fs: Set credentials before reading/writing
  ANDROID: Incremental fs: Fix incfs_test use of atol, open
  ANDROID: Incremental fs: Change per UID timeouts to microseconds
  ANDROID: Incremental fs: Add v2 feature flag
  ANDROID: Incremental fs: Add zstd feature flag
  Linux 4.14.211
  RDMA/i40iw: Address an mmap handler exploit in i40iw
  Input: i8042 - add ByteSpeed touchpad to noloop table
  Input: xpad - support Ardwiino Controllers
  ALSA: usb-audio: US16x08: fix value count for level meters
  dt-bindings: net: correct interrupt flags in examples
  net/mlx5: Fix wrong address reclaim when command interface is down
  net: pasemi: fix error return code in pasemi_mac_open()
  cxgb3: fix error return code in t3_sge_alloc_qset()
  net/x25: prevent a couple of overflows
  ibmvnic: Fix TX completion error handling
  ibmvnic: Ensure that SCRQ entry reads are correctly ordered
  ipv4: Fix tos mask in inet_rtm_getroute()
  netfilter: bridge: reset skb->pkt_type after NF_INET_POST_ROUTING traversal
  bonding: wait for sysfs kobject destruction before freeing struct slave
  usbnet: ipheth: fix connectivity with iOS 14
  tun: honor IOCB_NOWAIT flag
  tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control
  sock: set sk_err to ee_errno on dequeue from errq
  rose: Fix Null pointer dereference in rose_send_frame()
  net/af_iucv: set correct sk_protocol for child sockets
  ANDROID: Incremental fs: Fix 32-bit build
  UPSTREAM: arm64: sysreg: Clean up instructions for modifying PSTATE fields
  [CP] CHROMIUM: drm/virtio: rebase zero-copy patches to virgl/drm-misc-next
  Linux 4.14.210
  USB: core: Fix regression in Hercules audio card
  USB: core: add endpoint-blacklist quirk
  x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak
  x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak
  x86/speculation: Fix prctl() when spectre_v2_user={seccomp,prctl},ibpb
  usb: gadget: Fix memleak in gadgetfs_fill_super
  usb: gadget: f_midi: Fix memleak in f_midi_alloc
  USB: core: Change %pK for __user pointers to %px
  perf probe: Fix to die_entrypc() returns error correctly
  can: m_can: fix nominal bitiming tseg2 min for version >= 3.1
  platform/x86: toshiba_acpi: Fix the wrong variable assignment
  can: gs_usb: fix endianess problem with candleLight firmware
  efivarfs: revert "fix memory leak in efivarfs_create()"
  ibmvnic: fix NULL pointer dereference in ibmvic_reset_crq
  ibmvnic: fix NULL pointer dereference in reset_sub_crq_queues
  net: ena: set initial DMA width to avoid intel iommu issue
  nfc: s3fwrn5: use signed integer for parsing GPIO numbers
  IB/mthca: fix return value of error branch in mthca_init_cq()
  bnxt_en: Release PCI regions when DMA mask setup fails during probe.
  video: hyperv_fb: Fix the cache type when mapping the VRAM
  bnxt_en: fix error return code in bnxt_init_board()
  bnxt_en: fix error return code in bnxt_init_one()
  scsi: ufs: Fix race between shutdown and runtime resume flow
  batman-adv: set .owner to THIS_MODULE
  phy: tegra: xusb: Fix dangling pointer on probe failure
  perf/x86: fix sysfs type mismatches
  scsi: target: iscsi: Fix cmd abort fabric stop race
  scsi: libiscsi: Fix NOP race condition
  dmaengine: pl330: _prep_dma_memcpy: Fix wrong burst size
  nvme: free sq/cq dbbuf pointers when dbbuf set fails
  proc: don't allow async path resolution of /proc/self components
  HID: Add Logitech Dinovo Edge battery quirk
  x86/xen: don't unbind uninitialized lock_kicker_irq
  dmaengine: xilinx_dma: use readl_poll_timeout_atomic variant
  HID: hid-sensor-hub: Fix issue with devices with no report ID
  Input: i8042 - allow insmod to succeed on devices without an i8042 controller
  HID: cypress: Support Varmilo Keyboards' media hotkeys
  ALSA: hda/hdmi: fix incorrect locking in hdmi_pcm_close
  ALSA: hda/hdmi: Use single mutex unlock in error paths
  arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect()
  arm64: pgtable: Fix pte_accessible()
  btrfs: inode: Verify inode mode to avoid NULL pointer dereference
  btrfs: adjust return values of btrfs_inode_by_name
  btrfs: tree-checker: Enhance chunk checker to validate chunk profile
  PCI: Add device even if driver attach failed
  wireless: Use linux/stddef.h instead of stddef.h
  btrfs: fix lockdep splat when reading qgroup config on mount
  mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault()
  perf event: Check ref_reloc_sym before using it
  Linux 4.14.209
  x86/microcode/intel: Check patch signature before saving microcode for early loading
  s390/dasd: fix null pointer dereference for ERP requests
  s390/cpum_sf.c: fix file permission for cpum_sfb_size
  mac80211: free sta in sta_info_insert_finish() on errors
  mac80211: minstrel: fix tx status processing corner case
  mac80211: minstrel: remove deferred sampling code
  xtensa: disable preemption around cache alias management calls
  regulator: workaround self-referent regulators
  regulator: avoid resolve_supply() infinite recursion
  regulator: fix memory leak with repeated set_machine_constraints()
  iio: accel: kxcjk1013: Add support for KIOX010A ACPI DSM for setting tablet-mode
  iio: accel: kxcjk1013: Replace is_smo8500_device with an acpi_type enum
  ext4: fix bogus warning in ext4_update_dx_flag()
  staging: rtl8723bs: Add 024c:0627 to the list of SDIO device-ids
  efivarfs: fix memory leak in efivarfs_create()
  tty: serial: imx: keep console clocks always on
  ALSA: mixart: Fix mutex deadlock
  ALSA: ctl: fix error path at adding user-defined element set
  speakup: Do not let the line discipline be used several times
  powerpc/uaccess-flush: fix missing includes in kup-radix.h
  libfs: fix error cast of negative value in simple_attr_write()
  xfs: revert "xfs: fix rmap key and record comparison functions"
  regulator: ti-abb: Fix array out of bound read access on the first transition
  MIPS: Alchemy: Fix memleak in alchemy_clk_setup_cpu
  ASoC: qcom: lpass-platform: Fix memory leak
  can: m_can: m_can_handle_state_change(): fix state change
  can: peak_usb: fix potential integer overflow on shift of a int
  can: mcba_usb: mcba_usb_start_xmit(): first fill skb, then pass to can_put_echo_skb()
  can: ti_hecc: Fix memleak in ti_hecc_probe
  can: dev: can_restart(): post buffer from the right context
  can: af_can: prevent potential access of uninitialized member in canfd_rcv()
  can: af_can: prevent potential access of uninitialized member in can_rcv()
  perf lock: Don't free "lock_seq_stat" if read_count isn't zero
  ARM: dts: imx50-evk: Fix the chip select 1 IOMUX
  arm: dts: imx6qdl-udoo: fix rgmii phy-mode for ksz9031 phy
  MIPS: export has_transparent_hugepage() for modules
  Input: adxl34x - clean up a data type in adxl34x_probe()
  vfs: remove lockdep bogosity in __sb_start_write
  arm64: psci: Avoid printing in cpu_psci_cpu_die()
  pinctrl: rockchip: enable gpio pclk for rockchip_gpio_to_irq
  net: ftgmac100: Fix crash when removing driver
  tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimate
  net: usb: qmi_wwan: Set DTR quirk for MR400
  net/mlx5: Disable QoS when min_rates on all VFs are zero
  sctp: change to hold/put transport for proto_unreach_timer
  qlcnic: fix error return code in qlcnic_83xx_restart_hw()
  net: x25: Increase refcnt of "struct x25_neigh" in x25_rx_call_request
  net/mlx4_core: Fix init_hca fields offset
  netlabel: fix an uninitialized warning in netlbl_unlabel_staticlist()
  netlabel: fix our progress tracking in netlbl_unlabel_staticlist()
  net: Have netpoll bring-up DSA management interface
  net: dsa: mv88e6xxx: Avoid VTU corruption on 6097
  net: bridge: add missing counters to ndo_get_stats64 callback
  net: b44: fix error return code in b44_init_one()
  mlxsw: core: Use variable timeout for EMAD retries
  inet_diag: Fix error path to cancel the meseage in inet_req_diag_fill()
  devlink: Add missing genlmsg_cancel() in devlink_nl_sb_port_pool_fill()
  bnxt_en: read EEPROM A2h address using page 0
  atm: nicstar: Unmap DMA on send error
  ah6: fix error return code in ah6_input()
  ANDROID: Fix cuttlefish defconfigs now incfs uses zstd
  ANDROID: Incremental fs: Add zstd compression support
  ANDROID: Incremental fs: Small improvements
  ANDROID: Incremental fs: Initialize mount options correctly
  ANDROID: Incremental fs: Fix read_log_test which failed sporadically
  ANDROID: Incremental fs: Fix misuse of cpu_to_leXX and poll return
  ANDROID: Incremental fs: Add per UID read timeouts
  ANDROID: Incremental fs: Add .incomplete folder
  ANDROID: Incremental fs: Fix dangling else
  ANDROID: Incremental fs: Fix uninitialized variable
  ANDROID: Incremental fs: Fix filled block count from get filled blocks
  ANDROID: Incremental fs: Add hash block counts to IOC_IOCTL_GET_BLOCK_COUNT
  ANDROID: Incremental fs: Add INCFS_IOC_GET_BLOCK_COUNT
  ANDROID: Incremental fs: Make compatible with existing files
  ANDROID: Incremental fs: Remove block HASH flag
  ANDROID: Incremental fs: Remove back links and crcs
  ANDROID: Incremental fs: Remove attributes from file
  ANDROID: Incremental fs: Add .blocks_written file
  ANDROID: Incremental fs: Separate pseudo-file code
  ANDROID: Incremental fs: Add UID to pending_read
  ANDROID: Incremental fs: Create mapped file
  ANDROID: Incremental fs: Don't allow renaming .index directory.
  ANDROID: Incremental fs: Fix incfs to work on virtio-9p
  ANDROID: Incremental fs: Allow running a single test
  ANDROID: Incremental fs: Adding perf test
  ANDROID: Incremental fs: Stress tool
  ANDROID: Incremental fs: Use R/W locks to read/write segment blockmap.
  ANDROID: Incremental fs: Remove unnecessary dependencies
  ANDROID: Incremental fs: Remove annoying pr_debugs
  ANDROID: Incremental fs: dentry_revalidate should not return -EBADF.
  ANDROID: Incremental fs: Fix minor bugs
  ANDROID: Incremental fs: RCU locks instead of mutex for pending_reads.
  ANDROID: Incremental fs: fix up attempt to copy structures with READ/WRITE_ONCE
  Linux 4.14.208
  ACPI: GED: fix -Wformat
  KVM: x86: clflushopt should be treated as a no-op by emulation
  can: proc: can_remove_proc(): silence remove_proc_entry warning
  mac80211: always wind down STA state
  Input: sunkbd - avoid use-after-free in teardown paths
  powerpc/8xx: Always fault when _PAGE_ACCESSED is not set
  gpio: mockup: fix resource leak in error path
  i2c: imx: Fix external abort on interrupt in exit paths
  i2c: imx: use clk notifier for rate changes
  powerpc/64s: flush L1D after user accesses
  powerpc/uaccess: Evaluate macro arguments once, before user access is allowed
  powerpc: Fix __clear_user() with KUAP enabled
  powerpc: Implement user_access_begin and friends
  powerpc: Add a framework for user access tracking
  powerpc/64s: flush L1D on kernel entry
  powerpc/64s: move some exception handlers out of line
  powerpc/64s: Define MASKABLE_RELON_EXCEPTION_PSERIES_OOL
  Linux 4.14.207
  mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  Convert trailing spaces and periods in path components
  reboot: fix overflow parsing reboot cpu number
  Revert "kernel/reboot.c: convert simple_strtoul to kstrtoint"
  perf/core: Fix race in the perf_mmap_close() function
  xen/events: block rogue events for some time
  xen/events: defer eoi in case of excessive number of events
  xen/events: use a common cpu hotplug hook for event channels
  xen/events: switch user event channels to lateeoi model
  xen/pciback: use lateeoi irq binding
  xen/pvcallsback: use lateeoi irq binding
  xen/scsiback: use lateeoi irq binding
  xen/netback: use lateeoi irq binding
  xen/blkback: use lateeoi irq binding
  xen/events: add a new "late EOI" evtchn framework
  xen/events: fix race in evtchn_fifo_unmask()
  xen/events: add a proper barrier to 2-level uevent unmasking
  xen/events: avoid removing an event channel while handling it
  perf/core: Fix a memory leak in perf_event_parse_addr_filter()
  perf/core: Fix crash when using HW tracing kernel filters
  perf/core: Fix bad use of igrab()
  x86/speculation: Allow IBPB to be conditionally enabled on CPUs with always-on STIBP
  random32: make prandom_u32() output unpredictable
  net: Update window_clamp if SOCK_RCVBUF is set
  r8169: fix potential skb double free in an error path
  vrf: Fix fast path output packet handling with async Netfilter rules
  net/x25: Fix null-ptr-deref in x25_connect
  net/af_iucv: fix null pointer dereference on shutdown
  IPv6: Set SIT tunnel hard_header_len to zero
  swiotlb: fix "x86: Don't panic if can not alloc buffer for swiotlb"
  pinctrl: amd: fix incorrect way to disable debounce filter
  pinctrl: amd: use higher precision for 512 RtcClk
  drm/gma500: Fix out-of-bounds access to struct drm_device.vblank[]
  don't dump the threads that had been already exiting when zapped.
  selinux: Fix error return code in sel_ib_pkey_sid_slow()
  ocfs2: initialize ip_next_orphan
  futex: Don't enable IRQs unconditionally in put_pi_state()
  mei: protect mei_cl_mtu from null dereference
  usb: cdc-acm: Add DISABLE_ECHO for Renesas USB Download mode
  uio: Fix use-after-free in uio_unregister_device()
  thunderbolt: Add the missed ida_simple_remove() in ring_request_msix()
  ext4: unlock xattr_sem properly in ext4_inline_data_truncate()
  ext4: correctly report "not supported" for {usr,grp}jquota when !CONFIG_QUOTA
  perf: Fix get_recursion_context()
  cosa: Add missing kfree in error path of cosa_write
  of/address: Fix of_node memory leak in of_dma_is_coherent
  xfs: fix a missing unlock on error in xfs_fs_map_blocks
  xfs: fix rmap key and record comparison functions
  xfs: fix flags argument to rmap lookup when converting shared file rmaps
  nbd: fix a block_device refcount leak in nbd_release
  pinctrl: aspeed: Fix GPI only function problem.
  ARM: 9019/1: kprobes: Avoid fortify_panic() when copying optprobe template
  pinctrl: intel: Set default bias in case no particular value given
  iommu/amd: Increase interrupt remapping table limit to 512 entries
  scsi: scsi_dh_alua: Avoid crash during alua_bus_detach()
  cfg80211: regulatory: Fix inconsistent format argument
  mac80211: fix use of skb payload instead of header
  drm/amdgpu: perform srbm soft reset always on SDMA resume
  scsi: hpsa: Fix memory leak in hpsa_init_one()
  gfs2: check for live vs. read-only file system in gfs2_fitrim
  gfs2: Add missing truncate_inode_pages_final for sd_aspace
  gfs2: Free rd_bits later in gfs2_clear_rgrpd to fix use-after-free
  usb: gadget: goku_udc: fix potential crashes in probe
  ath9k_htc: Use appropriate rs_datalen type
  Btrfs: fix missing error return if writeback for extent buffer never started
  xfs: flush new eof page on truncate to avoid post-eof corruption
  can: peak_canfd: pucan_handle_can_rx(): fix echo management when loopback is on
  can: peak_usb: peak_usb_get_ts_time(): fix timestamp wrapping
  can: peak_usb: add range checking in decode operations
  can: can_create_echo_skb(): fix echo skb generation: always use skb_clone()
  can: dev: __can_get_echo_skb(): fix real payload length return value for RTR frames
  can: dev: can_get_echo_skb(): prevent call to kfree_skb() in hard IRQ context
  can: rx-offload: don't call kfree_skb() from IRQ context
  ALSA: hda: prevent undefined shift in snd_hdac_ext_bus_get_link()
  perf tools: Add missing swap for ino_generation
  net: xfrm: fix a race condition during allocing spi
  hv_balloon: disable warning when floor reached
  genirq: Let GENERIC_IRQ_IPI select IRQ_DOMAIN_HIERARCHY
  btrfs: reschedule when cloning lots of extents
  btrfs: sysfs: init devices outside of the chunk_mutex
  nbd: don't update block size after device is started
  time: Prevent undefined behaviour in timespec64_to_ns()
  mm: mempolicy: fix potential pte_unmap_unlock pte error
  ring-buffer: Fix recursion protection transitions between interrupt context
  regulator: defer probe when trying to get voltage from unresolved supply
  UPSTREAM: sched: idle: Avoid retaining the tick when it has been stopped
  UPSTREAM: cpuidle: menu: Handle stopped tick more aggressively
  UPSTREAM: staging: android: vsoc: fix copy_from_user overrun
  UPSTREAM: ipv6: ndisc: RFC-ietf-6man-ra-pref64-09 is now published as RFC8781
  BACKPORT: drm/virtio: fix missing dma_fence_put() in virtio_gpu_execbuffer_ioctl()

Conflicts:
	drivers/scsi/ufs/ufshcd.c
	drivers/soc/qcom/smp2p.c
	drivers/usb/gadget/function/f_accessory.c
	drivers/usb/gadget/function/f_fs.c
	drivers/usb/gadget/function/f_uac2.c

Change-Id: I7ec83fef94dcc943abd858eb81e4a78983faae8c
2021-01-22 21:56:05 +05:30
Johannes Weiner
aa93e9471a mm: memcontrol: fix excessive complexity in memory.stat reporting
commit a983b5ebee57209c99f68c8327072f25e0e6e3da upstream

We've seen memory.stat reads in top-level cgroups take up to fourteen
seconds during a userspace bug that created tens of thousands of ghost
cgroups pinned by lingering page cache.

Even with a more reasonable number of cgroups, aggregating memory.stat
is unnecessarily heavy.  The complexity is this:

	nr_cgroups * nr_stat_items * nr_possible_cpus

where the stat items are ~70 at this point.  With 128 cgroups and 128
CPUs - decent, not enormous setups - reading the top-level memory.stat
has to aggregate over a million per-cpu counters.  This doesn't scale.

Instead of spreading the source of truth across all CPUs, use the
per-cpu counters merely to batch updates to shared atomic counters.

This is the same as the per-cpu stocks we use for charging memory to the
shared atomic page_counters, and also the way the global vmstat counters
are implemented.

Vmstat has elaborate spilling thresholds that depend on the number of
CPUs, amount of memory, and memory pressure - carefully balancing the
cost of counter updates with the amount of per-cpu error.  That's
because the vmstat counters are system-wide, but also used for decisions
inside the kernel (e.g.  NR_FREE_PAGES in the allocator).  Neither is
true for the memory controller.

Use the same static batch size we already use for page_counter updates
during charging.  The per-cpu error in the stats will be 128k, which is
an acceptable ratio of cores to memory accounting granularity.

[hannes@cmpxchg.org: fix warning in __this_cpu_xchg() calls]
  Link: http://lkml.kernel.org/r/20171201135750.GB8097@cmpxchg.org
Link: http://lkml.kernel.org/r/20171103153336.24044-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
[shaoyi@amazon.com: resolved the conflict brought by commit 17ffa29c35 in mm/memcontrol.c by contextual fix]
Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-09 13:37:37 +01:00
Johannes Weiner
955f8bc9eb mm: memcontrol: implement lruvec stat functions on top of each other
commit 284542656e22c43fdada8c8cc0ca9ede8453eed7 upstream

The implementation of the lruvec stat functions and their variants for
accounting through a page, or accounting from a preemptible context, are
mostly identical and needlessly repetitive.

Implement the lruvec_page functions by looking up the page's lruvec and
then using the lruvec function.

Implement the functions for preemptible contexts by disabling preemption
before calling the atomic context functions.

Link: http://lkml.kernel.org/r/20171103153336.24044-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-09 13:37:37 +01:00
Johannes Weiner
fdcda71d87 mm: memcontrol: eliminate raw access to stat and event counters
commit c9019e9bf42e66d028d70d2da6206cad4dd9250d upstream

Replace all raw 'this_cpu_' modifications of the stat and event per-cpu
counters with API functions such as mod_memcg_state().

This makes the code easier to read, but is also in preparation for the
next patch, which changes the per-cpu implementation of those counters.

Link: http://lkml.kernel.org/r/20171103153336.24044-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-09 13:37:37 +01:00
Michal Hocko
e52a7c1cb9 memcg, oom: move out_of_memory back to the charge path
Commit 3812c8c8f3 ("mm: memcg: do not trap chargers with full
callstack on OOM") has changed the ENOMEM semantic of memcg charges.
Rather than invoking the oom killer from the charging context it delays
the oom killer to the page fault path (pagefault_out_of_memory).  This
in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
the corresponding memcg hits the hard limit and the memcg is is OOM.
This is behavior is inconsistent with !memcg case where the oom killer
is invoked from the allocation context and the allocator keeps retrying
until it succeeds.

The difference in the behavior is user visible.  mmap(MAP_POPULATE)
might result in not fully populated ranges while the mmap return code
doesn't tell that to the userspace.  Random syscalls might fail with
ENOMEM etc.

The primary motivation of the different memcg oom semantic was the
deadlock avoidance.  Things have changed since then, though.  We have an
async oom teardown by the oom reaper now and so we do not have to rely
on the victim to tear down its memory anymore.  Therefore we can return
to the original semantic as long as the memcg oom killer is not handed
over to the users space.

There is still one thing to be careful about here though.  If the oom
killer is not able to make any forward progress - e.g.  because there is
no eligible task to kill - then we have to bail out of the charge path
to prevent from same class of deadlocks.  We have basically two options
here.  Either we fail the charge with ENOMEM or force the charge and
allow overcharge.  The first option has been considered more harmful
than useful because rare inconsistencies in the ENOMEM behavior is hard
to test for and error prone.  Basically the same reason why the page
allocator doesn't fail allocations under such conditions.  The later
might allow runaways but those should be really unlikely unless somebody
misconfigures the system.  E.g.  allowing to migrate tasks away from the
memcg to a different unlimited memcg with move_charge_at_immigrate
disabled.

Change-Id: Ib861df994357b907df37705bf626d405320e2b06
Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 29ef680ae7c21110af8e6416d84d8a72fc147b14
Git-repo: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Vijayanand Jitta <vjitta@codeaurora.org>
2019-05-22 22:05:36 -07:00
Matthias Kaehlcke
04fecbf51b mm: memcontrol: use int for event/state parameter in several functions
Several functions use an enum type as parameter for an event/state, but
are called in some locations with an argument of a different enum type.
Adjust the interface of these functions to reality by changing the
parameter to int.

This fixes a ton of enum-conversion warnings that are generated when
building the kernel with clang.

[mka@chromium.org: also change parameter type of inc/dec/mod_memcg_page_state()]
  Link: http://lkml.kernel.org/r/20170728213442.93823-1-mka@chromium.org
Link: http://lkml.kernel.org/r/20170727211004.34435-1-mka@chromium.org
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Doug Anderson <dianders@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Johannes Weiner
739f79fc9d mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback()
Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
to update the memcg stats:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
    IP: test_clear_page_writeback+0x12e/0x2c0
    [...]
    RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
    Call Trace:
     <IRQ>
     end_page_writeback+0x47/0x70
     f2fs_write_end_io+0x76/0x180 [f2fs]
     bio_endio+0x9f/0x120
     blk_update_request+0xa8/0x2f0
     scsi_end_request+0x39/0x1d0
     scsi_io_completion+0x211/0x690
     scsi_finish_command+0xd9/0x120
     scsi_softirq_done+0x127/0x150
     __blk_mq_complete_request_remote+0x13/0x20
     flush_smp_call_function_queue+0x56/0x110
     generic_smp_call_function_single_interrupt+0x13/0x30
     smp_call_function_single_interrupt+0x27/0x40
     call_function_single_interrupt+0x89/0x90
    RIP: 0010:native_safe_halt+0x6/0x10

    (gdb) l *(test_clear_page_writeback+0x12e)
    0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
    614		mod_node_page_state(page_pgdat(page), idx, val);
    615		if (mem_cgroup_disabled() || !page->mem_cgroup)
    616			return;
    617		mod_memcg_state(page->mem_cgroup, idx, val);
    618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
    619		this_cpu_add(pn->lruvec_stat->count[idx], val);
    620	}
    621
    622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
    623							gfp_t gfp_mask,

The issue is that writeback doesn't hold a page reference and the page
might get freed after PG_writeback is cleared (and the mapping is
unlocked) in test_clear_page_writeback().  The stat functions looking up
the page's node or zone are safe, as those attributes are static across
allocation and free cycles.  But page->mem_cgroup is not, and it will
get cleared if we race with truncation or migration.

It appears this race window has been around for a while, but less likely
to trigger when the memcg stats were updated first thing after
PG_writeback is cleared.  Recent changes reshuffled this code to update
the global node stats before the memcg ones, though, stretching the race
window out to an extent where people can reproduce the problem.

Update test_clear_page_writeback() to look up and pin page->mem_cgroup
before clearing PG_writeback, then not use that pointer afterward.  It
is a partial revert of 62cccb8c8e ("mm: simplify lock_page_memcg()")
but leaves the pageref-holding callsites that aren't affected alone.

Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
Fixes: 62cccb8c8e ("mm: simplify lock_page_memcg()")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Jaegeuk Kim <jaegeuk@kernel.org>
Tested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Reported-by: Bradley Bolen <bradleybolen@gmail.com>
Tested-by: Brad Bolen <bradleybolen@gmail.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: <stable@vger.kernel.org>	[4.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-18 15:32:01 -07:00
Johannes Weiner
00f3ca2c2d mm: memcontrol: per-lruvec stats infrastructure
lruvecs are at the intersection of the NUMA node and memcg, which is the
scope for most paging activity.

Introduce a convenient accounting infrastructure that maintains
statistics per node, per memcg, and the lruvec itself.

Then convert over accounting sites for statistics that are already
tracked in both nodes and memcgs and can be easily switched.

[hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code]
  Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org
[hannes@cmpxchg.org: don't track uncharged pages at all
  Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org
[hannes@cmpxchg.org: add missing free_percpu()]
  Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org
[linux@roeck-us.net: hexagon: fix build error caused by include file order]
  Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net
Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:35 -07:00
Johannes Weiner
ed52be7bfd mm: memcontrol: use generic mod_memcg_page_state for kmem pages
The kmem-specific functions do the same thing.  Switch and drop.

Link: http://lkml.kernel.org/r/20170530181724.27197-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:35 -07:00
Johannes Weiner
320492961c mm: memcontrol: use the node-native slab memory counters
Now that the slab counters are moved from the zone to the node level we
can drop the private memcg node stats and use the official ones.

Link: http://lkml.kernel.org/r/20170530181724.27197-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:35 -07:00
Konstantin Khlebnikov
8e675f7af5 mm/oom_kill: count global and memory cgroup oom kills
Show count of oom killer invocations in /proc/vmstat and count of
processes killed in memory cgroup in knob "memory.events" (in
memory.oom_control for v1 cgroup).

Also describe difference between "oom" and "oom_kill" in memory cgroup
documentation.  Currently oom in memory cgroup kills tasks iff shortage
has happened inside page fault.

These counters helps in monitoring oom kills - for now the only way is
grepping for magic words in kernel log.

[akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename]
[akpm@linux-foundation.org: fix comment, per Konstantin]
Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Roman Guschin <guroan@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:35 -07:00
Roman Gushchin
2262185c5b mm: per-cgroup memory reclaim stats
Track the following reclaim counters for every memory cgroup: PGREFILL,
PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

These values are exposed using the memory.stats interface of cgroup v2.

The meaning of each value is the same as for global counters, available
using /proc/vmstat.

Also, for consistency, rename mem_cgroup_count_vm_event() to
count_memcg_event_mm().

Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:35 -07:00
Johannes Weiner
ccda7f4360 mm: memcontrol: use node page state naming scheme for memcg
The memory controllers stat function names are awkwardly long and
arbitrarily different from the zone and node stat functions.

The current interface is named:

  mem_cgroup_read_stat()
  mem_cgroup_update_stat()
  mem_cgroup_inc_stat()
  mem_cgroup_dec_stat()
  mem_cgroup_update_page_stat()
  mem_cgroup_inc_page_stat()
  mem_cgroup_dec_page_stat()

This patch renames it to match the corresponding node stat functions:

  memcg_page_state()		[node_page_state()]
  mod_memcg_state()		[mod_node_state()]
  inc_memcg_state()		[inc_node_state()]
  dec_memcg_state()		[dec_node_state()]
  mod_memcg_page_state()	[mod_node_page_state()]
  inc_memcg_page_state()	[inc_node_page_state()]
  dec_memcg_page_state()	[dec_node_page_state()]

Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:11 -07:00
Johannes Weiner
71cd31135d mm: memcontrol: re-use node VM page state enum
The current duplication is a high-maintenance mess, and it's painful to
add new items or query memcg state from the rest of the VM.

This increases the size of the stat array marginally, but we should aim
to track all these stats on a per-cgroup level anyway.

Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:11 -07:00
Johannes Weiner
df0e53d061 mm: memcontrol: re-use global VM event enum
The current duplication is a high-maintenance mess, and it's painful to
add new items.

This increases the size of the event array, but we'll eventually want
most of the VM events tracked on a per-cgroup basis anyway.

Link: http://lkml.kernel.org/r/20170404220148.28338-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:11 -07:00
Johannes Weiner
31176c7815 mm: memcontrol: clean up memory.events counting function
We only ever count single events, drop the @nr parameter.  Rename the
function accordingly.  Remove low-information kerneldoc.

Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:11 -07:00
Johannes Weiner
2a2e48854d mm: vmscan: fix IO/refault regression in cache workingset transition
Since commit 59dc76b0d4 ("mm: vmscan: reduce size of inactive file
list") we noticed bigger IO spikes during changes in cache access
patterns.

The patch in question shrunk the inactive list size to leave more room
for the current workingset in the presence of streaming IO.  However,
workingset transitions that previously happened on the inactive list are
now pushed out of memory and incur more refaults to complete.

This patch disables active list protection when refaults are being
observed.  This accelerates workingset transitions, and allows more of
the new set to establish itself from memory, without eating into the
ability to protect the established workingset during stable periods.

The workloads that were measurably affected for us were hit pretty bad
by it, with refault/majfault rates doubling and tripling during cache
transitions, and the machines sustaining half-hour periods of 100% IO
utilization, where they'd previously have sub-minute peaks at 60-90%.

Stateful services that handle user data tend to be more conservative
with kernel upgrades.  As a result we hit most page cache issues with
some delay, as was the case here.

The severity seemed to warrant a stable tag.

Fixes: 59dc76b0d4 ("mm: vmscan: reduce size of inactive file list")
Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>	[4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:11 -07:00
Johannes Weiner
9a4caf1e9f mm: memcontrol: provide shmem statistics
Cgroups currently don't report how much shmem they use, which can be
useful data to have, in particular since shmem is included in the
cache/file item while being reclaimed like anonymous memory.

Add a counter to track shmem pages during charging and uncharging.

Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Chris Down <cdown@fb.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:08 -07:00
Johannes Weiner
553af430e7 mm: rmap: fix huge file mmap accounting in the memcg stats
Huge pages are accounted as single units in the memcg's "file_mapped"
counter.  Account the correct number of base pages, like we do in the
corresponding node counter.

Link: http://lkml.kernel.org/r/20170322005111.3156-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-03-31 17:13:30 -07:00
Tejun Heo
17cc4dfeda slab: use memcg_kmem_cache_wq for slab destruction operations
If there's contention on slab_mutex, queueing the per-cache destruction
work item on the system_wq can unnecessarily create and tie up a lot of
kworkers.

Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
global and use that workqueue for the destruction work items too.  While
at it, convert the workqueue from an unbound workqueue to a per-cpu one
with concurrency limited to 1.  It's generally preferable to use per-cpu
workqueues and concurrency limit of 1 is safe enough.

This is suggested by Joonsoo Kim.

Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov@tarantool.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:27 -08:00
Tejun Heo
bc2791f857 slab: link memcg kmem_caches on their associated memory cgroup
With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is not
under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

While a memcg kmem_cache is listed on its root cache's ->children list,
there is no direct way to iterate all kmem_caches which are assocaited
with a memory cgroup.  The only way to iterate them is walking all
caches while filtering out caches which don't match, which would be most
of them.

This makes memcg destruction operations O(N^2) where N is the total
number of slab caches which can be huge.  This combined with the
synchronous RCU operations can tie up a CPU and affect the whole machine
for many hours when memory reclaim triggers offlining and destruction of
the stale memcgs.

This patch adds mem_cgroup->kmem_caches list which goes through
memcg_cache_params->kmem_caches_node of all kmem_caches which are
associated with the memcg.  All memcg specific iterations, including
stat file access, are updated to use the new list instead.

Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:27 -08:00
Michal Hocko
b4536f0c82 mm, memcg: fix the active list aging for lowmem requests when memcg is enabled
Nils Holland and Klaus Ethgen have reported unexpected OOM killer
invocations with 32b kernel starting with 4.8 kernels

	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
	kworker/u4:5 cpuset=/ mems_allowed=0
	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
	[...]
	Mem-Info:
	active_anon:58685 inactive_anon:90 isolated_anon:0
	 active_file:274324 inactive_file:281962 isolated_file:0
	 unevictable:0 dirty:649 writeback:0 unstable:0
	 slab_reclaimable:40662 slab_unreclaimable:17754
	 mapped:7382 shmem:202 pagetables:351 bounce:0
	 free:206736 free_pcp:332 free_cma:0
	Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
	DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
	lowmem_reserve[]: 0 813 3474 3474
	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
	lowmem_reserve[]: 0 0 21292 21292
	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

the oom killer is clearly pre-mature because there there is still a lot
of page cache in the zone Normal which should satisfy this lowmem
request.  Further debugging has shown that the reclaim cannot make any
forward progress because the page cache is hidden in the active list
which doesn't get rotated because inactive_list_is_low is not memcg
aware.

The code simply subtracts per-zone highmem counters from the respective
memcg's lru sizes which doesn't make any sense.  We can simply end up
always seeing the resulting active and inactive counts 0 and return
false.  This issue is not limited to 32b kernels but in practice the
effect on systems without CONFIG_HIGHMEM would be much harder to notice
because we do not invoke the OOM killer for allocations requests
targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled.  Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

We are losing empty LRU but non-zero lru size detection introduced by
ca707239e8 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs. node discrepancy.

Fixes: f8d1a31163 ("mm: consider whether to decivate based on eligible zones inactive ratio")
Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Nils Holland <nholland@tisys.org>
Tested-by: Nils Holland <nholland@tisys.org>
Reported-by: Klaus Ethgen <Klaus@Ethgen.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10 18:31:55 -08:00
Johannes Weiner
2d75807383 mm: memcontrol: consolidate cgroup socket tracking
The cgroup core and the memory controller need to track socket ownership
for different purposes, but the tracking sites being entirely different
is kind of ugly.

Be a better citizen and rename the memory controller callbacks to match
the cgroup core callbacks, then move them to the same place.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Vladimir Davydov
7c5f64f844 mm: oom: deduplicate victim selection code for memcg and global oom
When selecting an oom victim, we use the same heuristic for both memory
cgroup and global oom.  The only difference is the scope of tasks to
select the victim from.  So we could just export an iterator over all
memcg tasks and keep all oom related logic in oom_kill.c, but instead we
duplicate pieces of it in memcontrol.c reusing some initially private
functions of oom_kill.c in order to not duplicate all of it.  That looks
ugly and error prone, because any modification of select_bad_process
should also be propagated to mem_cgroup_out_of_memory.

Let's rework this as follows: keep all oom heuristic related code private
to oom_kill.c and make oom_kill.c use exported memcg functions when it's
really necessary (like in case of iterating over memcg tasks).

Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Andy Lutomirski
efdc949079 mm: fix memcg stack accounting for sub-page stacks
We should account for stacks regardless of stack size, and we need to
account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the units
to kilobytes and Move it into account_kernel_stack().

Fixes: 12580e4b54 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
7ee36a14f0 mm, vmscan: Update all zone LRU sizes before updating memcg
Minchan Kim reported setting the following warning on a 32-bit system
although it can affect 64-bit systems.

  WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110
  mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty
  Modules linked in:
  CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
    dump_stack+0x76/0xaf
    __warn+0xea/0x110
    ? mem_cgroup_update_lru_size+0x103/0x110
    warn_slowpath_fmt+0x3b/0x40
    mem_cgroup_update_lru_size+0x103/0x110
    isolate_lru_pages.isra.61+0x2e2/0x360
    shrink_active_list+0xac/0x2a0
    ? __delay+0xe/0x10
    shrink_node_memcg+0x53c/0x7a0
    shrink_node+0xab/0x2a0
    do_try_to_free_pages+0xc6/0x390
    try_to_free_pages+0x245/0x590

LRU list contents and counts are updated separately.  Counts are updated
before pages are added to the LRU and updated after pages are removed.
The warning above is from a check in mem_cgroup_update_lru_size that
ensures that list sizes of zero are empty.

The problem is that node-lru needs to account for highmem pages if
CONFIG_HIGHMEM is set.  One impact of the implementation is that the
sizes are updated in multiple passes when pages from multiple zones were
isolated.  This happens whether HIGHMEM is set or not.  When multiple
zones are isolated, it's possible for a debugging check in memcg to be
tripped.

This patch forces all the zone counts to be updated before the memcg
function is called.

Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Minchan Kim <minchan@kernel.org>
Reported-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
ef8f232799 mm, memcg: move memcg limit enforcement from zones to nodes
Memcg needs adjustment after moving LRUs to the node.  Limits are
tracked per memcg but the soft-limit excess is tracked per zone.  As
global page reclaim is based on the node, it is easy to imagine a
situation where a zone soft limit is exceeded even though the memcg
limit is fine.

This patch moves the soft limit tree the node.  Technically, all the
variable names should also change but people are already familiar by the
meaning of "mz" even if "mn" would be a more appropriate name now.

Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
a9dd0a8310 mm, vmscan: make shrink_node decisions more node-centric
Earlier patches focused on having direct reclaim and kswapd use data
that is node-centric for reclaiming but shrink_node() itself still uses
too much zone information.  This patch removes unnecessary zone-based
information with the most important decision being whether to continue
reclaim or not.  Some memcg APIs are adjusted as a result even though
memcg itself still uses some zone information.

[mgorman@techsingularity.net: optimization]
  Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
599d0c954f mm, vmscan: move LRU lists to node
This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic.  Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes.  It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies.  For example, the scans are
per-zone but using per-node counters.  We also mark a node as congested
when a zone is congested.  This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

   When pages are skipped, they are immediately added back onto the LRU
   list. If lowmem reclaim persisted for long periods of time, the same
   highmem pages get continually scanned. The idea would be that lowmem
   keeps those pages on a separate list until a reclaim for highmem pages
   arrives that splices the highmem pages back onto the LRU. It potentially
   could be implemented similar to the UNEVICTABLE list.

   That would reduce the skip rate with the potential corner case is that
   highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

   This will break LRU ordering but may be preferable and faster during
   memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Johannes Weiner
55779ec759 mm: fix vm-scalability regression in cgroup-aware workingset code
Commit 23047a96d7 ("mm: workingset: per-cgroup cache thrash
detection") added a page->mem_cgroup lookup to the cache eviction,
refault, and activation paths, as well as locking to the activation
path, and the vm-scalability tests showed a regression of -23%.

While the test in question is an artificial worst-case scenario that
doesn't occur in real workloads - reading two sparse files in parallel
at full CPU speed just to hammer the LRU paths - there is still some
optimizations that can be done in those paths.

Inline the lookup functions to eliminate calls.  Also, page->mem_cgroup
doesn't need to be stabilized when counting an activation; we merely
need to hold the RCU lock to prevent the memcg from being freed.

This cuts down on overhead quite a bit:

23047a96d7 063f6715e77a7be5770d6081fe
---------------- --------------------------
         %stddev     %change         %stddev
             \          |                \
  21621405 +- 0%     +11.3%   24069657 +- 2%  vm-scalability.throughput

[linux@roeck-us.net: drop unnecessary include file]
[hannes@cmpxchg.org: add WARN_ON_ONCE()s]
  Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
Reported-by: Ye Xiaolong <xiaolong.ye@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Vladimir Davydov
452647784b mm: memcontrol: cleanup kmem charge functions
- Handle memcg_kmem_enabled check out to the caller. This reduces the
   number of function definitions making the code easier to follow. At
   the same time it doesn't result in code bloat, because all of these
   functions are used only in one or two places.

 - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
   have to dive deep into memcg implementation to see which allocations
   are charged and which are not.

 - Refresh comments.

Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Johannes Weiner
73f576c04b mm: memcontrol: fix cgroup creation failure after many small jobs
The memory controller has quite a bit of state that usually outlives the
cgroup and pins its CSS until said state disappears.  At the same time
it imposes a 16-bit limit on the CSS ID space to economically store IDs
in the wild.  Consequently, when we use cgroups to contain frequent but
small and short-lived jobs that leave behind some page cache, we quickly
run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
fails with -ENOSPC while there are only a few, or even no user-visible
cgroups in existence.

Although pinning CSSs past cgroup removal is common, there are only two
instances that actually need an ID after a cgroup is deleted: cache
shadow entries and swapout records.

Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later.  They pose no hurdle.

Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after offlining are tmpfs/shmem pages.  And those
references are under the user's control, so they are manageable.

This patch introduces a private 16-bit memcg ID and switches swap and
cache shadow entries over to using that.  This ID can then be recycled
after offlining when the CSS remains pinned only by objects that don't
specifically need it.

This script demonstrates the problem by faulting one cache page in a new
cgroup and deleting it again:

  set -e
  mkdir -p pages
  for x in `seq 128000`; do
    [ $((x % 1000)) -eq 0 ] && echo $x
    mkdir /cgroup/foo
    echo $$ >/cgroup/foo/cgroup.procs
    echo trex >pages/$x
    echo $$ >/cgroup/cgroup.procs
    rmdir /cgroup/foo
  done

When run on an unpatched kernel, we eventually run out of possible IDs
even though there are no visible cgroups:

  [root@ham ~]# ./cssidstress.sh
  [...]
  65000
  mkdir: cannot create directory '/cgroup/foo': No space left on device

After this patch, the IDs get released upon cgroup destruction and the
cache and css objects get released once memory reclaim kicks in.

[hannes@cmpxchg.org: init the IDR]
  Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
Fixes: b2052564e6 ("mm: memcontrol: continue cache reclaim from offlined groups")
Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: John Garcia <john.garcia@mesosphere.io>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Nikolay Borisov <kernel@kyup.com>
Cc: <stable@vger.kernel.org>	[3.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-23 10:25:54 +09:00
Rik van Riel
59dc76b0d4 mm: vmscan: reduce size of inactive file list
The inactive file list should still be large enough to contain readahead
windows and freshly written file data, but it no longer is the only
source for detecting multiple accesses to file pages.  The workingset
refault measurement code causes recently evicted file pages that get
accessed again after a shorter interval to be promoted directly to the
active list.

With that mechanism in place, we can afford to (on a larger system)
dedicate more memory to the active file list, so we can actually cache
more of the frequently used file pages in memory, and not have them
pushed out by streaming writes, once-used streaming file reads, etc.

This can help things like database workloads, where only half the page
cache can currently be used to cache the database working set.  This
patch automatically increases that fraction on larger systems, using the
same ratio that has already been used for anonymous memory.

[hannes@cmpxchg.org: cgroup-awareness]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Hugh Dickins
9d5e6a9f22 mm: update_lru_size do the __mod_zone_page_state
Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy
reclaim was removed) that lru_size can be updated by -nr_taken once per
call to isolate_lru_pages(), instead of page by page.

Update it inside isolate_lru_pages(), or at its two callsites? I chose
to update it at the callsites, rearranging and grouping the updates by
nr_taken and nr_scanned together in both.

With one exception, mem_cgroup_update_lru_size(,lru,) is then used where
__mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding
some more calls in a future commit.  Make the code a little smaller and
simpler by incorporating stat update in lru_size update.

The exception was move_active_pages_to_lru(), which aggregated the
pgmoved stat update separately from the individual lru_size updates; but
I still think this a simplification worth making.

However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
better use the name update_lru_size, calls mem_cgroup_update_lru_size
when CONFIG_MEMCG.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Vladimir Davydov
0a6b76dd23 mm: workingset: make shadow node shrinker memcg aware
Workingset code was recently made memcg aware, but shadow node shrinker
is still global.  As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect.  To avoid this, we need to make shadow node
shrinker memcg aware.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Vladimir Davydov
b6ecd2dea4 mm: memcontrol: zap memcg_kmem_online helper
As kmem accounting is now either enabled for all cgroups or disabled
system-wide, there's no point in having memcg_kmem_online() helper -
instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
shrink_slab() now does.

There are only two places left where this helper is used -
__memcg_kmem_charge() and memcg_create_kmem_cache().  The former can
only be called if memcg_kmem_enabled() returned true.  Since the cgroup
it operates on is online, mem_cgroup_is_root() check will be enough.

memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
of memcg_kmem_online(), because it relies on the fact that in
memcg_offline_kmem() memcg->kmem_state is changed before
memcg_deactivate_kmem_caches() is called, but there we can just
open-code the check.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Vladimir Davydov
12580e4b54 mm: memcontrol: report kernel stack usage in cgroup2 memory.stat
Show how much memory is allocated to kernel stacks.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00