Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Change-Id: Ied2500a773b06ac1fdc378e61fd5403a270114a6
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[ Upstream commit 079be8fc630943d9fc70a97807feb73d169ee3fc ]
The validation of the value written to sched_rt_period_us was broken
because:
- the sysclt_sched_rt_period is declared as unsigned int
- parsed by proc_do_intvec()
- the range is asserted after the value parsed by proc_do_intvec()
Because of this negative values written to the file were written into a
unsigned integer that were later on interpreted as large positive
integers which did passed the check:
if (sysclt_sched_rt_period <= 0)
return EINVAL;
This commit fixes the parsing by setting explicit range for both
perid_us and runtime_us into the sched_rt_sysctls table and processes
the values with proc_dointvec_minmax() instead.
Alternatively if we wanted to use full range of unsigned int for the
period value we would have to split the proc_handler and use
proc_douintvec() for it however even the
Documentation/scheduller/sched-rt-group.rst describes the range as 1 to
INT_MAX.
As far as I can tell the only problem this causes is that the sysctl
file allows writing negative values which when read back may confuse
userspace.
There is also a LTP test being submitted for these sysctl files at:
http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz
[ pvorel: rebased for 4.19 ]
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2d931472d4740d3ada7011cc4c3499948d3a22fa)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
commit bce9332220bd677d83b19d21502776ad555a0e73 upstream.
proc_skip_spaces() seems to think it is working on C strings, and ends
up being just a wrapper around skip_spaces() with a really odd calling
convention.
Instead of basing it on skip_spaces(), it should have looked more like
proc_skip_char(), which really is the exact same function (except it
skips a particular character, rather than whitespace). So use that as
inspiration, odd coding and all.
Now the calling convention actually makes sense and works for the
intended purpose.
Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit e6cfaf34be9fcd1a8285a294e18986bfc41a409c upstream.
proc_get_long() is passed a size_t, but then assigns it to an 'int'
variable for the length. Let's not do that, even if our IO paths are
limited to MAX_RW_COUNT (exactly because of these kinds of type errors).
So do the proper test in the rigth type.
Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This is perhaps the worst performance-related feature ever, and is
thoroughly abused by userspace in all the worst ways possible.
Triggering a massive task migration via sched_boost during
latency-critical operations such as app launches is a very bad idea, and
leads to noticeably-horrible stuttering.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: Tashfin Shakeer Rhythm <tashfinshakeerrhythm@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.
This is also referred to as 'the default boost value of RT tasks'.
See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.
For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.
Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.
They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.
The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.
Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.
To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.
I anticipate this to be mostly in the form of modifying the init script
of a particular device.
To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.
Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.
With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
(cherry picked from commit 13685c4a08fca9dd76bf53bfcbadc044ab2a08cb)
Conflicts:
kernel/fork.c
kernel/sysctl.c
Upstream has commit 5a5cf5cb30d7 ("cgroup: refactor fork helpers") and
further commit ef2c41cf38a7 ("clone3: allow spawning processes into
cgroups") which affect the calls after this. Picking the first would
be easy but the 2nd would be much bigger. Also, my cherry-pick put my
sysctl in the wrong place in the table in sysctl.c, so I manually
moved it. Weird.
BUG=b:160171130
TEST=With series rt tasks don't get boosted
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Change-Id: I678d8ee899ecfbe0a1f0bb94da85d54fff924a57
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2340433
Reviewed-by: Joel Fernandes <joelaf@google.com>
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.
Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.
Add a privileged interface to define a system default configuration via:
/proc/sys/kernel/sched_uclamp_util_{min,max}
which works as an unconditional clamp range restriction for all tasks.
With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.
Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.
The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.
An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit e8f14172c6b11e9a86c65532497087f8eb0f91b1)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I4f014c5ec9c312aaad606518f6e205fd0cfbcaa2
Signed-off-by: Quentin Perret <qperret@google.com>
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.
This is also referred to as 'the default boost value of RT tasks'.
See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.
For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.
Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.
They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.
The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.
Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.
To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.
I anticipate this to be mostly in the form of modifying the init script
of a particular device.
To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.
Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.
With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
(cherry picked from commit 13685c4a08fca9dd76bf53bfcbadc044ab2a08cb)
Conflicts:
kernel/fork.c
kernel/sysctl.c
Upstream has commit 5a5cf5cb30d7 ("cgroup: refactor fork helpers") and
further commit ef2c41cf38a7 ("clone3: allow spawning processes into
cgroups") which affect the calls after this. Picking the first would
be easy but the 2nd would be much bigger. Also, my cherry-pick put my
sysctl in the wrong place in the table in sysctl.c, so I manually
moved it. Weird.
BUG=b:160171130
TEST=With series rt tasks don't get boosted
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Change-Id: I678d8ee899ecfbe0a1f0bb94da85d54fff924a57
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2340433
Reviewed-by: Joel Fernandes <joelaf@google.com>
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.
Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.
Add a privileged interface to define a system default configuration via:
/proc/sys/kernel/sched_uclamp_util_{min,max}
which works as an unconditional clamp range restriction for all tasks.
With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.
Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.
The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.
An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit e8f14172c6b11e9a86c65532497087f8eb0f91b1)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I4f014c5ec9c312aaad606518f6e205fd0cfbcaa2
Signed-off-by: Quentin Perret <qperret@google.com>
init modifies this and makes mm worse. Thanks to arter97 for figuring
this out.
Reported-by: Juhyung Park <qkrwngud825@gmail.com>
Signed-off-by: Kazuki Hashimoto <kazukih@tuta.io>
Page replacement is handled in the Linux Kernel in one of two ways:
1) Asynchronously via kswapd
2) Synchronously, via direct reclaim
At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing business
logic.
Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.
When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min watermark,
the task will no longer pull pages from the free list. Instead, the task
will run the same CPU-bound routines as kswapd to satisfy its own
allocation by scanning and evicting pages. This is called a direct reclaim.
The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd is
not actually required on a linux system. It exists for the sole purpose of
optimizing performance by preventing direct reclaims.
When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur in
small tasks that rarely allocate additional memory. Consider the impact of
injecting an additional 100ms of latency when nscd allocates memory to
facilitate caching of a DNS query.
The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the problem
go away. Since then hardware has evolved to bring a new struggle for
kswapd. Storage speeds have increased by orders of magnitude while CPU
clock speeds stayed the same or even slowed down in exchange for more
cores per package. This presents a throughput problem for a single
threaded kswapd that will get worse with each generation of new hardware.
Test Details
NOTE: The tests below were run with shadow entries disabled. See the
associated patch and cover letter for details
The tests below were designed with the assumption that a kswapd bottleneck
is best demonstrated using filesystem reads. This way, the inactive list
will be full of clean pages, simplifying the analysis and allowing kswapd
to achieve the highest possible steal rate. Maximum steal rates for kswapd
are likely to be the same or lower for any other mix of page types on the
system.
Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores,
756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has
an XFS file system mounted separately as /d0 through /d7. SSD drives
require multiple concurrent streams to show their potential, so I created
eleven 250GB zero-filled files on each drive so that I could test with
parallel reads.
The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.
During each stage dd tasks are launched to read from each drive in a round
robin fashion until the specified number of tasks for the stage has been
reached. Then iostat, vmstat and top are started in the background with 10
second intervals. After five minutes, all of the dd tasks are killed and
the iostat, vmstat and top output is parsed in order to report the
following:
CPU consumption
- sy - aggregate kernel mode CPU consumption from vmstat output. The value
doesn't tend to fluctuate much so I just grab the highest value.
Each sample is averaged over 10 seconds
- dd_cpu - for all of the dd tasks averaged across the top samples since
there is a lot of variation.
Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total
This first test performs reads using O_DIRECT in order to show the maximum
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.
The dd command for this test looks like this:
Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M
Test #1: Direct IO
dd sy dd_cpu throughput
6 0 2.33 14726026.40
10 1 2.95 19954974.80
16 1 2.63 24419689.30
22 1 2.63 25430303.20
28 1 2.91 26026513.20
34 1 2.53 26178618.00
40 1 2.18 26239229.20
46 1 1.91 26250550.40
52 1 1.69 26251845.60
58 1 1.54 26253205.60
64 1 1.43 26253780.80
70 1 1.31 26254154.80
76 1 1.21 26253660.80
82 1 1.12 26254214.80
88 1 1.07 26253770.00
90 1 1.04 26252406.40
Throughput was close to peak with only 22 dd tasks. Very little system CPU
was consumed as expected as the drives DMA directly into the user address
space when using direct IO.
In this next test, the iflag=direct option is removed and we only run the
test until the pgscan_kswapd from /proc/vmstat starts to increment. At
that point metrics are parsed and reported and the pagecache contents are
dropped prior to the next test. Lather, rinse, repeat.
Test #2: standard file system IO, no page replacement
dd sy dd_cpu throughput
6 2 28.78 5134316.40
10 3 31.40 8051218.40
16 5 34.73 11438106.80
22 7 33.65 14140596.40
28 8 31.24 16393455.20
34 10 29.88 18219463.60
40 11 28.33 19644159.60
46 11 25.05 20802497.60
52 13 26.92 22092370.00
58 13 23.29 22884881.20
64 14 23.12 23452248.80
70 15 22.40 23916468.00
76 16 22.06 24328737.20
82 17 20.97 24718693.20
88 16 18.57 25149404.40
90 16 18.31 25245565.60
Each read has to pause after the buffer in kernel space is populated while
those pages are added to the pagecache and copied into the user address
space. For this reason, more parallel streams are required to achieve peak
throughput. The copy operation consumes substantially more CPU than direct
IO as expected.
The next test measures throughput after kswapd starts running. This is the
same test only we wait for kswapd to wake up before we start collecting
metrics. The script actually keeps track of a few things that were not
mentioned earlier. It tracks direct reclaims and page scans by watching
the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the
same way it is tracked for dd.
Since the test is 100% reads, you can assume that the page steal rate for
kswapd and direct reclaims is almost identical to the scan rate.
Test #3: 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 4 26.07 28.56 27.03 7355924.40 0 459316976 0
16 7 34.94 69.33 69.66 10867895.20 0 872661643 0
22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334
28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142
34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515
40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710
46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973
52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458
58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540
64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337
70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436
76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384
82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008
88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208
90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901
In the previous test where kswapd was not involved, the system-wide kernel
mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption
with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA
node), kswapd can only be responsible for a little over 4% of the increase.
The rest is likely caused by 51,618 direct reclaims that scanned 1.2
billion pages over the five minute time period of the test.
Same test, more kswapd tasks:
Test #4: 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 5 27.09 16.65 14.17 7842605.60 0 459105291 0
16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515
22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0
28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0
34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0
40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0
46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0
52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0
58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0
64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821
70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159
76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763
82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704
88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202
90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615
By increasing the number of kswapd threads, throughput increased by ~50%
while kernel mode CPU utilization decreased or stayed the same, likely due
to a decrease in the number of parallel tasks at any given time doing page
replacement.
Change-Id: I966d4a9c33bad188b3409f7ceea1df205a63c3bd
Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com>
Patch-mainline: linux-mm @ Mon, 2 Apr 2018 09:24:22
Link: https://lore.kernel.org/lkml/1522661062-39745-1-git-send-email-buddy.lumpkin@oracle.com
[charante@codeaurora.org]: Changes done to ensure QGKI compliance.
Signed-off-by: Charan Teja Kalla <charante@codeaurora.org>
- Add a separate swappiness (default 60) value to use if task isn't handled by kswapd.
- Don't throttle tasks with OOM value < 0.
- Increase zsmalloc's maximum page order.
[cyberknight777]:
- Reword Kconfig option.
- Import with clean indents.
- Be more descriptive in Kconfig option's description.
Signed-off-by: Dark-Matter7232 <kerneldeveloper7232@gmail.com>
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
xNombre: Android modifies some scheduler parameters on boot.
Applying these manually resulted in better hackbench performance.
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>
To identify certain apps which request max cpu freq to affine its
tasks to specific cpus, besides checking its lib name, task name is
also a factor that we can identify the suspcious task.
Test: build and test the 'perfect kick 2' game.
Bug: 163293825
Bug: 161324271
Change-Id: I4359859db743b4c9122e9df40af0b109370e8f1f
Signed-off-by: Jimmy Shiu <jimmyshiu@google.com>
This change is for general scheduler improvement.
Change-Id: I50d41aa3338803cbd45ff6314b2bb3978c59282b
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Certain userspace applications, to achieve max performance, affines its
threads to cpus that run the fastest. This is not always the
correct strategy. For e.g. in certain architectures all the
cores have the same max freq but few of them have a bigger
cache. Affining to the cpus that have bigger cache is advantageous
but such an application would end up affining them to all the cores.
Similarly if an architecture has just one cpu that runs at max freq,
it ends up crowding all its thread on that single core, which is
detrimental for performance.
To address this issue, we need to detect a suspicious looking affinity
request from userspace and check if it links in a particular library.
The latter can easily be detected by traversing executable vm areas
that map a file and checking for that library name.
When such a affinity request is found, change it to use a proper
affinity. The suspicious affinity request, the proper affinity request
and the library name can be configured by the userspace.
Change-Id: I6bb8c310ca54c03261cc721f28dfd6023ab5591a
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
With the introduction of placement hint patch, boosted tasks will not
scheduled from big cores. We tune capacity margin to let important
boosted tasks get scheduled on big cores. However, the capacity margin
affects all group of tasks, so that non-boosted tasks get more chances
to be scheduled on big cores, too. This could be solved by separating
capacity margin for boosted tasks.
Bug: 147785606
Test: margin set correctly
Signed-off-by: Rick Yiu <rickyiu@google.com>
Change-Id: I2b02e138e36a6844afbc1ade60fe86a001814b30
Energy aware feature control is previously done through debugfs,
which will be deprecated, so move the control to sysctl.
Bug: 141333728
Test: function works as expected
Change-Id: I55411d3bb2669ba1fae3225d67cdf1cf8b3b3a7f
Signed-off-by: Rick Yiu <rickyiu@google.com>
Changes in 4.14.271
x86/speculation: Merge one test in spectre_v2_user_select_mitigation()
x86,bugs: Unconditionally allow spectre_v2=retpoline,amd
x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCE
x86/speculation: Add eIBRS + Retpoline options
Documentation/hw-vuln: Update spectre doc
x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting
x86/speculation: Use generic retpoline by default on AMD
x86/speculation: Update link to AMD speculation whitepaper
x86/speculation: Warn about Spectre v2 LFENCE mitigation
x86/speculation: Warn about eIBRS + LFENCE + Unprivileged eBPF + SMT
arm/arm64: Provide a wrapper for SMCCC 1.1 calls
arm/arm64: smccc/psci: add arm_smccc_1_1_get_conduit()
ARM: report Spectre v2 status through sysfs
ARM: early traps initialisation
ARM: use LOADADDR() to get load address of sections
ARM: Spectre-BHB workaround
ARM: include unprivileged BPF status in Spectre V2 reporting
ARM: fix build error when BPF_SYSCALL is disabled
ARM: fix co-processor register typo
ARM: Do not use NOCROSSREFS directive with ld.lld
ARM: fix build warning in proc-v7-bugs.c
xen/xenbus: don't let xenbus_grant_ring() remove grants in error case
xen/grant-table: add gnttab_try_end_foreign_access()
xen/blkfront: don't use gnttab_query_foreign_access() for mapped status
xen/netfront: don't use gnttab_query_foreign_access() for mapped status
xen/scsifront: don't use gnttab_query_foreign_access() for mapped status
xen/gntalloc: don't use gnttab_query_foreign_access()
xen: remove gnttab_query_foreign_access()
xen/9p: use alloc/free_pages_exact()
xen/gnttab: fix gnttab_end_foreign_access() without page specified
xen/netfront: react properly to failing gnttab_end_foreign_access_ref()
Linux 4.14.271
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I9f41bca7a4638913cfd2b97006d19185dd9bf584
commit 44a3918c8245ab10c6c9719dd12e7a8d291980d8 upstream.
With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable
to Spectre v2 BHB-based attacks.
When both are enabled, print a warning message and report it in the
'spectre_v2' sysfs vulnerabilities file.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
[fllinden@amazon.com: backported to 4.14]
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 4.14.267
integrity: check the return value of audit_log_start()
ima: Remove ima_policy file before directory
ima: Allow template selection with ima_template[_fmt]= after ima_hash=
mmc: sdhci-of-esdhc: Check for error num after setting mask
net: phy: marvell: Fix MDI-x polarity setting in 88e1118-compatible PHYs
NFS: Fix initialisation of nfs_client cl_flags field
NFSD: Clamp WRITE offsets
NFSv4 only print the label when its queried
nfs: nfs4clinet: check the return value of kstrdup()
NFSv4.1: Fix uninitialised variable in devicenotify
NFSv4 remove zero number of fs_locations entries error check
NFSv4 expose nfs_parse_server_name function
scsi: target: iscsi: Make sure the np under each tpg is unique
usb: dwc2: gadget: don't try to disable ep0 in dwc2_hsotg_suspend
net: stmmac: dwmac-sun8i: use return val of readl_poll_timeout()
Revert "net: axienet: Wait for PhyRstCmplt after core reset"
bpf: Add kconfig knob for disabling unpriv bpf by default
ARM: dts: imx23-evk: Remove MX23_PAD_SSP1_DETECT from hog group
ARM: dts: meson: Fix the UART compatible strings
staging: fbtft: Fix error path in fbtft_driver_module_init()
ARM: dts: imx6qdl-udoo: Properly describe the SD card detect
usb: f_fs: Fix use-after-free for epfile
bonding: pair enable_port with slave_arr_updates
ipmr,ip6mr: acquire RTNL before calling ip[6]mr_free_table() on failure path
net: do not keep the dst cache when uncloning an skb dst and its metadata
net: fix a memleak when uncloning an skb dst and its metadata
tipc: rate limit warning for received illegal binding update
net: amd-xgbe: disable interrupts during pci removal
vt_ioctl: fix array_index_nospec in vt_setactivate
vt_ioctl: add array_index_nospec to VT_ACTIVATE
n_tty: wake up poll(POLLRDNORM) on receiving data
usb: ulpi: Move of_node_put to ulpi_dev_release
usb: ulpi: Call of_node_put correctly
usb: dwc3: gadget: Prevent core from processing stale TRBs
USB: gadget: validate interface OS descriptor requests
usb: gadget: rndis: check size of RNDIS_MSG_SET command
USB: serial: ftdi_sio: add support for Brainboxes US-159/235/320
USB: serial: option: add ZTE MF286D modem
USB: serial: ch341: add support for GW Instek USB2.0-Serial devices
USB: serial: cp210x: add NCR Retail IO box id
USB: serial: cp210x: add CPI Bulk Coin Recycler id
seccomp: Invalidate seccomp mode to catch death failures
hwmon: (dell-smm) Speed up setting of fan speed
perf: Fix list corruption in perf_cgroup_switch()
Linux 4.14.267
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie562265e833202e361fd0734d18731990c0b1cbc
commit 08389d888287c3823f80b0216766b71e17f0aba5 upstream.
Add a kconfig knob which allows for unprivileged bpf to be disabled by default.
If set, the knob sets /proc/sys/kernel/unprivileged_bpf_disabled to value of 2.
This still allows a transition of 2 -> {0,1} through an admin. Similarly,
this also still keeps 1 -> {1} behavior intact, so that once set to permanently
disabled, it cannot be undone aside from a reboot.
We've also added extra2 with max of 2 for the procfs handler, so that an admin
still has a chance to toggle between 0 <-> 2.
Either way, as an additional alternative, applications can make use of CAP_BPF
that we added a while ago.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/74ec548079189e4e4dffaeb42b8987bb3c852eee.1620765074.git.daniel@iogearbox.net
[fllinden@amazon.com: backported to 4.14]
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If sysctl_sched_prefer_spread is enabled, then tasks would be freely
migrated to idle cpus within same cluster to reduce runnables.
By default, the feature is disabled.
User can trigger feature with:
echo 1 > /proc/sys/kernel/sched_prefer_spread
Aggressively spread tasks with in little cluster.
echo 2 > /proc/sys/kernel/sched_prefer_spread
Aggressively spread tasks with in little cluster as well as
big cluster, but not between big and little.
Change-Id: I0a4d87bd17de3525548765472e6f388a9970f13c
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
[render: minor fixups]
Signed-off-by: Zachariah Kennedy <zkennedy87@gmail.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
This change is for general scheduler improvement.
Change-Id: I5fbadf248c0bfe27bc761686de7a925cec2e4163
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Sai Harshini Nimmala <snimmala@codeaurora.org>
This change is for general scheduler improvement.
Change-Id: I33e9ec890f8b54d673770d5d02dba489a8e08ce7
Signed-off-by: Sai Harshini Nimmala <snimmala@codeaurora.org>