Patch series "Add error_report_end tracepoint to KFENCE and KASAN", v3.
This patchset adds a tracepoint, error_repor_end, that is to be used by
KFENCE, KASAN, and potentially other bug detection tools, when they print
an error report. One of the possible use cases is userspace collection of
kernel error reports: interested parties can subscribe to the tracing
event via tracefs, and get notified when an error report occurs.
This patch (of 3):
Introduce error_report_end tracepoint. It can be used in debugging tools
like KASAN, KFENCE, etc. to provide extensions to the error reporting
mechanisms (e.g. allow tests hook into error reporting, ease error report
collection from production kernels). Another benefit would be making use
of ftrace for debugging or benchmarking the tools themselves.
Should we need it, the tracepoint name leaves us with the possibility to
introduce a complementary error_report_start tracepoint in the future.
Link: https://lkml.kernel.org/r/20210121131915.1331302-1-glider@google.com
Link: https://lkml.kernel.org/r/20210121131915.1331302-2-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Suggested-by: Marco Elver <elver@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>
Based on upstream commit 794a56ebd9a57db12abaec63f038c6eb073461f7
Change-Id: I2a6be669c847da253f09e72c6f41437a9c0f11ef
Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>
Under CONFIG_FORTIFY_SOURCE, memcpy() will check the size of destination
and source buffers. Defining kernel_headers_data as "char" would trip
this check. Since these addresses are treated as byte arrays, define
them as arrays (as done everywhere else).
This was seen with:
$ cat /sys/kernel/kheaders.tar.xz >> /dev/null
detected buffer overflow in memcpy
kernel BUG at lib/string_helpers.c:1027!
...
RIP: 0010:fortify_panic+0xf/0x20
[...]
Call Trace:
<TASK>
ikheaders_read+0x45/0x50 [kheaders]
kernfs_fop_read_iter+0x1a4/0x2f0
...
Bug: 254441685
Reported-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20230302112130.6e402a98@kernel.org/
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 43d8ce9d65a5 ("Provide in-kernel headers to make extending kernel easier")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230302224946.never.243-kees@kernel.org
(cherry picked from commit b69edab47f1da8edd8e7bfdf8c70f51a2a5d89fb)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I73c7530b9c558c1c8dac5f8962dbc31c553c0be7
Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>
If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed in the following path:
do_rmdir
cgroup_rmdir
kernfs_drain_open_files
cgroup_file_release
cgroup_pressure_release
psi_trigger_destroy
However, the polling thread still has a reference to the pressure file and
will access the freed waitqueue when the file is closed or upon exit:
fput
ep_eventpoll_release
ep_free
ep_remove_wait_queue
remove_wait_queue
This results in use-after-free as pasted below.
The fundamental problem here is that cgroup_file_release() (and
consequently waitqueue's lifetime) is not tied to the file's real lifetime.
Using wake_up_pollfree() here might be less than ideal, but it is in line
with the comment at commit 42288cb44c4b ("wait: add wake_up_pollfree()")
since the waitqueue's lifetime is not tied to file's one and can be
considered as another special case. While this would be fixable by somehow
making cgroup_file_release() be tied to the fput(), it would require
sizable refactoring at cgroups or higher layer which might be more
justifiable if we identify more cases like this.
BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
Write of size 4 at addr ffff88810e625328 by task a.out/4404
CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
Call Trace:
<TASK>
dump_stack_lvl+0x73/0xa0
print_report+0x16c/0x4e0
kasan_report+0xc3/0xf0
kasan_check_range+0x2d2/0x310
_raw_spin_lock_irqsave+0x60/0xc0
remove_wait_queue+0x1a/0xa0
ep_free+0x12c/0x170
ep_eventpoll_release+0x26/0x30
__fput+0x202/0x400
task_work_run+0x11d/0x170
do_exit+0x495/0x1130
do_group_exit+0x100/0x100
get_signal+0xd67/0xde0
arch_do_signal_or_restart+0x2a/0x2b0
exit_to_user_mode_prepare+0x94/0x100
syscall_exit_to_user_mode+0x20/0x40
do_syscall_64+0x52/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
</TASK>
Allocated by task 4404:
kasan_set_track+0x3d/0x60
__kasan_kmalloc+0x85/0x90
psi_trigger_create+0x113/0x3e0
pressure_write+0x146/0x2e0
cgroup_file_write+0x11c/0x250
kernfs_fop_write_iter+0x186/0x220
vfs_write+0x3d8/0x5c0
ksys_write+0x90/0x110
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Freed by task 4407:
kasan_set_track+0x3d/0x60
kasan_save_free_info+0x27/0x40
____kasan_slab_free+0x11d/0x170
slab_free_freelist_hook+0x87/0x150
__kmem_cache_free+0xcb/0x180
psi_trigger_destroy+0x2e8/0x310
cgroup_file_release+0x4f/0xb0
kernfs_drain_open_files+0x165/0x1f0
kernfs_drain+0x162/0x1a0
__kernfs_remove+0x1fb/0x310
kernfs_remove_by_name_ns+0x95/0xe0
cgroup_addrm_files+0x67f/0x700
cgroup_destroy_locked+0x283/0x3c0
cgroup_rmdir+0x29/0x100
kernfs_iop_rmdir+0xd1/0x140
vfs_rmdir+0xfe/0x240
do_rmdir+0x13d/0x280
__x64_sys_rmdir+0x2c/0x30
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Bug: 254441685
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
Link: https://lore.kernel.org/r/20230214212705.4058045-1-kamatam@amazon.com
(cherry picked from commit c2dbe32d5db5c4ead121cf86dabd5ab691fb47fe)
Signed-off-by: Lee Jones <joneslee@google.com>
Change-Id: I9677499b2885149a1070f508931113ad8a02277a
Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>
* v1.9.0 - Syncronized suspend/resume driver printing of ignored errors,
* and turned on state notifier debugger..
* - subsys_incall changed to module_init,
* - state notifier - going back to scheduled work, and subsys initcall,
* - initializing work in module init.
* - updated our outdated method of workqueue declaration.
*
* v1.9.1 - Updated the depecrated method of declaring work but simply declaring
* the two work structs. Also actually INITialized the work on init,
* and flushed it on exit.
*
* v2.0.0 - Included State Notifier hooks to run explicitly once power state
* changes are completed to prevent blocking issues.
Credits to robcore (https://github.com/robcore) for some updates.
Signed-off-by: ThunderStorms21th - nalas
kernel/power/powersuspend: new PM kernel driver for Android w/o early…
…_suspend v1.5 (faux123/Yank555.lu)
powersuspend: new PM kernel driver for Android w/o early_suspend
Android early_suspend/late_resume PM kernel driver framework has been
deprecated by Google. This new powersuspend PM kernel driver is a replacement
for it and existing early_suspend drivers can be easily adapted to use this
new replacement driver
Signed-off-by: Paul Reioux <reioux@gmail.com>
powersuspend: fix logci derps :p
Signed-off-by: Paul Reioux <reioux@gmail.com>
kernel/power/powersuspend: remove userspace dependency from powersuspend
make powersuspend not depend on a userspace initiator anymore, but use a hook
in autosleep instead.
Signed-off-by: Paul Reioux <reioux@gmail.com>
kernel/power/powersuspend: add back userpace control w/ default kernel control
make kernel / userspace mode switchable
Signed-off-by: Paul Reioux <reioux@gmail.com>
kernel/power/powersuspend: default to userspace for now
Signed-off-by: Paul Reioux <reioux@gmail.com>
kernel/power/powersuspend: LCD screen on/off hooks (Yank555.lu)
- add an alternative hook for screen on / off detection in the display
pannel driver
- sleeps ~0.1s sooner, wakes up ~1s later then autosleep
- clean up source formatting a bit (Paul Reioux)
Signed-off-by: Paul Reioux <reioux@gmail.com>
mdss_dsi_panel.c: add powersuspend hooks
Signed-off-by: Paul Reioux <reioux@gmail.com>
kernel/power/powersuspend: cumulative update to version 1.5
- fix hybrid-kernel mode cannot be set through sysfs
kernel/power/powersuspend: new PM kernel driver for Android w/o early_suspend v1.4 (Yank555.lu)
- add hybrid-kernel mode (autosleep and panel, first wins)
- default to hybrid-kernel mode
- harmonize debug message with my other stuff
- include all 3 modes as default, so it's only commenting out to change
kernel/power/powersuspend: new PM kernel driver for Android w/o early_suspend v1.3 (Yank555.lu)
- fix stupid typo
- add an alternative hook for screen on / off detection in the display panel driver
(sleeps ~0.1s sooner, wakes up ~1s later then autosleep)
kernel/power/powersuspend: new PM kernel driver for Android w/o early_suspend v1.2 (Yank555.lu)
- make kernel / userspace mode switchable
kernel/power/powersuspend: new PM kernel driver for Android w/o early_suspend v1.1 (Yank555.lu)
- make powersuspend not depend on a userspace initiator anymore, but use a hook in autosleep instead.
kernel/power/powersuspend: new PM kernel driver for Android w/o early_suspend (faux123)
Android early_suspend/late_resume PM kernel driver framework has been
deprecated by Google. This new powersuspend PM kernel driver is a replacement
for it and existing early_suspend drivers can be easily adapted to use this
new replacement driver
Signed-off-by: Paul Reioux <reioux@gmail.com>
drivers/video/msm/mdss/mdss_dsi_panel.c: update powersuspend hook calls
fix typo!
Change-Id: If9ee91d5ff2ba2d7623865758d0d8cdf582c341c
Signed-off-by: Paul Reioux <reioux@gmail.com>
Removed mdss_dsi_panel.c panel hooks not for exynos/decon display
Signed-off-by: UpInTheAir <upintheair.xda@gmail.com>
Conflicts:
drivers/video/msm/mdss/mdss_dsi_panel.c
kernel/power/powersuspend: new PM kernel driver for Android w/o early…
…_suspend v1.6 (faux123/Yank555.lu)
- autosleep hook removed, not working on shamu
- autosleep and hybrid modes removed
- panel mode is now default
- debug output switchable in defconfig
kernel/power/powersuspend: v1.6.1 add autosleep & hybrid modes
hybrid mode is default
Signed-off-by: UpInTheAir <upintheair.xda@gmail.com>
kernel/power/powersuspend: new PM kernel driver for Android w/o early…
…_suspend v1.7 (faux123/Yank555.lu)
- do only run state change if change actually requests a new state
Change-Id: I3f3989ce939cd4d60831fb05dc8790cf10bcbd17
Signed-off-by: UpInTheAir <upintheair.xda@gmail.com>
Conflicts:
kernel/power/powersuspend.c
kernel/power/powersuspend: new PM kernel driver for Android w/o early…
…_suspend v1.7 (faux123/Yank555.lu)
- fix a #ifdef / #endif derp
Thanx to AuxXxilium (Christian Schulthess) for pointing this out to me !
Change-Id: I5827ea36a69e4a42d6a9196749ccac12c8bee746
Signed-off-by: yank555-lu <yank555.lu@gmail.com>
powersuspend: add power_suspended boolean for global access
Some routines just need the boolean for whether powersuspend is active or not.
Rather than hooking to the powersuspend on every occasions,
add power_suspended boolean for global access.
Signed-off-by: arter97 <qkrwngud825@gmail.com>
Signed-off-by: UpInTheAir <upintheair.xda@gmail.com>
Conflicts:
kernel/power/powersuspend.c
powersuspend: Replaced deprecated singlethread workqueue with updated…
… schedule_work
powersuspend: add debug sysfs trigger to see how driver work
Signed-off-by: UpInTheAir <upintheair.skyhigh@gmail.com>
powersuspend: disable debugging by default
Signed-off-by: UpInTheAir <upintheair.skyhigh@gmail.com>
fix powersuspend compile error
Signed-off-by: morogoku <morogoku@hotmail.com>
commit 678379e1d4f7443b170939525d3312cfc37bf86b upstream.
Cloning a descriptor table picks the size that would cover all currently
opened files. That's fine for clone() and unshare(), but for close_range()
there's an additional twist - we clone before we close, and it would be
a shame to have
close_range(3, ~0U, CLOSE_RANGE_UNSHARE)
leave us with a huge descriptor table when we are not going to keep
anything past stderr, just because some large file descriptor used to
be open before our call has taken it out.
Unfortunately, it had been dealt with in an inherently racy way -
sane_fdtable_size() gets a "don't copy anything past that" argument
(passed via unshare_fd() and dup_fd()), close_range() decides how much
should be trimmed and passes that to unshare_fd().
The problem is, a range that used to extend to the end of descriptor
table back when close_range() had looked at it might very well have stuff
grown after it by the time dup_fd() has allocated a new files_struct
and started to figure out the capacity of fdtable to be attached to that.
That leads to interesting pathological cases; at the very least it's a
QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes
a snapshot of descriptor table one might have observed at some point.
Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination
of unshare(CLONE_FILES) with plain close_range(), ending up with a
weird state that would never occur with unshare(2) is confusing, to put
it mildly.
It's not hard to get rid of - all it takes is passing both ends of the
range down to sane_fdtable_size(). There we are under ->files_lock,
so the race is trivially avoided.
So we do the following:
* switch close_files() from calling unshare_fd() to calling
dup_fd().
* undo the calling convention change done to unshare_fd() in
60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
* introduce struct fd_range, pass a pointer to that to dup_fd()
and sane_fdtable_size() instead of "trim everything past that point"
they are currently getting. NULL means "we are not going to be punching
any holes"; NR_OPEN_MAX is gone.
* make sane_fdtable_size() use find_last_bit() instead of
open-coding it; it's easier to follow that way.
* while we are at it, have dup_fd() report errors by returning
ERR_PTR(), no need to use a separate int *errorp argument.
Fixes: 60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
Cc: stable@vger.kernel.org
Change-Id: I6782a2edf98970b6c2d662048061e28f7e57b9c9
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
One of the use-cases of close_range() is to drop file descriptors just before
execve(). This would usually be expressed in the sequence:
unshare(CLONE_FILES);
close_range(3, ~0U);
as pointed out by Linus it might be desirable to have this be a part of
close_range() itself under a new flag CLOSE_RANGE_UNSHARE.
This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
maximum number of file descriptors to copy from the old struct files. When the
user requests that all file descriptors are supposed to be closed via
close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
to do any of the heavy fput() work for everything above min.
The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
fact currently share our file descriptor table we create a new private copy.
We then close all fds in the requested range and finally after we're done we
install the new fd table.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: I0813045886501e40a45693ee1edad50bdf2b66e5
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
* to ensure all pending call functions are processed correctly when the CPU is idle. This
synchronizes the idle task handling with recent core SMP changes.
Signed-off-by: StimLuks87 <san10031987san@gmail.com>
Several users reported stutters in gameplay with v2 which was resolved after restoring the default value of the timer frequency (also known as the tick rate). Some users noticed improvments with 300Hz, but there weren't any noticeable improvmenets with higher tick rate values like 600Hz. So, sticking to 300Hz for now.
Signed-off-by: alternoegraha <noegrahachan@gmail.com>
[ Upstream commit afe5960dc208fe069ddaaeb0994d857b24ac19d1 ]
When a tracepoint event is created with attr.freq = 1,
'hwc->period_left' is not initialized correctly. As a result,
in the perf_swevent_overflow() function, when the first time the event occurs,
it calculates the event overflow and the perf_swevent_set_period() returns 3,
this leads to the event are recorded for three duplicate times.
Step to reproduce:
1. Enable the tracepoint event & starting tracing
$ echo 1 > /sys/kernel/tracing/events/module/module_free
$ echo 1 > /sys/kernel/tracing/tracing_on
2. Record with perf
$ perf record -a --strict-freq -F 1 -e "module:module_free"
3. Trigger module_free event.
$ modprobe -i sunrpc
$ modprobe -r sunrpc
Result:
- Trace pipe result:
$ cat trace_pipe
modprobe-174509 [003] ..... 6504.868896: module_free: sunrpc
- perf sample:
modprobe 174509 [003] 6504.868980: module:module_free: sunrpc
modprobe 174509 [003] 6504.868980: module:module_free: sunrpc
modprobe 174509 [003] 6504.868980: module:module_free: sunrpc
By setting period_left via perf_swevent_set_period() as other sw_event did,
This problem could be solved.
After patch:
- Trace pipe result:
$ cat trace_pipe
modprobe 1153096 [068] 613468.867774: module:module_free: xfs
- perf sample
modprobe 1153096 [068] 613468.867794: module:module_free: xfs
Link: https://lore.kernel.org/20240913021347.595330-1-yeoreum.yun@arm.com
Fixes: bd2b5b1 ("perf_counter: More aggressive frequency adjustment")
Signed-off-by: Levi Yun <yeoreum.yun@arm.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Protect the working set under memory pressure to prevent thrashing, avoid high latency and prevent livelock in near-OOM conditions
The kernel does not provide a way to protect the working set under memory pressure. A certain amount of anonymous and clean file pages is required by the userspace for normal operation. First of all, the userspace needs a cache of shared libraries and executable binaries. If the amount of the clean file pages falls below a certain level, then thrashing and even livelock can take place.
The patch provides sysctl knobs for protecting the working set (anonymous and clean file pages) under memory pressure.
The vm.anon_min_kbytes sysctl knob provides hard protection of anonymous pages. The anonymous pages on the current node won't be reclaimed under any conditions when their amount is below vm.anon_min_kbytes. This knob may be used to prevent excessive swap thrashing when anonymous memory is low (for example, when memory is going to be overfilled by compressed data of zram module).
The vm.clean_low_kbytes sysctl knob provides best-effort protection of clean file pages. The file pages on the current node won't be reclaimed under memory pressure when the amount of clean file pages is below vm.clean_low_kbytes unless we threaten to OOM. Protection of clean file pages using this knob may be used when swapping is still possible to
- Prevent disk I/O thrashing under memory pressure.
- Improve performance in disk cache-bound tasks under memory pressure.
The vm.clean_min_kbytes sysctl knob provides hard protection of clean file pages. The file pages on the current node won't be reclaimed under memory pressure when the amount of clean file pages is below vm.clean_min_kbytes. Hard protection of clean file pages using this knob may be used to
- Prevent disk I/O thrashing under memory pressure even with no free swap space.
- Improve performance in disk cache-bound tasks under memory pressure.
- Avoid high latency and prevent livelock in near-OOM conditions.
le9ec patches provide three sysctl knobs (vm.anon_min_kbytes, vm.clean_low_kbytes, vm.clean_min_kbytes) with zero values and does not protect the working set by default (CONFIG_ANON_MIN_KBYTES=0, CONFIG_CLEAN_LOW_KBYTES=0, CONFIG_CLEAN_MIN_KBYTES=0). You can specify other values during kernel build, or change the knob values on the fly.
Effects
- Improving system responsiveness under low-memory conditions.
- Improving performance in I/O bound tasks under memory pressure;
- OOM killer comes faster (with hard protection).
- Fast system reclaiming after OOM (with hard protection).
Note that the effects depend on the values of the sysctl tunables.
source patch: https://github.com/hakavlad/le9-patch
Signed-off-by: kanonifyX <kanonify01@gmail.com>
We don't really need to know if the CPU is getting disabled or enabled on a production device.
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: Saikrishna1504 <saikrishna26918@gmail.com>
exit_mmap() is responsible for freeing the vast majority of an mm's
memory; in order to unblock Simple LMK faster, report an mm as freed as
soon as exit_mmap() finishes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The OOM killer sets the TIF_MEMDIE thread flag for its victims to alert
other kernel code that the current process was killed due to memory
pressure, and needs to finish whatever it's doing quickly. In the page
allocator this allows victim processes to quickly allocate memory using
emergency reserves. This is especially important when memory pressure is
high; if all processes are taking a while to allocate memory, then our
victim processes will face the same problem and can potentially get
stuck in the page allocator for a while rather than die expeditiously.
To ensure that victim processes die quickly, set TIF_MEMDIE for the
entire victim thread group.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This is a complete low memory killer solution for Android that is small
and simple. Processes are killed according to the priorities that
Android gives them, so that the least important processes are always
killed first. Processes are killed until memory deficits are satisfied,
as observed from kswapd struggling to free up pages. Simple LMK stops
killing processes when kswapd finally goes back to sleep.
The only tunables are the desired amount of memory to be freed per
reclaim event and desired frequency of reclaim events. Simple LMK tries
to free at least the desired amount of memory per reclaim and waits
until all of its victims' memory is freed before proceeding to kill more
processes.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Change-Id: Ied2500a773b06ac1fdc378e61fd5403a270114a6
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joel Fernandes found that the synchronize_rcu_tasks() was taking a
significant amount of time. He demonstrated it with the following test:
# cd /sys/kernel/tracing
# while [ 1 ]; do x=1; done &
# echo '__schedule_bug:traceon' > set_ftrace_filter
# time echo '!__schedule_bug:traceon' > set_ftrace_filter;
real 0m1.064s
user 0m0.000s
sys 0m0.004s
Where it takes a little over a second to perform the synchronize,
because there's a loop that waits 1 second at a time for tasks to get
through their quiescent points when there's a task that must be waited
for.
After discussion we came up with a simple way to wait for holdouts but
increase the time for each iteration of the loop but no more than a
full second.
With the new patch we have:
# time echo '!__schedule_bug:traceon' > set_ftrace_filter;
real 0m0.131s
user 0m0.000s
sys 0m0.004s
Which drops it down to 13% of what the original wait time was.
Link: http://lkml.kernel.org/r/20180523063815.198302-2-joel@joelfernandes.org
Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Edwiin Kusuma Jaya <kutemeikito0905@gmail.com>
Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>
97d0fb239c upstream
Sometimes we want to opportunistically get a
ref to a cred in an rcu_read_lock protected section.
get_task_cred() does this, and NFS does as similar thing
with its own credential structures.
To prepare for NFS converting to use 'struct cred' more
uniformly, define get_cred_rcu(), and use it in
get_task_cred().
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
CONFLICTS:
cred: switch to using atomic_long_t
- resolve conflict by using `atomic_long_inc_not_zero`
instead of `atomic_inc_not_zero`
Signed-off-by: backslashxx <118538522+backslashxx@users.noreply.github.com>
Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>
cgroup already uses floating point for percent[ile] numbers and there
are several controllers which want to take them as input. Add a
generic parse helper to handle inputs.
Update the interface convention documentation about the use of
percentage numbers. While at it, also clarify the default time unit.
Bug: 120440300
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit a5e112e6424adb77d953eac20e6936b952fd6b32)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ic1fcf21d7955eb8edd2e8e91517bca6aef41694f
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Naveen <133593113+elohim-etz@users.noreply.github.com>
By default, BPF uses module_alloc() to allocate executable memory,
but this is not necessary on all arches and potentially undesirable
on some of them.
So break out the module_alloc() and module_memfree() calls into __weak
functions to allow them to be overridden in arch code.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Change-Id: I582794881942bc0b766515861f2232354860536b
Signed-off-by: Saikrishna1504 <saikrishna26918@gmail.com>
The 'cgrp' is set but not used in commit <76f969e8948d8>
("cgroup: cgroup v2 freezer").
Remove it to avoid [-Wunused-but-set-variable] warning.
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from 533307dc20a9e84a0687d4ca24aeb669516c0243)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Change-Id: I6221a975c04f06249a4f8d693852776ae08a8d8e
Signed-off-by: prathamdby <134331217+prathamdby@users.noreply.github.com>
ptrace_stop() does preempt_enable_no_resched() to avoid the preemption,
but after that cgroup_enter_frozen() does spin_lock/unlock and this adds
another preemption point.
Reported-and-tested-by: Bruce Ashfield <bruce.ashfield@gmail.com>
Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: Ic53e0f2d6624b0bb90817b0c57060fb7db971348
(cherry picked from commit 937c6b27c73e02cd4114f95f5c37ba2c29fadba1)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: prathamdby <134331217+prathamdby@users.noreply.github.com>
If a new child cgroup is created in the frozen cgroup hierarchy
(one or more of ancestor cgroups is frozen), the CGRP_FREEZE cgroup
flag should be set. Otherwise if a process will be attached to the
child cgroup, it won't become frozen.
The problem can be reproduced with the test_cgfreezer_mkdir test.
This is the output before this patch:
~/test_freezer
ok 1 test_cgfreezer_simple
ok 2 test_cgfreezer_tree
ok 3 test_cgfreezer_forkbomb
Cgroup /sys/fs/cgroup/cg_test_mkdir_A/cg_test_mkdir_B isn't frozen
not ok 4 test_cgfreezer_mkdir
ok 5 test_cgfreezer_rmdir
ok 6 test_cgfreezer_migrate
ok 7 test_cgfreezer_ptrace
ok 8 test_cgfreezer_stopped
ok 9 test_cgfreezer_ptraced
ok 10 test_cgfreezer_vfork
And with this patch:
~/test_freezer
ok 1 test_cgfreezer_simple
ok 2 test_cgfreezer_tree
ok 3 test_cgfreezer_forkbomb
ok 4 test_cgfreezer_mkdir
ok 5 test_cgfreezer_rmdir
ok 6 test_cgfreezer_migrate
ok 7 test_cgfreezer_ptrace
ok 8 test_cgfreezer_stopped
ok 9 test_cgfreezer_ptraced
ok 10 test_cgfreezer_vfork
Reported-by: Mark Crossen <mcrossen@fb.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Fixes: 76f969e8948d ("cgroup: cgroup v2 freezer")
Cc: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v5.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
Change-Id: I6ba7b8dec5600e78bb7448f03fd97a9b43838fa0
(cherry picked from commit 97a61369830ab085df5aed0ff9256f35b07d425a)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
Signed-off-by: prathamdby <134331217+prathamdby@users.noreply.github.com>