6cfbbb961bb94de85455fe35140b1350c7ccb76c
29165 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6cfbbb961b |
tracing/kprobe: Fix kprobe_on_func_entry() modification
The commit |
||
|
|
3226fb90cf |
rcu: Fix missed wakeup of exp_wq waiters
commit fd6bc19d7676a060a171d1cf3dcbf6fd797eb05f upstream.
Tasks waiting within exp_funnel_lock() for an expedited grace period to
elapse can be starved due to the following sequence of events:
1. Tasks A and B both attempt to start an expedited grace
period at about the same time. This grace period will have
completed when the lower four bits of the rcu_state structure's
->expedited_sequence field are 0b'0100', for example, when the
initial value of this counter is zero. Task A wins, and thus
does the actual work of starting the grace period, including
acquiring the rcu_state structure's .exp_mutex and sets the
counter to 0b'0001'.
2. Because task B lost the race to start the grace period, it
waits on ->expedited_sequence to reach 0b'0100' inside of
exp_funnel_lock(). This task therefore blocks on the rcu_node
structure's ->exp_wq[1] field, keeping in mind that the
end-of-grace-period value of ->expedited_sequence (0b'0100')
is shifted down two bits before indexing the ->exp_wq[] field.
3. Task C attempts to start another expedited grace period,
but blocks on ->exp_mutex, which is still held by Task A.
4. The aforementioned expedited grace period completes, so that
->expedited_sequence now has the value 0b'0100'. A kworker task
therefore acquires the rcu_state structure's ->exp_wake_mutex
and starts awakening any tasks waiting for this grace period.
5. One of the first tasks awakened happens to be Task A. Task A
therefore releases the rcu_state structure's ->exp_mutex,
which allows Task C to start the next expedited grace period,
which causes the lower four bits of the rcu_state structure's
->expedited_sequence field to become 0b'0101'.
6. Task C's expedited grace period completes, so that the lower four
bits of the rcu_state structure's ->expedited_sequence field now
become 0b'1000'.
7. The kworker task from step 4 above continues its wakeups.
Unfortunately, the wake_up_all() refetches the rcu_state
structure's .expedited_sequence field:
wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
This results in the wakeup being applied to the rcu_node
structure's ->exp_wq[2] field, which is unfortunate given that
Task B is instead waiting on ->exp_wq[1].
On a busy system, no harm is done (or at least no permanent harm is done).
Some later expedited grace period will redo the wakeup. But on a quiet
system, such as many embedded systems, it might be a good long time before
there was another expedited grace period. On such embedded systems,
this situation could therefore result in a system hang.
This issue manifested as DPM device timeout during suspend (which
usually qualifies as a quiet time) due to a SCSI device being stuck in
_synchronize_rcu_expedited(), with the following stack trace:
schedule()
synchronize_rcu_expedited()
synchronize_rcu()
scsi_device_quiesce()
scsi_bus_suspend()
dpm_run_callback()
__device_suspend()
This commit therefore prevents such delays, timeouts, and hangs by
making rcu_exp_wait_wake() use its "s" argument consistently instead of
refetching from rcu_state.expedited_sequence.
Fixes:
|
||
|
|
c09a84aea0 |
events: Reuse value read using READ_ONCE instead of re-reading it
commit b89a05b21f46150ac10a962aa50109250b56b03b upstream.
In perf_event_addr_filters_apply, the task associated with
the event (event->ctx->task) is read using READ_ONCE at the beginning
of the function, checked, and then re-read from event->ctx->task,
voiding all guarantees of the checks. Reuse the value that was read by
READ_ONCE to ensure the consistency of the task struct throughout the
function.
Fixes:
|
||
|
|
2cc559a4b9 |
memcg: enable accounting for pids in nested pid namespaces
commit fab827dbee8c2e06ca4ba000fa6c48bcf9054aba upstream. Commit |
||
|
|
2fed7f8eda |
mm/hugetlb: initialize hugetlb_usage in mm_init
commit 13db8c50477d83ad3e3b9b0ae247e5cd833a7ae4 upstream.
After fork, the child process will get incorrect (2x) hugetlb_usage. If
a process uses 5 2MB hugetlb pages in an anonymous mapping,
HugetlbPages: 10240 kB
and then forks, the child will show,
HugetlbPages: 20480 kB
The reason for double the amount is because hugetlb_usage will be copied
from the parent and then increased when we copy page tables from parent
to child. Child will have 2x actual usage.
Fix this by adding hugetlb_count_init in mm_init.
Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com
Fixes:
|
||
|
|
a386fceed7 |
bpf: Fix pointer arithmetic mask tightening under state pruning
commit e042aa532c84d18ff13291d00620502ce7a38dda upstream.
In 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic mask") we
narrowed the offset mask for unprivileged pointer arithmetic in order to
mitigate a corner case where in the speculative domain it is possible to
advance, for example, the map value pointer by up to value_size-1 out-of-
bounds in order to leak kernel memory via side-channel to user space.
The verifier's state pruning for scalars leaves one corner case open
where in the first verification path R_x holds an unknown scalar with an
aux->alu_limit of e.g. 7, and in a second verification path that same
register R_x, here denoted as R_x', holds an unknown scalar which has
tighter bounds and would thus satisfy range_within(R_x, R_x') as well as
tnum_in(R_x, R_x') for state pruning, yielding an aux->alu_limit of 3:
Given the second path fits the register constraints for pruning, the final
generated mask from aux->alu_limit will remain at 7. While technically
not wrong for the non-speculative domain, it would however be possible
to craft similar cases where the mask would be too wide as in 7fedb63a8307.
One way to fix it is to detect the presence of unknown scalar map pointer
arithmetic and force a deeper search on unknown scalars to ensure that
we do not run into a masking mismatch.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
[OP: adjusted context for 4.19]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
a4fe956b03 |
bpf: verifier: Allocate idmap scratch in verifier env
commit c9e73e3d2b1eb1ea7ff068e05007eec3bd8ef1c9 upstream. func_states_equal makes a very short lived allocation for idmap, probably because it's too large to fit on the stack. However the function is called quite often, leading to a lot of alloc / free churn. Replace the temporary allocation with dedicated scratch space in struct bpf_verifier_env. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://lore.kernel.org/bpf/20210429134656.122225-4-lmb@cloudflare.com [OP: adjusted context for 4.19] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
8729685021 |
bpf: Fix leakage due to insufficient speculative store bypass mitigation
commit 2039f26f3aca5b0e419b98f65dd36481337b86ee upstream. Spectre v4 gadgets make use of memory disambiguation, which is a set of techniques that execute memory access instructions, that is, loads and stores, out of program order; Intel's optimization manual, section 2.4.4.5: A load instruction micro-op may depend on a preceding store. Many microarchitectures block loads until all preceding store addresses are known. The memory disambiguator predicts which loads will not depend on any previous stores. When the disambiguator predicts that a load does not have such a dependency, the load takes its data from the L1 data cache. Eventually, the prediction is verified. If an actual conflict is detected, the load and all succeeding instructions are re-executed. |
||
|
|
91cdb5b362 |
bpf: Introduce BPF nospec instruction for mitigating Spectre v4
commit f5e81d1117501546b7be050c5fbafa6efd2c722c upstream. In case of JITs, each of the JIT backends compiles the BPF nospec instruction /either/ to a machine instruction which emits a speculation barrier /or/ to /no/ machine instruction in case the underlying architecture is not affected by Speculative Store Bypass or has different mitigations in place already. This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence' instruction for mitigation. In case of arm64, we rely on the firmware mitigation as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled, it works for all of the kernel code with no need to provide any additional instructions here (hence only comment in arm64 JIT). Other archs can follow as needed. The BPF nospec instruction is specifically targeting Spectre v4 since i) we don't use a serialization barrier for the Spectre v1 case, and ii) mitigation instructions for v1 and v4 might be different on some archs. The BPF nospec is required for a future commit, where the BPF verifier does annotate intermediate BPF programs with speculation barriers. Co-developed-by: Piotr Krysiuk <piotras@gmail.com> Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de> Acked-by: Alexei Starovoitov <ast@kernel.org> [OP: adjusted context for 4.19, drop riscv and ppc32 changes] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
1195c579b2 |
bpf: track spill/fill of constants
commit f7cf25b2026dc8441e0fa3a202c2aa8a56211e30 upstream.
Compilers often spill induction variables into the stack,
hence it is necessary for the verifier to track scalar values
of the registers through stack slots.
Also few bpf programs were incorrectly rejected in the past,
since the verifier was not able to track such constants while
they were used to compute offsets into packet headers.
Tracking constants through the stack significantly decreases
the chances of state pruning, since two different constants
are considered to be different by state equivalency.
End result that cilium tests suffer serious degradation in the number
of states processed and corresponding verification time increase.
before after
bpf_lb-DLB_L3.o 1838 6441
bpf_lb-DLB_L4.o 3218 5908
bpf_lb-DUNKNOWN.o 1064 1064
bpf_lxc-DDROP_ALL.o 26935 93790
bpf_lxc-DUNKNOWN.o 34439 123886
bpf_netdev.o 9721 31413
bpf_overlay.o 6184 18561
bpf_lxc_jit.o 39389 359445
After further debugging turned out that cillium progs are
getting hurt by clang due to the same constant tracking issue.
Newer clang generates better code by spilling less to the stack.
Instead it keeps more constants in the registers which
hurts state pruning since the verifier already tracks constants
in the registers:
old clang new clang
(no spill/fill tracking introduced by this patch)
bpf_lb-DLB_L3.o 1838 1923
bpf_lb-DLB_L4.o 3218 3077
bpf_lb-DUNKNOWN.o 1064 1062
bpf_lxc-DDROP_ALL.o 26935 166729
bpf_lxc-DUNKNOWN.o 34439 174607
bpf_netdev.o 9721 8407
bpf_overlay.o 6184 5420
bpf_lcx_jit.o 39389 39389
The final table is depressing:
old clang old clang new clang new clang
const spill/fill const spill/fill
bpf_lb-DLB_L3.o 1838 6441 1923 8128
bpf_lb-DLB_L4.o 3218 5908 3077 6707
bpf_lb-DUNKNOWN.o 1064 1064 1062 1062
bpf_lxc-DDROP_ALL.o 26935 93790 166729 380712
bpf_lxc-DUNKNOWN.o 34439 123886 174607 440652
bpf_netdev.o 9721 31413 8407 31904
bpf_overlay.o 6184 18561 5420 23569
bpf_lxc_jit.o 39389 359445 39389 359445
Tracking constants in the registers hurts state pruning already.
Adding tracking of constants through stack hurts pruning even more.
The later patch address this general constant tracking issue
with coarse/precise logic.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[OP: - drop verbose_linfo() calls, as the function is not implemented in 4.19
- adjust mark_reg_read() calls to match the prototype in 4.19]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
7667818ef1 |
bpf: Sanity check max value for var_off stack access
commit 107c26a70ca81bfc33657366ad69d02fdc9efc9d upstream. As discussed in [1] max value of variable offset has to be checked for overflow on stack access otherwise verifier would accept code like this: 0: (b7) r2 = 6 1: (b7) r3 = 28 2: (7a) *(u64 *)(r10 -16) = 0 3: (7a) *(u64 *)(r10 -8) = 0 4: (79) r4 = *(u64 *)(r1 +168) 5: (c5) if r4 s< 0x0 goto pc+4 R1=ctx(id=0,off=0,imm=0) R2=inv6 R3=inv28 R4=inv(id=0,umax_value=9223372036854775807,var_off=(0x0; 0x7fffffffffffffff)) R10=fp0,call_-1 fp-8=mmmmmmmm fp-16=mmmmmmmm 6: (17) r4 -= 16 7: (0f) r4 += r10 8: (b7) r5 = 8 9: (85) call bpf_getsockopt#57 10: (b7) r0 = 0 11: (95) exit , where R4 obviosly has unbounded max value. Fix it by checking that reg->smax_value is inside (-BPF_MAX_VAR_OFF; BPF_MAX_VAR_OFF) range. reg->smax_value is used instead of reg->umax_value because stack pointers are calculated using negative offset from fp. This is opposite to e.g. map access where offset must be non-negative and where umax_value is used. Also dedicated verbose logs are added for both min and max bound check failures to have diagnostics consistent with variable offset handling in check_map_access(). [1] https://marc.info/?l=linux-netdev&m=155433357510597&w=2 Fixes: 2011fccfb61b ("bpf: Support variable offset stack access from helpers") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
14cf676ba6 |
bpf: Reject indirect var_off stack access in unpriv mode
commit 088ec26d9c2da9d879ab73e3f4117f9df6c566ee upstream.
Proper support of indirect stack access with variable offset in
unprivileged mode (!root) requires corresponding support in Spectre
masking for stack ALU in retrieve_ptr_limit().
There are no use-case for variable offset in unprivileged mode though so
make verifier reject such accesses for simplicity.
Pointer arithmetics is one (and only?) way to cause variable offset and
it's already rejected in unpriv mode so that verifier won't even get to
helper function whose argument contains variable offset, e.g.:
0: (7a) *(u64 *)(r10 -16) = 0
1: (7a) *(u64 *)(r10 -8) = 0
2: (61) r2 = *(u32 *)(r1 +0)
3: (57) r2 &= 4
4: (17) r2 -= 16
5: (0f) r2 += r10
variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1R2
stack pointer arithmetic goes out of range, prohibited for !root
Still it looks like a good idea to reject variable offset indirect stack
access for unprivileged mode in check_stack_boundary() explicitly.
Fixes: 2011fccfb61b ("bpf: Support variable offset stack access from helpers")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
[OP: drop comment in retrieve_ptr_limit()]
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
148339cb18 |
bpf: Reject indirect var_off stack access in raw mode
commit f2bcd05ec7b839ff826d2008506ad2d2dff46a59 upstream.
It's hard to guarantee that whole memory is marked as initialized on
helper return if uninitialized stack is accessed with variable offset
since specific bounds are unknown to verifier. This may cause
uninitialized stack leaking.
Reject such an access in check_stack_boundary to prevent possible
leaking.
There are no known use-cases for indirect uninitialized stack access
with variable offset so it shouldn't break anything.
Fixes: 2011fccfb61b ("bpf: Support variable offset stack access from helpers")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
caee4103f0 |
bpf: Support variable offset stack access from helpers
commit 2011fccfb61bbd1d7c8864b2b3ed7012342e9ba3 upstream. Currently there is a difference in how verifier checks memory access for helper arguments for PTR_TO_MAP_VALUE and PTR_TO_STACK with regard to variable part of offset. check_map_access, that is used for PTR_TO_MAP_VALUE, can handle variable offsets just fine, so that BPF program can call a helper like this: some_helper(map_value_ptr + off, size); , where offset is unknown at load time, but is checked by program to be in a safe rage (off >= 0 && off + size < map_value_size). But it's not the case for check_stack_boundary, that is used for PTR_TO_STACK, and same code with pointer to stack is rejected by verifier: some_helper(stack_value_ptr + off, size); For example: 0: (7a) *(u64 *)(r10 -16) = 0 1: (7a) *(u64 *)(r10 -8) = 0 2: (61) r2 = *(u32 *)(r1 +0) 3: (57) r2 &= 4 4: (17) r2 -= 16 5: (0f) r2 += r10 6: (18) r1 = 0xffff888111343a80 8: (85) call bpf_map_lookup_elem#1 invalid variable stack read R2 var_off=(0xfffffffffffffff0; 0x4) Add support for variable offset access to check_stack_boundary so that if offset is checked by program to be in a safe range it's accepted by verifier. Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> [OP: replace reg_state(env, regno) helper with "cur_regs(env) + regno"] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
79aba0ac3d |
bpf: correct slot_type marking logic to allow more stack slot sharing
commit 0bae2d4d62d523f06ff1a8e88ce38b45400acd28 upstream. Verifier is supposed to support sharing stack slot allocated to ptr with SCALAR_VALUE for privileged program. However this doesn't happen for some cases. The reason is verifier is not clearing slot_type STACK_SPILL for all bytes, it only clears part of them, while verifier is using: slot_type[0] == STACK_SPILL as a convention to check one slot is ptr type. So, the consequence of partial clearing slot_type is verifier could treat a partially overridden ptr slot, which should now be a SCALAR_VALUE slot, still as ptr slot, and rejects some valid programs. Before this patch, test_xdp_noinline.o under bpf selftests, bpf_lxc.o and bpf_netdev.o under Cilium bpf repo, when built with -mattr=+alu32 are rejected due to this issue. After this patch, they all accepted. There is no processed insn number change before and after this patch on Cilium bpf programs. Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> [OP: adjusted context for 4.19] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
228e64bdbb |
bpf/verifier: per-register parent pointers
commit 679c782de14bd48c19dd74cd1af20a2bc05dd936 upstream. By giving each register its own liveness chain, we elide the skip_callee() logic. Instead, each register's parent is the state it inherits from; both check_func_call() and prepare_func_exit() automatically connect reg states to the correct chain since when they copy the reg state across (r1-r5 into the callee as args, and r0 out as the return value) they also copy the parent pointer. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> [OP: adjusted context for 4.19] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
e6c3fefc6b |
hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns()
[ Upstream commit 627ef5ae2df8eeccb20d5af0e4cfa4df9e61ed28 ] If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then the timer has to be canceled first and then added back. If the timer is the first expiring timer then on removal the clockevent device is reprogrammed to the next expiring timer to avoid that the pending expiry fires needlessly. If the new expiry time ends up to be the first expiry again then the clock event device has to reprogrammed again. Avoid this by checking whether the timer is the first to expire and in that case, keep the timer on the current CPU and delay the reprogramming up to the point where the timer has been enqueued again. Reported-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135157.873137732@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
293fe77dbf |
sched/deadline: Fix missing clock update in migrate_task_rq_dl()
[ Upstream commit b4da13aa28d4fd0071247b7b41c579ee8a86c81a ] A missing clock update is causing the following warning: rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453 sub_running_bw.isra.0+0x190/0x1a0 ... CPU: 112 PID: 2041 Comm: sugov:112 Tainted: G W 5.14.0-rc1 #1 Hardware name: WIWYNN Mt.Jade Server System B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.6.20210526 (SCP: 1.06.20210526) 2021/05/26 ... Call trace: sub_running_bw.isra.0+0x190/0x1a0 migrate_task_rq_dl+0xf8/0x1e0 set_task_cpu+0xa8/0x1f0 try_to_wake_up+0x150/0x3d4 wake_up_q+0x64/0xc0 __up_write+0xd0/0x1c0 up_write+0x4c/0x2b0 cppc_set_perf+0x120/0x2d0 cppc_cpufreq_set_target+0xe0/0x1a4 [cppc_cpufreq] __cpufreq_driver_target+0x74/0x140 sugov_work+0x64/0x80 kthread_worker_fn+0xe0/0x230 kthread+0x138/0x140 ret_from_fork+0x10/0x18 The task causing this is the `cppc_fie` DL task introduced by commit 1eb5dde674f5 ("cpufreq: CPPC: Add support for frequency invariance"). With CONFIG_ACPI_CPPC_CPUFREQ_FIE=y and schedutil cpufreq governor on slow-switching system (like on this Ampere Altra WIWYNN Mt. Jade Arm Server): DL task `curr=sugov:112` lets `p=cppc_fie` migrate and since the latter is in `non_contending` state, migrate_task_rq_dl() calls sub_running_bw()->__sub_running_bw()->cpufreq_update_util()-> rq_clock()->assert_clock_updated() on p. Fix this by updating the clock for a non_contending task in migrate_task_rq_dl() before calling sub_running_bw(). Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20210804135925.3734605-1-dietmar.eggemann@arm.com Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
bc0d543f53 |
sched/deadline: Fix reset_on_fork reporting of DL tasks
[ Upstream commit f95091536f78971b269ec321b057b8d630b0ad8a ] It is possible for sched_getattr() to incorrectly report the state of the reset_on_fork flag when called on a deadline task. Indeed, if the flag was set on a deadline task using sched_setattr() with flags (SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_KEEP_PARAMS), then p->sched_reset_on_fork will be set, but __setscheduler() will bail out early, which means that the dl_se->flags will not get updated by __setscheduler_params()->__setparam_dl(). Consequently, if sched_getattr() is then called on the task, __getparam_dl() will override kattr.sched_flags with the now out-of-date copy in dl_se->flags and report the stale value to userspace. To fix this, make sure to only copy the flags that are relevant to sched_deadline to and from the dl_se->flags field. Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210727101103.2729607-2-qperret@google.com Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
b7a041073f |
locking/mutex: Fix HANDOFF condition
[ Upstream commit 048661a1f963e9517630f080687d48af79ed784c ]
Yanfei reported that setting HANDOFF should not depend on recomputing
@first, only on @first state. Which would then give:
if (ww_ctx || !first)
first = __mutex_waiter_is_first(lock, &waiter);
if (first)
__mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
But because 'ww_ctx || !first' is basically 'always' and the test for
first is relatively cheap, omit that first branch entirely.
Reported-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Yanfei Xu <yanfei.xu@windriver.com>
Link: https://lore.kernel.org/r/20210630154114.896786297@infradead.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
39f74b7c81 |
bpf: Fix truncation handling for mod32 dst reg wrt zero
Commit 9b00f1b78809309163dda2d044d9e94a3c0248a3 upstream. Recently noticed that when mod32 with a known src reg of 0 is performed, then the dst register is 32-bit truncated in verifier: 0: R1=ctx(id=0,off=0,imm=0) R10=fp0 0: (b7) r0 = 0 1: R0_w=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0 1: (b7) r1 = -1 2: R0_w=inv0 R1_w=inv-1 R10=fp0 2: (b4) w2 = -1 3: R0_w=inv0 R1_w=inv-1 R2_w=inv4294967295 R10=fp0 3: (9c) w1 %= w0 4: R0_w=inv0 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0 4: (b7) r0 = 1 5: R0_w=inv1 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0 5: (1d) if r1 == r2 goto pc+1 R0_w=inv1 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0 6: R0_w=inv1 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0 6: (b7) r0 = 2 7: R0_w=inv2 R1_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2_w=inv4294967295 R10=fp0 7: (95) exit 7: R0=inv1 R1=inv(id=0,umin_value=4294967295,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R2=inv4294967295 R10=fp0 7: (95) exit However, as a runtime result, we get 2 instead of 1, meaning the dst register does not contain (u32)-1 in this case. The reason is fairly straight forward given the 0 test leaves the dst register as-is: # ./bpftool p d x i 23 0: (b7) r0 = 0 1: (b7) r1 = -1 2: (b4) w2 = -1 3: (16) if w0 == 0x0 goto pc+1 4: (9c) w1 %= w0 5: (b7) r0 = 1 6: (1d) if r1 == r2 goto pc+1 7: (b7) r0 = 2 8: (95) exit This was originally not an issue given the dst register was marked as completely unknown (aka 64 bit unknown). However, after |
||
|
|
8313432df2 |
bpf: Fix 32 bit src register truncation on div/mod
Commit e88b2c6e5a4d9ce30d75391e4d950da74bb2bd90 upstream.
While reviewing a different fix, John and I noticed an oddity in one of the
BPF program dumps that stood out, for example:
# bpftool p d x i 13
0: (b7) r0 = 808464450
1: (b4) w4 = 808464432
2: (bc) w0 = w0
3: (15) if r0 == 0x0 goto pc+1
4: (9c) w4 %= w0
[...]
In line 2 we noticed that the mov32 would 32 bit truncate the original src
register for the div/mod operation. While for the two operations the dst
register is typically marked unknown e.g. from adjust_scalar_min_max_vals()
the src register is not, and thus verifier keeps tracking original bounds,
simplified:
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
0: (b7) r0 = -1
1: R0_w=invP-1 R1=ctx(id=0,off=0,imm=0) R10=fp0
1: (b7) r1 = -1
2: R0_w=invP-1 R1_w=invP-1 R10=fp0
2: (3c) w0 /= w1
3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1_w=invP-1 R10=fp0
3: (77) r1 >>= 32
4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1_w=invP4294967295 R10=fp0
4: (bf) r0 = r1
5: R0_w=invP4294967295 R1_w=invP4294967295 R10=fp0
5: (95) exit
processed 6 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
Runtime result of r0 at exit is 0 instead of expected -1. Remove the
verifier mov32 src rewrite in div/mod and replace it with a jmp32 test
instead. After the fix, we result in the following code generation when
having dividend r1 and divisor r6:
div, 64 bit: div, 32 bit:
0: (b7) r6 = 8 0: (b7) r6 = 8
1: (b7) r1 = 8 1: (b7) r1 = 8
2: (55) if r6 != 0x0 goto pc+2 2: (56) if w6 != 0x0 goto pc+2
3: (ac) w1 ^= w1 3: (ac) w1 ^= w1
4: (05) goto pc+1 4: (05) goto pc+1
5: (3f) r1 /= r6 5: (3c) w1 /= w6
6: (b7) r0 = 0 6: (b7) r0 = 0
7: (95) exit 7: (95) exit
mod, 64 bit: mod, 32 bit:
0: (b7) r6 = 8 0: (b7) r6 = 8
1: (b7) r1 = 8 1: (b7) r1 = 8
2: (15) if r6 == 0x0 goto pc+1 2: (16) if w6 == 0x0 goto pc+1
3: (9f) r1 %= r6 3: (9c) w1 %= w6
4: (b7) r0 = 0 4: (b7) r0 = 0
5: (95) exit 5: (95) exit
x86 in particular can throw a 'divide error' exception for div
instruction not only for divisor being zero, but also for the case
when the quotient is too large for the designated register. For the
edx:eax and rdx:rax dividend pair it is not an issue in x86 BPF JIT
since we always zero edx (rdx). Hence really the only protection
needed is against divisor being zero.
Fixes:
|
||
|
|
c348d806ed |
bpf: Do not use ax register in interpreter on div/mod
Partially undo old commit 144cd91c4c2b ("bpf: move tmp variable into ax
register in interpreter"). The reason we need this here is because ax
register will be used for holding temporary state for div/mod instruction
which otherwise interpreter would corrupt. This will cause a small +8 byte
stack increase for interpreter, but with the gain that we can use it from
verifier rewrites as scratch register.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
[cascardo: This partial revert is needed in order to support using AX for
the following two commits, as there is no JMP32 on 4.19.y]
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
2d349b0a69 |
tracing / histogram: Fix NULL pointer dereference on strcmp() on NULL event name
[ Upstream commit 5acce0bff2a0420ce87d4591daeb867f47d552c2 ]
The following commands:
# echo 'read_max u64 size;' > synthetic_events
# echo 'hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)' > events/syscalls/sys_enter_read/trigger
Causes:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 4 PID: 1763 Comm: bash Not tainted 5.14.0-rc2-test+ #155
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
v03.03 07/14/2016
RIP: 0010:strcmp+0xc/0x20
Code: 75 f7 31 c0 0f b6 0c 06 88 0c 02 48 83 c0 01 84 c9 75 f1 4c 89 c0
c3 0f 1f 80 00 00 00 00 31 c0 eb 08 48 83 c0 01 84 d2 74 0f <0f> b6 14 07
3a 14 06 74 ef 19 c0 83 c8 01 c3 31 c0 c3 66 90 48 89
RSP: 0018:ffffb5fdc0963ca8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffffb3a4e040 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9714c0d0b640 RDI: 0000000000000000
RBP: 0000000000000000 R08: 00000022986b7cde R09: ffffffffb3a4dff8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9714c50603c8
R13: 0000000000000000 R14: ffff97143fdf9e48 R15: ffff9714c01a2210
FS: 00007f1fa6785740(0000) GS:ffff9714da400000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000002d863004 CR4: 00000000001706e0
Call Trace:
__find_event_file+0x4e/0x80
action_create+0x6b7/0xeb0
? kstrdup+0x44/0x60
event_hist_trigger_func+0x1a07/0x2130
trigger_process_regex+0xbd/0x110
event_trigger_write+0x71/0xd0
vfs_write+0xe9/0x310
ksys_write+0x68/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f1fa6879e87
The problem was the "trace(read_max,count)" where the "count" should be
"$count" as "onmax()" only handles variables (although it really should be
able to figure out that "count" is a field of sys_enter_read). But there's
a path that does not find the variable and ends up passing a NULL for the
event, which ends up getting passed to "strcmp()".
Add a check for NULL to return and error on the command with:
# cat error_log
hist:syscalls:sys_enter_read: error: Couldn't create or find variable
Command: hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)
^
Link: https://lkml.kernel.org/r/20210808003011.4037f8d0@oasis.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
504a4c1057 |
genirq/msi: Ensure deactivation on teardown
commit dbbc93576e03fbe24b365fab0e901eb442237a8a upstream.
msi_domain_alloc_irqs() invokes irq_domain_activate_irq(), but
msi_domain_free_irqs() does not enforce deactivation before tearing down
the interrupts.
This happens when PCI/MSI interrupts are set up and never used before being
torn down again, e.g. in error handling pathes. The only place which cleans
that up is the error handling path in msi_domain_alloc_irqs().
Move the cleanup from msi_domain_alloc_irqs() into msi_domain_free_irqs()
to cure that.
Fixes:
|
||
|
|
cab824f67d |
genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP
commit 826da771291fc25a428e871f9e7fb465e390f852 upstream.
X86 IO/APIC and MSI interrupts (when used without interrupts remapping)
require that the affinity setup on startup is done before the interrupt is
enabled for the first time as the non-remapped operation mode cannot safely
migrate enabled interrupts from arbitrary contexts. Provide a new irq chip
flag which allows affected hardware to request this.
This has to be opt-in because there have been reports in the past that some
interrupt chips cannot handle affinity setting before startup.
Fixes:
|
||
|
|
9df311b2e7 |
bpf: Fix leakage under speculation on mispredicted branches
commit 9183671af6dbf60a1219371d4ed73e23f43b49db upstream.
The verifier only enumerates valid control-flow paths and skips paths that
are unreachable in the non-speculative domain. And so it can miss issues
under speculative execution on mispredicted branches.
For example, a type confusion has been demonstrated with the following
crafted program:
// r0 = pointer to a map array entry
// r6 = pointer to readable stack slot
// r9 = scalar controlled by attacker
1: r0 = *(u64 *)(r0) // cache miss
2: if r0 != 0x0 goto line 4
3: r6 = r9
4: if r0 != 0x1 goto line 6
5: r9 = *(u8 *)(r6)
6: // leak r9
Since line 3 runs iff r0 == 0 and line 5 runs iff r0 == 1, the verifier
concludes that the pointer dereference on line 5 is safe. But: if the
attacker trains both the branches to fall-through, such that the following
is speculatively executed ...
r6 = r9
r9 = *(u8 *)(r6)
// leak r9
... then the program will dereference an attacker-controlled value and could
leak its content under speculative execution via side-channel. This requires
to mistrain the branch predictor, which can be rather tricky, because the
branches are mutually exclusive. However such training can be done at
congruent addresses in user space using different branches that are not
mutually exclusive. That is, by training branches in user space ...
A: if r0 != 0x0 goto line C
B: ...
C: if r0 != 0x0 goto line D
D: ...
... such that addresses A and C collide to the same CPU branch prediction
entries in the PHT (pattern history table) as those of the BPF program's
lines 2 and 4, respectively. A non-privileged attacker could simply brute
force such collisions in the PHT until observing the attack succeeding.
Alternative methods to mistrain the branch predictor are also possible that
avoid brute forcing the collisions in the PHT. A reliable attack has been
demonstrated, for example, using the following crafted program:
// r0 = pointer to a [control] map array entry
// r7 = *(u64 *)(r0 + 0), training/attack phase
// r8 = *(u64 *)(r0 + 8), oob address
// [...]
// r0 = pointer to a [data] map array entry
1: if r7 == 0x3 goto line 3
2: r8 = r0
// crafted sequence of conditional jumps to separate the conditional
// branch in line 193 from the current execution flow
3: if r0 != 0x0 goto line 5
4: if r0 == 0x0 goto exit
5: if r0 != 0x0 goto line 7
6: if r0 == 0x0 goto exit
[...]
187: if r0 != 0x0 goto line 189
188: if r0 == 0x0 goto exit
// load any slowly-loaded value (due to cache miss in phase 3) ...
189: r3 = *(u64 *)(r0 + 0x1200)
// ... and turn it into known zero for verifier, while preserving slowly-
// loaded dependency when executing:
190: r3 &= 1
191: r3 &= 2
// speculatively bypassed phase dependency
192: r7 += r3
193: if r7 == 0x3 goto exit
194: r4 = *(u8 *)(r8 + 0)
// leak r4
As can be seen, in training phase (phase != 0x3), the condition in line 1
turns into false and therefore r8 with the oob address is overridden with
the valid map value address, which in line 194 we can read out without
issues. However, in attack phase, line 2 is skipped, and due to the cache
miss in line 189 where the map value is (zeroed and later) added to the
phase register, the condition in line 193 takes the fall-through path due
to prior branch predictor training, where under speculation, it'll load the
byte at oob address r8 (unknown scalar type at that point) which could then
be leaked via side-channel.
One way to mitigate these is to 'branch off' an unreachable path, meaning,
the current verification path keeps following the is_branch_taken() path
and we push the other branch to the verification stack. Given this is
unreachable from the non-speculative domain, this branch's vstate is
explicitly marked as speculative. This is needed for two reasons: i) if
this path is solely seen from speculative execution, then we later on still
want the dead code elimination to kick in in order to sanitize these
instructions with jmp-1s, and ii) to ensure that paths walked in the
non-speculative domain are not pruned from earlier walks of paths walked in
the speculative domain. Additionally, for robustness, we mark the registers
which have been part of the conditional as unknown in the speculative path
given there should be no assumptions made on their content.
The fix in here mitigates type confusion attacks described earlier due to
i) all code paths in the BPF program being explored and ii) existing
verifier logic already ensuring that given memory access instruction
references one specific data structure.
An alternative to this fix that has also been looked at in this scope was to
mark aux->alu_state at the jump instruction with a BPF_JMP_TAKEN state as
well as direction encoding (always-goto, always-fallthrough, unknown), such
that mixing of different always-* directions themselves as well as mixing of
always-* with unknown directions would cause a program rejection by the
verifier, e.g. programs with constructs like 'if ([...]) { x = 0; } else
{ x = 1; }' with subsequent 'if (x == 1) { [...] }'. For unprivileged, this
would result in only single direction always-* taken paths, and unknown taken
paths being allowed, such that the former could be patched from a conditional
jump to an unconditional jump (ja). Compared to this approach here, it would
have two downsides: i) valid programs that otherwise are not performing any
pointer arithmetic, etc, would potentially be rejected/broken, and ii) we are
required to turn off path pruning for unprivileged, where both can be avoided
in this work through pushing the invalid branch to the verification stack.
The issue was originally discovered by Adam and Ofek, and later independently
discovered and reported as a result of Benedict and Piotr's research work.
Fixes:
|
||
|
|
c510c1845f |
bpf: Do not mark insn as seen under speculative path verification
commit fe9a5ca7e370e613a9a75a13008a3845ea759d6e upstream. ... in such circumstances, we do not want to mark the instruction as seen given the goal is still to jmp-1 rewrite/sanitize dead code, if it is not reachable from the non-speculative path verification. We do however want to verify it for safety regardless. With the patch as-is all the insns that have been marked as seen before the patch will also be marked as seen after the patch (just with a potentially different non-zero count). An upcoming patch will also verify paths that are unreachable in the non-speculative domain, hence this extension is needed. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Benedict Schlueter <benedict.schlueter@rub.de> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> [OP: - env->pass_cnt is not used in 4.19, so adjust sanitize_mark_insn_seen() to assign "true" instead - drop sanitize_insn_aux_data() comment changes, as the function is not present in 4.19] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
0abc8c9754 |
bpf: Inherit expanded/patched seen count from old aux data
commit d203b0fd863a2261e5d00b97f3d060c4c2a6db71 upstream. Instead of relying on current env->pass_cnt, use the seen count from the old aux data in adjust_insn_aux_data(), and expand it to the new range of patched instructions. This change is valid given we always expand 1:n with n>=1, so what applies to the old/original instruction needs to apply for the replacement as well. Not relying on env->pass_cnt is a prerequisite for a later change where we want to avoid marking an instruction seen when verified under speculative execution path. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Benedict Schlueter <benedict.schlueter@rub.de> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> [OP: - declare old_data as bool instead of u32 (struct bpf_insn_aux_data.seen is bool in 5.4) - adjusted context for 4.19] Signed-off-by: Ovidiu Panait <ovidiu.panait@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
7c165d58ef |
tracing: Reject string operand in the histogram expression
commit a9d10ca4986571bffc19778742d508cc8dd13e02 upstream.
Since the string type can not be the target of the addition / subtraction
operation, it must be rejected. Without this fix, the string type silently
converted to digits.
Link: https://lkml.kernel.org/r/162742654278.290973.1523000673366456634.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
c0add455ae |
tracing/histogram: Rename "cpu" to "common_cpu"
commit 1e3bac71c5053c99d438771fc9fa5082ae5d90aa upstream.
Currently the histogram logic allows the user to write "cpu" in as an
event field, and it will record the CPU that the event happened on.
The problem with this is that there's a lot of events that have "cpu"
as a real field, and using "cpu" as the CPU it ran on, makes it
impossible to run histograms on the "cpu" field of events.
For example, if I want to have a histogram on the count of the
workqueue_queue_work event on its cpu field, running:
># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger
Gives a misleading and wrong result.
Change the command to "common_cpu" as no event should have "common_*"
fields as that's a reserved name for fields used by all events. And
this makes sense here as common_cpu would be a field used by all events.
Now we can even do:
># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
># cat events/workqueue/workqueue_queue_work/hist
# event histogram
#
# trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
#
{ common_cpu: 0, cpu: 2 } hitcount: 1
{ common_cpu: 0, cpu: 4 } hitcount: 1
{ common_cpu: 7, cpu: 7 } hitcount: 1
{ common_cpu: 0, cpu: 7 } hitcount: 1
{ common_cpu: 0, cpu: 1 } hitcount: 1
{ common_cpu: 0, cpu: 6 } hitcount: 2
{ common_cpu: 0, cpu: 5 } hitcount: 2
{ common_cpu: 1, cpu: 1 } hitcount: 4
{ common_cpu: 6, cpu: 6 } hitcount: 4
{ common_cpu: 5, cpu: 5 } hitcount: 14
{ common_cpu: 4, cpu: 4 } hitcount: 26
{ common_cpu: 0, cpu: 0 } hitcount: 39
{ common_cpu: 2, cpu: 2 } hitcount: 184
Now for backward compatibility, I added a trick. If "cpu" is used, and
the field is not found, it will fall back to "common_cpu" and work as
it did before. This way, it will still work for old programs that use
"cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
will get that event's "cpu" field, which is probably what it wants
anyway.
I updated the tracefs/README to include documentation about both the
common_timestamp and the common_cpu. This way, if that text is present in
the README, then an application can know that common_cpu is supported over
just plain "cpu".
Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
43cba13ff1 |
tracing / histogram: Give calculation hist_fields a size
commit 2c05caa7ba8803209769b9e4fe02c38d77ae88d0 upstream.
When working on my user space applications, I found a bug in the synthetic
event code where the automated synthetic event field was not matching the
event field calculation it was attached to. Looking deeper into it, it was
because the calculation hist_field was not given a size.
The synthetic event fields are matched to their hist_fields either by
having the field have an identical string type, or if that does not match,
then the size and signed values are used to match the fields.
The problem arose when I tried to match a calculation where the fields
were "unsigned int". My tool created a synthetic event of type "u32". But
it failed to match. The string was:
diff=field1-field2:onmatch(event).trace(synth,$diff)
Adding debugging into the kernel, I found that the size of "diff" was 0.
And since it was given "unsigned int" as a type, the histogram fallback
code used size and signed. The signed matched, but the size of u32 (4) did
not match zero, and the event failed to be created.
This can be worse if the field you want to match is not one of the
acceptable fields for a synthetic event. As event fields can have any type
that is supported in Linux, this can cause an issue. For example, if a
type is an enum. Then there's no way to use that with any calculations.
Have the calculation field simply take on the size of what it is
calculating.
Link: https://lkml.kernel.org/r/20210730171951.59c7743f@oasis.local.home
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
93175d935d |
padata: add separate cpuhp node for CPUHP_PADATA_DEAD
commit 3c2214b6027ff37945799de717c417212e1a8c54 upstream.
Removing the pcrypt module triggers this:
general protection fault, probably for non-canonical
address 0xdead000000000122
CPU: 5 PID: 264 Comm: modprobe Not tainted 5.6.0+ #2
Hardware name: QEMU Standard PC
RIP: 0010:__cpuhp_state_remove_instance+0xcc/0x120
Call Trace:
padata_sysfs_release+0x74/0xce
kobject_put+0x81/0xd0
padata_free+0x12/0x20
pcrypt_exit+0x43/0x8ee [pcrypt]
padata instances wrongly use the same hlist node for the online and dead
states, so __padata_free()'s second cpuhp remove call chokes on the node
that the first poisoned.
cpuhp multi-instance callbacks only walk forward in cpuhp_step->list and
the same node is linked in both the online and dead lists, so the list
corruption that results from padata_alloc() adding the node to a second
list without removing it from the first doesn't cause problems as long
as no instances are freed.
Avoid the issue by giving each state its own node.
Fixes: 894c9ef9780c ("padata: validate cpumask without removed CPU during offline")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
2b1207801c |
padata: validate cpumask without removed CPU during offline
commit 894c9ef9780c5cf2f143415e867ee39a33ecb75d upstream.
Configuring an instance's parallel mask without any online CPUs...
echo 2 > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
echo 0 > /sys/devices/system/cpu/cpu1/online
...makes tcrypt mode=215 crash like this:
divide error: 0000 [#1] SMP PTI
CPU: 4 PID: 283 Comm: modprobe Not tainted 5.4.0-rc8-padata-doc-v2+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191013_105130-anatol 04/01/2014
RIP: 0010:padata_do_parallel+0x114/0x300
Call Trace:
pcrypt_aead_encrypt+0xc0/0xd0 [pcrypt]
crypto_aead_encrypt+0x1f/0x30
do_mult_aead_op+0x4e/0xdf [tcrypt]
test_mb_aead_speed.constprop.0.cold+0x226/0x564 [tcrypt]
do_test+0x28c2/0x4d49 [tcrypt]
tcrypt_mod_init+0x55/0x1000 [tcrypt]
...
cpumask_weight() in padata_cpu_hash() returns 0 because the mask has no
CPUs. The problem is __padata_remove_cpu() checks for valid masks too
early and so doesn't mark the instance PADATA_INVALID as expected, which
would have made padata_do_parallel() return error before doing the
division.
Fix by introducing a second padata CPU hotplug state before
CPUHP_BRINGUP_CPU so that __padata_remove_cpu() sees the online mask
without @cpu. No need for the second argument to padata_replace() since
@cpu is now already missing from the online mask.
Fixes:
|
||
|
|
b7dc671550 |
workqueue: fix UAF in pwq_unbound_release_workfn()
commit b42b0bddcbc87b4c66f6497f66fc72d52b712aa7 upstream.
I got a UAF report when doing fuzz test:
[ 152.880091][ T8030] ==================================================================
[ 152.881240][ T8030] BUG: KASAN: use-after-free in pwq_unbound_release_workfn+0x50/0x190
[ 152.882442][ T8030] Read of size 4 at addr ffff88810d31bd00 by task kworker/3:2/8030
[ 152.883578][ T8030]
[ 152.883932][ T8030] CPU: 3 PID: 8030 Comm: kworker/3:2 Not tainted 5.13.0+ #249
[ 152.885014][ T8030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 152.886442][ T8030] Workqueue: events pwq_unbound_release_workfn
[ 152.887358][ T8030] Call Trace:
[ 152.887837][ T8030] dump_stack_lvl+0x75/0x9b
[ 152.888525][ T8030] ? pwq_unbound_release_workfn+0x50/0x190
[ 152.889371][ T8030] print_address_description.constprop.10+0x48/0x70
[ 152.890326][ T8030] ? pwq_unbound_release_workfn+0x50/0x190
[ 152.891163][ T8030] ? pwq_unbound_release_workfn+0x50/0x190
[ 152.891999][ T8030] kasan_report.cold.15+0x82/0xdb
[ 152.892740][ T8030] ? pwq_unbound_release_workfn+0x50/0x190
[ 152.893594][ T8030] __asan_load4+0x69/0x90
[ 152.894243][ T8030] pwq_unbound_release_workfn+0x50/0x190
[ 152.895057][ T8030] process_one_work+0x47b/0x890
[ 152.895778][ T8030] worker_thread+0x5c/0x790
[ 152.896439][ T8030] ? process_one_work+0x890/0x890
[ 152.897163][ T8030] kthread+0x223/0x250
[ 152.897747][ T8030] ? set_kthread_struct+0xb0/0xb0
[ 152.898471][ T8030] ret_from_fork+0x1f/0x30
[ 152.899114][ T8030]
[ 152.899446][ T8030] Allocated by task 8884:
[ 152.900084][ T8030] kasan_save_stack+0x21/0x50
[ 152.900769][ T8030] __kasan_kmalloc+0x88/0xb0
[ 152.901416][ T8030] __kmalloc+0x29c/0x460
[ 152.902014][ T8030] alloc_workqueue+0x111/0x8e0
[ 152.902690][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0
[ 152.903459][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0
[ 152.904198][ T8030] scrub_workers_get+0x1e8/0x490
[ 152.904929][ T8030] btrfs_scrub_dev+0x1b9/0x9c0
[ 152.905599][ T8030] btrfs_ioctl+0x122c/0x4e50
[ 152.906247][ T8030] __x64_sys_ioctl+0x137/0x190
[ 152.906916][ T8030] do_syscall_64+0x34/0xb0
[ 152.907535][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 152.908365][ T8030]
[ 152.908688][ T8030] Freed by task 8884:
[ 152.909243][ T8030] kasan_save_stack+0x21/0x50
[ 152.909893][ T8030] kasan_set_track+0x20/0x30
[ 152.910541][ T8030] kasan_set_free_info+0x24/0x40
[ 152.911265][ T8030] __kasan_slab_free+0xf7/0x140
[ 152.911964][ T8030] kfree+0x9e/0x3d0
[ 152.912501][ T8030] alloc_workqueue+0x7d7/0x8e0
[ 152.913182][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0
[ 152.913949][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0
[ 152.914703][ T8030] scrub_workers_get+0x1e8/0x490
[ 152.915402][ T8030] btrfs_scrub_dev+0x1b9/0x9c0
[ 152.916077][ T8030] btrfs_ioctl+0x122c/0x4e50
[ 152.916729][ T8030] __x64_sys_ioctl+0x137/0x190
[ 152.917414][ T8030] do_syscall_64+0x34/0xb0
[ 152.918034][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 152.918872][ T8030]
[ 152.919203][ T8030] The buggy address belongs to the object at ffff88810d31bc00
[ 152.919203][ T8030] which belongs to the cache kmalloc-512 of size 512
[ 152.921155][ T8030] The buggy address is located 256 bytes inside of
[ 152.921155][ T8030] 512-byte region [ffff88810d31bc00, ffff88810d31be00)
[ 152.922993][ T8030] The buggy address belongs to the page:
[ 152.923800][ T8030] page:ffffea000434c600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10d318
[ 152.925249][ T8030] head:ffffea000434c600 order:2 compound_mapcount:0 compound_pincount:0
[ 152.926399][ T8030] flags: 0x57ff00000010200(slab|head|node=1|zone=2|lastcpupid=0x7ff)
[ 152.927515][ T8030] raw: 057ff00000010200 dead000000000100 dead000000000122 ffff888009c42c80
[ 152.928716][ T8030] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
[ 152.929890][ T8030] page dumped because: kasan: bad access detected
[ 152.930759][ T8030]
[ 152.931076][ T8030] Memory state around the buggy address:
[ 152.931851][ T8030] ffff88810d31bc00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 152.932967][ T8030] ffff88810d31bc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 152.934068][ T8030] >ffff88810d31bd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 152.935189][ T8030] ^
[ 152.935763][ T8030] ffff88810d31bd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 152.936847][ T8030] ffff88810d31be00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 152.937940][ T8030] ==================================================================
If apply_wqattrs_prepare() fails in alloc_workqueue(), it will call put_pwq()
which invoke a work queue to call pwq_unbound_release_workfn() and use the 'wq'.
The 'wq' allocated in alloc_workqueue() will be freed in error path when
apply_wqattrs_prepare() fails. So it will lead a UAF.
CPU0 CPU1
alloc_workqueue()
alloc_and_link_pwqs()
apply_wqattrs_prepare() fails
apply_wqattrs_cleanup()
schedule_work(&pwq->unbound_release_work)
kfree(wq)
worker_thread()
pwq_unbound_release_workfn() <- trigger uaf here
If apply_wqattrs_prepare() fails, the new pwq are not linked, it doesn't
hold any reference to the 'wq', 'wq' is invalid to access in the worker,
so add check pwq if linked to fix this.
Fixes:
|
||
|
|
6a99bfee7f |
tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
commit 67f0d6d9883c13174669f88adac4f0ee656cc16a upstream.
The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
"head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
the same buffer page, whose "buffer_data_page" is empty and "read" field
is non-zero.
An error scenario could be constructed as followed (kernel perspective):
1. All pages in the buffer has been accessed by reader(s) so that all of
them will have non-zero "read" field.
2. Read and clear all buffer pages so that "rb_num_of_entries()" will
return 0 rendering there's no more data to read. It is also required
that the "read_page", "commit_page" and "tail_page" points to the same
page, while "head_page" is the next page of them.
3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
so that it shot pass the end of current tail buffer page. Now the
"head_page", "commit_page" and "tail_page" points to the same page.
4. Discard current event with "ring_buffer_discard_commit()", so that
"head_page", "commit_page" and "tail_page" points to a page whose buffer
data page is now empty.
When the error scenario has been constructed, "tracing_read_pipe" will
be trapped inside a deadloop: "trace_empty()" returns 0 since
"rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
constructed ring buffer. Then "trace_find_next_entry_inc()" always
return NULL since "rb_num_of_entries()" reports there's no more entry
to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
"tracing_read_pipe" back to the start of the "waitagain" loop.
I've also written a proof-of-concept script to construct the scenario
and trigger the bug automatically, you can use it to trace and validate
my reasoning above:
https://github.com/aegistudio/RingBufferDetonator.git
Tests has been carried out on linux kernel 5.14-rc2
(2734d6c1b1a089fb593ef6a23d4b70903526fe0c), my fixed version
of kernel (for testing whether my update fixes the bug) and
some older kernels (for range of affected kernels). Test result is
also attached to the proof-of-concept repository.
Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/
Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
aa36bd8fc1 |
sched/fair: Fix CFS bandwidth hrtimer expiry type
[ Upstream commit 72d0ad7cb5bad265adb2014dbe46c4ccb11afaba ] The time remaining until expiry of the refresh_timer can be negative. Casting the type to an unsigned 64-bit value will cause integer underflow, making the runtime_refresh_within return false instead of true. These situations are rare, but they do happen. This does not cause user-facing issues or errors; other than possibly unthrottling cfs_rq's using runtime from the previous period(s), making the CFS bandwidth enforcement less strict in those (special) situations. Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Link: https://lore.kernel.org/r/20210629121452.18429-1-odin@uged.al Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
1834fbbd66 |
tracing: Do not reference char * as a string in histograms
commit 704adfb5a9978462cd861f170201ae2b5e3d3a80 upstream.
The histogram logic was allowing events with char * pointers to be used as
normal strings. But it was easy to crash the kernel with:
# echo 'hist:keys=filename' > events/syscalls/sys_enter_openat/trigger
And open some files, and boom!
BUG: unable to handle page fault for address: 00007f2ced0c3280
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1173fa067 P4D 1173fa067 PUD 1171b6067 PMD 1171dd067 PTE 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 6 PID: 1810 Comm: cat Not tainted 5.13.0-rc5-test+ #61
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
v03.03 07/14/2016
RIP: 0010:strlen+0x0/0x20
Code: f6 82 80 2a 0b a9 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2a 0b
a9 20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74
10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3
RSP: 0018:ffffbdbf81567b50 EFLAGS: 00010246
RAX: 0000000000000003 RBX: ffff93815cdb3800 RCX: ffff9382401a22d0
RDX: 0000000000000100 RSI: 0000000000000000 RDI: 00007f2ced0c3280
RBP: 0000000000000100 R08: ffff9382409ff074 R09: ffffbdbf81567c98
R10: ffff9382409ff074 R11: 0000000000000000 R12: ffff9382409ff074
R13: 0000000000000001 R14: ffff93815a744f00 R15: 00007f2ced0c3280
FS: 00007f2ced0f8580(0000) GS:ffff93825a800000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2ced0c3280 CR3: 0000000107069005 CR4: 00000000001706e0
Call Trace:
event_hist_trigger+0x463/0x5f0
? find_held_lock+0x32/0x90
? sched_clock_cpu+0xe/0xd0
? lock_release+0x155/0x440
? kernel_init_free_pages+0x6d/0x90
? preempt_count_sub+0x9b/0xd0
? kernel_init_free_pages+0x6d/0x90
? get_page_from_freelist+0x12c4/0x1680
? __rb_reserve_next+0xe5/0x460
? ring_buffer_lock_reserve+0x12a/0x3f0
event_triggers_call+0x52/0xe0
ftrace_syscall_enter+0x264/0x2c0
syscall_trace_enter.constprop.0+0x1ee/0x210
do_syscall_64+0x1c/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Where it triggered a fault on strlen(key) where key was the filename.
The reason is that filename is a char * to user space, and the histogram
code just blindly dereferenced it, with obvious bad results.
I originally tried to use strncpy_from_user/kernel_nofault() but found
that there's other places that its dereferenced and not worth the effort.
Just do not allow "char *" to act like strings.
Link: https://lkml.kernel.org/r/20210715000206.025df9d2@rorschach.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: stable@vger.kernel.org
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Tom Zanussi <zanussi@kernel.org>
Fixes:
|
||
|
|
e1105c6465 |
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
commit 4030a6e6a6a4a42ff8c18414c9e0c93e24cc70b8 upstream.
Currently tgid_map is sized at PID_MAX_DEFAULT entries, which means that
on systems where pid_max is configured higher than PID_MAX_DEFAULT the
ftrace record-tgid option doesn't work so well. Any tasks with PIDs
higher than PID_MAX_DEFAULT are simply not recorded in tgid_map, and
don't show up in the saved_tgids file.
In particular since systemd v243 & above configure pid_max to its
highest possible 1<<22 value by default on 64 bit systems this renders
the record-tgids option of little use.
Increase the size of tgid_map to the configured pid_max instead,
allowing it to cover the full range of PIDs up to the maximum value of
PID_MAX_LIMIT if the system is configured that way.
On 64 bit systems with pid_max == PID_MAX_LIMIT this will increase the
size of tgid_map from 256KiB to 16MiB. Whilst this 64x increase in
memory overhead sounds significant 64 bit systems are presumably best
placed to accommodate it, and since tgid_map is only allocated when the
record-tgid option is actually used presumably the user would rather it
spends sufficient memory to actually record the tgids they expect.
The size of tgid_map could also increase for CONFIG_BASE_SMALL=y
configurations, but these seem unlikely to be systems upon which people
are both configuring a large pid_max and running ftrace with record-tgid
anyway.
Of note is that we only allocate tgid_map once, the first time that the
record-tgid option is enabled. Therefore its size is only set once, to
the value of pid_max at the time the record-tgid option is first
enabled. If a user increases pid_max after that point, the saved_tgids
file will not contain entries for any tasks with pids beyond the earlier
value of pid_max.
Link: https://lkml.kernel.org/r/20210701172407.889626-2-paulburton@google.com
Fixes:
|
||
|
|
44896b31b1 |
tracing: Simplify & fix saved_tgids logic
commit b81b3e959adb107cd5b36c7dc5ba1364bbd31eb2 upstream.
The tgid_map array records a mapping from pid to tgid, where the index
of an entry within the array is the pid & the value stored at that index
is the tgid.
The saved_tgids_next() function iterates over pointers into the tgid_map
array & dereferences the pointers which results in the tgid, but then it
passes that dereferenced value to trace_find_tgid() which treats it as a
pid & does a further lookup within the tgid_map array. It seems likely
that the intent here was to skip over entries in tgid_map for which the
recorded tgid is zero, but instead we end up skipping over entries for
which the thread group leader hasn't yet had its own tgid recorded in
tgid_map.
A minimal fix would be to remove the call to trace_find_tgid, turning:
if (trace_find_tgid(*ptr))
into:
if (*ptr)
..but it seems like this logic can be much simpler if we simply let
seq_read() iterate over the whole tgid_map array & filter out empty
entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that
approach, removing the incorrect logic here entirely.
Link: https://lkml.kernel.org/r/20210630003406.4013668-1-paulburton@google.com
Fixes:
|
||
|
|
7cb77faeda |
cpu/hotplug: Cure the cpusets trainwreck
commit b22afcdf04c96ca58327784e280e10288cfd3303 upstream.
Alexey and Joshua tried to solve a cpusets related hotplug problem which is
user space visible and results in unexpected behaviour for some time after
a CPU has been plugged in and the corresponding uevent was delivered.
cpusets delegate the hotplug work (rebuilding cpumasks etc.) to a
workqueue. This is done because the cpusets code has already a lock
nesting of cgroups_mutex -> cpu_hotplug_lock. A synchronous callback or
waiting for the work to finish with cpu_hotplug_lock held can and will
deadlock because that results in the reverse lock order.
As a consequence the uevent can be delivered before cpusets have consistent
state which means that a user space invocation of sched_setaffinity() to
move a task to the plugged CPU fails up to the point where the scheduled
work has been processed.
The same is true for CPU unplug, but that does not create user observable
failure (yet).
It's still inconsistent to claim that an operation is finished before it
actually is and that's the real issue at hand. uevents just make it
reliably observable.
Obviously the problem should be fixed in cpusets/cgroups, but untangling
that is pretty much impossible because according to the changelog of the
commit which introduced this 8 years ago:
3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
the lock order cgroups_mutex -> cpu_hotplug_lock is a design decision and
the whole code is built around that.
So bite the bullet and invoke the relevant cpuset function, which waits for
the work to finish, in _cpu_up/down() after dropping cpu_hotplug_lock and
only when tasks are not frozen by suspend/hibernate because that would
obviously wait forever.
Waiting there with cpu_add_remove_lock, which is protecting the present
and possible CPU maps, held is not a problem at all because neither work
queues nor cpusets/cgroups have any lockchains related to that lock.
Waiting in the hotplug machinery is not problematic either because there
are already state callbacks which wait for hardware queues to drain. It
makes the operations slightly slower, but hotplug is slow anyway.
This ensures that state is consistent before returning from a hotplug
up/down operation. It's still inconsistent during the operation, but that's
a different story.
Add a large comment which explains why this is done and why this is not a
dump ground for the hack of the day to work around half thought out locking
schemes. Document also the implications vs. hotplug operations and
serialization or the lack of it.
Thanks to Alexy and Joshua for analyzing why this temporary
sched_setaffinity() failure happened.
Fixes: 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()")
Reported-by: Alexey Klimov <aklimov@redhat.com>
Reported-by: Joshua Baker <jobaker@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexey Klimov <aklimov@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87tuowcnv3.ffs@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
73af8425a5 |
clocksource: Retry clock read if long delays detected
[ Upstream commit db3a34e17433de2390eb80d436970edcebd0ca3e ] When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. Therefore, re-read the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried. The maximum number of retries is specified by a new kernel boot parameter clocksource.max_cswd_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If more than one retry was required, a message is printed on the console (the occasional single retry is expected behavior, especially in guest OSes). If the maximum number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
9b808bdcc9 |
sched/fair: Fix ascii art by relpacing tabs
[ Upstream commit 08f7c2f4d0e9f4283f5796b8168044c034a1bfcb ] When using something other than 8 spaces per tab, this ascii art makes not sense, and the reader might end up wondering what this advanced equation "is". Signed-off-by: Odin Ugedal <odin@uged.al> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210518125202.78658-4-odin@uged.al Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
47ab2c7413 |
tracepoint: Add tracepoint_probe_register_may_exist() for BPF tracing
commit 9913d5745bd720c4266805c8d29952a3702e4eca upstream.
All internal use cases for tracepoint_probe_register() is set to not ever
be called with the same function and data. If it is, it is considered a
bug, as that means the accounting of handling tracepoints is corrupted.
If the function and data for a tracepoint is already registered when
tracepoint_probe_register() is called, it will call WARN_ON_ONCE() and
return with EEXISTS.
The BPF system call can end up calling tracepoint_probe_register() with
the same data, which now means that this can trigger the warning because
of a user space process. As WARN_ON_ONCE() should not be called because
user space called a system call with bad data, there needs to be a way to
register a tracepoint without triggering a warning.
Enter tracepoint_probe_register_may_exist(), which can be called, but will
not cause a WARN_ON() if the probe already exists. It will still error out
with EEXIST, which will then be sent to the user space that performed the
BPF system call.
This keeps the previous testing for issues with other users of the
tracepoint code, while letting BPF call it with duplicated data and not
warn about it.
Link: https://lore.kernel.org/lkml/20210626135845.4080-1-penguin-kernel@I-love.SAKURA.ne.jp/
Link: https://syzkaller.appspot.com/bug?id=41f4318cf01762389f4d1c1c459da4f542fe5153
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
b05195282f |
tracing/histograms: Fix parsing of "sym-offset" modifier
commit 26c563731056c3ee66f91106c3078a8c36bb7a9e upstream.
With the addition of simple mathematical operations (plus and minus), the
parsing of the "sym-offset" modifier broke, as it took the '-' part of the
"sym-offset" as a minus, and tried to break it up into a mathematical
operation of "field.sym - offset", in which case it failed to parse
(unless the event had a field called "offset").
Both .sym and .sym-offset modifiers should not be entered into
mathematical calculations anyway. If ".sym-offset" is found in the
modifier, then simply make it not an operation that can be calculated on.
Link: https://lkml.kernel.org/r/20210707110821.188ae255@oasis.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
6e2a98bc90 |
kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
commit 5fa54346caf67b4b1b10b1f390316ae466da4d53 upstream.
The system might hang with the following backtrace:
schedule+0x80/0x100
schedule_timeout+0x48/0x138
wait_for_common+0xa4/0x134
wait_for_completion+0x1c/0x2c
kthread_flush_work+0x114/0x1cc
kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144
kthread_cancel_delayed_work_sync+0x18/0x2c
xxxx_pm_notify+0xb0/0xd8
blocking_notifier_call_chain_robust+0x80/0x194
pm_notifier_call_chain_robust+0x28/0x4c
suspend_prepare+0x40/0x260
enter_state+0x80/0x3f4
pm_suspend+0x60/0xdc
state_store+0x108/0x144
kobj_attr_store+0x38/0x88
sysfs_kf_write+0x64/0xc0
kernfs_fop_write_iter+0x108/0x1d0
vfs_write+0x2f4/0x368
ksys_write+0x7c/0xec
It is caused by the following race between kthread_mod_delayed_work()
and kthread_cancel_delayed_work_sync():
CPU0 CPU1
Context: Thread A Context: Thread B
kthread_mod_delayed_work()
spin_lock()
__kthread_cancel_work()
spin_unlock()
del_timer_sync()
kthread_cancel_delayed_work_sync()
spin_lock()
__kthread_cancel_work()
spin_unlock()
del_timer_sync()
spin_lock()
work->canceling++
spin_unlock
spin_lock()
queue_delayed_work()
// dwork is put into the worker->delayed_work_list
spin_unlock()
kthread_flush_work()
// flush_work is put at the tail of the dwork
wait_for_completion()
Context: IRQ
kthread_delayed_work_timer_fn()
spin_lock()
list_del_init(&work->node);
spin_unlock()
BANG: flush_work is not longer linked and will never get proceed.
The problem is that kthread_mod_delayed_work() checks work->canceling
flag before canceling the timer.
A simple solution is to (re)check work->canceling after
__kthread_cancel_work(). But then it is not clear what should be
returned when __kthread_cancel_work() removed the work from the queue
(list) and it can't queue it again with the new @delay.
The return value might be used for reference counting. The caller has
to know whether a new work has been queued or an existing one was
replaced.
The proper solution is that kthread_mod_delayed_work() will remove the
work from the queue (list) _only_ when work->canceling is not set. The
flag must be checked after the timer is stopped and the remaining
operations can be done under worker->lock.
Note that kthread_mod_delayed_work() could remove the timer and then
bail out. It is fine. The other canceling caller needs to cancel the
timer as well. The important thing is that the queue (list)
manipulation is done atomically under worker->lock.
Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com
Fixes:
|
||
|
|
13bcf5aeb3 |
kthread_worker: split code for canceling the delayed work timer
commit 34b3d5344719d14fd2185b2d9459b3abcb8cf9d8 upstream. Patch series "kthread_worker: Fix race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync()". This patchset fixes the race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync() including proper return value handling. This patch (of 2): Simple code refactoring as a preparation step for fixing a race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(). It does not modify the existing behavior. Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: <jenhaochen@google.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
2445837e9c |
mm, futex: fix shared futex pgoff on shmem huge page
[ Upstream commit fe19bd3dae3d15d2fbfdb3de8839a6ea0fe94264 ] If more than one futex is placed on a shmem huge page, it can happen that waking the second wakes the first instead, and leaves the second waiting: the key's shared.pgoff is wrong. When 3.11 commit |
||
|
|
ff66086362 |
module: limit enabling module.sig_enforce
[ Upstream commit 0c18f29aae7ce3dadd26d8ee3505d07cc982df75 ]
Irrespective as to whether CONFIG_MODULE_SIG is configured, specifying
"module.sig_enforce=1" on the boot command line sets "sig_enforce".
Only allow "sig_enforce" to be set when CONFIG_MODULE_SIG is configured.
This patch makes the presence of /sys/module/module/parameters/sig_enforce
dependent on CONFIG_MODULE_SIG=y.
Fixes:
|
||
|
|
79197fb7ae |
tracing: Do no increment trace_clock_global() by one
commit 89529d8b8f8daf92d9979382b8d2eb39966846ea upstream.
The trace_clock_global() tries to make sure the events between CPUs is
somewhat in order. A global value is used and updated by the latest read
of a clock. If one CPU is ahead by a little, and is read by another CPU, a
lock is taken, and if the timestamp of the other CPU is behind, it will
simply use the other CPUs timestamp.
The lock is also only taken with a "trylock" due to tracing, and strange
recursions can happen. The lock is not taken at all in NMI context.
In the case where the lock is not able to be taken, the non synced
timestamp is returned. But it will not be less than the saved global
timestamp.
The problem arises because when the time goes "backwards" the time
returned is the saved timestamp plus 1. If the lock is not taken, and the
plus one to the timestamp is returned, there's a small race that can cause
the time to go backwards!
CPU0 CPU1
---- ----
trace_clock_global() {
ts = clock() [ 1000 ]
trylock(clock_lock) [ success ]
global_ts = ts; [ 1000 ]
<interrupted by NMI>
trace_clock_global() {
ts = clock() [ 999 ]
if (ts < global_ts)
ts = global_ts + 1 [ 1001 ]
trylock(clock_lock) [ fail ]
return ts [ 1001]
}
unlock(clock_lock);
return ts; [ 1000 ]
}
trace_clock_global() {
ts = clock() [ 1000 ]
if (ts < global_ts) [ false 1000 == 1000 ]
trylock(clock_lock) [ success ]
global_ts = ts; [ 1000 ]
unlock(clock_lock)
return ts; [ 1000 ]
}
The above case shows to reads of trace_clock_global() on the same CPU, but
the second read returns one less than the first read. That is, time when
backwards, and this is not what is allowed by trace_clock_global().
This was triggered by heavy tracing and the ring buffer checker that tests
for the clock going backwards:
Ring buffer clock went backwards: 20613921464 -> 20613921463
------------[ cut here ]------------
WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0
Modules linked in:
[..]
[CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463
[20613915818] PAGE TIME STAMP
[20613915818] delta:0
[20613915819] delta:1
[20613916035] delta:216
[20613916465] delta:430
[20613916575] delta:110
[20613916749] delta:174
[20613917248] delta:499
[20613917333] delta:85
[20613917775] delta:442
[20613917921] delta:146
[20613918321] delta:400
[20613918568] delta:247
[20613918768] delta:200
[20613919306] delta:538
[20613919353] delta:47
[20613919980] delta:627
[20613920296] delta:316
[20613920571] delta:275
[20613920862] delta:291
[20613921152] delta:290
[20613921464] delta:312
[20613921464] delta:0 TIME EXTEND
[20613921464] delta:0
This happened more than once, and always for an off by one result. It also
started happening after commit aafe104aa9096 was added.
Cc: stable@vger.kernel.org
Fixes: aafe104aa9096 ("tracing: Restructure trace_clock_global() to never block")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|