If the leftmost parent node of the tree has does not have a child
on the left side, then trie_get_next_key (and bpftool map dump) will
not look at the child on the right. This leads to the traversal
missing elements.
Lookup is not affected.
Update selftest to handle this case.
Reproducer:
bpftool map create /sys/fs/bpf/lpm type lpm_trie key 6 \
value 1 entries 256 name test_lpm flags 1
bpftool map update pinned /sys/fs/bpf/lpm key 8 0 0 0 0 0 value 1
bpftool map update pinned /sys/fs/bpf/lpm key 16 0 0 0 0 128 value 2
bpftool map dump pinned /sys/fs/bpf/lpm
Returns only 1 element. (2 expected)
Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE")
Change-Id: I942431b7feaa82aab38d4c37b3b5920ae70d8e24
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
trie_get_next_key() allocates a node stack with size trie->max_prefixlen,
while it writes (trie->max_prefixlen + 1) nodes to the stack when it has
full paths from the root to leaves. For example, consider a trie with
max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ...
0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with
.prefixlen = 8 make 9 nodes be written on the node stack with size 8.
Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Change-Id: I0626bd93acddf978dc56f8b1ee13305c50c90210
Signed-off-by: Byeonguk Jeong <jungbu2855@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>
Tested-by: Hou Tao <houtao1@huawei.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/Zxx384ZfdlFYnz6J@localhost.localdomain
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
trie_get_next_key() uses node->prefixlen == key->prefixlen to identify
an exact match, However, it is incorrect because when the target key
doesn't fully match the found node (e.g., node->prefixlen != matchlen),
these two nodes may also have the same prefixlen. It will return
expected result when the passed key exist in the trie. However when a
recently-deleted key or nonexistent key is passed to
trie_get_next_key(), it may skip keys and return incorrect result.
Fix it by using node->prefixlen == matchlen to identify exact matches.
When the condition is true after the search, it also implies
node->prefixlen equals key->prefixlen, otherwise, the search would
return NULL instead.
Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command
for LPM_TRIE map") introduces a bug likes below:
if (!rcu_dereference(trie->root))
return -ENOENT;
if (!key || key->prefixlen > trie->max_prefixlen) {
root = &trie->root;
goto find_leftmost;
}
......
find_leftmost:
for (node = rcu_dereference(*root); node;) {
In the code after label find_leftmost, it is assumed
that *root should not be NULL, but it is not true as
it is possbile trie->root is changed to NULL by an
asynchronous delete operation.
The issue is reported by syzbot and Eric Dumazet with the
below error log:
......
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 8033 Comm: syz-executor3 Not tainted 4.15.0-rc8+ #4
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:trie_get_next_key+0x3c2/0xf10 kernel/bpf/lpm_trie.c:682
......
This patch fixed the issue by use local rcu_dereferenced
pointer instead of *(&trie->root) later on.
Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command or LPM_TRIE map")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Current LPM_TRIE map type does not implement MAP_GET_NEXT_KEY
command. This command is handy when users want to enumerate
keys. Otherwise, a different map which supports key
enumeration may be required to store the keys. If the
map data is sparse and all map data are to be deleted without
closing file descriptor, using MAP_GET_NEXT_KEY to find
all keys is much faster than enumerating all key space.
This patch implements MAP_GET_NEXT_KEY command for LPM_TRIE map.
If user provided key pointer is NULL or the key does not have
an exact match in the trie, the first key will be returned.
Otherwise, the next key will be returned.
In this implemenation, key enumeration follows a postorder
traversal of internal trie. More specific keys
will be returned first than less specific ones, given
a sequence of MAP_GET_NEXT_KEY syscalls.
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Change-Id: Ied2500a773b06ac1fdc378e61fd5403a270114a6
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
By default, BPF uses module_alloc() to allocate executable memory,
but this is not necessary on all arches and potentially undesirable
on some of them.
So break out the module_alloc() and module_memfree() calls into __weak
functions to allow them to be overridden in arch code.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Change-Id: I582794881942bc0b766515861f2232354860536b
commit bcf86c01ca4676316557dd482c8416ece8c2e143 upstream.
"tracing_map->next_elt" in get_free_elt() is at risk of overflowing.
Once it overflows, new elements can still be inserted into the tracing_map
even though the maximum number of elements (`max_elts`) has been reached.
Continuing to insert elements after the overflow could result in the
tracing_map containing "tracing_map->max_size" elements, leaving no empty
entries.
If any attempt is made to insert an element into a full tracing_map using
`__tracing_map_insert()`, it will cause an infinite loop with preemption
disabled, leading to a CPU hang problem.
Fix this by preventing any further increments to "tracing_map->next_elt"
once it reaches "tracing_map->max_elt".
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 08d43a5fa0 ("tracing: Add lock-free tracing_map")
Co-developed-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com>
Link: https://lore.kernel.org/20240805055922.6277-1-Tze-nan.Wu@mediatek.com
Signed-off-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com>
Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 302ceb625d7b990db205a15e371f9a71238de91c)
[Vegard: s/atomic_fetch_add_unless/__atomic_add_unless/ due to missing
commit bfc18e389c7a09fbbbed6bf4032396685b14246e ("atomics/treewide:
Rename __atomic_add_unless() => atomic_fetch_add_unless()".]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit 06c03c8edce333b9ad9c6b207d93d3a5ae7c10c0 upstream.
Using syzkaller with the recently reintroduced signed integer overflow
sanitizer produces this UBSAN report:
UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:738:18
9223372036854775806 + 4 cannot be represented in type 'long'
Call Trace:
handle_overflow+0x171/0x1b0
__do_adjtimex+0x1236/0x1440
do_adjtimex+0x2be/0x740
The user supplied time_constant value is incremented by four and then
clamped to the operating range.
Before commit eea83d896e ("ntp: NTP4 user space bits update") the user
supplied value was sanity checked to be in the operating range. That change
removed the sanity check and relied on clamping after incrementing which
does not work correctly when the user supplied value is in the overflow
zone of the '+ 4' operation.
The operation requires CAP_SYS_TIME and the side effect of the overflow is
NTP getting out of sync.
Similar to the fixups for time_maxerror and time_esterror, clamp the user
space supplied value to the operating range.
[ tglx: Switch to clamping ]
Fixes: eea83d896e ("ntp: NTP4 user space bits update")
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-c-v2-1-f3a80096f36f@google.com
Closes: https://github.com/KSPP/linux/issues/352
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a13f8b269b6f4c9371ab149ecb65d2edb52e9669)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
[ Upstream commit 87d571d6fb77ec342a985afa8744bb9bb75b3622 ]
Using syzkaller alongside the newly reintroduced signed integer overflow
sanitizer spits out this report:
UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16
9223372036854775807 + 500 cannot be represented in type 'long'
Call Trace:
handle_overflow+0x171/0x1b0
second_overflow+0x2d6/0x500
accumulate_nsecs_to_secs+0x60/0x160
timekeeping_advance+0x1fe/0x890
update_wall_time+0x10/0x30
time_maxerror is unconditionally incremented and the result is checked
against NTP_PHASE_LIMIT, but the increment itself can overflow, resulting
in wrap-around to negative space.
Before commit eea83d896e ("ntp: NTP4 user space bits update") the user
supplied value was sanity checked to be in the operating range. That change
removed the sanity check and relied on clamping in handle_overflow() which
does not work correctly when the user supplied value is in the overflow
zone of the '+ 500' operation.
The operation requires CAP_SYS_TIME and the side effect of the overflow is
NTP getting out of sync.
Miroslav confirmed that the input value should be clamped to the operating
range and the same applies to time_esterror. The latter is not used by the
kernel, but the value still should be in the operating range as it was
before the sanity check got removed.
Clamp them to the operating range.
[ tglx: Changed it to clamping and included time_esterror ]
Fixes: eea83d896e ("ntp: NTP4 user space bits update")
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-usec-v2-1-d539180f2b79@google.com
Closes: https://github.com/KSPP/linux/issues/354
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ cast things to __kernel_long_t to fix compiler warnings - gregkh ]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9dfe2eef1ecfbb1f29e678700247de6010784eb9)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit 6881e75237a84093d0986f56223db3724619f26e upstream.
The recent fix for making the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.
This went unnoticed as compilers hoist the access into the non-preemptible
region where the pointer is actually used. But of course it's valid that
the compiler keeps it at the place where the code puts it which rightfully
triggers:
BUG: using smp_processor_id() in preemptible [00000000] code:
caller is hotplug_cpu__broadcast_tick_pull+0x1c/0xc0
Move it to the actual usage site which is in a non-preemptible region.
Fixes: f7d43dd206e7 ("tick/broadcast: Make takeover of broadcast hrtimer reliable")
Reported-by: David Wang <00107082@163.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/87ttg56ers.ffs@tglx
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit f54abf332a2bc0413cfa8bd6a8511f7aa99faea0)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
[ Upstream commit e2e821095949cde46256034975a90f88626a2a73 ]
The function kdb_position_cursor() takes in a "prompt" parameter but
never uses it. This doesn't _really_ matter since all current callers
of the function pass the same value and it's a global variable, but
it's a bit ugly. Let's clean it up.
Found by code inspection. This patch is expected to functionally be a
no-op.
Fixes: 09b35989421d ("kdb: Use format-strings rather than '\0' injection in kdb_read()")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240528071144.1.I0feb49839c6b6f4f2c4bf34764f5e95de3f55a66@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 90f2409c1d552f27a2b2bf8dc598d147c4173128)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
[ Upstream commit 70867efacf4370b6c7cdfc7a5b11300e9ef7de64 ]
When -Wformat-security is not disabled, using a string pointer
as a format causes a warning:
kernel/debug/kdb/kdb_io.c: In function 'kdb_read':
kernel/debug/kdb/kdb_io.c:365:36: error: format not a string literal and no format arguments [-Werror=format-security]
365 | kdb_printf(kdb_prompt_str);
| ^~~~~~~~~~~~~~
kernel/debug/kdb/kdb_io.c: In function 'kdb_getstr':
kernel/debug/kdb/kdb_io.c:456:20: error: format not a string literal and no format arguments [-Werror=format-security]
456 | kdb_printf(kdb_prompt_str);
| ^~~~~~~~~~~~~~
Use an explcit "%s" format instead.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5d5314d679 ("kdb: core for kgdb back end (1 of 2)")
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240528121154.3662553-1-arnd@kernel.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 22a100556ceab8b906ad180788bd6bdc07390f50)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit f944ffcbc2e1c759764850261670586ddf3bdabb upstream.
For systems on which the performance counter can expire early due to turbo
modes the watchdog handler has a safety net in place which validates that
since the last watchdog event there has at least 4/5th of the watchdog
period elapsed.
This works reliably only after the first watchdog event because the per
CPU variable which holds the timestamp of the last event is never
initialized.
So a first spurious event will validate against a timestamp of 0 which
results in a delta which is likely to be way over the 4/5 threshold of the
period. As this might happen before the first watchdog hrtimer event
increments the watchdog counter, this can lead to false positives.
Fix this by initializing the timestamp before enabling the hardware event.
Reset the rearm counter as well, as that might be non zero after the
watchdog was disabled and reenabled.
Link: https://lkml.kernel.org/r/87frsfu15a.ffs@tglx
Fixes: 7edaeb6841 ("kernel/watchdog: Prevent false positives with turbo modes")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 6d94ca5d571dfdb34f12dc3f63273ea275e8f40c)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit f7d43dd206e7e18c182f200e67a8db8c209907fa upstream.
Running the LTP hotplug stress test on a aarch64 machine results in
rcu_sched stall warnings when the broadcast hrtimer was owned by the
un-plugged CPU. The issue is the following:
CPU1 (owns the broadcast hrtimer) CPU2
tick_broadcast_enter()
// shutdown local timer device
broadcast_shutdown_local()
...
tick_broadcast_exit()
clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT)
// timer device is not programmed
cpumask_set_cpu(cpu, tick_broadcast_force_mask)
initiates offlining of CPU1
take_cpu_down()
/*
* CPU1 shuts down and does not
* send broadcast IPI anymore
*/
takedown_cpu()
hotplug_cpu__broadcast_tick_pull()
// move broadcast hrtimer to this CPU
clockevents_program_event()
bc_set_next()
hrtimer_start()
/*
* timer device is not programmed
* because only the first expiring
* timer will trigger clockevent
* device reprogramming
*/
What happens is that CPU2 exits broadcast mode with force bit set, then the
local timer device is not reprogrammed and CPU2 expects to receive the
expired event by the broadcast IPI. But this does not happen because CPU1
is offlined by CPU2. CPU switches the clockevent device to ONESHOT state,
but does not reprogram the device.
The subsequent reprogramming of the hrtimer broadcast device does not
program the clockevent device of CPU2 either because the pending expiry
time is already in the past and the CPU expects the event to be delivered.
As a consequence all CPUs which wait for a broadcast event to be delivered
are stuck forever.
Fix this issue by reprogramming the local timer device if the broadcast
force bit of the CPU is set so that the broadcast hrtimer is delivered.
[ tglx: Massage comment and change log. Add Fixes tag ]
Fixes: 989dcb645c ("tick: Handle broadcast wakeup of multiple cpus")
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240711124843.64167-1-liaoyu15@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit dfe19aa91378972f10530635ad83b2d77f481044)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
[ Upstream commit dbc48c8f41c208082cfa95e973560134489e3309 ]
nr_pages is unsigned long but gets passed to rb_alloc_aux() as an int,
and is stored as an int.
Only power-of-2 values are accepted, so if nr_pages is a 64_bit value, it
will be passed to rb_alloc_aux() as zero.
That is not ideal because:
1. the value is incorrect
2. rb_alloc_aux() is at risk of misbehaving, although it manages to
return -ENOMEM in that case, it is a result of passing zero to get_order()
even though the get_order() result is documented to be undefined in that
case.
Fix by simply validating the maximum supported value in the first place.
Use -ENOMEM error code for consistency with the current error code that
is returned in that case.
Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240624201101.60186-6-adrian.hunter@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit d7b1a76f33e6fc93924725b4410126740c890c44)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Commit [1] moved features.h inclusion in sched.h to the top of the file.
This prevents features.h from seeing the definition of HAVE_RT_PUSH_IPI,
which in turn causes the sched_feat(RT_PUSH_IPI) to be always disabled.
Fix this by checking the parent macro of HAVE_RT_PUSH_IPI instead.
[1] ("sched: Resolve sched_feat() at compile time to improve code optimization")
Signed-off-by: shygosh <shygosh@proton.me>
While the feature TTWU_QUEUE has the advantage of reducing cache
bouncing of runqueue locks, it has the side effect that runqueue
statistics are not updated until the remote CPU has a chance to
enqueue the task. Since there is no upper bound on the amount of
time it can take the remote CPU to enqueue the task, several
sequential wakeups can result in suboptimal task placement based
on the stale statistics. Turn off the feature as the cost of
sub-optimal placement is much higher than the cost of cache bouncing
spinlocks for msm based systems.
Change-Id: I0b85c0225237b2bc44f54934769f5e3750c0f3d6
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[pkondeti@codeaurora.org: Resolved minor merge conflict]
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
SBalance's statistic of new interrupts for each CPU is inherently flawed in
that it cannot track IRQ migration that occurs in between balance periods.
As a result, SBalance can observe a flawed number of new interrupts for a
CPU, which hurts its balancing decisions.
Furthermore, SBalance incorrectly assumes that IRQs are affined where
SBalance last placed them, which breaks SBalance entirely when the
assumption doesn't hold true.
As it turns out, it can be quite common to change an IRQ's affinity and
observe a successful return value despite the IRQ not actually moving. At
the very least this is observed on ARM's GICv3, and results in SBalance
never moving such an IRQ ever again because SBalance always thinks it has
zero new interrupts.
Since we can't trust irqchip drivers or hardware, gather IRQ statistics
directly in order to get the true number of new interrupts for each CPU and
the actual affinity of each IRQ based on the last CPU it fired upon.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
The atomic cpumask_clear_cpu() isn't needed. Use __cpumask_clear_cpu()
instead as a micro-optimization, and for clarity.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
SBalance is designed to poll to balance IRQs, but it shouldn't kick CPUs
out of idle to do so because idle CPUs clearly aren't processing
interrupts.
Open code a freezable wait that uses a deferrable timer in order to prevent
SBalance from waking up idle CPUs when there is little interrupt traffic.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Excluded CPUs are excluded from IRQ balancing with the intention that those
CPUs shouldn't really be processing interrupts, and thus shouldn't have
IRQs moved to them. However, SBalance completely ignores excluded CPUs,
which can cause them to end up with a disproportionate amount of interrupt
traffic that SBalance won't spread out. An easy example of this is when
CPU0 is an excluded CPU, since CPU0 ends up with all interrupts affined to
it by default on arm64.
Allow SBalance to move IRQs off of excluded CPUs so that they cannot slip
under the radar and pile up on an excluded CPU, like when CPU0 is excluded.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
This reverts commit 7226f90fb58123c5dbece9d74c9095af946d1219.
As per Sultan:
"This is only after wake up from deep sleep.
Also, the IRQ subsystem remembers which CPU
an IRQ was affined to and should put the
IRQs back upon resume"
commit c9b51ddb66b1d96e4d364c088da0f1dfb004c574 upstream.
Currently when the current line should be removed from the display
kdb_read() uses memset() to fill a temporary buffer with spaces.
The problem is not that this could be trivially implemented using a
format string rather than open coding it. The real problem is that
it is possible, on systems with a long kdb_prompt_str, to write past
the end of the tmpbuffer.
Happily, as mentioned above, this can be trivially implemented using a
format string. Make it so!
Cc: stable@vger.kernel.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-5-f236dbe9828d@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2467f3f182eb35627534effd4956fceb2504c127)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit db2f9c7dc29114f531df4a425d0867d01e1f1e28 upstream.
Currently, if the cursor position is not at the end of the command buffer
and the user uses the Tab-complete functions, then the console does not
leave the cursor in the correct position.
For example consider the following buffer with the cursor positioned
at the ^:
md kdb_pro 10
^
Pressing tab should result in:
md kdb_prompt_str 10
^
However this does not happen. Instead the cursor is placed at the end
(after then 10) and further cursor movement redraws incorrectly. The
same problem exists when we double-Tab but in a different part of the
code.
Fix this by sending a carriage return and then redisplaying the text to
the left of the cursor.
Cc: stable@vger.kernel.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-3-f236dbe9828d@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 21c068c1bbb4c336741749596d004b1965faab2c)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit 09b35989421dfd5573f0b4683c7700a7483c71f9 upstream.
Currently when kdb_read() needs to reposition the cursor it uses copy and
paste code that works by injecting an '\0' at the cursor position before
delivering a carriage-return and reprinting the line (which stops at the
'\0').
Tidy up the code by hoisting the copy and paste code into an appropriately
named function. Additionally let's replace the '\0' injection with a
proper field width parameter so that the string will be abridged during
formatting instead.
Cc: stable@vger.kernel.org # Not a bug fix but it is needed for later bug fixes
Tested-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-2-f236dbe9828d@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4edfbbaca46491b06af14e49dcb79ac661d0bbdc)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
[ Upstream commit 2a14c9ae15a38148484a128b84bff7e9ffd90d68 ]
It is a useful helper hence move it to common code so others can enjoy
it.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Stable-dep-of: 3ebc46ca8675 ("tcp: Fix shift-out-of-bounds in dctcp_update_alpha().")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 459de98d7a6b3d504b5e8664f32f59a306dd425c)
[Harshit: Also lift param_set_uint_minmax from staging lustre driver,
it is removed in 4.19.y so this upstream commit didnot try cleaning
it up there]
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
[ Upstream commit a1fd0b9d751f840df23ef0e75b691fc00cfd4743 ]
Change relax_domain_level checks so that it would be possible
to include or exclude all domains from newidle balancing.
This matches the behavior described in the documentation:
-1 no request. use system default or follow request of others.
0 no search.
1 search siblings (hyperthreads in a core).
"2" enables levels 0 and 1, level_max excludes the last (level_max)
level, and level_max+1 includes all levels.
Fixes: 1d3504fcf5 ("sched, cpuset: customize sched domains, core")
Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 454de5ed81766fbbf4777c43392d8b0b35e7e16d)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
[ Upstream commit 9ae7ab20b4835dbea0e5fc6a5c70171dc354a72e ]
As pointed out in commit
182a85f8a1 ("sched: Disable wakeup balancing")
SD_BALANCE_WAKE is a tad too aggressive, and is usually left unset.
However, it turns out cpuset domain relaxation will unconditionally set it
on domains below the relaxation level. This made sense back when
SD_BALANCE_WAKE was set unconditionally, but it no longer is the case.
We can improve things slightly by noticing that set_domain_attribute() is
always called after sd_init(), so rather than setting flags we can rely on
whatever sd_init() is doing and only clear certain flags when above the
relaxation level.
While at it, slightly clean up the function and flip the relax level
check to be more human readable.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mingo@kernel.org
Cc: vincent.guittot@linaro.org
Cc: juri.lelli@redhat.com
Cc: seto.hidetoshi@jp.fujitsu.com
Cc: qperret@google.com
Cc: Dietmar.Eggemann@arm.com
Cc: morten.rasmussen@arm.com
Link: https://lkml.kernel.org/r/20191014164408.32596-1-valentin.schneider@arm.com
Stable-dep-of: a1fd0b9d751f ("sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 046daa54c348ccec12ab38b92923060dd09ef00b)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit c2274b908db05529980ec056359fae916939fdaa upstream.
The reader code in rb_get_reader_page() swaps a new reader page into the
ring buffer by doing cmpxchg on old->list.prev->next to point it to the
new page. Following that, if the operation is successful,
old->list.next->prev gets updated too. This means the underlying
doubly-linked list is temporarily inconsistent, page->prev->next or
page->next->prev might not be equal back to page for some page in the
ring buffer.
The resize operation in ring_buffer_resize() can be invoked in parallel.
It calls rb_check_pages() which can detect the described inconsistency
and stop further tracing:
[ 190.271762] ------------[ cut here ]------------
[ 190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 rb_check_pages.isra.0+0x6a/0xa0
[ 190.271789] Modules linked in: [...]
[ 190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 skx_edac(E):1
[ 190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G E 6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f
[ 190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014
[ 190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0
[ 190.272023] Code: [...]
[ 190.272028] RSP: 0018:ffff9c37463abb70 EFLAGS: 00010206
[ 190.272034] RAX: ffff8eba04b6cb80 RBX: 0000000000000007 RCX: ffff8eba01f13d80
[ 190.272038] RDX: ffff8eba01f130c0 RSI: ffff8eba04b6cd00 RDI: ffff8eba0004c700
[ 190.272042] RBP: ffff8eba0004c700 R08: 0000000000010002 R09: 0000000000000000
[ 190.272045] R10: 00000000ffff7f52 R11: ffff8eba7f600000 R12: ffff8eba0004c720
[ 190.272049] R13: ffff8eba00223a00 R14: 0000000000000008 R15: ffff8eba067a8000
[ 190.272053] FS: 00007f1bd64752c0(0000) GS:ffff8eba7f680000(0000) knlGS:0000000000000000
[ 190.272057] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 190.272061] CR2: 00007f1bd6662590 CR3: 000000010291e001 CR4: 0000000000370ef0
[ 190.272070] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 190.272073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 190.272077] Call Trace:
[ 190.272098] <TASK>
[ 190.272189] ring_buffer_resize+0x2ab/0x460
[ 190.272199] __tracing_resize_ring_buffer.part.0+0x23/0xa0
[ 190.272206] tracing_resize_ring_buffer+0x65/0x90
[ 190.272216] tracing_entries_write+0x74/0xc0
[ 190.272225] vfs_write+0xf5/0x420
[ 190.272248] ksys_write+0x67/0xe0
[ 190.272256] do_syscall_64+0x82/0x170
[ 190.272363] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 190.272373] RIP: 0033:0x7f1bd657d263
[ 190.272381] Code: [...]
[ 190.272385] RSP: 002b:00007ffe72b643f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 190.272391] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1bd657d263
[ 190.272395] RDX: 0000000000000002 RSI: 0000555a6eb538e0 RDI: 0000000000000001
[ 190.272398] RBP: 0000555a6eb538e0 R08: 000000000000000a R09: 0000000000000000
[ 190.272401] R10: 0000555a6eb55190 R11: 0000000000000246 R12: 00007f1bd6662500
[ 190.272404] R13: 0000000000000002 R14: 00007f1bd6667c00 R15: 0000000000000002
[ 190.272412] </TASK>
[ 190.272414] ---[ end trace 0000000000000000 ]---
Note that ring_buffer_resize() calls rb_check_pages() only if the parent
trace_buffer has recording disabled. Recent commit d78ab792705c
("tracing: Stop current tracer when resizing buffer") causes that it is
now always the case which makes it more likely to experience this issue.
The window to hit this race is nonetheless very small. To help
reproducing it, one can add a delay loop in rb_get_reader_page():
ret = rb_head_page_replace(reader, cpu_buffer->reader_page);
if (!ret)
goto spin;
for (unsigned i = 0; i < 1U << 26; i++) /* inserted delay loop */
__asm__ __volatile__ ("" : : : "memory");
rb_list_head(reader->list.next)->prev = &cpu_buffer->reader_page->list;
.. and then run the following commands on the target system:
echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable
while true; do
echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
done &
while true; do
for i in /sys/kernel/tracing/per_cpu/*; do
timeout 0.1 cat $i/trace_pipe; sleep 0.2
done
done
To fix the problem, make sure ring_buffer_resize() doesn't invoke
rb_check_pages() concurrently with a reader operating on the same
ring_buffer_per_cpu by taking its cpu_buffer->reader_lock.
Link: https://lore.kernel.org/linux-trace-kernel/20240517134008.24529-3-petr.pavlu@suse.com
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 659f451ff2 ("ring-buffer: Add integrity check at end of iter read")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
[ Fixed whitespace ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b50932ea673b5a089a4bb570a8a868d95c72854e)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
This reverts commit bcf4a115a5068f3331fafb8c176c1af0da3d8b19 which is
commit 0958b33ef5a04ed91f61cef4760ac412080c4e08 upstream.
The change has an incorrect assumption about the return value because
in the current stable trees for versions 5.15 and before, the following
commit responsible for making 0 a success value is not present:
b8cc44a4d3c1 ("tracing: Remove logic for registering multiple event triggers at a time")
The return value should be 0 on failure in the current tree, because in
the functions event_trigger_callback() and event_enable_trigger_func(),
we have:
ret = cmd_ops->reg(glob, trigger_ops, trigger_data, file);
/*
* The above returns on success the # of functions enabled,
* but if it didn't find any functions it returns zero.
* Consider no functions a failure too.
*/
if (!ret) {
ret = -ENOENT;
Cc: stable@kernel.org # 5.15, 5.10, 5.4, 4.19
Signed-off-by: Siddh Raman Pant <siddh.raman.pant@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 34925d01baf3ee62ab21c21efd9e2c44c24c004a)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit 325f3fb551f8cd672dbbfc4cf58b14f9ee3fc9e8 upstream.
When unloading a module, its state is changing MODULE_STATE_LIVE ->
MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take
a time. `is_module_text_address()` and `__module_text_address()`
works with MODULE_STATE_LIVE and MODULE_STATE_GOING.
If we use `is_module_text_address()` and `__module_text_address()`
separately, there is a chance that the first one is succeeded but the
next one is failed because module->state becomes MODULE_STATE_UNFORMED
between those operations.
In `check_kprobe_address_safe()`, if the second `__module_text_address()`
is failed, that is ignored because it expected a kernel_text address.
But it may have failed simply because module->state has been changed
to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify
non-exist module text address (use-after-free).
To fix this problem, we should not use separated `is_module_text_address()`
and `__module_text_address()`, but use only `__module_text_address()`
once and do `try_module_get(module)` which is only available with
MODULE_STATE_LIVE.
Link: https://lore.kernel.org/all/20240410015802.265220-1-zhengyejian1@huawei.com/
Fixes: 28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
[Fix conflict due to lack dependency
commit 223a76b268c9 ("kprobes: Fix coding style issues")]
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b5808d40093403334d939e2c3c417144d12a6f33)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit 6b959ba22d34ca793ffdb15b5715457c78e38b1a upstream.
perf_output_read_group may respond to IPI request of other cores and invoke
__perf_install_in_context function. As a result, hwc configuration is modified.
causing inconsistency and unexpected consequences.
Interrupts are not disabled when perf_output_read_group reads PMU counter.
In this case, IPI request may be received from other cores.
As a result, PMU configuration is modified and an error occurs when
reading PMU counter:
CPU0 CPU1
__se_sys_perf_event_open
perf_install_in_context
perf_output_read_group smp_call_function_single
for_each_sibling_event(sub, leader) { generic_exec_single
if ((sub != event) && remote_function
(sub->state == PERF_EVENT_STATE_ACTIVE)) |
<enter IPI handler: __perf_install_in_context> <----RAISE IPI-----+
__perf_install_in_context
ctx_resched
event_sched_out
armpmu_del
...
hwc->idx = -1; // event->hwc.idx is set to -1
...
<exit IPI>
sub->pmu->read(sub);
armpmu_read
armv8pmu_read_counter
armv8pmu_read_hw_counter
int idx = event->hw.idx; // idx = -1
u64 val = armv8pmu_read_evcntr(idx);
u32 counter = ARMV8_IDX_TO_COUNTER(idx); // invalid counter = 30
read_pmevcntrn(counter) // undefined instruction
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220902082918.179248-1-yangjihong1@huawei.com
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a2039c87d30177f0fd349ab000e6af25a0d48de8)
[Vegard: fix conflict in context due to missing commit
ece0857258cbaf20b9828157035999f46ca060c8 ("perf/core: Add a new read
format to get a number of lost samples").]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>