29657 Commits

Author SHA1 Message Date
Jonathan Lemon
3716e60955 bpf: lpm_trie: check left child of last leftmost node for NULL
If the leftmost parent node of the tree has does not have a child
on the left side, then trie_get_next_key (and bpftool map dump) will
not look at the child on the right.  This leads to the traversal
missing elements.

Lookup is not affected.

Update selftest to handle this case.

Reproducer:

 bpftool map create /sys/fs/bpf/lpm type lpm_trie key 6 \
     value 1 entries 256 name test_lpm flags 1
 bpftool map update pinned /sys/fs/bpf/lpm key  8 0 0 0  0   0 value 1
 bpftool map update pinned /sys/fs/bpf/lpm key 16 0 0 0  0 128 value 2
 bpftool map dump   pinned /sys/fs/bpf/lpm

Returns only 1 element. (2 expected)

Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE")
Change-Id: I942431b7feaa82aab38d4c37b3b5920ae70d8e24
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-08-24 00:20:32 +07:00
Byeonguk Jeong
1a2be5e150 BACKPORT: bpf: Fix out-of-bounds write in trie_get_next_key()
trie_get_next_key() allocates a node stack with size trie->max_prefixlen,
while it writes (trie->max_prefixlen + 1) nodes to the stack when it has
full paths from the root to leaves. For example, consider a trie with
max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ...
0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with
.prefixlen = 8 make 9 nodes be written on the node stack with size 8.

Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Change-Id: I0626bd93acddf978dc56f8b1ee13305c50c90210
Signed-off-by: Byeonguk Jeong <jungbu2855@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org>
Tested-by: Hou Tao <houtao1@huawei.com>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/Zxx384ZfdlFYnz6J@localhost.localdomain
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-24 00:20:30 +07:00
Hou Tao
3d9dfcb53e bpf: Fix exact match conditions in trie_get_next_key()
trie_get_next_key() uses node->prefixlen == key->prefixlen to identify
an exact match, However, it is incorrect because when the target key
doesn't fully match the found node (e.g., node->prefixlen != matchlen),
these two nodes may also have the same prefixlen. It will return
expected result when the passed key exist in the trie. However when a
recently-deleted key or nonexistent key is passed to
trie_get_next_key(), it may skip keys and return incorrect result.

Fix it by using node->prefixlen == matchlen to identify exact matches.
When the condition is true after the search, it also implies
node->prefixlen equals key->prefixlen, otherwise, the search would
return NULL instead.

Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241206110622.1161752-6-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-24 00:20:28 +07:00
Yonghong Song
b046b89def bpf: fix kernel page fault in lpm map trie_get_next_key
Commit b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command
for LPM_TRIE map") introduces a bug likes below:

    if (!rcu_dereference(trie->root))
        return -ENOENT;
    if (!key || key->prefixlen > trie->max_prefixlen) {
        root = &trie->root;
        goto find_leftmost;
    }
    ......
  find_leftmost:
    for (node = rcu_dereference(*root); node;) {

In the code after label find_leftmost, it is assumed
that *root should not be NULL, but it is not true as
it is possbile trie->root is changed to NULL by an
asynchronous delete operation.

The issue is reported by syzbot and Eric Dumazet with the
below error log:
  ......
  kasan: CONFIG_KASAN_INLINE enabled
  kasan: GPF could be caused by NULL-ptr deref or user memory access
  general protection fault: 0000 [#1] SMP KASAN
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 1 PID: 8033 Comm: syz-executor3 Not tainted 4.15.0-rc8+ #4
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:trie_get_next_key+0x3c2/0xf10 kernel/bpf/lpm_trie.c:682
  ......

This patch fixed the issue by use local rcu_dereferenced
pointer instead of *(&trie->root) later on.

Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command or LPM_TRIE map")
Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-08-24 00:20:25 +07:00
Yonghong Song
5240f63619 bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map
Current LPM_TRIE map type does not implement MAP_GET_NEXT_KEY
command. This command is handy when users want to enumerate
keys. Otherwise, a different map which supports key
enumeration may be required to store the keys. If the
map data is sparse and all map data are to be deleted without
closing file descriptor, using MAP_GET_NEXT_KEY to find
all keys is much faster than enumerating all key space.

This patch implements MAP_GET_NEXT_KEY command for LPM_TRIE map.
If user provided key pointer is NULL or the key does not have
an exact match in the trie, the first key will be returned.
Otherwise, the next key will be returned.

In this implemenation, key enumeration follows a postorder
traversal of internal trie. More specific keys
will be returned first than less specific ones, given
a sequence of MAP_GET_NEXT_KEY syscalls.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-08-24 00:20:23 +07:00
Peter Xu
49cc3544e3 BACKPORT: userfaultfd/sysctl: add vm.unprivileged_userfaultfd
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available.  By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit.  While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.

We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.

Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users.  When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.

Andrea said:

: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd.  In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.

[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Change-Id: Ied2500a773b06ac1fdc378e61fd5403a270114a6
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-08-24 00:16:01 +07:00
Alexander Winkowski
07f0e4d3f5 ANDROID: Update CFI patchset from android11-5.4
Change-Id: Iff956ce7c2b89d3a442b0b81deba0da323f5d18b
Signed-off-by: Alexander Winkowski <dereference23@outlook.com>
2025-08-24 00:14:40 +07:00
Ard Biesheuvel
6ea1b901be bpf: add __weak hook for allocating executable memory
By default, BPF uses module_alloc() to allocate executable memory,
but this is not necessary on all arches and potentially undesirable
on some of them.

So break out the module_alloc() and module_memfree() calls into __weak
functions to allow them to be overridden in arch code.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Change-Id: I582794881942bc0b766515861f2232354860536b
2025-08-24 00:13:57 +07:00
John Galt
7e6c46b254 bpf: squash revert spoofing and some backports:
Squashed commit of the following:

commit 8ac5df9c8bc9575059fff6cea0c40463b96fc129
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:58:17 2024 -0500

    Revert "BACKPORT: bpf: add skb_load_bytes_relative helper"

    This reverts commit 029893dcc5d67af16fdf0723bacaae37ec567f67.

commit dbcbceafe848744ec188f74e87e9717916d359ea
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:58:13 2024 -0500

    Revert "BACKPORT: bpf: encapsulate verifier log state into a structure"

    This reverts commit d861145b97d247cbd9fe1400df52155f48639126.

commit 478f4dfee0406b54525e68764cc9ba48af1624fc
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:58:10 2024 -0500

    Revert "BACKPORT: bpf: Rename bpf_verifer_log"

    This reverts commit 5d088635de1bf2d6ae9ea94e3dd1c601d30c0cce.

commit 7bc7c24beb82168b49337530cb56b5dfeeafe19a
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:58:07 2024 -0500

    Revert "BACKPORT: bpf: btf: Introduce BPF Type Format (BTF)"

    This reverts commit 93d34e26514b4d9d15fd176706f57634b2e97485.

commit 7106457ba90a459b6241fdd44df658c1b52c0e4b
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:58:03 2024 -0500

    Revert "bpf: Update logging functions to work with BTF"

    This reverts commit 97e6c528eb2f76c58a3b6a4c1e7fbeafcd97633a.

commit 08e68c7ba56f5e78fd1afcd5a2164716a75b0fe3
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:58:00 2024 -0500

    Revert "bpf: btf: Validate type reference"

    This reverts commit c7b7eecbc1134e5d8865af2cc0692fc7156175d5.

commit 7763cf0831970a64ed62f9b7362fca02ab6e83f1
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:57:51 2024 -0500

    Revert "bpf: btf: Check members of struct/union"

    This reverts commit 9a77b51cad6f04866ca067ca0e70a89b9f59ed56.

commit eb033235f666b5f66995f4cf89702de7ab4721f8
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:57:47 2024 -0500

    Revert "bpf: btf: Add pretty print capability for data with BTF type info"

    This reverts commit 745692103435221d6e39bc177811769995540525.

commit c32995674ace91e06c591d2f63177585e81adc75
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:57:43 2024 -0500

    Revert "BACKPORT: bpf: btf: Add BPF_BTF_LOAD command"

    This reverts commit 4e0afd38e20e5aa2df444361309bc07251ca6b2a.

commit 1310bc8d4aca0015c8723e7624121eddf76b3244
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:57:38 2024 -0500

    Revert "bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd"

    This reverts commit d4b5d76d9101b97e6fe5181bcefe7f601ed19926.

commit 881a49445608712bdb0a0f0c959838bdbc725f62
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:57:34 2024 -0500

    Revert "BACKPORT: bpf: btf: Clean up btf.h in uapi"

    This reverts commit 26b661822933d41b3feb59bb284334bfbbc82af4.

commit e2109fd858ebd5fe392c8bf579b9350fbca35a35
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:57:29 2024 -0500

    Revert "bpf: btf: Avoid WARN_ON when CONFIG_REFCOUNT_FULL=y"

    This reverts commit 9abf878903404e649fef4ad0b189eec1c13d29fe.

commit 088a7d9137f03da4e0fc1d72add3901823081ccd
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:57:23 2024 -0500

    Revert "bpf: Fix compiler warning on info.map_ids for 32bit platform"

    This reverts commit a3a278e1f6cf167d538ac52f4ad60bb9cf8d4129.

commit 6e14aed6b63f2b266982454d83678445c062cf39
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:57:13 2024 -0500

    Revert "bpf: btf: Change how section is supported in btf_header"

    This reverts commit 4b60ffd683eb623a184b46761777838d7c49e707.

commit 151a60855c23bf0317734031481d779efb369d6c
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:57:08 2024 -0500

    Revert "bpf: btf: Check array->index_type"

    This reverts commit b00e10f1a073fadce178b6fb62496722e16db303.

commit 49775e9074a54ac5f60f518e6fc5a26172996eae
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:57:01 2024 -0500

    Revert "bpf: btf: Remove unused bits from uapi/linux/btf.h"

    This reverts commit c90c6ad34f7a8f565f351d21c2d5b9706838767d.

commit b6d6c6ab28e4b018da6ce9e64125e63f4191d3d9
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:58 2024 -0500

    Revert "bpf: btf: Avoid variable length array"

    This reverts commit fe7d1f7750242e77a73839d173ac36c3e39d4171.

commit a45bedecb9b1175fef96f2d64fba2d61777dbf35
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:49 2024 -0500

    Revert "bpf: btf: avoid -Wreturn-type warning"

    This reverts commit 78214f1e390bf1d69d9ae4ee80072ac85c34619e.

commit 445efb8465b9fa5706d81098417f15656265322e
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:46 2024 -0500

    Revert "bpf: btf: Check array t->size"

    This reverts commit aed532e7466f77885a362e4b863bf90c41e834ba.

commit 8aada590d525de735cf39196d88722e727c141e9
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:42 2024 -0500

    Revert "bpf: btf: Ensure t->type == 0 for BTF_KIND_FWD"

    This reverts commit 8c8b601dcc2e62e1276b73dfee8b49e40fb65944.

commit ed67ad09e866c9c30897488088bbb4555ea3dc80
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:38 2024 -0500

    Revert "bpf: btf: Fix bitfield extraction for big endian"

    This reverts commit b0696a226c52868d64963f01665dd1a640a92f2b.

commit 5cc64db782daf86cdf7ac77133ca94181bb29146
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:35 2024 -0500

    Revert "bpf: btf: Clean up BTF_INT_BITS() in uapi btf.h"

    This reverts commit 0f008594540b09c667ea88fc87cf289b8db334da.

commit 3a5c6b9010426449c08ecdcc10e758431b1e515f
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:31 2024 -0500

    Revert "bpf: btf: Ensure the member->offset is in the right order"

    This reverts commit c5e361ecd6d45a7cdbffda02e4691a7a37198bdd.

commit bd6173c1ac458b08d6cedaf06e6e53c93e6b0cc5
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:26 2024 -0500

    Revert "bpf: fix bpf_skb_load_bytes_relative pkt length check"

    This reverts commit 9ea14969874cd7896588df435c890f6f2f547821.

commit 0b61d26b25a65d9ded4611426c6da9c78e41567c
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:22 2024 -0500

    Revert "bpf: btf: Fix end boundary calculation for type section"

    This reverts commit 08ef221c7fb604cb60c490fa999ec7254d492f05.

commit 72fb2b9bb5b90f60ab71915fe4e57eeee3308163
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:18 2024 -0500

    Revert "bpf: btf: Fix a missing check bug"

    This reverts commit 594687e3e01e26086f3b0173e5eda9b9f0b672f8.

commit 575a34ceba4013ad0230038f29f6ea0b3ba41a7e
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:15 2024 -0500

    Revert "bpf, btf: fix a missing check bug in btf_parse"

    This reverts commit 6bf31bbc438663756e92fb0aad4f5a35fd730fb0.

commit bcca98c0bc5e19b38af3ddcd0feee80ad26e1f96
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:11 2024 -0500

    Revert "bpf: fix BTF limits"

    This reverts commit e351b26ae671dfacd82f27c1c5f66cf8089d930d.

commit f71c484e340041d8828c94b39a233ea587d8cc09
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:07 2024 -0500

    Revert "bpf/btf: Fix BTF verification of enum members in struct/union"

    This reverts commit 861e65b744c171d59850e61a01715f194f25e45c.

commit eca310722a2624d33cd49884aa18c36d435b10f8
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:56:02 2024 -0500

    Revert "bpf: btf: fix truncated last_member_type_id in btf_struct_resolve"

    This reverts commit d6cd1eac41b10e606ec7f445162a0617c01be973.

commit caae5c99a3ca7bed0e318b31b6aa7ca8260a1c52
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:58 2024 -0500

    Revert "BACKPORT: net: bpf: rename ndo_xdp to ndo_bpf"

    This reverts commit 2a1ddcb6a384745195d57b4e4cdda2a55d2cbe47.

commit f90bdcdaa095a4f10268bb740470a3e0893be21b
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:54 2024 -0500

    Revert "BACKPORT: bpf: offload: add infrastructure for loading programs for a specific netdev"

    This reverts commit a9516d402726094eafccce26a99cf5110d188be9.

commit c6e0ce9019c06d9a45c030a2bc38eed320afd45a
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:50 2024 -0500

    Revert "bpf: offload: rename the ifindex field"

    This reverts commit 36bc9c7351a1dc78b3e71571998af381e876b4cb.

commit 88b6a4d41b69df804b846a8ebdca410517e08343
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:46 2024 -0500

    Revert "BACKPORT: bpf: Check attach type at prog load time"

    This reverts commit fe5a0d514e4970d86983458136d4a2f6caeee365.

commit 9ccfaa66a5ea042331f0aacdb3667e23c8ed363e
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:43 2024 -0500

    Revert "BACKPORT: bpf: introduce BPF_PROG_QUERY command"

    This reverts commit a5720688858170f1054f9549b5a628db1c252a88.

commit adab2743b3fa0853d0351b33b0a286de745025e5
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:37 2024 -0500

    Revert "BACKPORT: bpf: Hooks for sys_bind"

    This reverts commit e484887c7e7aa026521ddc1773233368a6304b24.

commit d462e09db98ad89b3a836f9b9a925812b0d8cfe7
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:33 2024 -0500

    Revert "BACKPORT: net: Introduce __inet_bind() and __inet6_bind"

    This reverts commit 41a3131c3e94c28fd084dd6f4358baee3824fd17.

commit cdf7f55dc65b4bdf7ecfc924be77c6a039709b3d
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:29 2024 -0500

    Revert "BACKPORT: bpf: Hooks for sys_connect"

    This reverts commit f26fe7233e2885ef489707ab5a5a5dda9f081b80.

commit 97685d5058f76ba4ea6dd2db157f4537f3a8953d
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:23 2024 -0500

    Revert "BACKPORT: bpf: Post-hooks for sys_bind"

    This reverts commit 284ac5bc7c70dac338301445e94e1ad40fb40fdb.

commit d03d9c05036d3109eae643f473cc5a5ad0a80721
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:19 2024 -0500

    Revert "kernel: bpf: devmap: Create __dev_map_alloc_node"

    This reverts commit db726149fa9abfd1ca9add3e2db6b1524f7e90a3.

commit 8c34bcb3e4c6630799764871b4af2e5f9344a371
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:15 2024 -0500

    Revert "BACKPORT: xdp: Add devmap_hash map type for looking up devices by hashed index"

    This reverts commit c4d4e1d201d8433e06b2ac66041d7105095a0204.

commit ef277c7b3a08fd59943eb2b47af64afc513de008
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:11 2024 -0500

    Revert "BACKPORT: devmap: Allow map lookups from eBPF"

    This reverts commit 24d196375871c72de0de977de79afede5a7d1780.

commit 4fcd87869c55c28ed59bff916d640147601816d2
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:07 2024 -0500

    Revert "gen_headers_{arm, arm64}: Add btf.h to the list"

    This reverts commit 37edfe7c90bac355885ffec3327b338a34619792.

commit b89560e0b405b58ecc5fc12c15ad4f56147760d6
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:55:03 2024 -0500

    Revert "syscall: Fake uname to 4.19 for bpfloader/netd"

    This reverts commit 186e74af61269602d0c068d98928b1f25e03eba2.

commit fd49f8c35eb7875d6810a5a52877ebc59bfd4530
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:54:59 2024 -0500

    Revert "syscall: Fake uname to 4.19 also for netbpfload"

    This reverts commit 34b9a1ab387d7dc83ede613b2c12b3741ea08edb.

commit b853fcf2ff892664d0ff522ca7fd530bc94c023e
Author: John Galt <johngaltfirstrun@gmail.com>
Date:   Fri Dec 13 07:54:53 2024 -0500

    Revert "syscall: Increase bpf fake uname to 5.4"

    This reverts commit 9cdc014e11b410a7f03d8c968a35ee0dd6a28fff.
2025-01-01 18:14:02 -05:00
balgxmr
bb43b90fab Merge branch 'linux-4.14.y' of https://github.com/openela/kernel-lts into fourteen 2024-10-12 20:06:00 -05:00
Tze-nan Wu
b28271a442 tracing: Fix overflow in get_free_elt()
commit bcf86c01ca4676316557dd482c8416ece8c2e143 upstream.

"tracing_map->next_elt" in get_free_elt() is at risk of overflowing.

Once it overflows, new elements can still be inserted into the tracing_map
even though the maximum number of elements (`max_elts`) has been reached.
Continuing to insert elements after the overflow could result in the
tracing_map containing "tracing_map->max_size" elements, leaving no empty
entries.
If any attempt is made to insert an element into a full tracing_map using
`__tracing_map_insert()`, it will cause an infinite loop with preemption
disabled, leading to a CPU hang problem.

Fix this by preventing any further increments to "tracing_map->next_elt"
once it reaches "tracing_map->max_elt".

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 08d43a5fa0 ("tracing: Add lock-free tracing_map")
Co-developed-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com>
Link: https://lore.kernel.org/20240805055922.6277-1-Tze-nan.Wu@mediatek.com
Signed-off-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com>
Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 302ceb625d7b990db205a15e371f9a71238de91c)
[Vegard: s/atomic_fetch_add_unless/__atomic_add_unless/ due to missing
 commit bfc18e389c7a09fbbbed6bf4032396685b14246e ("atomics/treewide:
 Rename __atomic_add_unless() => atomic_fetch_add_unless()".]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-10-10 10:27:30 +00:00
Justin Stitt
53390d85b1 ntp: Safeguard against time_constant overflow
commit 06c03c8edce333b9ad9c6b207d93d3a5ae7c10c0 upstream.

Using syzkaller with the recently reintroduced signed integer overflow
sanitizer produces this UBSAN report:

UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:738:18
9223372036854775806 + 4 cannot be represented in type 'long'
Call Trace:
 handle_overflow+0x171/0x1b0
 __do_adjtimex+0x1236/0x1440
 do_adjtimex+0x2be/0x740

The user supplied time_constant value is incremented by four and then
clamped to the operating range.

Before commit eea83d896e ("ntp: NTP4 user space bits update") the user
supplied value was sanity checked to be in the operating range. That change
removed the sanity check and relied on clamping after incrementing which
does not work correctly when the user supplied value is in the overflow
zone of the '+ 4' operation.

The operation requires CAP_SYS_TIME and the side effect of the overflow is
NTP getting out of sync.

Similar to the fixups for time_maxerror and time_esterror, clamp the user
space supplied value to the operating range.

[ tglx: Switch to clamping ]

Fixes: eea83d896e ("ntp: NTP4 user space bits update")
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-c-v2-1-f3a80096f36f@google.com
Closes: https://github.com/KSPP/linux/issues/352
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a13f8b269b6f4c9371ab149ecb65d2edb52e9669)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-10-10 10:27:30 +00:00
Justin Stitt
07f7f40df9 ntp: Clamp maxerror and esterror to operating range
[ Upstream commit 87d571d6fb77ec342a985afa8744bb9bb75b3622 ]

Using syzkaller alongside the newly reintroduced signed integer overflow
sanitizer spits out this report:

UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16
9223372036854775807 + 500 cannot be represented in type 'long'
Call Trace:
 handle_overflow+0x171/0x1b0
 second_overflow+0x2d6/0x500
 accumulate_nsecs_to_secs+0x60/0x160
 timekeeping_advance+0x1fe/0x890
 update_wall_time+0x10/0x30

time_maxerror is unconditionally incremented and the result is checked
against NTP_PHASE_LIMIT, but the increment itself can overflow, resulting
in wrap-around to negative space.

Before commit eea83d896e ("ntp: NTP4 user space bits update") the user
supplied value was sanity checked to be in the operating range. That change
removed the sanity check and relied on clamping in handle_overflow() which
does not work correctly when the user supplied value is in the overflow
zone of the '+ 500' operation.

The operation requires CAP_SYS_TIME and the side effect of the overflow is
NTP getting out of sync.

Miroslav confirmed that the input value should be clamped to the operating
range and the same applies to time_esterror. The latter is not used by the
kernel, but the value still should be in the operating range as it was
before the sanity check got removed.

Clamp them to the operating range.

[ tglx: Changed it to clamping and included time_esterror ]

Fixes: eea83d896e ("ntp: NTP4 user space bits update")
Signed-off-by: Justin Stitt <justinstitt@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: https://lore.kernel.org/all/20240517-b4-sio-ntp-usec-v2-1-d539180f2b79@google.com
Closes: https://github.com/KSPP/linux/issues/354
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ cast things to __kernel_long_t to fix compiler warnings - gregkh ]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9dfe2eef1ecfbb1f29e678700247de6010784eb9)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-10-10 10:27:29 +00:00
Thomas Gleixner
6fad54cc7a tick/broadcast: Move per CPU pointer access into the atomic section
commit 6881e75237a84093d0986f56223db3724619f26e upstream.

The recent fix for making the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.

This went unnoticed as compilers hoist the access into the non-preemptible
region where the pointer is actually used. But of course it's valid that
the compiler keeps it at the place where the code puts it which rightfully
triggers:

  BUG: using smp_processor_id() in preemptible [00000000] code:
       caller is hotplug_cpu__broadcast_tick_pull+0x1c/0xc0

Move it to the actual usage site which is in a non-preemptible region.

Fixes: f7d43dd206e7 ("tick/broadcast: Make takeover of broadcast hrtimer reliable")
Reported-by: David Wang <00107082@163.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/87ttg56ers.ffs@tglx
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit f54abf332a2bc0413cfa8bd6a8511f7aa99faea0)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-10-10 10:27:29 +00:00
Douglas Anderson
4925aa995a kdb: Use the passed prompt in kdb_position_cursor()
[ Upstream commit e2e821095949cde46256034975a90f88626a2a73 ]

The function kdb_position_cursor() takes in a "prompt" parameter but
never uses it. This doesn't _really_ matter since all current callers
of the function pass the same value and it's a global variable, but
it's a bit ugly. Let's clean it up.

Found by code inspection. This patch is expected to functionally be a
no-op.

Fixes: 09b35989421d ("kdb: Use format-strings rather than '\0' injection in kdb_read()")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240528071144.1.I0feb49839c6b6f4f2c4bf34764f5e95de3f55a66@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 90f2409c1d552f27a2b2bf8dc598d147c4173128)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-10-10 10:27:25 +00:00
Arnd Bergmann
fbcf6bbfac kdb: address -Wformat-security warnings
[ Upstream commit 70867efacf4370b6c7cdfc7a5b11300e9ef7de64 ]

When -Wformat-security is not disabled, using a string pointer
as a format causes a warning:

kernel/debug/kdb/kdb_io.c: In function 'kdb_read':
kernel/debug/kdb/kdb_io.c:365:36: error: format not a string literal and no format arguments [-Werror=format-security]
  365 |                         kdb_printf(kdb_prompt_str);
      |                                    ^~~~~~~~~~~~~~
kernel/debug/kdb/kdb_io.c: In function 'kdb_getstr':
kernel/debug/kdb/kdb_io.c:456:20: error: format not a string literal and no format arguments [-Werror=format-security]
  456 |         kdb_printf(kdb_prompt_str);
      |                    ^~~~~~~~~~~~~~

Use an explcit "%s" format instead.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 5d5314d679 ("kdb: core for kgdb back end (1 of 2)")
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240528121154.3662553-1-arnd@kernel.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 22a100556ceab8b906ad180788bd6bdc07390f50)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-10-10 10:27:25 +00:00
Wenlin Kang
2527458f09 kdb: Fix bound check compiler warning
[ Upstream commit ca976bfb3154c7bc67c4651ecd144fdf67ccaee7 ]

The strncpy() function may leave the destination string buffer
unterminated, better use strscpy() instead.

This fixes the following warning with gcc 8.2:

kernel/debug/kdb/kdb_io.c: In function 'kdb_getstr':
kernel/debug/kdb/kdb_io.c:449:3: warning: 'strncpy' specified bound 256 equals destination size [-Wstringop-truncation]
   strncpy(kdb_prompt_str, prompt, CMD_BUFLEN);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Wenlin Kang <wenlin.kang@windriver.com>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Stable-dep-of: 70867efacf43 ("kdb: address -Wformat-security warnings")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit b15593e2904d2ff0094b7170f806dba0eeefac75)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-10-10 10:27:25 +00:00
Thomas Gleixner
dbffea43e8 watchdog/perf: properly initialize the turbo mode timestamp and rearm counter
commit f944ffcbc2e1c759764850261670586ddf3bdabb upstream.

For systems on which the performance counter can expire early due to turbo
modes the watchdog handler has a safety net in place which validates that
since the last watchdog event there has at least 4/5th of the watchdog
period elapsed.

This works reliably only after the first watchdog event because the per
CPU variable which holds the timestamp of the last event is never
initialized.

So a first spurious event will validate against a timestamp of 0 which
results in a delta which is likely to be way over the 4/5 threshold of the
period.  As this might happen before the first watchdog hrtimer event
increments the watchdog counter, this can lead to false positives.

Fix this by initializing the timestamp before enabling the hardware event.
Reset the rearm counter as well, as that might be non zero after the
watchdog was disabled and reenabled.

Link: https://lkml.kernel.org/r/87frsfu15a.ffs@tglx
Fixes: 7edaeb6841 ("kernel/watchdog: Prevent false positives with turbo modes")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 6d94ca5d571dfdb34f12dc3f63273ea275e8f40c)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-10-10 10:27:24 +00:00
Yu Liao
3065612975 tick/broadcast: Make takeover of broadcast hrtimer reliable
commit f7d43dd206e7e18c182f200e67a8db8c209907fa upstream.

Running the LTP hotplug stress test on a aarch64 machine results in
rcu_sched stall warnings when the broadcast hrtimer was owned by the
un-plugged CPU. The issue is the following:

CPU1 (owns the broadcast hrtimer)	CPU2

				tick_broadcast_enter()
				  // shutdown local timer device
				  broadcast_shutdown_local()
				...
				tick_broadcast_exit()
				  clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT)
				  // timer device is not programmed
				  cpumask_set_cpu(cpu, tick_broadcast_force_mask)

				initiates offlining of CPU1
take_cpu_down()
/*
 * CPU1 shuts down and does not
 * send broadcast IPI anymore
 */
				takedown_cpu()
				  hotplug_cpu__broadcast_tick_pull()
				    // move broadcast hrtimer to this CPU
				    clockevents_program_event()
				      bc_set_next()
					hrtimer_start()
					/*
					 * timer device is not programmed
					 * because only the first expiring
					 * timer will trigger clockevent
					 * device reprogramming
					 */

What happens is that CPU2 exits broadcast mode with force bit set, then the
local timer device is not reprogrammed and CPU2 expects to receive the
expired event by the broadcast IPI. But this does not happen because CPU1
is offlined by CPU2. CPU switches the clockevent device to ONESHOT state,
but does not reprogram the device.

The subsequent reprogramming of the hrtimer broadcast device does not
program the clockevent device of CPU2 either because the pending expiry
time is already in the past and the CPU expects the event to be delivered.
As a consequence all CPUs which wait for a broadcast event to be delivered
are stuck forever.

Fix this issue by reprogramming the local timer device if the broadcast
force bit of the CPU is set so that the broadcast hrtimer is delivered.

[ tglx: Massage comment and change log. Add Fixes tag ]

Fixes: 989dcb645c ("tick: Handle broadcast wakeup of multiple cpus")
Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240711124843.64167-1-liaoyu15@huawei.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit dfe19aa91378972f10530635ad83b2d77f481044)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-10-10 10:27:23 +00:00
Adrian Hunter
26864f03cc perf: Prevent passing zero nr_pages to rb_alloc_aux()
[ Upstream commit dbc48c8f41c208082cfa95e973560134489e3309 ]

nr_pages is unsigned long but gets passed to rb_alloc_aux() as an int,
and is stored as an int.

Only power-of-2 values are accepted, so if nr_pages is a 64_bit value, it
will be passed to rb_alloc_aux() as zero.

That is not ideal because:
 1. the value is incorrect
 2. rb_alloc_aux() is at risk of misbehaving, although it manages to
 return -ENOMEM in that case, it is a result of passing zero to get_order()
 even though the get_order() result is documented to be undefined in that
 case.

Fix by simply validating the maximum supported value in the first place.
Use -ENOMEM error code for consistency with the current error code that
is returned in that case.

Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240624201101.60186-6-adrian.hunter@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit d7b1a76f33e6fc93924725b4410126740c890c44)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-10-10 10:27:20 +00:00
Adrian Hunter
6f7bc617b3 perf: Fix perf_aux_size() for greater-than 32-bit size
[ Upstream commit 3df94a5b1078dfe2b0c03f027d018800faf44c82 ]

perf_buffer->aux_nr_pages uses a 32-bit type, so a cast is needed to
calculate a 64-bit size.

Fixes: 45bfb2e504 ("perf: Add AUX area to ring buffer for raw data streams")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240624201101.60186-5-adrian.hunter@intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 542abbf58e88f34dfc659b63476a5976acf52c0e)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-10-10 10:27:20 +00:00
balgxmr
a06692c4a8 Revert "Revert "Revert "mm: Import oplus memory management hacks"""
This reverts commit 309a8f8510.
2024-09-16 18:42:44 -05:00
balgxmr
11137c0736 Merge branch 'linux-4.14.y' of https://github.com/openela/kernel-lts into fourteen-staging 2024-09-15 09:34:31 -05:00
Jinliang Zheng
3409634195 mm: optimize the redundant loop of mm_update_owner_next()
commit cf3f9a593dab87a032d2b6a6fb205e7f3de4f0a1 upstream.

When mm_update_owner_next() is racing with swapoff (try_to_unuse()) or
/proc or ptrace or page migration (get_task_mm()), it is impossible to
find an appropriate task_struct in the loop whose mm_struct is the same as
the target mm_struct.

If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.

Recognize this situation in advance and exit early.

Link: https://lkml.kernel.org/r/20240620122123.3877432-1-alexjlzheng@tencent.com
Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tycho Andersen <tandersen@netflix.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2890a7faf552dd3e4e40e343610ba3e0ba5b788e)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-08-23 16:50:05 +00:00
shygosh
abf49c4b59 sched: Fix erroneous definition of sched_feat(RT_PUSH_IPI)
Commit [1] moved features.h inclusion in sched.h to the top of the file.
This prevents features.h from seeing the definition of HAVE_RT_PUSH_IPI,
which in turn causes the sched_feat(RT_PUSH_IPI) to be always disabled.

Fix this by checking the parent macro of HAVE_RT_PUSH_IPI instead.

[1] ("sched: Resolve sched_feat() at compile time to improve code optimization")

Signed-off-by: shygosh <shygosh@proton.me>
2024-08-18 20:34:35 -05:00
Syed Rameez Mustafa
052c21eb7d sched: turn off the TTWU_QUEUE feature
While the feature TTWU_QUEUE has the advantage of reducing cache
bouncing of runqueue locks, it has the side effect that runqueue
statistics are not updated until the remote CPU has a chance to
enqueue the task. Since there is no upper bound on the amount of
time it can take the remote CPU to enqueue the task, several
sequential wakeups can result in suboptimal task placement based
on the stale statistics. Turn off the feature as the cost of
sub-optimal placement is much higher than the cost of cache bouncing
spinlocks for msm based systems.

Change-Id: I0b85c0225237b2bc44f54934769f5e3750c0f3d6
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
[pkondeti@codeaurora.org: Resolved minor merge conflict]
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2024-08-13 15:19:23 -05:00
Sultan Alsawaf
29ab145632 sbalance: Fix systemic issues caused by flawed IRQ statistics
SBalance's statistic of new interrupts for each CPU is inherently flawed in
that it cannot track IRQ migration that occurs in between balance periods.
As a result, SBalance can observe a flawed number of new interrupts for a
CPU, which hurts its balancing decisions.

Furthermore, SBalance incorrectly assumes that IRQs are affined where
SBalance last placed them, which breaks SBalance entirely when the
assumption doesn't hold true.

As it turns out, it can be quite common to change an IRQ's affinity and
observe a successful return value despite the IRQ not actually moving. At
the very least this is observed on ARM's GICv3, and results in SBalance
never moving such an IRQ ever again because SBalance always thinks it has
zero new interrupts.

Since we can't trust irqchip drivers or hardware, gather IRQ statistics
directly in order to get the true number of new interrupts for each CPU and
the actual affinity of each IRQ based on the last CPU it fired upon.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2024-08-13 15:19:16 -05:00
Sultan Alsawaf
a4c07db737 sbalance: Use non-atomic cpumask_clear_cpu() variant
The atomic cpumask_clear_cpu() isn't needed. Use __cpumask_clear_cpu()
instead as a micro-optimization, and for clarity.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2024-08-13 15:19:16 -05:00
Sultan Alsawaf
fa3ef8f843 sbalance: Use a deferrable timer to avoid waking idle CPUs
SBalance is designed to poll to balance IRQs, but it shouldn't kick CPUs
out of idle to do so because idle CPUs clearly aren't processing
interrupts.

Open code a freezable wait that uses a deferrable timer in order to prevent
SBalance from waking up idle CPUs when there is little interrupt traffic.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2024-08-13 15:19:16 -05:00
Sultan Alsawaf
90fe24a596 sbalance: Allow IRQs to be moved off of excluded CPUs
Excluded CPUs are excluded from IRQ balancing with the intention that those
CPUs shouldn't really be processing interrupts, and thus shouldn't have
IRQs moved to them. However, SBalance completely ignores excluded CPUs,
which can cause them to end up with a disproportionate amount of interrupt
traffic that SBalance won't spread out. An easy example of this is when
CPU0 is an excluded CPU, since CPU0 ends up with all interrupts affined to
it by default on arm64.

Allow SBalance to move IRQs off of excluded CPUs so that they cannot slip
under the radar and pile up on an excluded CPU, like when CPU0 is excluded.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
2024-08-13 15:19:15 -05:00
pwnrazr
469a9911a9 Revert "Trigger sbalance on enable_nonboot_cpus"
This reverts commit 7226f90fb58123c5dbece9d74c9095af946d1219.

As per Sultan:
"This is only after wake up from deep sleep.
Also, the IRQ subsystem remembers which CPU
an IRQ was affined to and should put the
IRQs back upon resume"
2024-08-13 15:19:07 -05:00
balgxmr
0fb6d25210 Merge branch 'linux-4.14.y' of https://github.com/openela/kernel-lts into fourteen 2024-08-09 14:23:24 -05:00
balgxmr
a9b2658915 Merge branch 'linux-4.14.y' of https://github.com/openela/kernel-lts into fourteen-staging 2024-08-08 17:28:37 -05:00
Peter Oberparleiter
cd513bd7a5 gcov: add support for GCC 14
commit c1558bc57b8e5b4da5d821537cd30e2e660861d8 upstream.

Using gcov on kernels compiled with GCC 14 results in truncated 16-byte
long .gcda files with no usable data.  To fix this, update GCOV_COUNTERS
to match the value defined by GCC 14.

Tested with GCC versions 14.1.0 and 13.2.0.

Link: https://lkml.kernel.org/r/20240610092743.1609845-1-oberpar@linux.ibm.com
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reported-by: Allison Henderson <allison.henderson@oracle.com>
Reported-by: Chuck Lever III <chuck.lever@oracle.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 48d5f4d66962fde59c78dd0a7cf6bd03b3f49c61)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-08-08 15:52:18 +00:00
Paul E. McKenney
d3d05ae427 rcutorture: Fix rcu_torture_one_read() pipe_count overflow comment
[ Upstream commit 8b9b443fa860276822b25057cb3ff3b28734dec0 ]

The "pipe_count > RCU_TORTURE_PIPE_LEN" check has a comment saying "Should
not happen, but...".  This is only true when testing an RCU whose grace
periods are always long enough.  This commit therefore fixes this comment.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Closes: https://lore.kernel.org/lkml/CAHk-=wi7rJ-eGq+xaxVfzFEgbL9tdf6Kc8Z89rCpfcQOKm74Tw@mail.gmail.com/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 6652029853316f4c273219145ef0e71b148bbe01)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-08-08 15:52:17 +00:00
Daniel Thompson
c89fb4fb75 kdb: Use format-specifiers rather than memset() for padding in kdb_read()
commit c9b51ddb66b1d96e4d364c088da0f1dfb004c574 upstream.

Currently when the current line should be removed from the display
kdb_read() uses memset() to fill a temporary buffer with spaces.
The problem is not that this could be trivially implemented using a
format string rather than open coding it. The real problem is that
it is possible, on systems with a long kdb_prompt_str, to write past
the end of the tmpbuffer.

Happily, as mentioned above, this can be trivially implemented using a
format string. Make it so!

Cc: stable@vger.kernel.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-5-f236dbe9828d@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2467f3f182eb35627534effd4956fceb2504c127)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-07-15 18:30:24 +00:00
Daniel Thompson
201f3d45de kdb: Merge identical case statements in kdb_read()
commit 6244917f377bf64719551b58592a02a0336a7439 upstream.

The code that handles case 14 (down) and case 16 (up) has been copy and
pasted despite being byte-for-byte identical. Combine them.

Cc: stable@vger.kernel.org # Not a bug fix but it is needed for later bug fixes
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-4-f236dbe9828d@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4a89182788f9af9a290c19098382fb972ebe2783)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-07-15 18:30:24 +00:00
Daniel Thompson
fc2a098bc0 kdb: Fix console handling when editing and tab-completing commands
commit db2f9c7dc29114f531df4a425d0867d01e1f1e28 upstream.

Currently, if the cursor position is not at the end of the command buffer
and the user uses the Tab-complete functions, then the console does not
leave the cursor in the correct position.

For example consider the following buffer with the cursor positioned
at the ^:

md kdb_pro 10
          ^

Pressing tab should result in:

md kdb_prompt_str 10
                 ^

However this does not happen. Instead the cursor is placed at the end
(after then 10) and further cursor movement redraws incorrectly. The
same problem exists when we double-Tab but in a different part of the
code.

Fix this by sending a carriage return and then redisplaying the text to
the left of the cursor.

Cc: stable@vger.kernel.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-3-f236dbe9828d@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 21c068c1bbb4c336741749596d004b1965faab2c)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-07-15 18:30:23 +00:00
Daniel Thompson
f990953b6d kdb: Use format-strings rather than '\0' injection in kdb_read()
commit 09b35989421dfd5573f0b4683c7700a7483c71f9 upstream.

Currently when kdb_read() needs to reposition the cursor it uses copy and
paste code that works by injecting an '\0' at the cursor position before
delivering a carriage-return and reprinting the line (which stops at the
'\0').

Tidy up the code by hoisting the copy and paste code into an appropriately
named function. Additionally let's replace the '\0' injection with a
proper field width parameter so that the string will be abridged during
formatting instead.

Cc: stable@vger.kernel.org # Not a bug fix but it is needed for later bug fixes
Tested-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-2-f236dbe9828d@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4edfbbaca46491b06af14e49dcb79ac661d0bbdc)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-07-15 18:30:23 +00:00
Daniel Thompson
f751d9a9f3 kdb: Fix buffer overflow during tab-complete
commit e9730744bf3af04cda23799029342aa3cddbc454 upstream.

Currently, when the user attempts symbol completion with the Tab key, kdb
will use strncpy() to insert the completed symbol into the command buffer.
Unfortunately it passes the size of the source buffer rather than the
destination to strncpy() with predictably horrible results. Most obviously
if the command buffer is already full but cp, the cursor position, is in
the middle of the buffer, then we will write past the end of the supplied
buffer.

Fix this by replacing the dubious strncpy() calls with memmove()/memcpy()
calls plus explicit boundary checks to make sure we have enough space
before we start moving characters around.

Reported-by: Justin Stitt <justinstitt@google.com>
Closes: https://lore.kernel.org/all/CAFhGd8qESuuifuHsNjFPR-Va3P80bxrw+LqvC8deA8GziUJLpw@mail.gmail.com/
Cc: stable@vger.kernel.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Tested-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20240424-kgdb_read_refactor-v3-1-f236dbe9828d@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit fb824a99e148ff272a53d71d84122728b5f00992)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-07-15 18:30:23 +00:00
Sagi Grimberg
45e0b12972 params: lift param_set_uint_minmax to common code
[ Upstream commit 2a14c9ae15a38148484a128b84bff7e9ffd90d68 ]

It is a useful helper hence move it to common code so others can enjoy
it.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Stable-dep-of: 3ebc46ca8675 ("tcp: Fix shift-out-of-bounds in dctcp_update_alpha().")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 459de98d7a6b3d504b5e8664f32f59a306dd425c)
[Harshit: Also lift param_set_uint_minmax from staging lustre driver,
 it is removed in 4.19.y so this upstream commit didnot try cleaning
 it up there]
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-07-15 17:44:31 +00:00
Vitalii Bursov
cd212b4b36 sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level
[ Upstream commit a1fd0b9d751f840df23ef0e75b691fc00cfd4743 ]

Change relax_domain_level checks so that it would be possible
to include or exclude all domains from newidle balancing.

This matches the behavior described in the documentation:

  -1   no request. use system default or follow request of others.
   0   no search.
   1   search siblings (hyperthreads in a core).

"2" enables levels 0 and 1, level_max excludes the last (level_max)
level, and level_max+1 includes all levels.

Fixes: 1d3504fcf5 ("sched, cpuset: customize sched domains, core")
Signed-off-by: Vitalii Bursov <vitaly@bursov.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/bd6de28e80073c79466ec6401cdeae78f0d4423d.1714488502.git.vitaly@bursov.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 454de5ed81766fbbf4777c43392d8b0b35e7e16d)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-07-08 08:15:11 +00:00
Valentin Schneider
03baaba21b sched/topology: Don't set SD_BALANCE_WAKE on cpuset domain relax
[ Upstream commit 9ae7ab20b4835dbea0e5fc6a5c70171dc354a72e ]

As pointed out in commit

  182a85f8a1 ("sched: Disable wakeup balancing")

SD_BALANCE_WAKE is a tad too aggressive, and is usually left unset.

However, it turns out cpuset domain relaxation will unconditionally set it
on domains below the relaxation level. This made sense back when
SD_BALANCE_WAKE was set unconditionally, but it no longer is the case.

We can improve things slightly by noticing that set_domain_attribute() is
always called after sd_init(), so rather than setting flags we can rely on
whatever sd_init() is doing and only clear certain flags when above the
relaxation level.

While at it, slightly clean up the function and flip the relax level
check to be more human readable.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mingo@kernel.org
Cc: vincent.guittot@linaro.org
Cc: juri.lelli@redhat.com
Cc: seto.hidetoshi@jp.fujitsu.com
Cc: qperret@google.com
Cc: Dietmar.Eggemann@arm.com
Cc: morten.rasmussen@arm.com
Link: https://lkml.kernel.org/r/20191014164408.32596-1-valentin.schneider@arm.com
Stable-dep-of: a1fd0b9d751f ("sched/fair: Allow disabling sched_balance_newidle with sched_relax_domain_level")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 046daa54c348ccec12ab38b92923060dd09ef00b)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-07-08 08:15:11 +00:00
Petr Pavlu
a43f1f02b3 ring-buffer: Fix a race between readers and resize checks
commit c2274b908db05529980ec056359fae916939fdaa upstream.

The reader code in rb_get_reader_page() swaps a new reader page into the
ring buffer by doing cmpxchg on old->list.prev->next to point it to the
new page. Following that, if the operation is successful,
old->list.next->prev gets updated too. This means the underlying
doubly-linked list is temporarily inconsistent, page->prev->next or
page->next->prev might not be equal back to page for some page in the
ring buffer.

The resize operation in ring_buffer_resize() can be invoked in parallel.
It calls rb_check_pages() which can detect the described inconsistency
and stop further tracing:

[  190.271762] ------------[ cut here ]------------
[  190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 rb_check_pages.isra.0+0x6a/0xa0
[  190.271789] Modules linked in: [...]
[  190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 skx_edac(E):1
[  190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G            E      6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f
[  190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014
[  190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0
[  190.272023] Code: [...]
[  190.272028] RSP: 0018:ffff9c37463abb70 EFLAGS: 00010206
[  190.272034] RAX: ffff8eba04b6cb80 RBX: 0000000000000007 RCX: ffff8eba01f13d80
[  190.272038] RDX: ffff8eba01f130c0 RSI: ffff8eba04b6cd00 RDI: ffff8eba0004c700
[  190.272042] RBP: ffff8eba0004c700 R08: 0000000000010002 R09: 0000000000000000
[  190.272045] R10: 00000000ffff7f52 R11: ffff8eba7f600000 R12: ffff8eba0004c720
[  190.272049] R13: ffff8eba00223a00 R14: 0000000000000008 R15: ffff8eba067a8000
[  190.272053] FS:  00007f1bd64752c0(0000) GS:ffff8eba7f680000(0000) knlGS:0000000000000000
[  190.272057] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  190.272061] CR2: 00007f1bd6662590 CR3: 000000010291e001 CR4: 0000000000370ef0
[  190.272070] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  190.272073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  190.272077] Call Trace:
[  190.272098]  <TASK>
[  190.272189]  ring_buffer_resize+0x2ab/0x460
[  190.272199]  __tracing_resize_ring_buffer.part.0+0x23/0xa0
[  190.272206]  tracing_resize_ring_buffer+0x65/0x90
[  190.272216]  tracing_entries_write+0x74/0xc0
[  190.272225]  vfs_write+0xf5/0x420
[  190.272248]  ksys_write+0x67/0xe0
[  190.272256]  do_syscall_64+0x82/0x170
[  190.272363]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  190.272373] RIP: 0033:0x7f1bd657d263
[  190.272381] Code: [...]
[  190.272385] RSP: 002b:00007ffe72b643f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  190.272391] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1bd657d263
[  190.272395] RDX: 0000000000000002 RSI: 0000555a6eb538e0 RDI: 0000000000000001
[  190.272398] RBP: 0000555a6eb538e0 R08: 000000000000000a R09: 0000000000000000
[  190.272401] R10: 0000555a6eb55190 R11: 0000000000000246 R12: 00007f1bd6662500
[  190.272404] R13: 0000000000000002 R14: 00007f1bd6667c00 R15: 0000000000000002
[  190.272412]  </TASK>
[  190.272414] ---[ end trace 0000000000000000 ]---

Note that ring_buffer_resize() calls rb_check_pages() only if the parent
trace_buffer has recording disabled. Recent commit d78ab792705c
("tracing: Stop current tracer when resizing buffer") causes that it is
now always the case which makes it more likely to experience this issue.

The window to hit this race is nonetheless very small. To help
reproducing it, one can add a delay loop in rb_get_reader_page():

 ret = rb_head_page_replace(reader, cpu_buffer->reader_page);
 if (!ret)
 	goto spin;
 for (unsigned i = 0; i < 1U << 26; i++)  /* inserted delay loop */
 	__asm__ __volatile__ ("" : : : "memory");
 rb_list_head(reader->list.next)->prev = &cpu_buffer->reader_page->list;

.. and then run the following commands on the target system:

 echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable
 while true; do
 	echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
 	echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
 done &
 while true; do
 	for i in /sys/kernel/tracing/per_cpu/*; do
 		timeout 0.1 cat $i/trace_pipe; sleep 0.2
 	done
 done

To fix the problem, make sure ring_buffer_resize() doesn't invoke
rb_check_pages() concurrently with a reader operating on the same
ring_buffer_per_cpu by taking its cpu_buffer->reader_lock.

Link: https://lore.kernel.org/linux-trace-kernel/20240517134008.24529-3-petr.pavlu@suse.com

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 659f451ff2 ("ring-buffer: Add integrity check at end of iter read")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
[ Fixed whitespace ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b50932ea673b5a089a4bb570a8a868d95c72854e)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-07-08 08:14:19 +00:00
balgxmr
a3e27025de Merge branch 'linux-4.14.y' of https://github.com/openela/kernel-lts into fourteen 2024-06-07 18:31:44 -05:00
Robin H. Johnson
e3b3f139e8 tracing: Show size of requested perf buffer
commit a90afe8d020da9298c98fddb19b7a6372e2feb45 upstream.

If the perf buffer isn't large enough, provide a hint about how large it
needs to be for whatever is running.

Link: https://lkml.kernel.org/r/20210831043723.13481-1-robbat2@gentoo.org

Signed-off-by: Robin H. Johnson <robbat2@gentoo.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
(cherry picked from commit 78b92d50fe6ab79d536f4b12c5bde15f2751414d)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-05-31 12:57:27 +00:00
Siddh Raman Pant
638cc92651 Revert "tracing/trigger: Fix to return error if failed to alloc snapshot"
This reverts commit bcf4a115a5068f3331fafb8c176c1af0da3d8b19 which is
commit 0958b33ef5a04ed91f61cef4760ac412080c4e08 upstream.

The change has an incorrect assumption about the return value because
in the current stable trees for versions 5.15 and before, the following
commit responsible for making 0 a success value is not present:
b8cc44a4d3c1 ("tracing: Remove logic for registering multiple event triggers at a time")

The return value should be 0 on failure in the current tree, because in
the functions event_trigger_callback() and event_enable_trigger_func(),
we have:

	ret = cmd_ops->reg(glob, trigger_ops, trigger_data, file);
	/*
	 * The above returns on success the # of functions enabled,
	 * but if it didn't find any functions it returns zero.
	 * Consider no functions a failure too.
	 */
	if (!ret) {
		ret = -ENOENT;

Cc: stable@kernel.org # 5.15, 5.10, 5.4, 4.19
Signed-off-by: Siddh Raman Pant <siddh.raman.pant@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 34925d01baf3ee62ab21c21efd9e2c44c24c004a)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-05-31 12:30:39 +00:00
Zheng Yejian
b13c030fd2 kprobes: Fix possible use-after-free issue on kprobe registration
commit 325f3fb551f8cd672dbbfc4cf58b14f9ee3fc9e8 upstream.

When unloading a module, its state is changing MODULE_STATE_LIVE ->
 MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take
a time. `is_module_text_address()` and `__module_text_address()`
works with MODULE_STATE_LIVE and MODULE_STATE_GOING.
If we use `is_module_text_address()` and `__module_text_address()`
separately, there is a chance that the first one is succeeded but the
next one is failed because module->state becomes MODULE_STATE_UNFORMED
between those operations.

In `check_kprobe_address_safe()`, if the second `__module_text_address()`
is failed, that is ignored because it expected a kernel_text address.
But it may have failed simply because module->state has been changed
to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify
non-exist module text address (use-after-free).

To fix this problem, we should not use separated `is_module_text_address()`
and `__module_text_address()`, but use only `__module_text_address()`
once and do `try_module_get(module)` which is only available with
MODULE_STATE_LIVE.

Link: https://lore.kernel.org/all/20240410015802.265220-1-zhengyejian1@huawei.com/

Fixes: 28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
[Fix conflict due to lack dependency
commit 223a76b268c9 ("kprobes: Fix coding style issues")]
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b5808d40093403334d939e2c3c417144d12a6f33)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-05-31 12:30:39 +00:00
Arnd Bergmann
39f0520cea tracing: hide unused ftrace_event_id_fops
[ Upstream commit 5281ec83454d70d98b71f1836fb16512566c01cd ]

When CONFIG_PERF_EVENTS, a 'make W=1' build produces a warning about the
unused ftrace_event_id_fops variable:

kernel/trace/trace_events.c:2155:37: error: 'ftrace_event_id_fops' defined but not used [-Werror=unused-const-variable=]
 2155 | static const struct file_operations ftrace_event_id_fops = {

Hide this in the same #ifdef as the reference to it.

Link: https://lore.kernel.org/linux-trace-kernel/20240403080702.3509288-7-arnd@kernel.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ajay Kaher <akaher@vmware.com>
Cc: Jinjie Ruan <ruanjinjie@huawei.com>
Cc: Clément Léger <cleger@rivosinc.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
Fixes: 620a30e97f ("tracing: Don't pass file_operations array to event_create_dir()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 8bfa576fe3c6df875a16f3eb27f7ec3fdd7f3168)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-05-31 12:30:39 +00:00
Yang Jihong
d1bcca1f15 perf/core: Fix reentry problem in perf_output_read_group()
commit 6b959ba22d34ca793ffdb15b5715457c78e38b1a upstream.

perf_output_read_group may respond to IPI request of other cores and invoke
__perf_install_in_context function. As a result, hwc configuration is modified.
causing inconsistency and unexpected consequences.

Interrupts are not disabled when perf_output_read_group reads PMU counter.
In this case, IPI request may be received from other cores.
As a result, PMU configuration is modified and an error occurs when
reading PMU counter:

		     CPU0                                         CPU1
						      __se_sys_perf_event_open
							perf_install_in_context
  perf_output_read_group                                  smp_call_function_single
    for_each_sibling_event(sub, leader) {                   generic_exec_single
      if ((sub != event) &&                                   remote_function
	  (sub->state == PERF_EVENT_STATE_ACTIVE))                    |
  <enter IPI handler: __perf_install_in_context>   <----RAISE IPI-----+
  __perf_install_in_context
    ctx_resched
      event_sched_out
	armpmu_del
	  ...
	  hwc->idx = -1; // event->hwc.idx is set to -1
  ...
  <exit IPI>
	      sub->pmu->read(sub);
		armpmu_read
		  armv8pmu_read_counter
		    armv8pmu_read_hw_counter
		      int idx = event->hw.idx; // idx = -1
		      u64 val = armv8pmu_read_evcntr(idx);
			u32 counter = ARMV8_IDX_TO_COUNTER(idx); // invalid counter = 30
			read_pmevcntrn(counter) // undefined instruction

Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220902082918.179248-1-yangjihong1@huawei.com
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a2039c87d30177f0fd349ab000e6af25a0d48de8)
[Vegard: fix conflict in context due to missing commit
 ece0857258cbaf20b9828157035999f46ca060c8 ("perf/core: Add a new read
 format to get a number of lost samples").]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
2024-05-30 09:00:41 +00:00