44 Commits

Author SHA1 Message Date
Daniel Borkmann
324bf264fc BACKPORT: bpf: Fix 32 bit src register truncation on div/mod
commit e88b2c6e5a4d9ce30d75391e4d950da74bb2bd90 upstream.

While reviewing a different fix, John and I noticed an oddity in one of the
BPF program dumps that stood out, for example:

  # bpftool p d x i 13
   0: (b7) r0 = 808464450
   1: (b4) w4 = 808464432
   2: (bc) w0 = w0
   3: (15) if r0 == 0x0 goto pc+1
   4: (9c) w4 %= w0
  [...]

In line 2 we noticed that the mov32 would 32 bit truncate the original src
register for the div/mod operation. While for the two operations the dst
register is typically marked unknown e.g. from adjust_scalar_min_max_vals()
the src register is not, and thus verifier keeps tracking original bounds,
simplified:

  0: R1=ctx(id=0,off=0,imm=0) R10=fp0
  0: (b7) r0 = -1
  1: R0_w=invP-1 R1=ctx(id=0,off=0,imm=0) R10=fp0
  1: (b7) r1 = -1
  2: R0_w=invP-1 R1_w=invP-1 R10=fp0
  2: (3c) w0 /= w1
  3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1_w=invP-1 R10=fp0
  3: (77) r1 >>= 32
  4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1_w=invP4294967295 R10=fp0
  4: (bf) r0 = r1
  5: R0_w=invP4294967295 R1_w=invP4294967295 R10=fp0
  5: (95) exit
  processed 6 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0

Runtime result of r0 at exit is 0 instead of expected -1. Remove the
verifier mov32 src rewrite in div/mod and replace it with a jmp32 test
instead. After the fix, we result in the following code generation when
having dividend r1 and divisor r6:

  div, 64 bit:                             div, 32 bit:

   0: (b7) r6 = 8                           0: (b7) r6 = 8
   1: (b7) r1 = 8                           1: (b7) r1 = 8
   2: (55) if r6 != 0x0 goto pc+2           2: (56) if w6 != 0x0 goto pc+2
   3: (ac) w1 ^= w1                         3: (ac) w1 ^= w1
   4: (05) goto pc+1                        4: (05) goto pc+1
   5: (3f) r1 /= r6                         5: (3c) w1 /= w6
   6: (b7) r0 = 0                           6: (b7) r0 = 0
   7: (95) exit                             7: (95) exit

  mod, 64 bit:                             mod, 32 bit:

   0: (b7) r6 = 8                           0: (b7) r6 = 8
   1: (b7) r1 = 8                           1: (b7) r1 = 8
   2: (15) if r6 == 0x0 goto pc+1           2: (16) if w6 == 0x0 goto pc+1
   3: (9f) r1 %= r6                         3: (9c) w1 %= w6
   4: (b7) r0 = 0                           4: (b7) r0 = 0
   5: (95) exit                             5: (95) exit

x86 in particular can throw a 'divide error' exception for div
instruction not only for divisor being zero, but also for the case
when the quotient is too large for the designated register. For the
edx:eax and rdx:rax dividend pair it is not an issue in x86 BPF JIT
since we always zero edx (rdx). Hence really the only protection
needed is against divisor being zero.

Also add some other code missed when backporting.

Fixes: 68fda450a7df ("bpf: fix 32-bit divide by zero")
Co-developed-by: John Fastabend <john.fastabend@gmail.com>
Change-Id: I35a7f4f346bbcbc2f01003e607f2b00b7abe92ae
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02 22:15:38 +08:00
Yuntao Wang
eac489c955 UPSTREAM: bpf: Fix incorrect memory charge cost calculation in stack_map_alloc()
commit b45043192b3e481304062938a6561da2ceea46a6 upstream.

This is a backport of the original upstream patch for 5.4/5.10.

The original upstream patch has been applied to 5.4/5.10 branches, which
simply removed the line:

  cost += n_buckets * (value_size + sizeof(struct stack_map_bucket));

This is correct for upstream branch but incorrect for 5.4/5.10 branches,
as the 5.4/5.10 branches do not have the commit 370868107bf6 ("bpf:
Eliminate rlimit-based memory accounting for stackmap maps"), so the
bpf_map_charge_init() function has not been removed.

Currently the bpf_map_charge_init() function in 5.4/5.10 branches takes a
wrong memory charge cost, the

  attr->max_entries * (sizeof(struct stack_map_bucket) + (u64)value_size))

part is missing, let's fix it.

Cc: <stable@vger.kernel.org> # 5.4.y
Cc: <stable@vger.kernel.org> # 5.10.y
Change-Id: I91bcb932cab87a23f16a85db2e2f9269b5be8638
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02 22:15:10 +08:00
Yuntao Wang
1082b45db8 UPSTREAM: bpf: Fix excessive memory allocation in stack_map_alloc()
[ Upstream commit b45043192b3e481304062938a6561da2ceea46a6 ]

The 'n_buckets * (value_size + sizeof(struct stack_map_bucket))' part of the
allocated memory for 'smap' is never used after the memlock accounting was
removed, thus get rid of it.

[ Note, Daniel:

Commit b936ca643ade ("bpf: rework memlock-based memory accounting for maps")
moved `cost += n_buckets * (value_size + sizeof(struct stack_map_bucket))`
up and therefore before the bpf_map_area_alloc() allocation, sigh. In a later
step commit c85d69135a91 ("bpf: move memory size checks to bpf_map_charge_init()"),
and the overflow checks of `cost >= U32_MAX - PAGE_SIZE` moved into
bpf_map_charge_init(). And then 370868107bf6 ("bpf: Eliminate rlimit-based
memory accounting for stackmap maps") finally removed the bpf_map_charge_init().
Anyway, the original code did the allocation same way as /after/ this fix. ]

Fixes: b936ca643ade ("bpf: rework memlock-based memory accounting for maps")
Change-Id: I4a8febd929f09ff4e328ca099e1c47894c92d12a
Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220407130423.798386-1-ytcoode@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02 22:15:10 +08:00
Roman Gushchin
1a371f6225 UPSTREAM: bpf: move memory size checks to bpf_map_charge_init()
Most bpf map types doing similar checks and bytes to pages
conversion during memory allocation and charging.

Let's unify these checks by moving them into bpf_map_charge_init().

Change-Id: I55ceded2303102feba9e485042e8f5169f490609
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:45 +08:00
Roman Gushchin
632d849a6d UPSTREAM: bpf: rework memlock-based memory accounting for maps
In order to unify the existing memlock charging code with the
memcg-based memory accounting, which will be added later, let's
rework the current scheme.

Currently the following design is used:
  1) .alloc() callback optionally checks if the allocation will likely
     succeed using bpf_map_precharge_memlock()
  2) .alloc() performs actual allocations
  3) .alloc() callback calculates map cost and sets map.memory.pages
  4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
     and performs actual charging; in case of failure the map is
     destroyed
  <map is in use>
  1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
     performs uncharge and releases the user
  2) .map_free() callback releases the memory

The scheme can be simplified and made more robust:
  1) .alloc() calculates map cost and calls bpf_map_charge_init()
  2) bpf_map_charge_init() sets map.memory.user and performs actual
    charge
  3) .alloc() performs actual allocations
  <map is in use>
  1) .map_free() callback releases the memory
  2) bpf_map_charge_finish() performs uncharge and releases the user

The new scheme also allows to reuse bpf_map_charge_init()/finish()
functions for memcg-based accounting. Because charges are performed
before actual allocations and uncharges after freeing the memory,
no bogus memory pressure can be created.

In cases when the map structure is not available (e.g. it's not
created yet, or is already destroyed), on-stack bpf_map_memory
structure is used. The charge can be transferred with the
bpf_map_charge_move() function.

Change-Id: I299bfa9d3e74f366861b6de3bf17951a1374824b
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:44 +08:00
Roman Gushchin
7032f89546 UPSTREAM: bpf: group memory related fields in struct bpf_map_memory
Group "user" and "pages" fields of bpf_map into the bpf_map_memory
structure. Later it can be extended with "memcg" and other related
information.

The main reason for a such change (beside cosmetics) is to pass
bpf_map_memory structure to charging functions before the actual
allocation of bpf_map.

Change-Id: I04e4edf805bfe4c26fce45f7166317fe00dd0dfa
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:44 +08:00
Mauricio Vasquez B
74d966c14e UPSTREAM: bpf: rename stack trace map operations
In the following patches queue and stack maps (FIFO and LIFO
datastructures) will be implemented.  In order to avoid confusion and
a possible name clash rename stack_map_ops to stack_trace_map_ops

Change-Id: I4083e53979275e3f710fca7aa60da879416afcf5
Signed-off-by: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:21 +08:00
Prashant Bhole
b61bfd4593 UPSTREAM: bpf: return EOPNOTSUPP when map lookup isn't supported
Return ERR_PTR(-EOPNOTSUPP) from map_lookup_elem() methods of below
map types:
- BPF_MAP_TYPE_PROG_ARRAY
- BPF_MAP_TYPE_STACK_TRACE
- BPF_MAP_TYPE_XSKMAP
- BPF_MAP_TYPE_SOCKMAP/BPF_MAP_TYPE_SOCKHASH

Change-Id: I13937c36055b419f4446d8bfa06f139c757480c9
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:21 +08:00
Song Liu
ab2aea5bf5 UPSTREAM: bpf/stackmap: Fix deadlock with rq_lock in bpf_get_stack()
[ Upstream commit eac9153f2b584c702cea02c1f1a57d85aa9aea42 ]

bpf stackmap with build-id lookup (BPF_F_STACK_BUILD_ID) can trigger A-A
deadlock on rq_lock():

rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[...]
Call Trace:
 try_to_wake_up+0x1ad/0x590
 wake_up_q+0x54/0x80
 rwsem_wake+0x8a/0xb0
 bpf_get_stack+0x13c/0x150
 bpf_prog_fbdaf42eded9fe46_on_event+0x5e3/0x1000
 bpf_overflow_handler+0x60/0x100
 __perf_event_overflow+0x4f/0xf0
 perf_swevent_overflow+0x99/0xc0
 ___perf_sw_event+0xe7/0x120
 __schedule+0x47d/0x620
 schedule+0x29/0x90
 futex_wait_queue_me+0xb9/0x110
 futex_wait+0x139/0x230
 do_futex+0x2ac/0xa50
 __x64_sys_futex+0x13c/0x180
 do_syscall_64+0x42/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

This can be reproduced by:
1. Start a multi-thread program that does parallel mmap() and malloc();
2. taskset the program to 2 CPUs;
3. Attach bpf program to trace_sched_switch and gather stackmap with
   build-id, e.g. with trace.py from bcc tools:
   trace.py -U -p <pid> -s <some-bin,some-lib> t:sched:sched_switch

A sample reproducer is attached at the end.

This could also trigger deadlock with other locks that are nested with
rq_lock.

Fix this by checking whether irqs are disabled. Since rq_lock and all
other nested locks are irq safe, it is safe to do up_read() when irqs are
not disable. If the irqs are disabled, postpone up_read() in irq_work.

Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191014171223.357174-1-songliubraving@fb.com

Reproducer:
============================ 8< ============================

char *filename;

void *worker(void *p)
{
        void *ptr;
        int fd;
        char *pptr;

        fd = open(filename, O_RDONLY);
        if (fd < 0)
                return NULL;
        while (1) {
                struct timespec ts = {0, 1000 + rand() % 2000};

                ptr = mmap(NULL, 4096 * 64, PROT_READ, MAP_PRIVATE, fd, 0);
                usleep(1);
                if (ptr == MAP_FAILED) {
                        printf("failed to mmap\n");
                        break;
                }
                munmap(ptr, 4096 * 64);
                usleep(1);
                pptr = malloc(1);
                usleep(1);
                pptr[0] = 1;
                usleep(1);
                free(pptr);
                usleep(1);
                nanosleep(&ts, NULL);
        }
        close(fd);
        return NULL;
}

int main(int argc, char *argv[])
{
        void *ptr;
        int i;
        pthread_t threads[THREAD_COUNT];

        if (argc < 2)
                return 0;

        filename = argv[1];

        for (i = 0; i < THREAD_COUNT; i++) {
                if (pthread_create(threads + i, NULL, worker, NULL)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 0;
                }
        }

        for (i = 0; i < THREAD_COUNT; i++)
                pthread_join(threads[i], NULL);
        return 0;
}
============================ 8< ============================

Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02 22:14:11 +08:00
Alexei Starovoitov
136b0faa80 UPSTREAM: bpf: fix lockdep false positive in stackmap
[ Upstream commit 3defaf2f15b2bfd86c6664181ac009e91985f8ac ]

Lockdep warns about false positive:
[   11.211460] ------------[ cut here ]------------
[   11.211936] DEBUG_LOCKS_WARN_ON(depth <= 0)
[   11.211985] WARNING: CPU: 0 PID: 141 at ../kernel/locking/lockdep.c:3592 lock_release+0x1ad/0x280
[   11.213134] Modules linked in:
[   11.214954] RIP: 0010:lock_release+0x1ad/0x280
[   11.223508] Call Trace:
[   11.223705]  <IRQ>
[   11.223874]  ? __local_bh_enable+0x7a/0x80
[   11.224199]  up_read+0x1c/0xa0
[   11.224446]  do_up_read+0x12/0x20
[   11.224713]  irq_work_run_list+0x43/0x70
[   11.225030]  irq_work_run+0x26/0x50
[   11.225310]  smp_irq_work_interrupt+0x57/0x1f0
[   11.225662]  irq_work_interrupt+0xf/0x20

since rw_semaphore is released in a different task vs task that locked the sema.
It is expected behavior.
Fix the warning with up_read_non_owner() and rwsem_release() annotation.

Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02 22:14:07 +08:00
Stanislav Fomichev
2cb707d6ee UPSTREAM: bpf: zero out build_id for BPF_STACK_BUILD_ID_IP
[ Upstream commit 4af396ae4836c4ecab61e975b8e61270c551894d ]

When returning BPF_STACK_BUILD_ID_IP from stack_map_get_build_id_offset,
make sure that build_id field is empty. Since we are using percpu
free list, there is a possibility that we might reuse some previous
bpf_stack_build_id with non-zero build_id.

Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02 22:14:06 +08:00
Stanislav Fomichev
2573790af4 UPSTREAM: bpf: don't assume build-id length is always 20 bytes
[ Upstream commit 0b698005a9d11c0e91141ec11a2c4918a129f703 ]

Build-id length is not fixed to 20, it can be (`man ld` /--build-id):
  * 128-bit (uuid)
  * 160-bit (sha1)
  * any length specified in ld --build-id=0xhexstring

To fix the issue of missing BPF_STACK_BUILD_ID_VALID for shorter build-ids,
assume that build-id is somewhere in the range of 1 .. 20.
Set the remaining bytes to zero.

v2:
* don't introduce new "len = min(BPF_BUILD_ID_SIZE, nhdr->n_descsz)",
  we already know that nhdr->n_descsz <= BPF_BUILD_ID_SIZE if we enter
  this 'if' condition

Fixes: 615755a77b24 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02 22:14:06 +08:00
Song Liu
db6f6dcbad UPSTREAM: bpf: fix panic in stack_map_get_build_id() on i386 and arm32
[ Upstream commit beaf3d1901f4ea46fbd5c9d857227d99751de469 ]

As Naresh reported, test_stacktrace_build_id() causes panic on i386 and
arm32 systems. This is caused by page_address() returns NULL in certain
cases.

This patch fixes this error by using kmap_atomic/kunmap_atomic instead
of page_address.

Fixes: 615755a77b24 (" bpf: extend stackmap to save binary_build_id+offset instead of address")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02 22:14:06 +08:00
Daniel Borkmann
716ee03915 UPSTREAM: bpf: decouple btf from seq bpf fs dump and enable more maps
Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") and 699c86d6ec21 ("bpf: btf: add pretty
print for hash/lru_hash maps") enabled support for BTF and
dumping via BPF fs for array and hash/lru map. However, both
can be decoupled from each other such that regular BPF maps
can be supported for attaching BTF key/value information,
while not all maps necessarily need to dump via map_seq_show_elem()
callback.

The basic sanity check which is a prerequisite for all maps
is that key/value size has to match in any case, and some maps
can have extra checks via map_check_btf() callback, e.g.
probing certain types or indicating no support in general. With
that we can also enable retrieving BTF info for per-cpu map
types and lpm.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
2025-10-02 22:14:02 +08:00
Arnd Bergmann
8ddcd6ae9b UPSTREAM: bpf: avoid -Wmaybe-uninitialized warning
The stack_map_get_build_id_offset() function is too long for gcc to track
whether 'work' may or may not be initialized at the end of it, leading
to a false-positive warning:

kernel/bpf/stackmap.c: In function 'stack_map_get_build_id_offset':
kernel/bpf/stackmap.c:334:13: error: 'work' may be used uninitialized in this function [-Werror=maybe-uninitialized]

This removes the 'in_nmi_ctx' flag and uses the state of that variable
itself to see if it got initialized.

Fixes: bae77c5eb5b2 ("bpf: enable stackmap with build_id in nmi context")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:13:23 +08:00
Song Liu
9d381cb5af BACKPORT: bpf: enable stackmap with build_id in nmi context
Currently, we cannot parse build_id in nmi context because of
up_read(&current->mm->mmap_sem), this makes stackmap with build_id
less useful. This patch enables parsing build_id in nmi by putting
the up_read() call in irq_work. To avoid memory allocation in nmi
context, we use per cpu variable for the irq_work. As a result, only
one irq_work per cpu is allowed. If the irq_work is in-use, we
fallback to only report ips.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:13:19 +08:00
Yonghong Song
cf168d7e20 BACKPORT: bpf: add bpf_get_stack helper
Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Acked-by: Alexei Starovoitov <ast@fb.com>
Change-Id: I7dbdcba1a8ceda4c3626a07c436b33d9f35b3c0e
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:13:01 +08:00
Yonghong Song
71422b4ee7 UPSTREAM: bpf: change prototype for stack_map_get_build_id_offset
This patch didn't incur functionality change. The function prototype
got changed so that the same function can be reused later.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:13:01 +08:00
Song Liu
1854029ed8 BACKPORT: bpf: extend stackmap to save binary_build_id+offset instead of address
Currently, bpf stackmap store address for each entry in the call trace.
To map these addresses to user space files, it is necessary to maintain
the mapping from these virtual address to symbols in the binary. Usually,
the user space profiler (such as perf) has to scan /proc/pid/maps at the
beginning of profiling, and monitor mmap2() calls afterwards. Given the
cost of maintaining the address map, this solution is not practical for
system wide profiling that is always on.

This patch tries to solve this problem with a variation of stackmap. This
variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing
addresses, the variation stores ELF file build_id + offset.

Build ID is a 20-byte unique identifier for ELF files. The following
command shows the Build ID of /bin/bash:

  [user@]$ readelf -n /bin/bash
  ...
    Build ID: XXXXXXXXXX
  ...

With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID
for each entry in the call trace, and translate it into the following
struct:

  struct bpf_stack_build_id_offset {
          __s32           status;
          unsigned char   build_id[BPF_BUILD_ID_SIZE];
          union {
                  __u64   offset;
                  __u64   ip;
          };
  };

The search of build_id is limited to the first page of the file, and this
page should be in page cache. Otherwise, we fallback to store ip for this
entry (ip field in struct bpf_stack_build_id_offset). This requires the
build_id to be stored in the first page. A quick survey of binary and
dynamic library files in a few different systems shows that almost all
binary and dynamic library files have build_id in the first page.

Build_id is only meaningful for user stack. If a kernel stack is added to
a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to
only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id
lookup failed for some reason, it will also fallback to store ip.

User space can access struct bpf_stack_build_id_offset with bpf
syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to
maintain mapping from build id to binary files. This mostly static
mapping is much easier to maintain than per process address maps.

Note: Stackmap with build_id only works in non-nmi context at this time.
This is because we need to take mm->mmap_sem for find_vma(). If this
changes, we would like to allow build_id lookup in nmi context.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:12:51 +08:00
Jakub Kicinski
2aea5b0ff9 UPSTREAM: bpf: add helper for copying attrs to struct bpf_map
All map types reimplement the field-by-field copy of union bpf_attr
members into struct bpf_map.  Add a helper to perform this operation.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:12:46 +08:00
Yonghong Song
fd19f8c911 UPSTREAM: bpf: implement syscall command BPF_MAP_GET_NEXT_KEY for stacktrace map
Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not
supported for stacktrace map. However, there are use cases where
user space wants to enumerate all stacktrace map entries where
BPF_MAP_GET_NEXT_KEY command will be really helpful.
In addition, if user space wants to delete all map entries
in order to save memory and does not want to close the
map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve
performance if map entries are sparsely populated.

The implementation has similar behavior for
BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides
a NULL key pointer or an invalid key, the first key is returned.
Otherwise, the first valid key after the input parameter "key"
is returned, or -ENOENT if no valid key can be found.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:12:45 +08:00
Tim Zimmermann
3996f04715 Squashed revert of BPF backports
Revert "Partially revert "fixup: add back code missed during BPF picking""

This reverts commit cc477455f73d317733850a9e4818dfd90be4d33d.

Revert "bpf: lpm_trie: check left child of last leftmost node for NULL"

This reverts commit e89007b7df49292c5ae52b3d165c0d815a61cd10.

Revert "BACKPORT: bpf: Fix out-of-bounds write in trie_get_next_key()"

This reverts commit a1c4f565bb00b05ab3734a64451c08b0b965ce42.

Revert "bpf: Fix exact match conditions in trie_get_next_key()"

This reverts commit 4356a64dad3d38372147457b3004930c6e2e9c51.

Revert "bpf: fix kernel page fault in lpm map trie_get_next_key"

This reverts commit df4649b5d6cb374edbb67e5a5ecbd102a2e6c897.

Revert "bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map"

This reverts commit fe6656a5d48df6144fe9929399c648957166edd0.

Revert "bpf: allow helpers to return PTR_TO_SOCK_COMMON"

This reverts commit b24d1ae9ccbf3ebe6f4baa50d2d48c03be02bc17.

Revert "bpf: implement lookup-free direct value access for maps"

This reverts commit de1959fcd3df0629380894d9c47ebb253c920ad1.

Revert "bpf: Add bpf_verifier_vlog() and bpf_verifier_log_needed()"

This reverts commit b777824607bd3eb8c9130f4639d97d15bcac9af5.

Revert "bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE"

This reverts commit 4cfef728c1eac6cce34f4fff1fbab3e66dc430d9.

Revert "bpf: always allocate at least 16 bytes for setsockopt hook"

This reverts commit 59817f83c964c753e93a75128ecaad4eeaa769fc.

Revert "bpf, sockmap: convert to generic sk_msg interface"

This reverts commit fe4ef742e22924b21749de333211941d0205501e.

Revert "bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb"

This reverts commit d17c8c2c2f623e087d6c297de50c173a006e6e55.

Revert "bpf: sockmap: fix typos"

This reverts commit 07e31378d7795371cdbccce06b4125b27ffce536.

Revert "sockmap: convert refcnt to an atomic refcnt"

This reverts commit c1fa11ec9da5dc0e8cae4334c550264cff77eef9.

Revert "bpf: sockmap, add hash map support"

This reverts commit 3f43379c38e329e9a7d4b5a1640670de37ba317b.

Revert "bpf: sockmap, refactor sockmap routines to work with hashmap"

This reverts commit 41a2b6e925db031978eb2484835f60908de884d7.

Revert "bpf: implement getsockopt and setsockopt hooks"

This reverts commit 9526fe6ff3e06939c12bb781e0dda01a8f3017ec.

Revert "bpf: Introduce bpf sk local storage"

This reverts commit ffedc38a46ddaca40de672fafe78c45fbfae9839.

Revert "bpf: introduce BPF_F_LOCK flag"

This reverts commit e7f5758fbcb1674e17c645837f7bff3b1febbad5.

Revert "bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types"

This reverts commit e29b4e3c2bdd3b5d0d34668836ae8e5115cb31af.

Revert "bpf/verifier: add ARG_PTR_TO_UNINIT_MAP_VALUE"

This reverts commit f25c66c27cd6a774fb73769d804f91e969dd5f7b.

Revert "bpf: allow map helpers access to map values directly"

This reverts commit 7af696635219d0c5cdf1a166bb7543cae9e50328.

Revert "bpf: add writable context for raw tracepoints"

This reverts commit a546d8f0433039cee0de6ce96d5d35c4033a7b98.

Revert "bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock"

This reverts commit 03093478c52e79c94791a04f8138d5c019119087.

Revert "bpf: Support socket lookup in CGROUP_SOCK_ADDR progs"

This reverts commit 8047013945361fbff0e449c8a212cb6fc93a5245.

Revert "bpf: Extend the sk_lookup() helper to XDP hookpoint."

This reverts commit 8315368983086e70ccc6f103d710903c63cca7df.

Revert "xdp: generic XDP handling of xdp_rxq_info"

This reverts commit 11d9514e6e6801941abf1c0485fd4ef53082d970.

Revert "xdp: move struct xdp_buff from filter.h to xdp.h"

This reverts commit a1795f54e4d99e02d5cb84a46fac0240cf29e206.

Revert "net: avoid including xdp.h in filter.h"

This reverts commit a39c59398f3ab64de44e5953ee0bd23c5136bb48.

Revert "xdp: base API for new XDP rx-queue info concept"

This reverts commit 49fb5bae77ab2041a2ad9f9f87ad7e0a6e215fdf.

Revert "net: Add asynchronous callbacks for xfrm on layer 2."

This reverts commit d0656f64d7719993d5634a9fc6600026e9a805ee.

Revert "xfrm: Separate ESP handling from segmentation for GRO packets."

This reverts commit c8afadf7f5ed8786652d307558345ef90ea91726.

Revert "net: move secpath_exist helper to sk_buff.h"

This reverts commit 0e5483057121dad47567b01845c656955e51989e.

Revert "sk_buff: add skb extension infrastructure"

This reverts commit 3a9ae74b075757495c4becf4dd1eec056d364801.

Revert "fixup: add back code missed during BPF picking"

This reverts commit 74ec8cef7051b5af72f2a6d83ca8c51c3c61c444.

Revert "bpf: undo prog rejection on read-only lock failure"

This reverts commit af2dc6e4993c4221603dbe6e81a3d0c8269f3171.

Revert "bpf: Add helper to retrieve socket in BPF"

This reverts commit 53495e3bc33cb46d9961ea122f576faded058aa1.

Revert "SQUASH! bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helpe"

This reverts commit 3b25fbf81c041af954d9f5ac1c7867eb07c40b07.

Revert "bpf: introduce bpf_spin_lock"

This reverts commit 0095fb54160e4f8b326fa8df103e334f90c5ab56.

Revert "bpf: enable cgroup local storage map pretty print with kind_flag"

This reverts commit 3fe92cb79b5eae557b113c37b03e78efee2280db.

Revert "bpf: btf: fix struct/union/fwd types with kind_flag"

This reverts commit 2bd4856277f459974dd6234a849cbe20fd475b8f.

Revert "bpf: add bpffs pretty print for cgroup local storage maps"

This reverts commit e07d8c8279f37cee8471846a63acc51f1ab7ce03.

Revert "bpf: pass struct btf pointer to the map_check_btf() callback"

This reverts commit 78a8140faf32710799c19495db28d71693c98030.

Revert "bpf: Define cgroup_bpf_enabled for CONFIG_CGROUP_BPF=n"

This reverts commit aada945d89950c67099e490af1c4c25eef7f31e6.

Revert "bpf: introduce per-cpu cgroup local storage"

This reverts commit d37432968663559f06c7fd7df44197a807fb84ca.

Revert "bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info"

This reverts commit 063c5a25e5f47e8b82b6c43a44ed7be851884abb.

Revert "bpf: fix a compilation error when CONFIG_BPF_SYSCALL is not defined"

This reverts commit bcf5bfaf50bb6f1f981d5c538f87e6da7aab78f2.

Revert "bpf: Create a new btf_name_by_offset() for non type name use case"

This reverts commit 52b4739d0bdd763e1b00feb50bef8a821f5c7570.

Revert "bpf: reject any prog that failed read-only lock"

This reverts commit 30d1bfec06a3bcaa773213113904580e3046a57a.

Revert "bpf: Add bpf_line_info support"

This reverts commit 50b094eeeb1ced32c62b3a10045bbf43126de760.

Revert "bpf: don't leave partial mangled prog in jit_subprogs error path"

This reverts commit a466f85be89f5daab4bd748f92915ea713d63934.

Revert "bpf: btf: support proper non-jit func info"

This reverts commit 492a556de94c502376ec3b0d5a724ec9fe9f6996.

Revert "bpf: Introduce bpf_func_info"

This reverts commit 39cade88686b0d9b7befc1f14e9d2c2cad19a769.

Revert "bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO"

This reverts commit 2010b6bacc271a48e74942506f3cf45268b6c264.

Revert "bpf: fix bpf_prog_get_info_by_fd to return 0 func_lens for unpriv"

This reverts commit a0ea14ac88a0f5529a635fc6e20277942fc6bb99.

Revert "bpf: Expose check_uarg_tail_zero()"

This reverts commit 1190aaae686534c2854838b3d642dac45d26b1f4.

Revert "bpf: Append prog->aux->name in bpf_get_prog_name()"

This reverts commit 8b82528df4a11a8501393c854978662fc218014e.

Revert "bpf: get JITed image lengths of functions via syscall"

This reverts commit 0722dbc626915fcb9acb952ebc1fcb0c4554cb07.

Revert "bpf: get kernel symbol addresses via syscall"

This reverts commit 6736ec7558dd262fef6669eec02a9797c7c4ecb7.

Revert "bpf: Add gpl_compatible flag to struct bpf_prog_info"

This reverts commit b60c7a51fd3692259c93413f3e87150078be1dac.

Revert "bpf: centre subprog information fields"

This reverts commit b5186fdf6f3e1bb38d7e4abfed5bf7dd6f85a6c3.

Revert "bpf: unify main prog and subprog"

This reverts commit e8e2ad5d9ae98bc7b85b99c0712a5dfbfc151a41.

Revert "bpf: fix maximum stack depth tracking logic"

This reverts commit 10c7127615dc2c00b724069a1620b2232d905113.

Revert "bpf, x64: fix memleak when not converging on calls"

This reverts commit 6bc867f718ef2656266f984b605151971026cc98.

Revert "bpf: decouple btf from seq bpf fs dump and enable more maps"

This reverts commit 3036e2c4384d3f43c695b88c8a1cf97b8337e3bd.

Revert "bpf: Add reference tracking to verifier"

This reverts commit 3a4900a188ac4de817dc6f114f01159d7bdd2f3e.

Revert "bpf: properly enforce index mask to prevent out-of-bounds speculation"

This reverts commit ef85925d5c07b46f7447487605da601fc7be026e.

Revert "bpf, verifier: detect misconfigured mem, size argument pair"

This reverts commit c3853ee3cb96833e907f18bf90e78040fe4cf06f.

Revert "bpf: introduce ARG_PTR_TO_MEM_OR_NULL"

This reverts commit 58560e13f545f2a079bbce17ac1b731d8b94fec7.

Revert "bpf: Macrofy stack state copy"

This reverts commit 88d98d8c2ae320ab248150eb86e1c89427e5017c.

Revert "bpf: Generalize ptr_or_null regs check"

This reverts commit d2cbc2e57b8624699a1548e67b7b3ce992b396fc.

Revert "bpf: Add iterator for spilled registers"

This reverts commit d956e1ba51a7e5ce86bb35002e26d4c1e0a2497c.

Revert "bpf/verifier: refine retval R0 state for bpf_get_stack helper"

This reverts commit ceaf6d678ccb60da107b0455da64c7bf90c5102d.

Revert "bpf: Remove struct bpf_verifier_env argument from print_bpf_insn"

This reverts commit 058fd54c07a289f9b506f2d2326434e411fa65fe.

Revert "bpf: annotate bpf_insn_print_t with __printf"

This reverts commit 9b07d2ccf07855d62446e274d817672713f15be4.

Revert "bpf: allow for correlation of maps and helpers in dump"

This reverts commit af690c2e2d177352f7270f77d8a6bc9e9f60c98c.

Revert "bpf: Add bpf_patch_call_args prototype to include/linux/bpf.h"

This reverts commit 8a2c588b3ab98916147fe4a449312ce8db70c471.

Revert "bpf: x64: add JIT support for multi-function programs"

This reverts commit 752f261e545f80942272c6becf82def1729f84be.

Revert "bpf: fix net.core.bpf_jit_enable race"

This reverts commit 4720901114c20204aa3ffa2076265d2c8cc9e81b.

Revert "bpf: add support for bpf_call to interpreter"

This reverts commit c79b2e547adc8e50dabc72244370cfd37ac6a6bd.

Revert "bpf: introduce function calls (verification)"

This reverts commit f779fda96c7d9e921525f48d67fa2e9c68b4bd48.

Revert "bpf: cleanup register_is_null()"

This reverts commit 1c81f751670b4feb3102e4de136e25fa24e303fe.

Revert "bpf: print liveness info to verifier log"

This reverts commit fdc851301b33b9d646bd1d37124cbd45cedcd62b.

Revert "bpf: also improve pattern matches for meta access"

This reverts commit 9aa150d07927b911f26e0db2af0efd6aa07b8707.

Revert "bpf: add meta pointer for direct access"

This reverts commit 94f3f502ef9ef150ed687113cfbd38e91b5edc44.

Revert "bpf: rename bpf_compute_data_end into bpf_compute_data_pointers"

This reverts commit 9573c6feb301346cd1493eea4e363c6d8345e899.

Revert "bpf: squash of log related commits"

This reverts commit b08f2111e030a72a92eec4ebd6201165d03a20b8.

Revert "bpf: move instruction printing into a separate file"

This reverts commit 8fcbd39afb58847914f3f84d9c076000e09d2fb9.

Revert "bpf: btf: Introduce BTF ID"

This reverts commit 423c40d67dfc783c3b0cb227d9da53e725e0f35c.

Revert "bpf: btf: Add pretty print support to the basic arraymap"

This reverts commit 6cd4d5bba662ca0d8980e5806ef37e0341eab929.

Revert "nsfs: clean-up ns_get_path() signature to return int"

This reverts commit ec1ce41701f411c5dee396cec2931fb651f447cc.

Revert "bpf_obj_do_pin(): switch to vfs_mkobj(), quit abusing ->mknod()"

This reverts commit 8fbcb4ebf5a751f4685cdd2757cff2264032a5d9.

Revert "bpf: offload: report device information about offloaded maps"

This reverts commit 1105e63f25a9db675671288b583a5ce2c7d10b1f.

Revert "bpf: offload: add map offload infrastructure"

This reverts commit 20cdf9df3d5bd010d799ea3c80219f625c998307.

Revert "bpf: add map_alloc_check callback"

This reverts commit 6feb4121ea083053ac9587ac426195efe9fb143d.

Revert "bpf: offload: factor out netdev checking at allocation time"

This reverts commit 1425fb5676b8fe9d761f2f6545e4be8880ce0ac8.

Revert "bpf: rename bpf_dev_offload -> bpf_prog_offload"

This reverts commit a03ae0ec508200433fd6c35b87e342df4de0b320.

Revert "bpf: offload: allow netdev to disappear while verifier is running"

This reverts commit f6cf7214fd1ff3a018009ba90c33eac1d8de21de.

Revert "bpf: offload: free program id when device disappears"

This reverts commit b12b5e56b799cfe900ab8f0ee4177c6c08a904c6.

Revert "bpf: offload: report device information for offloaded programs"

This reverts commit c73c9a0ffa332eeb49927a48780f5537597e2d42.

Revert "bpf: offload: don't require rtnl for dev list manipulation"

This reverts commit 1993f08662f07581a370899a2da209ba0c996dbb.

Revert "bpf: offload: ignore namespace moves"

This reverts commit 9fefb21d8aa2691019f9c4f0b8025fb45ba60b49.

Revert "bpf: Add PTR_TO_SOCKET verifier type"

This reverts commit 55fdbc844801cd4007237fa6c5842b46985a5c9a.

Revert "bpf: extend cgroup bpf core to allow multiple cgroup storage types"

This reverts commit a6d82e371ef32fb24d493cff32765b4607581dd4.

Revert "bpf: permit CGROUP_DEVICE programs accessing helper bpf_get_current_cgroup_id()"

This reverts commit 1bfd0a07a8317004a89d6de736e24861db8281b5.

Revert "bpf: implement bpf_get_current_cgroup_id() helper"

This reverts commit 23603ed6d7df86392701a7ea7d9a1dba66f28d4b.

Revert "bpf: introduce the bpf_get_local_storage() helper function"

This reverts commit 3d777256b1c9f34975c5230d836023ea3e0d4cfd.

Revert "bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE"

This reverts commit 93c12733dc97984f7bf57a77160eacc480bfc3de.

Revert "bpf: extend bpf_prog_array to store pointers to the cgroup storage"

This reverts commit b26baff1fb34607938c9ac0e421e3f4b5fedad4d.

Revert "BACKPORT: bpf: allocate cgroup storage entries on attaching bpf programs"

This reverts commit 804605c21a3be3277c0031504dcd3fdd1be64290.

Revert "bpf: include errno.h from bpf-cgroup.h"

This reverts commit 6b4df332b357e9a5942ca4c6f985cd33dfc30e25.

Revert "bpf: pass a pointer to a cgroup storage using pcpu variable"

This reverts commit c8af92dc9fc00e49f06f6997969284ef5e5c5af5.

Revert "bpf: introduce cgroup storage maps"

This reverts commit c61c2271cb8a1e47678bddc8cdfae83035a07fec.

Revert "bpf: add ability to charge bpf maps memory dynamically"

This reverts commit 3a430745e9f675b450477fffead5568046432f29.

Revert "bpf: add helper for copying attrs to struct bpf_map"

This reverts commit 6d7be0ae93371692e564c00003ce184cbaefbb8d.

Revert "bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP"

This reverts commit 15f584d2d3d4814cfbd3059ab810db02af8773a0.

Revert "bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog"

This reverts commit fc9bf5e48985f7c3a39bf34a27477a2607a5dc6d.

Revert "bpf: set maximum number of attached progs to 64 for a single perf tp"

This reverts commit 0d5fc9795d824fbca21b81c8d91748ba21313d4c.

Revert "bpf: avoid rcu_dereference inside bpf_event_mutex lock region"

This reverts commit 948e200e3173dd959de907e326f2a2c90eda4b28.

Revert "bpf: fix bpf_prog_array_copy_to_user() issues"

This reverts commit 66811698b8de9b3cf13c09730d287b6d1d5d3699.

Revert "bpf: fix pointer offsets in context for 32 bit"

This reverts commit 99661813c136c52e56b328a2a8ecd2bc0e187eba.

Revert "BACKPORT: bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data"

This reverts commit 36f0ea00dd121b13f80617e5b2eb93ba160df85a.

Revert "BACKPORT: bpf: Sysctl hook"

This reverts commit 4a543990e03b5de4a2c23777abd0f77afd61cc2d.

Revert "BACKPORT: flow_dissector: implements flow dissector BPF hook"

This reverts commit de610a8a4324170a0deaf12e2e64c2ff068785fb.

Revert "BACKPORT: bpf: Add base proto function for cgroup-bpf programs"

This reverts commit f3ac0a6cbec3472ff2e3808a436891881f3cbf87.

Revert "FROMLIST: [net-next,v2,1/2] bpf: Allow CGROUP_SKB eBPF program to access sk_buff"

This reverts commit 6d4dcc0e3de628003d91075e4b1ab1a128b8892e.

Revert "BACKPORT: bpf: introduce BPF_RAW_TRACEPOINT"

This reverts commit b2a5c6b4958c8250e58ddb6c334018a5f7ee5437.

Revert "bpf/tracing: fix kernel/events/core.c compilation error"

This reverts commit 70249d4eb7359e9dc59e044951beb99d0d8725cd.

Revert "BACKPORT: bpf/tracing: allow user space to query prog array on the same tp"

This reverts commit 08a6d8c01372940bfec78fdc6cb8a47e08c745b0.

Revert "bpf: sockmap, add sock close() hook to remove socks"

This reverts commit e6b363b8d09d9740dff309fb4dc88e7a1e90726b.

Revert "BACKPORT: bpf: remove the verifier ops from program structure"

This reverts commit 94c2f61efa741bf6a97415f42cfbfb9ec83dfd8e.

Revert "bpf, cgroup: implement eBPF-based device controller for cgroup v2"

This reverts commit 22faa9c56550a34488e607ca3aca59c68b1f7938.

Revert "BACKPORT: bpf: split verifier and program ops"

This reverts commit d2b1388504c1129d5756bb9b20af9bd64e75d015.

Revert "bpf: btf: Break up btf_type_is_void()"

This reverts commit 052989c47b68feaf381d371ec1e6a169edc26d30.

Revert "bpf: btf: refactor btf_int_bits_seq_show()"

This reverts commit 8cc3fb30656cfab91205194a8ee7661bdd95e005.

Revert "BACKPORT: bpf: fix unconnected udp hooks"

This reverts commit b108e725aa70e39cfd37296d1a1d31e8896fa7b7.

Revert "BACKPORT: bpf: enforce return code for cgroup-bpf programs"

This reverts commit 10215080915bfbdaa9f666a95ffda02cc1ef7a29.

Revert "bpf: Hooks for sys_sendmsg"

This reverts commit cd847db1be8a37e0e7e9c813b5d8f93697dc5af0.

Revert "BACKPORT: devmap: Allow map lookups from eBPF"

This reverts commit 37da95fde647e8967b362e0769136bfbebc03628.

Revert "BACKPORT: xdp: Add devmap_hash map type for looking up devices by hashed index"

This reverts commit ae6a87f44c4ef20ac290ce68c4d5b542cf46f3d7.

Revert "kernel: bpf: devmap: Create __dev_map_alloc_node"

This reverts commit 15928a97ed93cf9f606a21bf869ff421b997a2c5.

Revert "BACKPORT: bpf: Post-hooks for sys_bind"

This reverts commit c221d44e76c3ab69285c9986680e5eb726cf157b.

Revert "BACKPORT: bpf: Hooks for sys_connect"

This reverts commit 003311ea43163c77e4e0c1921b81438286925baa.

Revert "BACKPORT: net: Introduce __inet_bind() and __inet6_bind"

This reverts commit 74f1eb60012c13bd606e4dc718e63aec7f8cce8f.

Revert "BACKPORT: bpf: Hooks for sys_bind"

This reverts commit cef0bd97f2fec8363c3ef58b2cb508deaa9bc5b2.

Revert "BACKPORT: bpf: introduce BPF_PROG_QUERY command"

This reverts commit a4ef81ce48cb25843ddb4d4331dacf2742215909.

Revert "BACKPORT: bpf: Check attach type at prog load time"

This reverts commit 750a3f976c75797e572a6dfdd2e8865b8b49964a.

Revert "bpf: offload: rename the ifindex field"

This reverts commit 921e6becfb28fbe505603bf927f195d1d72a0eea.

Revert "BACKPORT: bpf: offload: add infrastructure for loading programs for a specific netdev"

This reverts commit cb1607a58d026a4ac1d9e71f6c3cd1dc23820e2f.

Revert "BACKPORT: net: bpf: rename ndo_xdp to ndo_bpf"

This reverts commit 932d47ebc5910bb1ec954002206b1ce8749a9cd6.

Revert "bpf: btf: fix truncated last_member_type_id in btf_struct_resolve"

This reverts commit e7af669fe00a8e2030913088836189a9f65a04d8.

Revert "bpf/btf: Fix BTF verification of enum members in struct/union"

This reverts commit a098516b98fe35e8f0e89709443fff8b37eb04b8.

Revert "bpf: fix BTF limits"

This reverts commit 794ad07fab9540989f96351c11b039e2229c2a8e.

Revert "bpf, btf: fix a missing check bug in btf_parse"

This reverts commit 27c4178ecc8edbb2306fa479f275ffd35f5b57c9.

Revert "bpf: btf: Fix a missing check bug"

This reverts commit 71f5a7d140aa5a37d164e217b2fefcb2d409b894.

Revert "bpf: btf: Fix end boundary calculation for type section"

This reverts commit 549615befd671b6877677acb009b66cd374408d3.

Revert "bpf: fix bpf_skb_load_bytes_relative pkt length check"

This reverts commit 5f3d68c4da18dfbcde4c02cb34c63599709fcf3c.

Revert "bpf: btf: Ensure the member->offset is in the right order"

This reverts commit 4f9d26cbc747a4728c4944b7dc9725fc2737f892.

Revert "bpf: btf: Clean up BTF_INT_BITS() in uapi btf.h"

This reverts commit 480c6f80a14431f6d680a687363dcb0d9cd1d7a8.

Revert "bpf: btf: Fix bitfield extraction for big endian"

This reverts commit 0463c259aa21e99d1bf798c8cf54da18b5906938.

Revert "bpf: btf: Ensure t->type == 0 for BTF_KIND_FWD"

This reverts commit ecc54be6970a3484eb163ac09996856c9ece5727.

Revert "bpf: btf: Check array t->size"

This reverts commit 3cda848b9be9fbb6dfa8912a425801c263bcbff7.

Revert "bpf: btf: avoid -Wreturn-type warning"

This reverts commit fd7fede5952004dcacb39f318249c4cf8e5c51e0.

Revert "bpf: btf: Avoid variable length array"

This reverts commit 2826641eb171c705d0b2db86d8834eff33945d0e.

Revert "bpf: btf: Remove unused bits from uapi/linux/btf.h"

This reverts commit 2d9e7a574f7e47a027974ec616ac812ad6a2d086.

Revert "bpf: btf: Check array->index_type"

This reverts commit f9ee68f7e8a471450536a70b43bd96d4bdfbfb81.

Revert "bpf: btf: Change how section is supported in btf_header"

This reverts commit 63a4474da4bf56c8a700d542bcf3a57a4b737ed6.

Revert "bpf: Fix compiler warning on info.map_ids for 32bit platform"

This reverts commit a4f706ea7d2b874ef739168a12a30ae5454487a6.

Revert "BACKPORT: bpf: Use char in prog and map name"

This reverts commit 8d4ad88eabb5d1500814c5f5b76a11f80346669c.

Revert "bpf: Change bpf_obj_name_cpy() to better ensure map's name is init by 0"

This reverts commit c4acfd3c9f5a97123c240676750f3e4ae2a2c24c.

Revert "BACKPORT: bpf: Add map_name to bpf_map_info"

This reverts commit 0e03a4e584eabe3f4c448f06f271753cdaae3aab.

Revert "BACKPORT: bpf: Add name, load_time, uid and map_ids to bpf_prog_info"

This reverts commit 16872f60e6c1fc6b10e905ff18c14d8aaeb4e09d.

Revert "bpf: btf: Avoid WARN_ON when CONFIG_REFCOUNT_FULL=y"

This reverts commit 0b618ec6e162e650aaa583a31f4de4c4558148bf.

Revert "BACKPORT: bpf: btf: Clean up btf.h in uapi"

This reverts commit ea0c0ad08c18ddf62dbb6c8edc814c75cbb3e8b9.

Revert "bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd"

This reverts commit f51fe1d1edb742176c622bc93301e98a1cbf2e63.

Revert "BACKPORT: bpf: btf: Add BPF_BTF_LOAD command"

This reverts commit 85db8f764069f15d1b181bea67336ce4d66a58c1.

Revert "bpf: btf: Add pretty print capability for data with BTF type info"

This reverts commit 0a8aae433c53b1f441cab70979517660fb6a6038.

Revert "bpf: btf: Check members of struct/union"

This reverts commit ce2e8103ac1a977ce32db51ec042faea6f100a3d.

Revert "bpf: btf: Validate type reference"

This reverts commit a1aa96e6dae2b4c8c0b0a4dedab3006d3f697460.

Revert "bpf: Update logging functions to work with BTF"

This reverts commit b9289460f0a6b5c261ec0b6dcafa6fcd09d4957e.

Revert "BACKPORT: bpf: btf: Introduce BPF Type Format (BTF)"

This reverts commit ceebd58f6470e8ec6d9d694ab382fe88f43b998b.

Revert "BACKPORT: bpf: Rename bpf_verifer_log"

This reverts commit 50bdc7513d966811fb418d24a0e5797ffd8c907c.

Revert "BACKPORT: bpf: encapsulate verifier log state into a structure"

This reverts commit 0bcb397bde4675fdeb977d9debed20ed213f9ecd.

Change-Id: Iecaa276b078c6d2db773a8071e7da9e6195277d6
2025-10-02 22:12:00 +08:00
Toke Høiland-Jørgensen
d7e805ca07 bpf: Fix stackmap overflow check on 32-bit arches
[ Upstream commit 7a4b21250bf79eef26543d35bd390448646c536b ]

The stackmap code relies on roundup_pow_of_two() to compute the number
of hash buckets, and contains an overflow check by checking if the
resulting value is 0. However, on 32-bit arches, the roundup code itself
can overflow by doing a 32-bit left-shift of an unsigned long value,
which is undefined behaviour, so it is not guaranteed to truncate
neatly. This was triggered by syzbot on the DEVMAP_HASH type, which
contains the same check, copied from the hashtab code.

The commit in the fixes tag actually attempted to fix this, but the fix
did not account for the UB, so the fix only works on CPUs where an
overflow does result in a neat truncation to zero, which is not
guaranteed. Checking the value before rounding does not have this
problem.

Fixes: 6183f4d3a0a2 ("bpf: Check for integer overflow when using roundup_pow_of_two()")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Bui Quang Minh <minhquangbui99@gmail.com>
Message-ID: <20240307120340.99577-4-toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit d0e214acc59145ce25113f617311aa79dda39cb3)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
2025-09-17 16:58:08 +08:00
Daniel Borkmann
ebe70c600b bpf: decouple btf from seq bpf fs dump and enable more maps
Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") and 699c86d6ec21 ("bpf: btf: add pretty
print for hash/lru_hash maps") enabled support for BTF and
dumping via BPF fs for array and hash/lru map. However, both
can be decoupled from each other such that regular BPF maps
can be supported for attaching BTF key/value information,
while not all maps necessarily need to dump via map_seq_show_elem()
callback.

The basic sanity check which is a prerequisite for all maps
is that key/value size has to match in any case, and some maps
can have extra checks via map_check_btf() callback, e.g.
probing certain types or indicating no support in general. With
that we can also enable retrieving BTF info for per-cpu map
types and lpm.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
2025-09-17 16:57:58 +08:00
Jakub Kicinski
2b54e342d1 bpf: add helper for copying attrs to struct bpf_map
All map types reimplement the field-by-field copy of union bpf_attr
members into struct bpf_map.  Add a helper to perform this operation.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-09-17 16:57:52 +08:00
Tatsuhiko Yasumatsu
cb63e314f2 bpf: Fix integer overflow in prealloc_elems_and_freelist()
[ Upstream commit 30e29a9a2bc6a4888335a6ede968b75cd329657a ]

In prealloc_elems_and_freelist(), the multiplication to calculate the
size passed to bpf_map_area_alloc() could lead to an integer overflow.
As a result, out-of-bounds write could occur in pcpu_freelist_populate()
as reported by KASAN:

[...]
[   16.968613] BUG: KASAN: slab-out-of-bounds in pcpu_freelist_populate+0xd9/0x100
[   16.969408] Write of size 8 at addr ffff888104fc6ea0 by task crash/78
[   16.970038]
[   16.970195] CPU: 0 PID: 78 Comm: crash Not tainted 5.15.0-rc2+ #1
[   16.970878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[   16.972026] Call Trace:
[   16.972306]  dump_stack_lvl+0x34/0x44
[   16.972687]  print_address_description.constprop.0+0x21/0x140
[   16.973297]  ? pcpu_freelist_populate+0xd9/0x100
[   16.973777]  ? pcpu_freelist_populate+0xd9/0x100
[   16.974257]  kasan_report.cold+0x7f/0x11b
[   16.974681]  ? pcpu_freelist_populate+0xd9/0x100
[   16.975190]  pcpu_freelist_populate+0xd9/0x100
[   16.975669]  stack_map_alloc+0x209/0x2a0
[   16.976106]  __sys_bpf+0xd83/0x2ce0
[...]

The possibility of this overflow was originally discussed in [0], but
was overlooked.

Fix the integer overflow by changing elem_size to u64 from u32.

  [0] https://lore.kernel.org/bpf/728b238e-a481-eb50-98e9-b0f430ab01e7@gmail.com/

Fixes: 557c0c6e7d ("bpf: convert stackmap to pre-allocation")
Signed-off-by: Tatsuhiko Yasumatsu <th.yasumatsu@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210930135545.173698-1-th.yasumatsu@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Change-Id: Id80935bb0cb5f690974e9b26e248fd2b480a742e
2024-07-31 15:15:56 +02:00
Bui Quang Minh
dba2a24b6a bpf: Check for integer overflow when using roundup_pow_of_two()
[ Upstream commit 6183f4d3a0a2ad230511987c6c362ca43ec0055f ]

On 32-bit architecture, roundup_pow_of_two() can return 0 when the argument
has upper most bit set due to resulting 1UL << 32. Add a check for this case.

Fixes: d5a3b1f691 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210127063653.3576-1-minhquangbui99@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Change-Id: Ia803d064bfcffce8a767590cd4de66be000abaab
2024-07-31 15:15:55 +02:00
Chenbo Feng
cace572e16 BACKPORT: bpf: Add file mode configuration into bpf maps
Introduce the map read/write flags to the eBPF syscalls that returns the
map fd. The flags is used to set up the file mode when construct a new
file descriptor for bpf maps. To not break the backward capability, the
f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise
it should be O_RDONLY or O_WRONLY. When the userspace want to modify or
read the map content, it will check the file mode to see if it is
allowed to make the change.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

Bug: 30950746
Change-Id: Icfad20f1abb77f91068d244fb0d87fa40824dd1b

(cherry picked from commit 6e71b04a82248ccf13a94b85cbc674a9fefe53f5)
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2017-12-18 21:11:22 +05:30
Martin KaFai Lau
96eabe7a40 bpf: Allow selecting numa node during map creation
The current map creation API does not allow to provide the numa-node
preference.  The memory usually comes from where the map-creation-process
is running.  The performance is not ideal if the bpf_prog is known to
always run in a numa node different from the map-creation-process.

One of the use case is sharding on CPU to different LRU maps (i.e.
an array of LRU maps).  Here is the test result of map_perf_test on
the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
CPU0 to be allocated from a remote numa node:

[ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]

># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec  #<<<

After specifying numa node:
># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<

This patch adds one field, numa_node, to the bpf_attr.  Since numa node 0
is a valid node, a new flag BPF_F_NUMA_NODE is also added.  The numa_node
field is honored if and only if the BPF_F_NUMA_NODE flag is set.

Numa node selection is not supported for percpu map.

This patch does not change all the kmalloc.  F.e.
'htab = kzalloc()' is not changed since the object
is small enough to stay in the cache.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-19 21:35:43 -07:00
Daniel Borkmann
a316338cb7 bpf: fix wrong exposure of map_flags into fdinfo for lpm
trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
attr->map_flags, since it does not support preallocation yet. We
check the flag, but we never copy the flag into trie->map.map_flags,
which is later on exposed into fdinfo and used by loaders such as
iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
whether a pinned map has the same spec as the one from the BPF obj
file and if not, bails out, which is currently the case for lpm
since it exposes always 0 as flags.

Also copy over flags in array_map_alloc() and stack_map_alloc().
They always have to be 0 right now, but we should make sure to not
miss to copy them over at a later point in time when we add actual
flags for them to use.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Reported-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 13:44:28 -04:00
Johannes Berg
40077e0cf6 bpf: remove struct bpf_map_type_list
There's no need to have struct bpf_map_type_list since
it just contains a list_head, the type, and the ops
pointer. Since the types are densely packed and not
actually dynamically registered, it's much easier and
smaller to have an array of type->ops pointer. Also
initialize this array statically to remove code needed
to initialize it.

In order to save duplicating the list, move it to the
types header file added by the previous patch and
include it in the same fashion.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 14:38:43 -04:00
Daniel Borkmann
c78f8bdfa1 bpf: mark all registered map/prog types as __ro_after_init
All map types and prog types are registered to the BPF core through
bpf_register_map_type() and bpf_register_prog_type() during init and
remain unchanged thereafter. As by design we don't (and never will)
have any pluggable code that can register to that at any later point
in time, lets mark all the existing bpf_{map,prog}_type_list objects
in the tree as __ro_after_init, so they can be moved to read-only
section from then onwards.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-17 13:40:04 -05:00
Daniel Borkmann
d407bd25a2 bpf: don't trigger OOM killer under pressure with map alloc
This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.

Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.

Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.

Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-18 17:12:26 -05:00
Daniel Borkmann
f3694e0012 bpf: add BPF_CALL_x macros for declaring helpers
This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_<n>() call.

This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.

BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.

I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-09 19:36:04 -07:00
Ingo Molnar
616d1c1b98 Merge branch 'linus' into perf/core, to refresh the branch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 09:26:46 +02:00
Arnaldo Carvalho de Melo
97c79a38cd perf core: Per event callchain limit
Additionally to being able to control the system wide maximum depth via
/proc/sys/kernel/perf_event_max_stack, now we are able to ask for
different depths per event, using perf_event_attr.sample_max_stack for
that.

This uses an u16 hole at the end of perf_event_attr, that, when
perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
sample_max_stack is zero, means use perf_event_max_stack, otherwise
it'll be bounds checked under callchain_mutex.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-05-30 12:41:44 -03:00
Linus Torvalds
bdc6b758e4 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Mostly tooling and PMU driver fixes, but also a number of late updates
  such as the reworking of the call-chain size limiting logic to make
  call-graph recording more robust, plus tooling side changes for the
  new 'backwards ring-buffer' extension to the perf ring-buffer"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
  perf record: Read from backward ring buffer
  perf record: Rename variable to make code clear
  perf record: Prevent reading invalid data in record__mmap_read
  perf evlist: Add API to pause/resume
  perf trace: Use the ptr->name beautifier as default for "filename" args
  perf trace: Use the fd->name beautifier as default for "fd" args
  perf report: Add srcline_from/to branch sort keys
  perf evsel: Record fd into perf_mmap
  perf evsel: Add overwrite attribute and check write_backward
  perf tools: Set buildid dir under symfs when --symfs is provided
  perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
  perf annotate: Sort list of recognised instructions
  perf annotate: Fix identification of ARM blt and bls instructions
  perf tools: Fix usage of max_stack sysctl
  perf callchain: Stop validating callchains by the max_stack sysctl
  perf trace: Fix exit_group() formatting
  perf top: Use machine->kptr_restrict_warned
  perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
  perf machine: Do not bail out if not managing to read ref reloc symbol
  perf/x86/intel/p4: Trival indentation fix, remove space
  ...
2016-05-25 17:05:40 -07:00
Linus Torvalds
a7fd20d1c4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support SPI based w5100 devices, from Akinobu Mita.

   2) Partial Segmentation Offload, from Alexander Duyck.

   3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

   4) Allow cls_flower stats offload, from Amir Vadai.

   5) Implement bpf blinding, from Daniel Borkmann.

   6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
      actually using FASYNC these atomics are superfluous.  From Eric
      Dumazet.

   7) Run TCP more preemptibly, also from Eric Dumazet.

   8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
      driver, from Gal Pressman.

   9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

  10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

  11) Support tunneling offloads in qed, from Manish Chopra.

  12) aRFS offloading in mlx5e, from Maor Gottlieb.

  13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
      Leitner.

  14) Add MSG_EOR support to TCP, this allows controlling packet
      coalescing on application record boundaries for more accurate
      socket timestamp sampling.  From Martin KaFai Lau.

  15) Fix alignment of 64-bit netlink attributes across the board, from
      Nicolas Dichtel.

  16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

  17) Several conversions of drivers to ethtool ksettings, from Philippe
      Reynes.

  18) Checksum neutral ILA in ipv6, from Tom Herbert.

  19) Factorize all of the various marvell dsa drivers into one, from
      Vivien Didelot

  20) Add VF support to qed driver, from Yuval Mintz"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
  Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
  Revert "phy dp83867: Make rgmii parameters optional"
  r8169: default to 64-bit DMA on recent PCIe chips
  phy dp83867: Make rgmii parameters optional
  phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
  bpf: arm64: remove callee-save registers use for tmp registers
  asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
  switchdev: pass pointer to fib_info instead of copy
  net_sched: close another race condition in tcf_mirred_release()
  tipc: fix nametable publication field in nl compat
  drivers: net: Don't print unpopulated net_device name
  qed: add support for dcbx.
  ravb: Add missing free_irq() calls to ravb_close()
  qed: Remove a stray tab
  net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
  net: ethernet: fec-mpc52xx: use phydev from struct net_device
  bpf, doc: fix typo on bpf_asm descriptions
  stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
  net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
  net: ethernet: fs-enet: use phydev from struct net_device
  ...
2016-05-17 16:26:30 -07:00
Arnaldo Carvalho de Melo
cfbcf46845 perf core: Pass max stack as a perf_callchain_entry context
This makes perf_callchain_{user,kernel}() receive the max stack
as context for the perf_callchain_entry, instead of accessing
the global sysctl_perf_event_max_stack.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-05-16 23:11:50 -03:00
Arnaldo Carvalho de Melo
c5dfd78eb7 perf core: Allow setting up max frame stack depth via sysctl
The default remains 127, which is good for most cases, and not even hit
most of the time, but then for some cases, as reported by Brendan, 1024+
deep frames are appearing on the radar for things like groovy, ruby.

And in some workloads putting a _lower_ cap on this may make sense. One
that is per event still needs to be put in place tho.

The new file is:

  # cat /proc/sys/kernel/perf_event_max_stack
  127

Chaging it:

  # echo 256 > /proc/sys/kernel/perf_event_max_stack
  # cat /proc/sys/kernel/perf_event_max_stack
  256

But as soon as there is some event using callchains we get:

  # echo 512 > /proc/sys/kernel/perf_event_max_stack
  -bash: echo: write error: Device or resource busy
  #

Because we only allocate the callchain percpu data structures when there
is a user, which allows for changing the max easily, its just a matter
of having no callchain users at that point.

Reported-and-Tested-by: Brendan Gregg <brendan.d.gregg@gmail.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David Ahern <dsahern@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-04-27 10:20:39 -03:00
Alexei Starovoitov
9940d67c93 bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs
needs two wrapper functions to fetch 'struct pt_regs *' to convert
tracepoint bpf context into kprobe bpf context to reuse existing
helper functions

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 21:04:26 -04:00
Alexei Starovoitov
557c0c6e7d bpf: convert stackmap to pre-allocation
It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.

The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
     cpu0                          cpu1
     ----                          ----
user does lookup(stackidX)
starts copying ips into buffer
                                   delete(stackidX)
                                   calls bpf_get_stackid()
				   which recyles the element and
                                   overwrites with new stack trace

To avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.

Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.

Fixes: d5a3b1f691 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
Alexei Starovoitov
823707b68d bpf: check for reserved flag bits in array and stack maps
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-08 15:28:31 -05:00
Alexei Starovoitov
d5a3b1f691 bpf: introduce BPF_MAP_TYPE_STACK_TRACE
add new map type to store stack traces and corresponding helper
bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
@ctx: struct pt_regs*
@map: pointer to stack_trace map
@flags: bits 0-7 - numer of stack frames to skip
        bit 8 - collect user stack instead of kernel
        bit 9 - compare stacks by hash only
        bit 10 - if two different stacks hash into the same stackid
                 discard old
        other bits - reserved
Return: >= 0 stackid on success or negative error

stackid is a 32-bit integer handle that can be further combined with
other data (including other stackid) and used as a key into maps.

Userspace will access stackmap using standard lookup/delete syscall commands to
retrieve full stack trace for given stackid.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-20 00:21:44 -05:00