Commit Graph

1655 Commits

Author SHA1 Message Date
Daniel Borkmann
d4414bc0e9 UPSTREAM: bpf, net: Rework cookie generator as per-cpu one
With its use in BPF, the cookie generator can be called very frequently
in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
and attached to the root cgroup, for example, when used in v1/v2 mixed
environments. In particular, when there's a high churn on sockets in the
system there can be many parallel requests to the bpf_get_socket_cookie()
and bpf_get_netns_cookie() helpers which then cause contention on the
atomic counter.

As similarly done in f991bd2e14 ("fs: introduce a per-cpu last_ino
allocator"), add a small helper library that both can use for the 64 bit
counters. Given this can be called from different contexts, we also need
to deal with potential nested calls even though in practice they are
considered extremely rare. One idea as suggested by Eric Dumazet was
to use a reverse counter for this situation since we don't expect 64 bit
overflows anyways; that way, we can avoid bigger gaps in the 64 bit
counter space compared to just batch-wise increase. Even on machines
with small number of cores (e.g. 4) the cookie generation shrinks from
min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run
in parallel from multiple CPUs.

Change-Id: I964c74107f7b39b2de42afedfe8568c5d03d0fb5
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net
2025-09-08 17:28:17 +03:00
Toke Høiland-Jørgensen
3e7d25a3c0 UPSTREAM: bpf: Fix context type resolving for extension programs
Eelco reported we can't properly access arguments if the tracing
program is attached to extension program.

Having following program:

  SEC("classifier/test_pkt_md_access")
  int test_pkt_md_access(struct __sk_buff *skb)

with its extension:

  SEC("freplace/test_pkt_md_access")
  int test_pkt_md_access_new(struct __sk_buff *skb)

and tracing that extension with:

  SEC("fentry/test_pkt_md_access_new")
  int BPF_PROG(fentry, struct sk_buff *skb)

It's not possible to access skb argument in the fentry program,
with following error from verifier:

  ; int BPF_PROG(fentry, struct sk_buff *skb)
  0: (79) r1 = *(u64 *)(r1 +0)
  invalid bpf_context access off=0 size=8

The problem is that btf_ctx_access gets the context type for the
traced program, which is in this case the extension.

But when we trace extension program, we want to get the context
type of the program that the extension is attached to, so we can
access the argument properly in the trace program.

This version of the patch is tweaked slightly from Jiri's original one,
since the refactoring in the previous patches means we have to get the
target prog type from the new variable in prog->aux instead of directly
from the target prog.

Reported-by: Eelco Chaudron <echaudro@redhat.com>
Suggested-by: Jiri Olsa <jolsa@kernel.org>
Change-Id: I5f449000d53a1a6b74d79ff02c82163deade04bb
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/160138355278.48470.17057040257274725638.stgit@toke.dk
2025-09-08 17:28:16 +03:00
Toke Høiland-Jørgensen
4fc2b7ee10 UPSTREAM: bpf: Support attaching freplace programs to multiple attach points
This enables support for attaching freplace programs to multiple attach
points. It does this by amending the UAPI for bpf_link_Create with a target
btf ID that can be used to supply the new attachment point along with the
target program fd. The target must be compatible with the target that was
supplied at program load time.

The implementation reuses the checks that were factored out of
check_attach_btf_id() to ensure compatibility between the BTF types of the
old and new attachment. If these match, a new bpf_tracing_link will be
created for the new attach target, allowing multiple attachments to
co-exist simultaneously.

The code could theoretically support multiple-attach of other types of
tracing programs as well, but since I don't have a use case for any of
those, there is no API support for doing so.

Change-Id: Ifeca634ba0c7ae9b1c5bc3c11a9d2b83fb0b5d23
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/160138355169.48470.17165680973640685368.stgit@toke.dk
2025-09-08 17:28:15 +03:00
Toke Høiland-Jørgensen
646e8c59cf UPSTREAM: bpf: Move prog->aux->linked_prog and trampoline into bpf_link on attach
In preparation for allowing multiple attachments of freplace programs, move
the references to the target program and trampoline into the
bpf_tracing_link structure when that is created. To do this atomically,
introduce a new mutex in prog->aux to protect writing to the two pointers
to target prog and trampoline, and rename the members to make it clear that
they are related.

With this change, it is no longer possible to attach the same tracing
program multiple times (detaching in-between), since the reference from the
tracing program to the target disappears on the first attach. However,
since the next patch will let the caller supply an attach target, that will
also make it possible to attach to the same place multiple times.

Change-Id: I42fa3ca6dbc455eb3015b20d5f9d56f1a80e693a
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/160138355059.48470.2503076992210324984.stgit@toke.dk
2025-09-08 17:28:15 +03:00
Toke Høiland-Jørgensen
81cfabcda1 UPSTREAM: bpf/preload: Make sure Makefile cleans up after itself, and add .gitignore
The Makefile in bpf/preload builds a local copy of libbpf, but does not
properly clean up after itself. This can lead to subsequent compilation
failures, since the feature detection cache is kept around which can lead
subsequent detection to fail.

Fix this by properly setting clean-files, and while we're at it, also add a
.gitignore for the directory to ignore the build artifacts.

Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates bpffs.")
Change-Id: I7da3cc7799b8ade2b8f9cbfccd50256a3988a2b8
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200927193005.8459-1-toke@redhat.com
2025-09-08 17:28:15 +03:00
Alan Maguire
879c8a8cee UPSTREAM: bpf: Add bpf_seq_printf_btf helper
A helper is added to allow seq file writing of kernel data
structures using vmlinux BTF.  Its signature is

long bpf_seq_printf_btf(struct seq_file *m, struct btf_ptr *ptr,
                        u32 btf_ptr_size, u64 flags);

Flags and struct btf_ptr definitions/use are identical to the
bpf_snprintf_btf helper, and the helper returns 0 on success
or a negative error value.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Change-Id: Ief0f9b8b9d9ed5f725159d71d1f3eb26f28c27c1
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-8-git-send-email-alan.maguire@oracle.com
2025-09-08 17:28:15 +03:00
Alan Maguire
ced9385a8a UPSTREAM: bpf: Bump iter seq size to support BTF representation of large data structures
BPF iter size is limited to PAGE_SIZE; if we wish to display BTF-based
representations of larger kernel data structures such as task_struct,
this will be insufficient.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Change-Id: Ica24b1c3e6b5b4249e9ea7d01f0f3cb37f5b07e4
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-6-git-send-email-alan.maguire@oracle.com
2025-09-08 17:28:14 +03:00
Alan Maguire
f735d7727c UPSTREAM: bpf: Add bpf_snprintf_btf helper
A helper is added to support tracing kernel type information in BPF
using the BPF Type Format (BTF).  Its signature is

long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
		      u32 btf_ptr_size, u64 flags);

struct btf_ptr * specifies

- a pointer to the data to be traced
- the BTF id of the type of data pointed to
- a flags field is provided for future use; these flags
  are not to be confused with the BTF_F_* flags
  below that control how the btf_ptr is displayed; the
  flags member of the struct btf_ptr may be used to
  disambiguate types in kernel versus module BTF, etc;
  the main distinction is the flags relate to the type
  and information needed in identifying it; not how it
  is displayed.

For example a BPF program with a struct sk_buff *skb
could do the following:

	static struct btf_ptr b = { };

	b.ptr = skb;
	b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
	bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);

Default output looks like this:

(struct sk_buff){
 .transport_header = (__u16)65535,
 .mac_header = (__u16)65535,
 .end = (sk_buff_data_t)192,
 .head = (unsigned char *)0x000000007524fd8b,
 .data = (unsigned char *)0x000000007524fd8b,
 .truesize = (unsigned int)768,
 .users = (refcount_t){
  .refs = (atomic_t){
   .counter = (int)1,
  },
 },
}

Flags modifying display are as follows:

- BTF_F_COMPACT:	no formatting around type information
- BTF_F_NONAME:		no struct/union member names/types
- BTF_F_PTR_RAW:	show raw (unobfuscated) pointer values;
			equivalent to %px.
- BTF_F_ZERO:		show zero-valued struct/union members;
			they are not displayed by default

Change-Id: I77f2c2a0d41aee2f4f10e3288a36de475fd2cb46
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com
2025-09-08 17:28:14 +03:00
Alan Maguire
daf1e73239 BACKPORT: bpf: Move to generic BTF show support, apply it to seq files/strings
generalize the "seq_show" seq file support in btf.c to support
a generic show callback of which we support two instances; the
current seq file show, and a show with snprintf() behaviour which
instead writes the type data to a supplied string.

Both classes of show function call btf_type_show() with different
targets; the seq file or the string to be written.  In the string
case we need to track additional data - length left in string to write
and length to return that we would have written (a la snprintf).

By default show will display type information, field members and
their types and values etc, and the information is indented
based upon structure depth. Zeroed fields are omitted.

Show however supports flags which modify its behaviour:

BTF_SHOW_COMPACT - suppress newline/indent.
BTF_SHOW_NONAME - suppress show of type and member names.
BTF_SHOW_PTR_RAW - do not obfuscate pointer values.
BTF_SHOW_UNSAFE - do not copy data to safe buffer before display.
BTF_SHOW_ZERO - show zeroed values (by default they are not shown).

Use the old name for copy_from_kernel_nofault since backporting the
commit is too much effort and their is no functional change.

Change-Id: I1165ad6ea0d2b3620c6d920701160dc1a4472bb3
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-3-git-send-email-alan.maguire@oracle.com
2025-09-08 17:28:14 +03:00
Alan Maguire
5e8f1e99ee UPSTREAM: bpf: Provide function to get vmlinux BTF information
It will be used later for BPF structure display support

Change-Id: Ib804bccf1c37fc3c55d1eb042a063bc37eebdb74
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-2-git-send-email-alan.maguire@oracle.com
2025-09-08 17:28:14 +03:00
Toke Høiland-Jørgensen
707191ce21 UPSTREAM: bpf: verifier: refactor check_attach_btf_id()
The check_attach_btf_id() function really does three things:

1. It performs a bunch of checks on the program to ensure that the
   attachment is valid.

2. It stores a bunch of state about the attachment being requested in
   the verifier environment and struct bpf_prog objects.

3. It allocates a trampoline for the attachment.

This patch splits out (1.) and (3.) into separate functions which will
perform the checks, but return the computed values instead of directly
modifying the environment. This is done in preparation for reusing the
checks when the actual attachment is happening, which will allow tracing
programs to have multiple (compatible) attachments.

This also fixes a bug where a bunch of checks were skipped if a trampoline
already existed for the tracing target.

Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
Fixes: 1e6c62a88215 ("bpf: Introduce sleepable BPF programs")
Acked-by: Andrii Nakryiko <andriin@fb.com>
Change-Id: Ie71636c6681afd4c3d3615a090e2e75742200a3b
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 17:28:13 +03:00
Toke Høiland-Jørgensen
ee98705a02 UPSTREAM: bpf: change logging calls from verbose() to bpf_log() and use log pointer
In preparation for moving code around, change a bunch of references to
env->log (and the verbose() logging helper) to use bpf_log() and a direct
pointer to struct bpf_verifier_log. While we're touching the function
signature, mark the 'prog' argument to bpf_check_type_match() as const.

Also enhance the bpf_verifier_log_needed() check to handle NULL pointers
for the log struct so we can re-use the code with logging disabled.

Acked-by: Andrii Nakryiko <andriin@fb.com>
Change-Id: Iefc10bf5bd36b29b6423dd8e9ee6bd9fb3d4e042
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 17:28:13 +03:00
Toke Høiland-Jørgensen
104cb8f9e0 UPSTREAM: bpf: disallow attaching modify_return tracing functions to other BPF programs
From the checks and commit messages for modify_return, it seems it was
never the intention that it should be possible to attach a tracing program
with expected_attach_type == BPF_MODIFY_RETURN to another BPF program.
However, check_attach_modify_return() will only look at the function name,
so if the target function starts with "security_", the attach will be
allowed even for bpf2bpf attachment.

Fix this oversight by also blocking the modification if a target program is
supplied.

Fixes: 18644cec714a ("bpf: Fix use-after-free in fmod_ret check")
Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
Acked-by: Andrii Nakryiko <andriin@fb.com>
Change-Id: I40e2002414d9f73aa5326798d9c5d299ac7da03b
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 17:28:13 +03:00
Lorenz Bauer
d6b319ea3c UPSTREAM: bpf: sockmap: Enable map_update_elem from bpf_iter
Allow passing a pointer to a BTF struct sock_common* when updating
a sockmap or sockhash. Since BTF pointers can fault and therefore be
NULL at runtime we need to add an additional !sk check to
sock_map_update_elem. Since we may be passed a request or timewait
socket we also need to check sk_fullsock. Doing this allows calling
map_update_elem on sockmap from bpf_iter context, which uses
BTF pointers.

Change-Id: Ifc6771f27753316add13e847cbcccbcb93c12965
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200928090805.23343-2-lmb@cloudflare.com
2025-09-08 17:28:13 +03:00
Martin KaFai Lau
f81fa45a60 UPSTREAM: bpf: Enable bpf_skc_to_* sock casting helper to networking prog type
There is a constant need to add more fields into the bpf_tcp_sock
for the bpf programs running at tc, sock_ops...etc.

A current workaround could be to use bpf_probe_read_kernel().  However,
other than making another helper call for reading each field and missing
CO-RE, it is also not as intuitive to use as directly reading
"tp->lsndtime" for example.  While already having perfmon cap to do
bpf_probe_read_kernel(), it will be much easier if the bpf prog can
directly read from the tcp_sock.

This patch tries to do that by using the existing casting-helpers
bpf_skc_to_*() whose func_proto returns a btf_id.  For example, the
func_proto of bpf_skc_to_tcp_sock returns the btf_id of the
kernel "struct tcp_sock".

These helpers are also added to is_ptr_cast_function().
It ensures the returning reg (BPF_REF_0) will also carries the ref_obj_id.
That will keep the ref-tracking works properly.

The bpf_skc_to_* helpers are made available to most of the bpf prog
types in filter.c. The bpf_skc_to_* helpers will be limited by
perfmon cap.

This patch adds a ARG_PTR_TO_BTF_ID_SOCK_COMMON.  The helper accepting
this arg can accept a btf-id-ptr (PTR_TO_BTF_ID + &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON])
or a legacy-ctx-convert-skc-ptr (PTR_TO_SOCK_COMMON).  The bpf_skc_to_*()
helpers are changed to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that
they will accept pointer obtained from skb->sk.

Instead of specifying both arg_type and arg_btf_id in the same func_proto
which is how the current ARG_PTR_TO_BTF_ID does, the arg_btf_id of
the new ARG_PTR_TO_BTF_ID_SOCK_COMMON is specified in the
compatible_reg_types[] in verifier.c.  The reason is the arg_btf_id is
always the same.  Discussion in this thread:
https://lore.kernel.org/bpf/20200922070422.1917351-1-kafai@fb.com/

The ARG_PTR_TO_BTF_ID_ part gives a clear expectation that the helper is
expecting a PTR_TO_BTF_ID which could be NULL.  This is the same
behavior as the existing helper taking ARG_PTR_TO_BTF_ID.

The _SOCK_COMMON part means the helper is also expecting the legacy
SOCK_COMMON pointer.

By excluding the _OR_NULL part, the bpf prog cannot call helper
with a literal NULL which doesn't make sense in most cases.
e.g. bpf_skc_to_tcp_sock(NULL) will be rejected.  All PTR_TO_*_OR_NULL
reg has to do a NULL check first before passing into the helper or else
the bpf prog will be rejected.  This behavior is nothing new and
consistent with the current expectation during bpf-prog-load.

[ ARG_PTR_TO_BTF_ID_SOCK_COMMON will be used to replace
  ARG_PTR_TO_SOCK* of other existing helpers later such that
  those existing helpers can take the PTR_TO_BTF_ID returned by
  the bpf_skc_to_*() helpers.

  The only special case is bpf_sk_lookup_assign() which can accept a
  literal NULL ptr.  It has to be handled specially in another follow
  up patch if there is a need (e.g. by renaming ARG_PTR_TO_SOCKET_OR_NULL
  to ARG_PTR_TO_BTF_ID_SOCK_COMMON_OR_NULL). ]

[ When converting the older helpers that take ARG_PTR_TO_SOCK* in
  the later patch, if the kernel does not support BTF,
  ARG_PTR_TO_BTF_ID_SOCK_COMMON will behave like ARG_PTR_TO_SOCK_COMMON
  because no reg->type could have PTR_TO_BTF_ID in this case.

  It is not a concern for the newer-btf-only helper like the bpf_skc_to_*()
  here though because these helpers must require BTF vmlinux to begin
  with. ]

Change-Id: I8463fd1c8f7843a7548b470c9a2555466a3f5fa7
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200925000350.3855720-1-kafai@fb.com
2025-09-08 17:28:12 +03:00
Lorenzo Bianconi
6fa5c7864e UPSTREAM: bpf, cpumap: Remove rcpu pointer from cpu_map_build_skb signature
Get rid of bpf_cpu_map_entry pointer in cpu_map_build_skb routine
signature since it is no longer needed.

Change-Id: Idcfb2814fc425d3c63c866727e59dc62e3c52041
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/33cb9b7dc447de3ea6fd6ce713ac41bca8794423.1601292015.git.lorenzo@kernel.org
2025-09-08 17:28:12 +03:00
Song Liu
8f848e3e7e UPSTREAM: bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint
Add .test_run for raw_tracepoint. Also, introduce a new feature that runs
the target program on a specific CPU. This is achieved by a new flag in
bpf_attr.test, BPF_F_TEST_RUN_ON_CPU. When this flag is set, the program
is triggered on cpu with id bpf_attr.test.cpu. This feature is needed for
BPF programs that handle perf_event and other percpu resources, as the
program can access these resource locally.

Change-Id: Id8f992caff30d7b65df8195f8934bcb2d8b658cb
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200925205432.1777-2-songliubraving@fb.com
2025-09-08 17:28:11 +03:00
John Fastabend
7124049d74 UPSTREAM: bpf, verifier: Remove redundant var_off.value ops in scalar known reg cases
In BPF_AND and BPF_OR alu cases we have this pattern when the src and dst
tnum is a constant.

 1 dst_reg->var_off = tnum_[op](dst_reg->var_off, src_reg.var_off)
 2 scalar32_min_max_[op]
 3       if (known) return
 4 scalar_min_max_[op]
 5       if (known)
 6          __mark_reg_known(dst_reg,
                   dst_reg->var_off.value [op] src_reg.var_off.value)

The result is in 1 we calculate the var_off value and store it in the
dst_reg. Then in 6 we duplicate this logic doing the op again on the
value.

The duplication comes from the the tnum_[op] handlers because they have
already done the value calcuation. For example this is tnum_and().

 struct tnum tnum_and(struct tnum a, struct tnum b)
 {
	u64 alpha, beta, v;

	alpha = a.value | a.mask;
	beta = b.value | b.mask;
	v = a.value & b.value;
	return TNUM(v, alpha & beta & ~v);
 }

So lets remove the redundant op calculation. Its confusing for readers
and unnecessary. Its also not harmful because those ops have the
property, r1 & r1 = r1 and r1 | r1 = r1.

Change-Id: Ibff1ff7d05a8dc1a4cbb3b2efbc75bbd0dbc377c
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 17:28:11 +03:00
Martin KaFai Lau
ea103760ae UPSTREAM: bpf: Change bpf_sk_storage_*() to accept ARG_PTR_TO_BTF_ID_SOCK_COMMON
This patch changes the bpf_sk_storage_*() to take
ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer
returned by the bpf_skc_to_*() helpers also.

A micro benchmark has been done on a "cgroup_skb/egress" bpf program
which does a bpf_sk_storage_get().  It was driven by netperf doing
a 4096 connected UDP_STREAM test with 64bytes packet.
The stats from "kernel.bpf_stats_enabled" shows no meaningful difference.

The sk_storage_get_btf_proto, sk_storage_delete_btf_proto,
btf_sk_storage_get_proto, and btf_sk_storage_delete_proto are
no longer needed, so they are removed.

Change-Id: Ifef1303f11952714d181a3c369e154a70a864fd0
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200925000402.3856307-1-kafai@fb.com
2025-09-08 17:28:10 +03:00
Martin KaFai Lau
fd25ce5ed8 UPSTREAM: bpf: Move the PTR_TO_BTF_ID check to check_reg_type()
check_reg_type() checks whether a reg can be used as an arg of a
func_proto.  For PTR_TO_BTF_ID, the check is actually not
completely done until the reg->btf_id is pointing to a
kernel struct that is acceptable by the func_proto.

Thus, this patch moves the btf_id check into check_reg_type().
"arg_type" and "arg_btf_id" are passed to check_reg_type() instead of
"compatible".  The compatible_reg_types[] usage is localized in
check_reg_type() now.

The "if (!btf_id) verbose(...); " is also removed since it won't happen.

Change-Id: I729be72b179095f88e0edb433e24586b76bdaae9
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200925000344.3854828-1-kafai@fb.com
2025-09-08 17:28:09 +03:00
Alexei Starovoitov
7792056354 UPSTREAM: Revert "bpf: Fix potential call bpf_link_free() in atomic context"
This reverts commit 31f23a6a181c81543b10a1a9056b0e6c7ef1c747.

This change made many selftests/bpf flaky: flow_dissector, sk_lookup, sk_assign and others.
There was no issue in the code.

Change-Id: Ie4d28d711448cf7443ebd9dbd7d8169f94fff0ae
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 17:28:09 +03:00
Lorenz Bauer
c6b2930bb2 UPSTREAM: bpf: Explicitly size compatible_reg_types
Arrays with designated initializers have an implicit length of the highest
initialized value plus one. I used this to ensure that newly added entries
in enum bpf_reg_type get a NULL entry in compatible_reg_types.

This is difficult to understand since it requires knowledge of the
peculiarities of designated initializers. Use __BPF_ARG_TYPE_MAX to size
the array instead.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Change-Id: I2f69d60ae26f495de801e0be5b24a7a95a10da15
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200923160156.80814-1-lmb@cloudflare.com
2025-09-08 17:28:09 +03:00
Lorenz Bauer
23945da090 UPSTREAM: bpf: Use a table to drive helper arg type checks
The mapping between bpf_arg_type and bpf_reg_type is encoded in a big
hairy if statement that is hard to follow. The debug output also leaves
to be desired: if a reg_type doesn't match we only print one of the
options, instead printing all the valid ones.

Convert the if statement into a table which is then used to drive type
checking. If none of the reg_types match we print all options, e.g.:

    R2 type=rdonly_buf expected=fp, pkt, pkt_meta, map_value

Change-Id: Ic6330a681b3ca2d3aa609173d5746f7ffcc4d5b8
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-12-lmb@cloudflare.com
2025-09-08 17:28:09 +03:00
Lorenz Bauer
99e29c3309 UPSTREAM: bpf: Hoist type checking for nullable arg types
check_func_arg has a plethora of weird if statements with empty branches.
They work around the fact that *_OR_NULL argument types should accept a
SCALAR_VALUE register, as long as it's value is 0. These statements make
it difficult to reason about the type checking logic.

Instead, skip more detailed type checking logic iff the register is 0,
and the function expects a nullable type. This allows simplifying the type
checking itself.

Change-Id: I4e783d07cb9cfe231a16ebf3d27a55b438711006
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-11-lmb@cloudflare.com
2025-09-08 17:28:08 +03:00
Lorenz Bauer
23f66cd492 UPSTREAM: bpf: Check ARG_PTR_TO_SPINLOCK register type in check_func_arg
Move the check for PTR_TO_MAP_VALUE to check_func_arg, where all other
checking is done as well. Move the invocation of process_spin_lock away
from the register type checking, to allow a future refactoring.

Change-Id: I185ed6e4c44be2bdfa6f5bab3ec57fd2c9e58381
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-10-lmb@cloudflare.com
2025-09-08 17:28:08 +03:00
Lorenz Bauer
ac99e8c348 UPSTREAM: bpf: Set meta->raw_mode for pointers close to use
If we encounter a pointer to memory, we set meta->raw_mode depending
on the type of memory we point at. What isn't obvious is that this
information is only used when the next memory size argument is
encountered.

Move the assignment closer to where it's used, and add a comment that
explains what is going on.

Change-Id: Iec0b6ade10c7475d72ac7dd8fff4c7e73deef728
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-9-lmb@cloudflare.com
2025-09-08 17:28:08 +03:00
Lorenz Bauer
fd2bd3a82f UPSTREAM: bpf: Make context access check generic
Always check context access if the register we're operating on is
PTR_TO_CTX, rather than relying on ARG_PTR_TO_CTX. This allows
simplifying the arg_type checking section of the function.

Change-Id: Ic9ab0542797d42272d61a3e0e71aead2adc89a4e
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-8-lmb@cloudflare.com
2025-09-08 17:28:08 +03:00
Lorenz Bauer
8d56713572 UPSTREAM: bpf: Make reference tracking generic
Instead of dealing with reg->ref_obj_id individually for every arg type that
needs it, rely on the fact that ref_obj_id is zero if the register is not
reference tracked.

Change-Id: I68181c83545a3f1de73eb863986fe4165df10a87
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-7-lmb@cloudflare.com
2025-09-08 17:28:07 +03:00
Lorenz Bauer
44ea176c2e UPSTREAM: bpf: Make BTF pointer type checking generic
Perform BTF type checks if the register we're working on contains a BTF
pointer, rather than if the argument is for a BTF pointer. This is easier
to understand, and allows removing the code from the arg_type checking
section of the function.

Change-Id: I1b8f522a2f4438a1c86a6512d733b7b8f2bb1149
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-6-lmb@cloudflare.com
2025-09-08 17:28:07 +03:00
Lorenz Bauer
7548dd425c UPSTREAM: bpf: Allow specifying a BTF ID per argument in function protos
Function prototypes using ARG_PTR_TO_BTF_ID currently use two ways to signal
which BTF IDs are acceptable. First, bpf_func_proto.btf_id is an array of
IDs, one for each argument. This array is only accessed up to the highest
numbered argument that uses ARG_PTR_TO_BTF_ID and may therefore be less than
five arguments long. It usually points at a BTF_ID_LIST. Second, check_btf_id
is a function pointer that is called by the verifier if present. It gets the
actual BTF ID of the register, and the argument number we're currently checking.
It turns out that the only user check_arg_btf_id ignores the argument, and is
simply used to check whether the BTF ID has a struct sock_common at it's start.

Replace both of these mechanisms with an explicit BTF ID for each argument
in a function proto. Thanks to btf_struct_ids_match this is very flexible:
check_arg_btf_id can be replaced by requiring struct sock_common.

Change-Id: I04d5adc4574380e9a67e7a688b98611158bb7102
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-5-lmb@cloudflare.com
2025-09-08 17:28:07 +03:00
Lorenz Bauer
8b1ebefb81 UPSTREAM: bpf: Check scalar or invalid register in check_helper_mem_access
Move the check for a NULL or zero register to check_helper_mem_access. This
makes check_stack_boundary easier to understand.

Change-Id: I92619e8d6144ecb79ebf44ef29ff9e80a4f6c1e0
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-3-lmb@cloudflare.com
2025-09-08 17:28:06 +03:00
Lorenz Bauer
5ca7018995 UPSTREAM: btf: Make btf_set_contains take a const pointer
bsearch doesn't modify the contents of the array, so we can take a const pointer.

Change-Id: I9ba6e9b56b0e48b0a4240ad4192def77486f4c32
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-2-lmb@cloudflare.com
2025-09-08 17:28:06 +03:00
Muchun Song
66bdb024c9 UPSTREAM: bpf: Fix potential call bpf_link_free() in atomic context
The in_atomic() macro cannot always detect atomic context, in particular,
it cannot know about held spinlocks in non-preemptible kernels. Although,
there is no user call bpf_link_put() with holding spinlock now, be on the
safe side, so we can avoid this in the future.

Change-Id: I02f0190a342ba26a760e8a11f8e39f7d5bc60473
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200917074453.20621-1-songmuchun@bytedance.com
2025-09-08 17:28:06 +03:00
Martin KaFai Lau
260d4107d8 UPSTREAM: bpf: Use hlist_add_head_rcu when linking to local_storage
The local_storage->list will be traversed by rcu reader in parallel.
Thus, hlist_add_head_rcu() is needed in bpf_selem_link_storage_nolock().
This patch fixes it.

This part of the code has recently been refactored in bpf-next
and this patch makes changes to the new file "bpf_local_storage.c".
Instead of using the original offending commit in the Fixes tag,
the commit that created the file "bpf_local_storage.c" is used.

A separate fix has been provided to the bpf tree.

Fixes: 450af8d0f6be ("bpf: Split bpf_local_storage to bpf_sk_storage")
Change-Id: Ia5fdd4c4c1a96f99127c7cfe504bf075bfea5bf1
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200916204453.2003915-1-kafai@fb.com
2025-09-08 17:28:05 +03:00
Alexei Starovoitov
593b444f07 UPSTREAM: bpf: Add abnormal return checks.
LD_[ABS|IND] instructions may return from the function early. bpf_tail_call
pseudo instruction is either fallthrough or return. Allow them in the
subprograms only when subprograms are BTF annotated and have scalar return
types. Allow ld_abs and tail_call in the main program even if it calls into
subprograms. In the past that was not ok to do for ld_abs, since it was JITed
with special exit sequence. Since bpf_gen_ld_abs() was introduced the ld_abs
looks like normal exit insn from JIT point of view, so it's safe to allow them
in the main program.

Change-Id: I8ab267d2cdfd4a8fb8c99314039e7e261fd7f9f0
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 17:28:05 +03:00
Maciej Fijalkowski
3a59c4ecef UPSTREAM: bpf: allow for tailcalls in BPF subprograms for x64 JIT
Relax verifier's restriction that was meant to forbid tailcall usage
when subprog count was higher than 1.

Also, do not max out the stack depth of program that utilizes tailcalls.

Change-Id: I85723d4b31b0403c853d00350e54214d455e5afb
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 17:28:05 +03:00
Maciej Fijalkowski
877246b5b7 UPSTREAM: bpf, x64: rework pro/epilogue and tailcall handling in JIT
This commit serves two things:
1) it optimizes BPF prologue/epilogue generation
2) it makes possible to have tailcalls within BPF subprogram

Both points are related to each other since without 1), 2) could not be
achieved.

In [1], Alexei says:
"The prologue will look like:
nop5
xor eax,eax  // two new bytes if bpf_tail_call() is used in this
             // function
push rbp
mov rbp, rsp
sub rsp, rounded_stack_depth
push rax // zero init tail_call counter
variable number of push rbx,r13,r14,r15

Then bpf_tail_call will pop variable number rbx,..
and final 'pop rax'
Then 'add rsp, size_of_current_stack_frame'
jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
rbp, rsp'

This way new function will set its own stack size and will init tail
call
counter with whatever value the parent had.

If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
Instead it would need to have 'nop2' in there."

Implement that suggestion.

Since the layout of stack is changed, tail call counter handling can not
rely anymore on popping it to rbx just like it have been handled for
constant prologue case and later overwrite of rbx with actual value of
rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
is considered to be volatile/caller-saved and pop the value of tail call
counter in there in the epilogue.

Drop the BUILD_BUG_ON in emit_prologue and in
emit_bpf_tail_call_indirect where instruction layout is not constant
anymore.

Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
dedicated for skipping the register pops and stack unwind that are
generated right before the actual jump to target program.
For case when the target program is not present, BPF program will skip
the pop instructions and nop5 dedicated for jmpq $target. An example of
such state when only R6 of callee saved registers is used by program:

ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
ffffffffc0513aa6:       5b                      pop    %rbx
ffffffffc0513aa7:       58                      pop    %rax
ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi

When target program is inserted, the jump that was there to skip
pops/nop5 will become the nop5, so CPU will go over pops and do the
actual tailcall.

One might ask why there simply can not be pushes after the nop5?
In the following example snippet:

ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
(...)
ffffffffc0370332:       5b                      pop    %rbx
ffffffffc0370333:       58                      pop    %rax
ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
ffffffffc0370347:       50                      push   %rax
ffffffffc0370348:       53                      push   %rbx
ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548

There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
and jump target is not present. ctx is in %rbx register and BPF
subprogram that we will call into on ffffffffc037034c is relying on it,
e.g. it will pick ctx from there. Such code layout is therefore broken
as we would overwrite the content of %rbx with the value that was pushed
on the prologue. That is the reason for the 'bypass' approach.

Special care needs to be taken during the install/update/remove of
tailcall target. In case when target program is not present, the CPU
must not execute the pop instructions that precede the tailcall.

To address that, the following states can be defined:
A nop, unwind, nop
B nop, unwind, tail
C skip, unwind, nop
D skip, unwind, tail

A is forbidden (lead to incorrectness). The state transitions between
tailcall install/update/remove will work as follows:

First install tail call f: C->D->B(f)
 * poke the tailcall, after that get rid of the skip
Update tail call f to f': B(f)->B(f')
 * poke the tailcall (poke->tailcall_target) and do NOT touch the
   poke->tailcall_bypass
Remove tail call: B(f')->C(f')
 * poke->tailcall_bypass is poked back to jump, then we wait the RCU
   grace period so that other programs will finish its execution and
   after that we are safe to remove the poke->tailcall_target
Install new tail call (f''): C(f')->D(f'')->B(f'').
 * same as first step

This way CPU can never be exposed to "unwind, tail" state.

Last but not least, when tailcalls get mixed with bpf2bpf calls, it
would be possible to encounter the endless loop due to clearing the
tailcall counter if for example we would use the tailcall3-like from BPF
selftests program that would be subprogram-based, meaning the tailcall
would be present within the BPF subprogram.

This test, broken down to particular steps, would do:
entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
func0 -> call subprog_tail
(we are NOT skipping the first 11 bytes of prologue and this subprogram
has a tailcall, therefore we clear the counter...)
subprog -> do the same thing as entry

and then loop forever.

To address this, the idea is to go through the call chain of bpf2bpf progs
and look for a tailcall presence throughout whole chain. If we saw a single
tail call then each node in this call chain needs to be marked as a subprog
that can reach the tailcall. We would later feed the JIT with this info
and:
- set eax to 0 only when tailcall is reachable and this is the entry prog
- if tailcall is reachable but there's no tailcall in insns of currently
  JITed prog then push rax anyway, so that it will be possible to
  propagate further down the call chain
- finally if tailcall is reachable, then we need to precede the 'call'
  insn with mov rax, [rbp - (stack_depth + 8)]

Tail call related cases from test_verifier kselftest are also working
fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
work properly as well.

[1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Change-Id: I0c252ca000361887bf2087b51011b9e3c961ce35
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 17:28:05 +03:00
Maciej Fijalkowski
d584a39a9b UPSTREAM: bpf: rename poke descriptor's 'ip' member to 'tailcall_target'
Reflect the actual purpose of poke->ip and rename it to
poke->tailcall_target so that it will not the be confused with another
poke target that will be introduced in next commit.

While at it, do the same thing with poke->ip_stable - rename it to
poke->tailcall_target_stable.

Change-Id: I574e7dcf5412474a68d410401f84597ae5047873
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 17:28:04 +03:00
Maciej Fijalkowski
948e15049d UPSTREAM: bpf: propagate poke descriptors to subprograms
Previously, there was no need for poke descriptors being present in
subprogram's bpf_prog_aux struct since tailcalls were simply not allowed
in them. Each subprog is JITed independently so in order to enable
JITing subprograms that use tailcalls, do the following:

- in fixup_bpf_calls() store the index of tailcall insn onto the generated
  poke descriptor,
- in case when insn patching occurs, adjust the tailcall insn idx from
  bpf_patch_insn_data,
- then in jit_subprogs() check whether the given poke descriptor belongs
  to the current subprog by checking if that previously stored absolute
  index of tail call insn is in the scope of the insns of given subprog,
- update the insn->imm with new poke descriptor slot so that while JITing
  the proper poke descriptor will be grabbed

This way each of the main program's poke descriptors are distributed
across the subprograms poke descriptor array, so main program's
descriptors can be untracked out of the prog array map.

Add also subprog's aux struct to the BPF map poke_progs list by calling
on it map_poke_track().

In case of any error, call the map_poke_untrack() on subprog's aux
structs that have already been registered to prog array map.

Change-Id: Ic622a47cd5fb4e339a42f0f08aafd2bf34c915f0
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-08 17:28:04 +03:00
YiFei Zhu
22b4a11da1 UPSTREAM: bpf: Add BPF_PROG_BIND_MAP syscall
This syscall binds a map to a program. Returns success if the map is
already bound to the program.

Change-Id: I3899904e13ff5cddb35d41ffa9abe67d29a55ca9
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: YiFei Zhu <zhuyifei1999@gmail.com>
Link: https://lore.kernel.org/bpf/20200915234543.3220146-3-sdf@google.com
2025-09-08 17:28:04 +03:00
YiFei Zhu
ab4f7ba5a3 UPSTREAM: bpf: Mutex protect used_maps array and count
To support modifying the used_maps array, we use a mutex to protect
the use of the counter and the array. The mutex is initialized right
after the prog aux is allocated, and destroyed right before prog
aux is freed. This way we guarantee it's initialized for both cBPF
and eBPF.

Change-Id: I5a5a46193377945bf8bded1a5d8e2effc426555e
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: YiFei Zhu <zhuyifei1999@gmail.com>
Link: https://lore.kernel.org/bpf/20200915234543.3220146-2-sdf@google.com
2025-09-08 17:28:04 +03:00
Matthew Wilcox (Oracle)
a10d5ec4e9 UPSTREAM: sysctl: Convert to iter interfaces
Using the read_iter/write_iter interfaces allows for in-kernel users
to set sysctls without using set_fs().  Also, the buffer is a string,
so give it the real type of 'char *', not void *.

[AV: Christoph's fixup folded in]

Change-Id: I9925d71d0c27b6712467085ec446ec7fca8e6db4
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-08 17:28:02 +03:00
Yonghong Song
ca76009e93 UPSTREAM: bpf: Do not use bucket_lock for hashmap iterator
Currently, for hashmap, the bpf iterator will grab a bucket lock, a
spinlock, before traversing the elements in the bucket. This can ensure
all bpf visted elements are valid. But this mechanism may cause
deadlock if update/deletion happens to the same bucket of the
visited map in the program. For example, if we added bpf_map_update_elem()
call to the same visited element in selftests bpf_iter_bpf_hash_map.c,
we will have the following deadlock:

  ============================================
  WARNING: possible recursive locking detected
  5.9.0-rc1+ #841 Not tainted
  --------------------------------------------
  test_progs/1750 is trying to acquire lock:
  ffff9a5bb73c5e70 (&htab->buckets[i].raw_lock){....}-{2:2}, at: htab_map_update_elem+0x1cf/0x410

  but task is already holding lock:
  ffff9a5bb73c5e20 (&htab->buckets[i].raw_lock){....}-{2:2}, at: bpf_hash_map_seq_find_next+0x94/0x120

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&htab->buckets[i].raw_lock);
    lock(&htab->buckets[i].raw_lock);

   *** DEADLOCK ***
   ...
  Call Trace:
   dump_stack+0x78/0xa0
   __lock_acquire.cold.74+0x209/0x2e3
   lock_acquire+0xba/0x380
   ? htab_map_update_elem+0x1cf/0x410
   ? __lock_acquire+0x639/0x20c0
   _raw_spin_lock_irqsave+0x3b/0x80
   ? htab_map_update_elem+0x1cf/0x410
   htab_map_update_elem+0x1cf/0x410
   ? lock_acquire+0xba/0x380
   bpf_prog_ad6dab10433b135d_dump_bpf_hash_map+0x88/0xa9c
   ? find_held_lock+0x34/0xa0
   bpf_iter_run_prog+0x81/0x16e
   __bpf_hash_map_seq_show+0x145/0x180
   bpf_seq_read+0xff/0x3d0
   vfs_read+0xad/0x1c0
   ksys_read+0x5f/0xe0
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ...

The bucket_lock first grabbed in seq_ops->next() called by bpf_seq_read(),
and then grabbed again in htab_map_update_elem() in the bpf program, causing
deadlocks.

Actually, we do not need bucket_lock here, we can just use rcu_read_lock()
similar to netlink iterator where the rcu_read_{lock,unlock} likes below:
 seq_ops->start():
     rcu_read_lock();
 seq_ops->next():
     rcu_read_unlock();
     /* next element */
     rcu_read_lock();
 seq_ops->stop();
     rcu_read_unlock();

Compared to old bucket_lock mechanism, if concurrent updata/delete happens,
we may visit stale elements, miss some elements, or repeat some elements.
I think this is a reasonable compromise. For users wanting to avoid
stale, missing/repeated accesses, bpf_map batch access syscall interface
can be used.

Change-Id: Id1ba4dedc71723382029e05d83ecd1c4dcd98bb5
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200902235340.2001375-1-yhs@fb.com
2025-09-08 17:28:01 +03:00
Yonghong Song
41aaa22e13 UPSTREAM: bpf: Avoid iterating duplicated files for task_file iterator
Currently, task_file iterator iterates all files from all tasks.
This may potentially visit a lot of duplicated files if there are
many tasks sharing the same files, e.g., typical pthreads
where these pthreads and the main thread are sharing the same files.

This patch changed task_file iterator to skip a particular task
if that task shares the same files as its group_leader (the task
having the same tgid and also task->tgid == task->pid).
This will preserve the same result, visiting all files from all
tasks, and will reduce runtime cost significantl, e.g., if there are
a lot of pthreads and the process has a lot of open files.

Suggested-by: Andrii Nakryiko <andriin@fb.com>
Change-Id: I0aa5a478b70f31487da8bd57446ffd3d9b9b3eb4
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/bpf/20200902023112.1672792-1-yhs@fb.com
2025-09-08 17:28:01 +03:00
Björn Töpel
72d818ef7f UPSTREAM: bpf: {cpu,dev}map: Change various functions return type from int to void
The functions bq_enqueue(), bq_flush_to_queue(), and bq_xmit_all() in
{cpu,dev}map.c always return zero. Changing the return type from int
to void makes the code easier to follow.

Suggested-by: David Ahern <dsahern@gmail.com>
Change-Id: I405d72c40b4c2d2530d7879176fea23ea3d7dbb9
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200901083928.6199-1-bjorn.topel@gmail.com
2025-09-08 17:28:01 +03:00
Alexei Starovoitov
3288385861 UPSTREAM: bpf: Remove bpf_lsm_file_mprotect from sleepable list.
Technically the bpf programs can sleep while attached to bpf_lsm_file_mprotect,
but such programs need to access user memory. So they're in might_fault()
category. Which means they cannot be called from file_mprotect lsm hook that
takes write lock on mm->mmap_lock.
Adjust the test accordingly.

Also add might_fault() to __bpf_prog_enter_sleepable() to catch such deadlocks early.

Fixes: 1e6c62a88215 ("bpf: Introduce sleepable BPF programs")
Fixes: e68a144547fc ("selftests/bpf: Add sleepable tests")
Reported-by: Yonghong Song <yhs@fb.com>
Change-Id: I9fbc55e149205e27c86105560ceabc162679d0aa
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200831201651.82447-1-alexei.starovoitov@gmail.com
2025-09-08 17:28:01 +03:00
Alexei Starovoitov
9221f9b257 UPSTREAM: bpf: Fix build without BPF_LSM.
resolve_btfids doesn't like empty set. Add unused ID when BPF_LSM is off.

Fixes: 1e6c62a88215 ("bpf: Introduce sleepable BPF programs")
Reported-by: Björn Töpel <bjorn.topel@gmail.com>
Change-Id: Ia9572a7876cdd0bdddc96495b49d1faf9e9729d7
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Song Liu <songliubraving@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200831163132.66521-1-alexei.starovoitov@gmail.com
2025-09-08 17:28:00 +03:00
Alexei Starovoitov
f60488663e UPSTREAM: bpf: Add bpf_copy_from_user() helper.
Sleepable BPF programs can now use copy_from_user() to access user memory.

Change-Id: Ifaf2da741d14dc56b121163d7c486e5c15d1d5a4
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-4-alexei.starovoitov@gmail.com
2025-09-08 17:28:00 +03:00
Alexei Starovoitov
b029758e8a UPSTREAM: bpf: Introduce sleepable BPF programs
Introduce sleepable BPF programs that can request such property for themselves
via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
to use helpers like bpf_copy_from_user() that might sleep. At present only
fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
when they are attached to kernel functions that are known to allow sleeping.

The non-sleepable programs are relying on implicit rcu_read_lock() and
migrate_disable() to protect life time of programs, maps that they use and
per-cpu kernel structures used to pass info between bpf programs and the
kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
should not be enclosed in migrate_disable() as well. Therefore
rcu_read_lock_trace is used to protect the life time of sleepable progs.

There are many networking and tracing program types. In many cases the
'struct bpf_prog *' pointer itself is rcu protected within some other kernel
data structure and the kernel code is using rcu_dereference() to load that
program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
Instead sleepable bpf programs are allowed with bpf trampoline only. The
program pointers are hard-coded into generated assembly of bpf trampoline and
synchronize_rcu_tasks_trace() is used to protect the life time of the program.
The same trampoline can hold both sleepable and non-sleepable progs.

When rcu_read_lock_trace is held it means that some sleepable bpf program is
running from bpf trampoline. Those programs can use bpf arrays and preallocated
hash/lru maps. These map types are waiting on programs to complete via
synchronize_rcu_tasks_trace();

Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
synchronize_rcu_tasks() to wait for sleepable progs to finish and for
trampoline assembly to finish.

This is the first step of introducing sleepable progs. Eventually dynamically
allocated hash maps can be allowed and networking program types can become
sleepable too.

Change-Id: I281471010348f21de4e0832dfc0265dfe85b4a0d
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
2025-09-08 17:28:00 +03:00
Martin KaFai Lau
6fcd6e4906 UPSTREAM: bpf: Relax max_entries check for most of the inner map types
Most of the maps do not use max_entries during verification time.
Thus, those map_meta_equal() do not need to enforce max_entries
when it is inserted as an inner map during runtime.  The max_entries
check is removed from the default implementation bpf_map_meta_equal().

The prog_array_map and xsk_map are exception.  Its map_gen_lookup
uses max_entries to generate inline lookup code.  Thus, they will
implement its own map_meta_equal() to enforce max_entries.
Since there are only two cases now, the max_entries check
is not refactored and stays in its own .c file.

Change-Id: I5054274bbd6f0944078b5bf52de072077d37aa34
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200828011813.1970516-1-kafai@fb.com
2025-09-08 17:27:59 +03:00