78 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
b758102651 UPSTREAM: Revert "bpf: Add map and need_defer parameters to .map_fd_put_ptr()"
This reverts commit eb6f68ec92ab60b0540ebf64fe851e99d846e086 which is
commit 20c20bd11a0702ce4dc9300c3da58acf551d9725 upstream.

It breaks the Android kernel abi and can be brought back in the future
in an abi-safe way if it is really needed.

Bug: 161946584
Change-Id: I4611eed3677738ab29469733e2b4f6734ef3d605
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2025-10-02 22:15:11 +08:00
Thomas Gleixner
3deb30fb74 BACKPORT: treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 295
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of version 2 of the gnu general public license as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 64 file(s).

Change-Id: Ic7cca08bbba3c38e0d53d3374c43ee8bf1e24172
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02 22:14:45 +08:00
Roman Gushchin
1a371f6225 UPSTREAM: bpf: move memory size checks to bpf_map_charge_init()
Most bpf map types doing similar checks and bytes to pages
conversion during memory allocation and charging.

Let's unify these checks by moving them into bpf_map_charge_init().

Change-Id: I55ceded2303102feba9e485042e8f5169f490609
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:45 +08:00
Roman Gushchin
632d849a6d UPSTREAM: bpf: rework memlock-based memory accounting for maps
In order to unify the existing memlock charging code with the
memcg-based memory accounting, which will be added later, let's
rework the current scheme.

Currently the following design is used:
  1) .alloc() callback optionally checks if the allocation will likely
     succeed using bpf_map_precharge_memlock()
  2) .alloc() performs actual allocations
  3) .alloc() callback calculates map cost and sets map.memory.pages
  4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
     and performs actual charging; in case of failure the map is
     destroyed
  <map is in use>
  1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
     performs uncharge and releases the user
  2) .map_free() callback releases the memory

The scheme can be simplified and made more robust:
  1) .alloc() calculates map cost and calls bpf_map_charge_init()
  2) bpf_map_charge_init() sets map.memory.user and performs actual
    charge
  3) .alloc() performs actual allocations
  <map is in use>
  1) .map_free() callback releases the memory
  2) bpf_map_charge_finish() performs uncharge and releases the user

The new scheme also allows to reuse bpf_map_charge_init()/finish()
functions for memcg-based accounting. Because charges are performed
before actual allocations and uncharges after freeing the memory,
no bogus memory pressure can be created.

In cases when the map structure is not available (e.g. it's not
created yet, or is already destroyed), on-stack bpf_map_memory
structure is used. The charge can be transferred with the
bpf_map_charge_move() function.

Change-Id: I299bfa9d3e74f366861b6de3bf17951a1374824b
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:44 +08:00
Roman Gushchin
7032f89546 UPSTREAM: bpf: group memory related fields in struct bpf_map_memory
Group "user" and "pages" fields of bpf_map into the bpf_map_memory
structure. Later it can be extended with "memcg" and other related
information.

The main reason for a such change (beside cosmetics) is to pass
bpf_map_memory structure to charging functions before the actual
allocation of bpf_map.

Change-Id: I04e4edf805bfe4c26fce45f7166317fe00dd0dfa
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:44 +08:00
Daniel Borkmann
b0c33de60f UPSTREAM: bpf: allow for key-less BTF in array map
Given we'll be reusing BPF array maps for global data/bss/rodata
sections, we need a way to associate BTF DataSec type as its map
value type. In usual cases we have this ugly BPF_ANNOTATE_KV_PAIR()
macro hack e.g. via 38d5d3b3d5db ("bpf: Introduce BPF_ANNOTATE_KV_PAIR")
to get initial map to type association going. While more use cases
for it are discouraged, this also won't work for global data since
the use of array map is a BPF loader detail and therefore unknown
at compilation time. For array maps with just a single entry we make
an exception in terms of BTF in that key type is declared optional
if value type is of DataSec type. The latter LLVM is guaranteed to
emit and it also aligns with how we regard global data maps as just
a plain buffer area reusing existing map facilities for allowing
things like introspection with existing tools.

Change-Id: I6fd7e20b453529e07aa1c77beacff4e62c7500bd
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:38 +08:00
Daniel Borkmann
6851cbec41 UPSTREAM: bpf: add program side {rd, wr}only support for maps
This work adds two new map creation flags BPF_F_RDONLY_PROG
and BPF_F_WRONLY_PROG in order to allow for read-only or
write-only BPF maps from a BPF program side.

Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
applies to system call side, meaning the BPF program has full
read/write access to the map as usual while bpf(2) calls with
map fd can either only read or write into the map depending
on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
for the exact opposite such that verifier is going to reject
program loads if write into a read-only map or a read into a
write-only map is detected. For read-only map case also some
helpers are forbidden for programs that would alter the map
state such as map deletion, update, etc. As opposed to the two
BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
as BPF_F_WRONLY_PROG really do correspond to the map lifetime.

We've enabled this generic map extension to various non-special
maps holding normal user data: array, hash, lru, lpm, local
storage, queue and stack. Further generic map types could be
followed up in future depending on use-case. Main use case
here is to forbid writes into .rodata map values from verifier
side.

Change-Id: Iad96790cec92137902fe3ad12f53f1a94d58bc61
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:37 +08:00
Daniel Borkmann
5489474293 BACKPORT: bpf: implement lookup-free direct value access for maps
This generic extension to BPF maps allows for directly loading
an address residing inside a BPF map value as a single BPF
ldimm64 instruction!

The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
is a special src_reg flag for ldimm64 instruction that indicates
that inside the first part of the double insns's imm field is a
file descriptor which the verifier then replaces as a full 64bit
address of the map into both imm parts. For the newly added
BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
the first part of the double insns's imm field is again a file
descriptor corresponding to the map, and the second part of the
imm field is an offset into the value. The verifier will then
replace both imm parts with an address that points into the BPF
map value at the given value offset for maps that support this
operation. Currently supported is array map with single entry.
It is possible to support more than just single map element by
reusing both 16bit off fields of the insns as a map index, so
full array map lookup could be expressed that way. It hasn't
been implemented here due to lack of concrete use case, but
could easily be done so in future in a compatible way, since
both off fields right now have to be 0 and would correctly
denote a map index 0.

The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
map pointer versus load of map's value at offset 0, and changing
BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
regular map pointer and map value pointer would add unnecessary
complexity and increases barrier for debugability thus less
suitable. Using the second part of the imm field as an offset
into the value does /not/ come with limitations since maximum
possible value size is in u32 universe anyway.

This optimization allows for efficiently retrieving an address
to a map value memory area without having to issue a helper call
which needs to prepare registers according to calling convention,
etc, without needing the extra NULL test, and without having to
add the offset in an additional instruction to the value base
pointer. The verifier then treats the destination register as
PTR_TO_MAP_VALUE with constant reg->off from the user passed
offset from the second imm field, and guarantees that this is
within bounds of the map value. Any subsequent operations are
normally treated as typical map value handling without anything
extra needed from verification side.

The two map operations for direct value access have been added to
array map for now. In future other types could be supported as
well depending on the use case. The main use case for this commit
is to allow for BPF loader support for global variables that
reside in .data/.rodata/.bss sections such that we can directly
load the address of them with minimal additional infrastructure
required. Loader support has been added in subsequent commits for
libbpf library.

Change-Id: I51974f2fe227ba837b338b8b3ebb44c145583673
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:36 +08:00
Alexei Starovoitov
eb322f919d UPSTREAM: bpf: introduce BPF_F_LOCK flag
Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands
and for map_update() helper function.
In all these cases take a lock of existing element (which was provided
in BTF description) before copying (in or out) the rest of map value.

Implementation details that are part of uapi:

Array:
The array map takes the element lock for lookup/update.

Hash:
hash map also takes the lock for lookup/update and tries to avoid the bucket lock.
If old element exists it takes the element lock and updates the element in place.
If element doesn't exist it allocates new one and inserts into hash table
while holding the bucket lock.
In rare case the hashmap has to take both the bucket lock and the element lock
to update old value in place.

Cgroup local storage:
It is similar to array. update in place and lookup are done with lock taken.

Change-Id: I76b13e23e1f6241c1f919a1c24650530f7705d9e
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:14:32 +08:00
Alexei Starovoitov
eb0dfde540 BACKPORT: bpf: introduce bpf_spin_lock
Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
bpf program serialize access to other variables.

Example:
struct hash_elem {
    int cnt;
    struct bpf_spin_lock lock;
};
struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
if (val) {
    bpf_spin_lock(&val->lock);
    val->cnt++;
    bpf_spin_unlock(&val->lock);
}

Restrictions and safety checks:
- bpf_spin_lock is only allowed inside HASH and ARRAY maps.
- BTF description of the map is mandatory for safety analysis.
- bpf program can take one bpf_spin_lock at a time, since two or more can
  cause dead locks.
- only one 'struct bpf_spin_lock' is allowed per map element.
  It drastically simplifies implementation yet allows bpf program to use
  any number of bpf_spin_locks.
- when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
- bpf program must bpf_spin_unlock() before return.
- bpf program can access 'struct bpf_spin_lock' only via
  bpf_spin_lock()/bpf_spin_unlock() helpers.
- load/store into 'struct bpf_spin_lock lock;' field is not allowed.
- to use bpf_spin_lock() helper the BTF description of map value must be
  a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
  Nested lock inside another struct is not allowed.
- syscall map_lookup doesn't copy bpf_spin_lock field to user space.
- syscall map_update and program map_update do not update bpf_spin_lock field.
- bpf_spin_lock cannot be on the stack or inside networking packet.
  bpf_spin_lock can only be inside HASH or ARRAY map value.
- bpf_spin_lock is available to root only and to all program types.
- bpf_spin_lock is not allowed in inner maps of map-in-map.
- ld_abs is not allowed inside spin_lock-ed region.
- tracing progs and socket filter progs cannot use bpf_spin_lock due to
  insufficient preemption checks

Implementation details:
- cgroup-bpf class of programs can nest with xdp/tc programs.
  Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
  Other solutions to avoid nested bpf_spin_lock are possible.
  Like making sure that all networking progs run with softirq disabled.
  spin_lock_irqsave is the simplest and doesn't add overhead to the
  programs that don't use it.
- arch_spinlock_t is used when its implemented as queued_spin_lock
- archs can force their own arch_spinlock_t
- on architectures where queued_spin_lock is not available and
  sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
- presence of bpf_spin_lock inside map value could have been indicated via
  extra flag during map_create, but specifying it via BTF is cleaner.
  It provides introspection for map key/value and reduces user mistakes.

Next steps:
- allow bpf_spin_lock in other map types (like cgroup local storage)
- introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
  to request kernel to grab bpf_spin_lock before rewriting the value.
  That will serialize access to map elements.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Change-Id: Id03322189a8f05c006a05479f7078b23c8c020ea
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:14:32 +08:00
Roman Gushchin
7a056468d8 UPSTREAM: bpf: pass struct btf pointer to the map_check_btf() callback
If key_type or value_type are of non-trivial data types
(e.g. structure or typedef), it's not possible to check them without
the additional information, which can't be obtained without a pointer
to the btf structure.

So, let's pass btf pointer to the map_check_btf() callbacks.

Change-Id: I95716060b450288d4ffcbe231d1cf5fdb530e292
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:28 +08:00
Prashant Bhole
b61bfd4593 UPSTREAM: bpf: return EOPNOTSUPP when map lookup isn't supported
Return ERR_PTR(-EOPNOTSUPP) from map_lookup_elem() methods of below
map types:
- BPF_MAP_TYPE_PROG_ARRAY
- BPF_MAP_TYPE_STACK_TRACE
- BPF_MAP_TYPE_XSKMAP
- BPF_MAP_TYPE_SOCKMAP/BPF_MAP_TYPE_SOCKHASH

Change-Id: I13937c36055b419f4446d8bfa06f139c757480c9
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:21 +08:00
Yonghong Song
84ace6305a UPSTREAM: bpf: add bpffs pretty print for program array map
Added bpffs pretty print for program array map. For a particular
array index, if the program array points to a valid program,
the "<index>: <prog_id>" will be printed out like
   0: 6
which means bpf program with id "6" is installed at index "0".

Change-Id: Ibfeac1777df6dc8742debe574ba259d212e7ecea
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-02 22:14:19 +08:00
Yonghong Song
c71bbb5912 UPSTREAM: bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash
Added bpffs pretty print for percpu arraymap, percpu hashmap
and percpu lru hashmap.

For each map <key, value> pair, the format is:
   <key_value>: {
	cpu0: <value_on_cpu0>
	cpu1: <value_on_cpu1>
	...
	cpun: <value_on_cpun>
   }

For example, on my VM, there are 4 cpus, and
for test_btf test in the next patch:
   cat /sys/fs/bpf/pprint_test_percpu_hash

You may get:
   ...
   43602: {
	cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
	cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
	cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
	cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
   }
   72847: {
	cpu0: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
	cpu1: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
	cpu2: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
	cpu3: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
   }
   ...

Change-Id: I286e7505765aa92ea9a8919ddecf8434a24fc187
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:14:19 +08:00
Tao Chen
aac2e51438 UPSTREAM: bpf: Check percpu map value size first
[ Upstream commit 1d244784be6b01162b732a5a7d637dfc024c3203 ]

Percpu map is often used, but the map value size limit often ignored,
like issue: https://github.com/iovisor/bcc/issues/2519. Actually,
percpu map value size is bound by PCPU_MIN_UNIT_SIZE, so we
can check the value size whether it exceeds PCPU_MIN_UNIT_SIZE first,
like percpu map of local_storage. Maybe the error message seems clearer
compared with "cannot allocate memory".

Signed-off-by: Jinke Han <jinkehan@didiglobal.com>
Signed-off-by: Tao Chen <chen.dylane@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240910144111.1464912-2-chen.dylane@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-02 22:14:18 +08:00
Daniel Borkmann
716ee03915 UPSTREAM: bpf: decouple btf from seq bpf fs dump and enable more maps
Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") and 699c86d6ec21 ("bpf: btf: add pretty
print for hash/lru_hash maps") enabled support for BTF and
dumping via BPF fs for array and hash/lru map. However, both
can be decoupled from each other such that regular BPF maps
can be supported for attaching BTF key/value information,
while not all maps necessarily need to dump via map_seq_show_elem()
callback.

The basic sanity check which is a prerequisite for all maps
is that key/value size has to match in any case, and some maps
can have extra checks via map_check_btf() callback, e.g.
probing certain types or indicating no support in general. With
that we can also enable retrieving BTF info for per-cpu map
types and lpm.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
2025-10-02 22:14:02 +08:00
Martin KaFai Lau
54514c7d06 BACKPORT: bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.

To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision.  In this case, decide which
SO_REUSEPORT sk to serve the incoming request.

By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map.  That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).

For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map.  The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information.  This connection info contact/API stays within the
userspace.

Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process.  The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds.  [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]

During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
ensured in "reuseport_array_update_check()".

A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map).  SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.

When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also.  It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.

The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]".  Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.

The reuseport_array will also support lookup from the syscall
side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).

The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update.  Supporting different
value_size between lookup and update seems unintuitive also.

We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4.  The syscall's lookup
will return ENOSPC on value_size=4.  It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:14:01 +08:00
Martin KaFai Lau
e3ecf4c219 BACKPORT: bpf: btf: Use exact btf value_size match in map_check_btf()
The current map_check_btf() in BPF_MAP_TYPE_ARRAY rejects
'> map->value_size' to ensure map_seq_show_elem() will not
access things beyond an array element.

Yonghong suggested that using '!=' is a more correct
check.  The 8 bytes round_up on value_size is stored
in array->elem_size.  Hence, using '!=' on map->value_size
is a proper check.

This patch also adds new tests to check the btf array
key type and value type.  Two of these new tests verify
the btf's value_size (the change in this patch).

It also fixes two existing tests that wrongly encoded
a btf's type size (pprint_test) and the value_type_id (in one
of the raw_tests[]).  However, that do not affect these two
BTF verification tests before or after this test changes.
These two tests mainly failed at array creation time after
this patch.

Fixes: a26ca7c982cb ("bpf: btf: Add pretty print support to the basic arraymap")
Suggested-by: Yonghong Song <yhs@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:13:59 +08:00
Martin KaFai Lau
7f82a95bf2 UPSTREAM: bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info
In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id"
could cause confusion because the "id" of "btf_id" means the BPF obj id
given to the BTF object while
"btf_key_id" and "btf_value_id" means the BTF type id within
that BTF object.

To make it clear, btf_key_id and btf_value_id are
renamed to btf_key_type_id and btf_value_type_id.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:13:20 +08:00
Martin KaFai Lau
a469002199 UPSTREAM: bpf: btf: Add pretty print support to the basic arraymap
This patch adds pretty print support to the basic arraymap.
Support for other bpf maps can be added later.

This patch adds new attrs to the BPF_MAP_CREATE command to allow
specifying the btf_fd, btf_key_id and btf_value_id.  The
BPF_MAP_CREATE can then associate the btf to the map if
the creating map supports BTF.

A BTF supported map needs to implement two new map ops,
map_seq_show_elem() and map_check_btf().  This patch has
implemented these new map ops for the basic arraymap.

It also adds file_operations, bpffs_map_fops, to the pinned
map such that the pinned map can be opened and read.
After that, the user has an intuitive way to do
"cat bpffs/pathto/a-pinned-map" instead of getting
an error.

bpffs_map_fops should not be extended further to support
other operations.  Other operations (e.g. write/key-lookup...)
should be realized by the userspace tools (e.g. bpftool) through
the BPF_OBJ_GET_INFO_BY_FD, map's lookup/update interface...etc.
Follow up patches will allow the userspace to obtain
the BTF from a map-fd.

Here is a sample output when reading a pinned arraymap
with the following map's value:

struct map_value {
	int count_a;
	int count_b;
};

cat /sys/fs/bpf/pinned_array_map:

0: {1,2}
1: {3,4}
2: {5,6}
...

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:12:58 +08:00
Jakub Kicinski
9dffd994d6 BACKPORT: bpf: arraymap: use bpf_map_init_from_attr()
Arraymap was not converted to use bpf_map_init_from_attr()
to avoid merge conflicts with emergency fixes.  Do it now.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:12:47 +08:00
Jakub Kicinski
0daa92a890 UPSTREAM: bpf: arraymap: move checks out of alloc function
Use the new callback to perform allocation checks for array maps.
The fd maps don't need a special allocation callback, they only
need a special check callback.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-10-02 22:12:47 +08:00
Yonghong Song
456de77985 BACKPORT: bpf: perf event change needed for subsequent bpf helpers
This patch does not impact existing functionalities.
It contains the changes in perf event area needed for
subsequent bpf_perf_event_read_value and
bpf_perf_prog_read_value helpers.

Change-Id: I066312fce9ebb0185b02ce6904e057d728473f90
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2025-10-02 22:12:32 +08:00
Tim Zimmermann
3996f04715 Squashed revert of BPF backports
Revert "Partially revert "fixup: add back code missed during BPF picking""

This reverts commit cc477455f73d317733850a9e4818dfd90be4d33d.

Revert "bpf: lpm_trie: check left child of last leftmost node for NULL"

This reverts commit e89007b7df49292c5ae52b3d165c0d815a61cd10.

Revert "BACKPORT: bpf: Fix out-of-bounds write in trie_get_next_key()"

This reverts commit a1c4f565bb00b05ab3734a64451c08b0b965ce42.

Revert "bpf: Fix exact match conditions in trie_get_next_key()"

This reverts commit 4356a64dad3d38372147457b3004930c6e2e9c51.

Revert "bpf: fix kernel page fault in lpm map trie_get_next_key"

This reverts commit df4649b5d6cb374edbb67e5a5ecbd102a2e6c897.

Revert "bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map"

This reverts commit fe6656a5d48df6144fe9929399c648957166edd0.

Revert "bpf: allow helpers to return PTR_TO_SOCK_COMMON"

This reverts commit b24d1ae9ccbf3ebe6f4baa50d2d48c03be02bc17.

Revert "bpf: implement lookup-free direct value access for maps"

This reverts commit de1959fcd3df0629380894d9c47ebb253c920ad1.

Revert "bpf: Add bpf_verifier_vlog() and bpf_verifier_log_needed()"

This reverts commit b777824607bd3eb8c9130f4639d97d15bcac9af5.

Revert "bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE"

This reverts commit 4cfef728c1eac6cce34f4fff1fbab3e66dc430d9.

Revert "bpf: always allocate at least 16 bytes for setsockopt hook"

This reverts commit 59817f83c964c753e93a75128ecaad4eeaa769fc.

Revert "bpf, sockmap: convert to generic sk_msg interface"

This reverts commit fe4ef742e22924b21749de333211941d0205501e.

Revert "bpf: sockmap, convert bpf_compute_data_pointers to bpf_*_sk_skb"

This reverts commit d17c8c2c2f623e087d6c297de50c173a006e6e55.

Revert "bpf: sockmap: fix typos"

This reverts commit 07e31378d7795371cdbccce06b4125b27ffce536.

Revert "sockmap: convert refcnt to an atomic refcnt"

This reverts commit c1fa11ec9da5dc0e8cae4334c550264cff77eef9.

Revert "bpf: sockmap, add hash map support"

This reverts commit 3f43379c38e329e9a7d4b5a1640670de37ba317b.

Revert "bpf: sockmap, refactor sockmap routines to work with hashmap"

This reverts commit 41a2b6e925db031978eb2484835f60908de884d7.

Revert "bpf: implement getsockopt and setsockopt hooks"

This reverts commit 9526fe6ff3e06939c12bb781e0dda01a8f3017ec.

Revert "bpf: Introduce bpf sk local storage"

This reverts commit ffedc38a46ddaca40de672fafe78c45fbfae9839.

Revert "bpf: introduce BPF_F_LOCK flag"

This reverts commit e7f5758fbcb1674e17c645837f7bff3b1febbad5.

Revert "bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types"

This reverts commit e29b4e3c2bdd3b5d0d34668836ae8e5115cb31af.

Revert "bpf/verifier: add ARG_PTR_TO_UNINIT_MAP_VALUE"

This reverts commit f25c66c27cd6a774fb73769d804f91e969dd5f7b.

Revert "bpf: allow map helpers access to map values directly"

This reverts commit 7af696635219d0c5cdf1a166bb7543cae9e50328.

Revert "bpf: add writable context for raw tracepoints"

This reverts commit a546d8f0433039cee0de6ce96d5d35c4033a7b98.

Revert "bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock"

This reverts commit 03093478c52e79c94791a04f8138d5c019119087.

Revert "bpf: Support socket lookup in CGROUP_SOCK_ADDR progs"

This reverts commit 8047013945361fbff0e449c8a212cb6fc93a5245.

Revert "bpf: Extend the sk_lookup() helper to XDP hookpoint."

This reverts commit 8315368983086e70ccc6f103d710903c63cca7df.

Revert "xdp: generic XDP handling of xdp_rxq_info"

This reverts commit 11d9514e6e6801941abf1c0485fd4ef53082d970.

Revert "xdp: move struct xdp_buff from filter.h to xdp.h"

This reverts commit a1795f54e4d99e02d5cb84a46fac0240cf29e206.

Revert "net: avoid including xdp.h in filter.h"

This reverts commit a39c59398f3ab64de44e5953ee0bd23c5136bb48.

Revert "xdp: base API for new XDP rx-queue info concept"

This reverts commit 49fb5bae77ab2041a2ad9f9f87ad7e0a6e215fdf.

Revert "net: Add asynchronous callbacks for xfrm on layer 2."

This reverts commit d0656f64d7719993d5634a9fc6600026e9a805ee.

Revert "xfrm: Separate ESP handling from segmentation for GRO packets."

This reverts commit c8afadf7f5ed8786652d307558345ef90ea91726.

Revert "net: move secpath_exist helper to sk_buff.h"

This reverts commit 0e5483057121dad47567b01845c656955e51989e.

Revert "sk_buff: add skb extension infrastructure"

This reverts commit 3a9ae74b075757495c4becf4dd1eec056d364801.

Revert "fixup: add back code missed during BPF picking"

This reverts commit 74ec8cef7051b5af72f2a6d83ca8c51c3c61c444.

Revert "bpf: undo prog rejection on read-only lock failure"

This reverts commit af2dc6e4993c4221603dbe6e81a3d0c8269f3171.

Revert "bpf: Add helper to retrieve socket in BPF"

This reverts commit 53495e3bc33cb46d9961ea122f576faded058aa1.

Revert "SQUASH! bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helpe"

This reverts commit 3b25fbf81c041af954d9f5ac1c7867eb07c40b07.

Revert "bpf: introduce bpf_spin_lock"

This reverts commit 0095fb54160e4f8b326fa8df103e334f90c5ab56.

Revert "bpf: enable cgroup local storage map pretty print with kind_flag"

This reverts commit 3fe92cb79b5eae557b113c37b03e78efee2280db.

Revert "bpf: btf: fix struct/union/fwd types with kind_flag"

This reverts commit 2bd4856277f459974dd6234a849cbe20fd475b8f.

Revert "bpf: add bpffs pretty print for cgroup local storage maps"

This reverts commit e07d8c8279f37cee8471846a63acc51f1ab7ce03.

Revert "bpf: pass struct btf pointer to the map_check_btf() callback"

This reverts commit 78a8140faf32710799c19495db28d71693c98030.

Revert "bpf: Define cgroup_bpf_enabled for CONFIG_CGROUP_BPF=n"

This reverts commit aada945d89950c67099e490af1c4c25eef7f31e6.

Revert "bpf: introduce per-cpu cgroup local storage"

This reverts commit d37432968663559f06c7fd7df44197a807fb84ca.

Revert "bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info"

This reverts commit 063c5a25e5f47e8b82b6c43a44ed7be851884abb.

Revert "bpf: fix a compilation error when CONFIG_BPF_SYSCALL is not defined"

This reverts commit bcf5bfaf50bb6f1f981d5c538f87e6da7aab78f2.

Revert "bpf: Create a new btf_name_by_offset() for non type name use case"

This reverts commit 52b4739d0bdd763e1b00feb50bef8a821f5c7570.

Revert "bpf: reject any prog that failed read-only lock"

This reverts commit 30d1bfec06a3bcaa773213113904580e3046a57a.

Revert "bpf: Add bpf_line_info support"

This reverts commit 50b094eeeb1ced32c62b3a10045bbf43126de760.

Revert "bpf: don't leave partial mangled prog in jit_subprogs error path"

This reverts commit a466f85be89f5daab4bd748f92915ea713d63934.

Revert "bpf: btf: support proper non-jit func info"

This reverts commit 492a556de94c502376ec3b0d5a724ec9fe9f6996.

Revert "bpf: Introduce bpf_func_info"

This reverts commit 39cade88686b0d9b7befc1f14e9d2c2cad19a769.

Revert "bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO"

This reverts commit 2010b6bacc271a48e74942506f3cf45268b6c264.

Revert "bpf: fix bpf_prog_get_info_by_fd to return 0 func_lens for unpriv"

This reverts commit a0ea14ac88a0f5529a635fc6e20277942fc6bb99.

Revert "bpf: Expose check_uarg_tail_zero()"

This reverts commit 1190aaae686534c2854838b3d642dac45d26b1f4.

Revert "bpf: Append prog->aux->name in bpf_get_prog_name()"

This reverts commit 8b82528df4a11a8501393c854978662fc218014e.

Revert "bpf: get JITed image lengths of functions via syscall"

This reverts commit 0722dbc626915fcb9acb952ebc1fcb0c4554cb07.

Revert "bpf: get kernel symbol addresses via syscall"

This reverts commit 6736ec7558dd262fef6669eec02a9797c7c4ecb7.

Revert "bpf: Add gpl_compatible flag to struct bpf_prog_info"

This reverts commit b60c7a51fd3692259c93413f3e87150078be1dac.

Revert "bpf: centre subprog information fields"

This reverts commit b5186fdf6f3e1bb38d7e4abfed5bf7dd6f85a6c3.

Revert "bpf: unify main prog and subprog"

This reverts commit e8e2ad5d9ae98bc7b85b99c0712a5dfbfc151a41.

Revert "bpf: fix maximum stack depth tracking logic"

This reverts commit 10c7127615dc2c00b724069a1620b2232d905113.

Revert "bpf, x64: fix memleak when not converging on calls"

This reverts commit 6bc867f718ef2656266f984b605151971026cc98.

Revert "bpf: decouple btf from seq bpf fs dump and enable more maps"

This reverts commit 3036e2c4384d3f43c695b88c8a1cf97b8337e3bd.

Revert "bpf: Add reference tracking to verifier"

This reverts commit 3a4900a188ac4de817dc6f114f01159d7bdd2f3e.

Revert "bpf: properly enforce index mask to prevent out-of-bounds speculation"

This reverts commit ef85925d5c07b46f7447487605da601fc7be026e.

Revert "bpf, verifier: detect misconfigured mem, size argument pair"

This reverts commit c3853ee3cb96833e907f18bf90e78040fe4cf06f.

Revert "bpf: introduce ARG_PTR_TO_MEM_OR_NULL"

This reverts commit 58560e13f545f2a079bbce17ac1b731d8b94fec7.

Revert "bpf: Macrofy stack state copy"

This reverts commit 88d98d8c2ae320ab248150eb86e1c89427e5017c.

Revert "bpf: Generalize ptr_or_null regs check"

This reverts commit d2cbc2e57b8624699a1548e67b7b3ce992b396fc.

Revert "bpf: Add iterator for spilled registers"

This reverts commit d956e1ba51a7e5ce86bb35002e26d4c1e0a2497c.

Revert "bpf/verifier: refine retval R0 state for bpf_get_stack helper"

This reverts commit ceaf6d678ccb60da107b0455da64c7bf90c5102d.

Revert "bpf: Remove struct bpf_verifier_env argument from print_bpf_insn"

This reverts commit 058fd54c07a289f9b506f2d2326434e411fa65fe.

Revert "bpf: annotate bpf_insn_print_t with __printf"

This reverts commit 9b07d2ccf07855d62446e274d817672713f15be4.

Revert "bpf: allow for correlation of maps and helpers in dump"

This reverts commit af690c2e2d177352f7270f77d8a6bc9e9f60c98c.

Revert "bpf: Add bpf_patch_call_args prototype to include/linux/bpf.h"

This reverts commit 8a2c588b3ab98916147fe4a449312ce8db70c471.

Revert "bpf: x64: add JIT support for multi-function programs"

This reverts commit 752f261e545f80942272c6becf82def1729f84be.

Revert "bpf: fix net.core.bpf_jit_enable race"

This reverts commit 4720901114c20204aa3ffa2076265d2c8cc9e81b.

Revert "bpf: add support for bpf_call to interpreter"

This reverts commit c79b2e547adc8e50dabc72244370cfd37ac6a6bd.

Revert "bpf: introduce function calls (verification)"

This reverts commit f779fda96c7d9e921525f48d67fa2e9c68b4bd48.

Revert "bpf: cleanup register_is_null()"

This reverts commit 1c81f751670b4feb3102e4de136e25fa24e303fe.

Revert "bpf: print liveness info to verifier log"

This reverts commit fdc851301b33b9d646bd1d37124cbd45cedcd62b.

Revert "bpf: also improve pattern matches for meta access"

This reverts commit 9aa150d07927b911f26e0db2af0efd6aa07b8707.

Revert "bpf: add meta pointer for direct access"

This reverts commit 94f3f502ef9ef150ed687113cfbd38e91b5edc44.

Revert "bpf: rename bpf_compute_data_end into bpf_compute_data_pointers"

This reverts commit 9573c6feb301346cd1493eea4e363c6d8345e899.

Revert "bpf: squash of log related commits"

This reverts commit b08f2111e030a72a92eec4ebd6201165d03a20b8.

Revert "bpf: move instruction printing into a separate file"

This reverts commit 8fcbd39afb58847914f3f84d9c076000e09d2fb9.

Revert "bpf: btf: Introduce BTF ID"

This reverts commit 423c40d67dfc783c3b0cb227d9da53e725e0f35c.

Revert "bpf: btf: Add pretty print support to the basic arraymap"

This reverts commit 6cd4d5bba662ca0d8980e5806ef37e0341eab929.

Revert "nsfs: clean-up ns_get_path() signature to return int"

This reverts commit ec1ce41701f411c5dee396cec2931fb651f447cc.

Revert "bpf_obj_do_pin(): switch to vfs_mkobj(), quit abusing ->mknod()"

This reverts commit 8fbcb4ebf5a751f4685cdd2757cff2264032a5d9.

Revert "bpf: offload: report device information about offloaded maps"

This reverts commit 1105e63f25a9db675671288b583a5ce2c7d10b1f.

Revert "bpf: offload: add map offload infrastructure"

This reverts commit 20cdf9df3d5bd010d799ea3c80219f625c998307.

Revert "bpf: add map_alloc_check callback"

This reverts commit 6feb4121ea083053ac9587ac426195efe9fb143d.

Revert "bpf: offload: factor out netdev checking at allocation time"

This reverts commit 1425fb5676b8fe9d761f2f6545e4be8880ce0ac8.

Revert "bpf: rename bpf_dev_offload -> bpf_prog_offload"

This reverts commit a03ae0ec508200433fd6c35b87e342df4de0b320.

Revert "bpf: offload: allow netdev to disappear while verifier is running"

This reverts commit f6cf7214fd1ff3a018009ba90c33eac1d8de21de.

Revert "bpf: offload: free program id when device disappears"

This reverts commit b12b5e56b799cfe900ab8f0ee4177c6c08a904c6.

Revert "bpf: offload: report device information for offloaded programs"

This reverts commit c73c9a0ffa332eeb49927a48780f5537597e2d42.

Revert "bpf: offload: don't require rtnl for dev list manipulation"

This reverts commit 1993f08662f07581a370899a2da209ba0c996dbb.

Revert "bpf: offload: ignore namespace moves"

This reverts commit 9fefb21d8aa2691019f9c4f0b8025fb45ba60b49.

Revert "bpf: Add PTR_TO_SOCKET verifier type"

This reverts commit 55fdbc844801cd4007237fa6c5842b46985a5c9a.

Revert "bpf: extend cgroup bpf core to allow multiple cgroup storage types"

This reverts commit a6d82e371ef32fb24d493cff32765b4607581dd4.

Revert "bpf: permit CGROUP_DEVICE programs accessing helper bpf_get_current_cgroup_id()"

This reverts commit 1bfd0a07a8317004a89d6de736e24861db8281b5.

Revert "bpf: implement bpf_get_current_cgroup_id() helper"

This reverts commit 23603ed6d7df86392701a7ea7d9a1dba66f28d4b.

Revert "bpf: introduce the bpf_get_local_storage() helper function"

This reverts commit 3d777256b1c9f34975c5230d836023ea3e0d4cfd.

Revert "bpf/verifier: introduce BPF_PTR_TO_MAP_VALUE"

This reverts commit 93c12733dc97984f7bf57a77160eacc480bfc3de.

Revert "bpf: extend bpf_prog_array to store pointers to the cgroup storage"

This reverts commit b26baff1fb34607938c9ac0e421e3f4b5fedad4d.

Revert "BACKPORT: bpf: allocate cgroup storage entries on attaching bpf programs"

This reverts commit 804605c21a3be3277c0031504dcd3fdd1be64290.

Revert "bpf: include errno.h from bpf-cgroup.h"

This reverts commit 6b4df332b357e9a5942ca4c6f985cd33dfc30e25.

Revert "bpf: pass a pointer to a cgroup storage using pcpu variable"

This reverts commit c8af92dc9fc00e49f06f6997969284ef5e5c5af5.

Revert "bpf: introduce cgroup storage maps"

This reverts commit c61c2271cb8a1e47678bddc8cdfae83035a07fec.

Revert "bpf: add ability to charge bpf maps memory dynamically"

This reverts commit 3a430745e9f675b450477fffead5568046432f29.

Revert "bpf: add helper for copying attrs to struct bpf_map"

This reverts commit 6d7be0ae93371692e564c00003ce184cbaefbb8d.

Revert "bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP"

This reverts commit 15f584d2d3d4814cfbd3059ab810db02af8773a0.

Revert "bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog"

This reverts commit fc9bf5e48985f7c3a39bf34a27477a2607a5dc6d.

Revert "bpf: set maximum number of attached progs to 64 for a single perf tp"

This reverts commit 0d5fc9795d824fbca21b81c8d91748ba21313d4c.

Revert "bpf: avoid rcu_dereference inside bpf_event_mutex lock region"

This reverts commit 948e200e3173dd959de907e326f2a2c90eda4b28.

Revert "bpf: fix bpf_prog_array_copy_to_user() issues"

This reverts commit 66811698b8de9b3cf13c09730d287b6d1d5d3699.

Revert "bpf: fix pointer offsets in context for 32 bit"

This reverts commit 99661813c136c52e56b328a2a8ecd2bc0e187eba.

Revert "BACKPORT: bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data"

This reverts commit 36f0ea00dd121b13f80617e5b2eb93ba160df85a.

Revert "BACKPORT: bpf: Sysctl hook"

This reverts commit 4a543990e03b5de4a2c23777abd0f77afd61cc2d.

Revert "BACKPORT: flow_dissector: implements flow dissector BPF hook"

This reverts commit de610a8a4324170a0deaf12e2e64c2ff068785fb.

Revert "BACKPORT: bpf: Add base proto function for cgroup-bpf programs"

This reverts commit f3ac0a6cbec3472ff2e3808a436891881f3cbf87.

Revert "FROMLIST: [net-next,v2,1/2] bpf: Allow CGROUP_SKB eBPF program to access sk_buff"

This reverts commit 6d4dcc0e3de628003d91075e4b1ab1a128b8892e.

Revert "BACKPORT: bpf: introduce BPF_RAW_TRACEPOINT"

This reverts commit b2a5c6b4958c8250e58ddb6c334018a5f7ee5437.

Revert "bpf/tracing: fix kernel/events/core.c compilation error"

This reverts commit 70249d4eb7359e9dc59e044951beb99d0d8725cd.

Revert "BACKPORT: bpf/tracing: allow user space to query prog array on the same tp"

This reverts commit 08a6d8c01372940bfec78fdc6cb8a47e08c745b0.

Revert "bpf: sockmap, add sock close() hook to remove socks"

This reverts commit e6b363b8d09d9740dff309fb4dc88e7a1e90726b.

Revert "BACKPORT: bpf: remove the verifier ops from program structure"

This reverts commit 94c2f61efa741bf6a97415f42cfbfb9ec83dfd8e.

Revert "bpf, cgroup: implement eBPF-based device controller for cgroup v2"

This reverts commit 22faa9c56550a34488e607ca3aca59c68b1f7938.

Revert "BACKPORT: bpf: split verifier and program ops"

This reverts commit d2b1388504c1129d5756bb9b20af9bd64e75d015.

Revert "bpf: btf: Break up btf_type_is_void()"

This reverts commit 052989c47b68feaf381d371ec1e6a169edc26d30.

Revert "bpf: btf: refactor btf_int_bits_seq_show()"

This reverts commit 8cc3fb30656cfab91205194a8ee7661bdd95e005.

Revert "BACKPORT: bpf: fix unconnected udp hooks"

This reverts commit b108e725aa70e39cfd37296d1a1d31e8896fa7b7.

Revert "BACKPORT: bpf: enforce return code for cgroup-bpf programs"

This reverts commit 10215080915bfbdaa9f666a95ffda02cc1ef7a29.

Revert "bpf: Hooks for sys_sendmsg"

This reverts commit cd847db1be8a37e0e7e9c813b5d8f93697dc5af0.

Revert "BACKPORT: devmap: Allow map lookups from eBPF"

This reverts commit 37da95fde647e8967b362e0769136bfbebc03628.

Revert "BACKPORT: xdp: Add devmap_hash map type for looking up devices by hashed index"

This reverts commit ae6a87f44c4ef20ac290ce68c4d5b542cf46f3d7.

Revert "kernel: bpf: devmap: Create __dev_map_alloc_node"

This reverts commit 15928a97ed93cf9f606a21bf869ff421b997a2c5.

Revert "BACKPORT: bpf: Post-hooks for sys_bind"

This reverts commit c221d44e76c3ab69285c9986680e5eb726cf157b.

Revert "BACKPORT: bpf: Hooks for sys_connect"

This reverts commit 003311ea43163c77e4e0c1921b81438286925baa.

Revert "BACKPORT: net: Introduce __inet_bind() and __inet6_bind"

This reverts commit 74f1eb60012c13bd606e4dc718e63aec7f8cce8f.

Revert "BACKPORT: bpf: Hooks for sys_bind"

This reverts commit cef0bd97f2fec8363c3ef58b2cb508deaa9bc5b2.

Revert "BACKPORT: bpf: introduce BPF_PROG_QUERY command"

This reverts commit a4ef81ce48cb25843ddb4d4331dacf2742215909.

Revert "BACKPORT: bpf: Check attach type at prog load time"

This reverts commit 750a3f976c75797e572a6dfdd2e8865b8b49964a.

Revert "bpf: offload: rename the ifindex field"

This reverts commit 921e6becfb28fbe505603bf927f195d1d72a0eea.

Revert "BACKPORT: bpf: offload: add infrastructure for loading programs for a specific netdev"

This reverts commit cb1607a58d026a4ac1d9e71f6c3cd1dc23820e2f.

Revert "BACKPORT: net: bpf: rename ndo_xdp to ndo_bpf"

This reverts commit 932d47ebc5910bb1ec954002206b1ce8749a9cd6.

Revert "bpf: btf: fix truncated last_member_type_id in btf_struct_resolve"

This reverts commit e7af669fe00a8e2030913088836189a9f65a04d8.

Revert "bpf/btf: Fix BTF verification of enum members in struct/union"

This reverts commit a098516b98fe35e8f0e89709443fff8b37eb04b8.

Revert "bpf: fix BTF limits"

This reverts commit 794ad07fab9540989f96351c11b039e2229c2a8e.

Revert "bpf, btf: fix a missing check bug in btf_parse"

This reverts commit 27c4178ecc8edbb2306fa479f275ffd35f5b57c9.

Revert "bpf: btf: Fix a missing check bug"

This reverts commit 71f5a7d140aa5a37d164e217b2fefcb2d409b894.

Revert "bpf: btf: Fix end boundary calculation for type section"

This reverts commit 549615befd671b6877677acb009b66cd374408d3.

Revert "bpf: fix bpf_skb_load_bytes_relative pkt length check"

This reverts commit 5f3d68c4da18dfbcde4c02cb34c63599709fcf3c.

Revert "bpf: btf: Ensure the member->offset is in the right order"

This reverts commit 4f9d26cbc747a4728c4944b7dc9725fc2737f892.

Revert "bpf: btf: Clean up BTF_INT_BITS() in uapi btf.h"

This reverts commit 480c6f80a14431f6d680a687363dcb0d9cd1d7a8.

Revert "bpf: btf: Fix bitfield extraction for big endian"

This reverts commit 0463c259aa21e99d1bf798c8cf54da18b5906938.

Revert "bpf: btf: Ensure t->type == 0 for BTF_KIND_FWD"

This reverts commit ecc54be6970a3484eb163ac09996856c9ece5727.

Revert "bpf: btf: Check array t->size"

This reverts commit 3cda848b9be9fbb6dfa8912a425801c263bcbff7.

Revert "bpf: btf: avoid -Wreturn-type warning"

This reverts commit fd7fede5952004dcacb39f318249c4cf8e5c51e0.

Revert "bpf: btf: Avoid variable length array"

This reverts commit 2826641eb171c705d0b2db86d8834eff33945d0e.

Revert "bpf: btf: Remove unused bits from uapi/linux/btf.h"

This reverts commit 2d9e7a574f7e47a027974ec616ac812ad6a2d086.

Revert "bpf: btf: Check array->index_type"

This reverts commit f9ee68f7e8a471450536a70b43bd96d4bdfbfb81.

Revert "bpf: btf: Change how section is supported in btf_header"

This reverts commit 63a4474da4bf56c8a700d542bcf3a57a4b737ed6.

Revert "bpf: Fix compiler warning on info.map_ids for 32bit platform"

This reverts commit a4f706ea7d2b874ef739168a12a30ae5454487a6.

Revert "BACKPORT: bpf: Use char in prog and map name"

This reverts commit 8d4ad88eabb5d1500814c5f5b76a11f80346669c.

Revert "bpf: Change bpf_obj_name_cpy() to better ensure map's name is init by 0"

This reverts commit c4acfd3c9f5a97123c240676750f3e4ae2a2c24c.

Revert "BACKPORT: bpf: Add map_name to bpf_map_info"

This reverts commit 0e03a4e584eabe3f4c448f06f271753cdaae3aab.

Revert "BACKPORT: bpf: Add name, load_time, uid and map_ids to bpf_prog_info"

This reverts commit 16872f60e6c1fc6b10e905ff18c14d8aaeb4e09d.

Revert "bpf: btf: Avoid WARN_ON when CONFIG_REFCOUNT_FULL=y"

This reverts commit 0b618ec6e162e650aaa583a31f4de4c4558148bf.

Revert "BACKPORT: bpf: btf: Clean up btf.h in uapi"

This reverts commit ea0c0ad08c18ddf62dbb6c8edc814c75cbb3e8b9.

Revert "bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd"

This reverts commit f51fe1d1edb742176c622bc93301e98a1cbf2e63.

Revert "BACKPORT: bpf: btf: Add BPF_BTF_LOAD command"

This reverts commit 85db8f764069f15d1b181bea67336ce4d66a58c1.

Revert "bpf: btf: Add pretty print capability for data with BTF type info"

This reverts commit 0a8aae433c53b1f441cab70979517660fb6a6038.

Revert "bpf: btf: Check members of struct/union"

This reverts commit ce2e8103ac1a977ce32db51ec042faea6f100a3d.

Revert "bpf: btf: Validate type reference"

This reverts commit a1aa96e6dae2b4c8c0b0a4dedab3006d3f697460.

Revert "bpf: Update logging functions to work with BTF"

This reverts commit b9289460f0a6b5c261ec0b6dcafa6fcd09d4957e.

Revert "BACKPORT: bpf: btf: Introduce BPF Type Format (BTF)"

This reverts commit ceebd58f6470e8ec6d9d694ab382fe88f43b998b.

Revert "BACKPORT: bpf: Rename bpf_verifer_log"

This reverts commit 50bdc7513d966811fb418d24a0e5797ffd8c907c.

Revert "BACKPORT: bpf: encapsulate verifier log state into a structure"

This reverts commit 0bcb397bde4675fdeb977d9debed20ed213f9ecd.

Change-Id: Iecaa276b078c6d2db773a8071e7da9e6195277d6
2025-10-02 22:12:00 +08:00
Daniel Borkmann
3e0f9ad71f bpf: implement lookup-free direct value access for maps
This generic extension to BPF maps allows for directly loading
an address residing inside a BPF map value as a single BPF
ldimm64 instruction!

The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
is a special src_reg flag for ldimm64 instruction that indicates
that inside the first part of the double insns's imm field is a
file descriptor which the verifier then replaces as a full 64bit
address of the map into both imm parts. For the newly added
BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
the first part of the double insns's imm field is again a file
descriptor corresponding to the map, and the second part of the
imm field is an offset into the value. The verifier will then
replace both imm parts with an address that points into the BPF
map value at the given value offset for maps that support this
operation. Currently supported is array map with single entry.
It is possible to support more than just single map element by
reusing both 16bit off fields of the insns as a map index, so
full array map lookup could be expressed that way. It hasn't
been implemented here due to lack of concrete use case, but
could easily be done so in future in a compatible way, since
both off fields right now have to be 0 and would correctly
denote a map index 0.

The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
map pointer versus load of map's value at offset 0, and changing
BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
regular map pointer and map value pointer would add unnecessary
complexity and increases barrier for debugability thus less
suitable. Using the second part of the imm field as an offset
into the value does /not/ come with limitations since maximum
possible value size is in u32 universe anyway.

This optimization allows for efficiently retrieving an address
to a map value memory area without having to issue a helper call
which needs to prepare registers according to calling convention,
etc, without needing the extra NULL test, and without having to
add the offset in an additional instruction to the value base
pointer. The verifier then treats the destination register as
PTR_TO_MAP_VALUE with constant reg->off from the user passed
offset from the second imm field, and guarantees that this is
within bounds of the map value. Any subsequent operations are
normally treated as typical map value handling without anything
extra needed from verification side.

The two map operations for direct value access have been added to
array map for now. In future other types could be supported as
well depending on the use case. The main use case for this commit
is to allow for BPF loader support for global variables that
reside in .data/.rodata/.bss sections such that we can directly
load the address of them with minimal additional infrastructure
required. Loader support has been added in subsequent commits for
libbpf library.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-17 16:58:07 +08:00
Alexei Starovoitov
b795918c20 bpf: introduce BPF_F_LOCK flag
Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands
and for map_update() helper function.
In all these cases take a lock of existing element (which was provided
in BTF description) before copying (in or out) the rest of map value.

Implementation details that are part of uapi:

Array:
The array map takes the element lock for lookup/update.

Hash:
hash map also takes the lock for lookup/update and tries to avoid the bucket lock.
If old element exists it takes the element lock and updates the element in place.
If element doesn't exist it allocates new one and inserts into hash table
while holding the bucket lock.
In rare case the hashmap has to take both the bucket lock and the element lock
to update old value in place.

Cgroup local storage:
It is similar to array. update in place and lookup are done with lock taken.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-09-17 16:58:05 +08:00
Alexei Starovoitov
0b7048f2ba bpf: introduce bpf_spin_lock
Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
bpf program serialize access to other variables.

Example:
struct hash_elem {
    int cnt;
    struct bpf_spin_lock lock;
};
struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
if (val) {
    bpf_spin_lock(&val->lock);
    val->cnt++;
    bpf_spin_unlock(&val->lock);
}

Restrictions and safety checks:
- bpf_spin_lock is only allowed inside HASH and ARRAY maps.
- BTF description of the map is mandatory for safety analysis.
- bpf program can take one bpf_spin_lock at a time, since two or more can
  cause dead locks.
- only one 'struct bpf_spin_lock' is allowed per map element.
  It drastically simplifies implementation yet allows bpf program to use
  any number of bpf_spin_locks.
- when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
- bpf program must bpf_spin_unlock() before return.
- bpf program can access 'struct bpf_spin_lock' only via
  bpf_spin_lock()/bpf_spin_unlock() helpers.
- load/store into 'struct bpf_spin_lock lock;' field is not allowed.
- to use bpf_spin_lock() helper the BTF description of map value must be
  a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
  Nested lock inside another struct is not allowed.
- syscall map_lookup doesn't copy bpf_spin_lock field to user space.
- syscall map_update and program map_update do not update bpf_spin_lock field.
- bpf_spin_lock cannot be on the stack or inside networking packet.
  bpf_spin_lock can only be inside HASH or ARRAY map value.
- bpf_spin_lock is available to root only and to all program types.
- bpf_spin_lock is not allowed in inner maps of map-in-map.
- ld_abs is not allowed inside spin_lock-ed region.
- tracing progs and socket filter progs cannot use bpf_spin_lock due to
  insufficient preemption checks

Implementation details:
- cgroup-bpf class of programs can nest with xdp/tc programs.
  Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
  Other solutions to avoid nested bpf_spin_lock are possible.
  Like making sure that all networking progs run with softirq disabled.
  spin_lock_irqsave is the simplest and doesn't add overhead to the
  programs that don't use it.
- arch_spinlock_t is used when its implemented as queued_spin_lock
- archs can force their own arch_spinlock_t
- on architectures where queued_spin_lock is not available and
  sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
- presence of bpf_spin_lock inside map value could have been indicated via
  extra flag during map_create, but specifying it via BTF is cleaner.
  It provides introspection for map key/value and reduces user mistakes.

Next steps:
- allow bpf_spin_lock in other map types (like cgroup local storage)
- introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
  to request kernel to grab bpf_spin_lock before rewriting the value.
  That will serialize access to map elements.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-09-17 16:58:03 +08:00
Roman Gushchin
382aa77db5 bpf: pass struct btf pointer to the map_check_btf() callback
If key_type or value_type are of non-trivial data types
(e.g. structure or typedef), it's not possible to check them without
the additional information, which can't be obtained without a pointer
to the btf structure.

So, let's pass btf pointer to the map_check_btf() callbacks.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-17 16:58:02 +08:00
Martin KaFai Lau
fa392d4082 bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info
In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id"
could cause confusion because the "id" of "btf_id" means the BPF obj id
given to the BTF object while
"btf_key_id" and "btf_value_id" means the BTF type id within
that BTF object.

To make it clear, btf_key_id and btf_value_id are
renamed to btf_key_type_id and btf_value_type_id.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-09-17 16:58:01 +08:00
Daniel Borkmann
ebe70c600b bpf: decouple btf from seq bpf fs dump and enable more maps
Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
the basic arraymap") and 699c86d6ec21 ("bpf: btf: add pretty
print for hash/lru_hash maps") enabled support for BTF and
dumping via BPF fs for array and hash/lru map. However, both
can be decoupled from each other such that regular BPF maps
can be supported for attaching BTF key/value information,
while not all maps necessarily need to dump via map_seq_show_elem()
callback.

The basic sanity check which is a prerequisite for all maps
is that key/value size has to match in any case, and some maps
can have extra checks via map_check_btf() callback, e.g.
probing certain types or indicating no support in general. With
that we can also enable retrieving BTF info for per-cpu map
types and lpm.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
2025-09-17 16:57:58 +08:00
Martin KaFai Lau
adceed2a9c bpf: btf: Add pretty print support to the basic arraymap
This patch adds pretty print support to the basic arraymap.
Support for other bpf maps can be added later.

This patch adds new attrs to the BPF_MAP_CREATE command to allow
specifying the btf_fd, btf_key_id and btf_value_id.  The
BPF_MAP_CREATE can then associate the btf to the map if
the creating map supports BTF.

A BTF supported map needs to implement two new map ops,
map_seq_show_elem() and map_check_btf().  This patch has
implemented these new map ops for the basic arraymap.

It also adds file_operations, bpffs_map_fops, to the pinned
map such that the pinned map can be opened and read.
After that, the user has an intuitive way to do
"cat bpffs/pathto/a-pinned-map" instead of getting
an error.

bpffs_map_fops should not be extended further to support
other operations.  Other operations (e.g. write/key-lookup...)
should be realized by the userspace tools (e.g. bpftool) through
the BPF_OBJ_GET_INFO_BY_FD, map's lookup/update interface...etc.
Follow up patches will allow the userspace to obtain
the BTF from a map-fd.

Here is a sample output when reading a pinned arraymap
with the following map's value:

struct map_value {
	int count_a;
	int count_b;
};

cat /sys/fs/bpf/pinned_array_map:

0: {1,2}
1: {3,4}
2: {5,6}
...

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2025-09-17 16:57:55 +08:00
Hou Tao
ffb6438211 bpf: Add map and need_defer parameters to .map_fd_put_ptr()
[ Upstream commit 20c20bd11a0702ce4dc9300c3da58acf551d9725 ]

map is the pointer of outer map, and need_defer needs some explanation.
need_defer tells the implementation to defer the reference release of
the passed element and ensure that the element is still alive before
the bpf program, which may manipulate it, exits.

The following three cases will invoke map_fd_put_ptr() and different
need_defer values will be passed to these callers:

1) release the reference of the old element in the map during map update
   or map deletion. The release must be deferred, otherwise the bpf
   program may incur use-after-free problem, so need_defer needs to be
   true.
2) release the reference of the to-be-added element in the error path of
   map update. The to-be-added element is not visible to any bpf
   program, so it is OK to pass false for need_defer parameter.
3) release the references of all elements in the map during map release.
   Any bpf program which has access to the map must have been exited and
   released, so need_defer=false will be OK.

These two parameters will be used by the following patches to fix the
potential use-after-free problem for map-in-map.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 5aa1e7d3f6d0db96c7139677d9e898bbbd6a7dcf)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Change-Id: Ifb87b3a6a590d0deab4aaa4cf5510753a42ef9ce
2024-07-31 15:16:01 +02:00
Greg Kroah-Hartman
f9cf23e1ff Merge 4.14.79 into android-4.14-p
Changes in 4.14.79
	xfrm: Validate address prefix lengths in the xfrm selector.
	xfrm6: call kfree_skb when skb is toobig
	xfrm: reset transport header back to network header after all input transforms ahave been applied
	xfrm: reset crypto_done when iterating over multiple input xfrms
	mac80211: Always report TX status
	cfg80211: reg: Init wiphy_idx in regulatory_hint_core()
	mac80211: fix pending queue hang due to TX_DROP
	cfg80211: Address some corner cases in scan result channel updating
	mac80211: TDLS: fix skb queue/priority assignment
	mac80211: fix TX status reporting for ieee80211s
	xfrm: Fix NULL pointer dereference when skb_dst_force clears the dst_entry.
	ARM: 8799/1: mm: fix pci_ioremap_io() offset check
	xfrm: validate template mode
	netfilter: bridge: Don't sabotage nf_hook calls from an l3mdev
	arm64: hugetlb: Fix handling of young ptes
	ARM: dts: BCM63xx: Fix incorrect interrupt specifiers
	net: macb: Clean 64b dma addresses if they are not detected
	soc: fsl: qbman: qman: avoid allocating from non existing gen_pool
	soc: fsl: qe: Fix copy/paste bug in ucc_get_tdm_sync_shift()
	nl80211: Fix possible Spectre-v1 for NL80211_TXRATE_HT
	mac80211_hwsim: do not omit multicast announce of first added radio
	Bluetooth: SMP: fix crash in unpairing
	pxa168fb: prepare the clock
	qed: Avoid implicit enum conversion in qed_set_tunn_cls_info
	qed: Fix mask parameter in qed_vf_prep_tunn_req_tlv
	qed: Avoid implicit enum conversion in qed_roce_mode_to_flavor
	qed: Avoid constant logical operation warning in qed_vf_pf_acquire
	qed: Avoid implicit enum conversion in qed_iwarp_parse_rx_pkt
	nl80211: Fix possible Spectre-v1 for CQM RSSI thresholds
	asix: Check for supported Wake-on-LAN modes
	ax88179_178a: Check for supported Wake-on-LAN modes
	lan78xx: Check for supported Wake-on-LAN modes
	sr9800: Check for supported Wake-on-LAN modes
	r8152: Check for supported Wake-on-LAN Modes
	smsc75xx: Check for Wake-on-LAN modes
	smsc95xx: Check for Wake-on-LAN modes
	cfg80211: fix use-after-free in reg_process_hint()
	perf/core: Fix perf_pmu_unregister() locking
	perf/ring_buffer: Prevent concurent ring buffer access
	perf/x86/intel/uncore: Fix PCI BDF address of M3UPI on SKX
	perf/x86/amd/uncore: Set ThreadMask and SliceMask for L3 Cache perf events
	net: fec: fix rare tx timeout
	declance: Fix continuation with the adapter identification message
	net: qualcomm: rmnet: Skip processing loopback packets
	locking/ww_mutex: Fix runtime warning in the WW mutex selftest
	be2net: don't flip hw_features when VXLANs are added/deleted
	net: cxgb3_main: fix a missing-check bug
	yam: fix a missing-check bug
	ocfs2: fix crash in ocfs2_duplicate_clusters_by_page()
	iwlwifi: mvm: check for short GI only for OFDM
	iwlwifi: dbg: allow wrt collection before ALIVE
	iwlwifi: fix the ALIVE notification layout
	tools/testing/nvdimm: unit test clear-error commands
	usbip: vhci_hcd: update 'status' file header and format
	scsi: aacraid: address UBSAN warning regression
	IB/ipoib: Fix lockdep issue found on ipoib_ib_dev_heavy_flush
	IB/rxe: put the pool on allocation failure
	s390/qeth: fix error handling in adapter command callbacks
	net/mlx5: Fix mlx5_get_vector_affinity function
	powerpc/pseries: Add empty update_numa_cpu_lookup_table() for NUMA=n
	dm integrity: fail early if required HMAC key is not available
	net: phy: realtek: Use the dummy stubs for MMD register access for rtl8211b
	net: phy: Add general dummy stubs for MMD register access
	net/mlx5e: Refine ets validation function
	scsi: qla2xxx: Avoid double completion of abort command
	kbuild: set no-integrated-as before incl. arch Makefile
	IB/mlx5: Avoid passing an invalid QP type to firmware
	ARM: tegra: Fix ULPI regression on Tegra20
	l2tp: remove configurable payload offset
	cifs: Use ULL suffix for 64-bit constant
	test_bpf: Fix testing with CONFIG_BPF_JIT_ALWAYS_ON=y on other arches
	KVM: x86: Update the exit_qualification access bits while walking an address
	sparc64: Fix regression in pmdp_invalidate().
	tpm: move the delay_msec increment after sleep in tpm_transmit()
	bpf: sockmap, map_release does not hold refcnt for pinned maps
	tpm: tpm_crb: relinquish locality on error path.
	xen-netfront: Update features after registering netdev
	xen-netfront: Fix mismatched rtnl_unlock
	IB/usnic: Update with bug fixes from core code
	mmc: dw_mmc-rockchip: correct property names in debug
	MIPS: Workaround GCC __builtin_unreachable reordering bug
	lan78xx: Don't reset the interface on open
	enic: do not overwrite error code
	iio: buffer: fix the function signature to match implementation
	selftests/powerpc: Add ptrace hw breakpoint test
	scsi: ibmvfc: Avoid unnecessary port relogin
	scsi: sd: Remember that READ CAPACITY(16) succeeded
	btrfs: quota: Set rescan progress to (u64)-1 if we hit last leaf
	net: phy: phylink: Don't release NULL GPIO
	x86/paravirt: Fix some warning messages
	net: stmmac: mark PM functions as __maybe_unused
	kconfig: fix the rule of mainmenu_stmt symbol
	libertas: call into generic suspend code before turning off power
	perf tests: Fix indexing when invoking subtests
	compiler.h: Allow arch-specific asm/compiler.h
	ARM: dts: imx53-qsb: disable 1.2GHz OPP
	perf python: Use -Wno-redundant-decls to build with PYTHON=python3
	rxrpc: Don't check RXRPC_CALL_TX_LAST after calling rxrpc_rotate_tx_window()
	rxrpc: Only take the rwind and mtu values from latest ACK
	rxrpc: Fix connection-level abort handling
	net: ena: fix warning in rmmod caused by double iounmap
	net: ena: fix NULL dereference due to untimely napi initialization
	selftests: rtnetlink.sh explicitly requires bash.
	fs/fat/fatent.c: add cond_resched() to fat_count_free_clusters()
	sch_netem: restore skb->dev after dequeuing from the rbtree
	mtd: spi-nor: Add support for is25wp series chips
	kvm: x86: fix WARN due to uninitialized guest FPU state
	ARM: dts: r8a7790: Correct critical CPU temperature
	media: uvcvideo: Fix driver reference counting
	ALSA: usx2y: Fix invalid stream URBs
	Revert "netfilter: ipv6: nf_defrag: drop skb dst before queueing"
	perf tools: Disable parallelism for 'make clean'
	drm/i915/gvt: fix memory leak of a cmd_entry struct on error exit path
	bridge: do not add port to router list when receives query with source 0.0.0.0
	net: bridge: remove ipv6 zero address check in mcast queries
	ipv6: mcast: fix a use-after-free in inet6_mc_check
	ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called
	llc: set SOCK_RCU_FREE in llc_sap_add_socket()
	net: fec: don't dump RX FIFO register when not available
	net/ipv6: Fix index counter for unicast addresses in in6_dump_addrs
	net: sched: gred: pass the right attribute to gred_change_table_def()
	net: socket: fix a missing-check bug
	net: stmmac: Fix stmmac_mdio_reset() when building stmmac as modules
	net: udp: fix handling of CHECKSUM_COMPLETE packets
	r8169: fix NAPI handling under high load
	sctp: fix race on sctp_id2asoc
	udp6: fix encap return code for resubmitting
	vhost: Fix Spectre V1 vulnerability
	virtio_net: avoid using netif_tx_disable() for serializing tx routine
	ethtool: fix a privilege escalation bug
	bonding: fix length of actor system
	ip6_tunnel: Fix encapsulation layout
	openvswitch: Fix push/pop ethernet validation
	net/mlx5: Take only bit 24-26 of wqe.pftype_wq for page fault type
	net: sched: Fix for duplicate class dump
	net: drop skb on failure in ip_check_defrag()
	net: fix pskb_trim_rcsum_slow() with odd trim offset
	net/mlx5e: fix csum adjustments caused by RXFCS
	rtnetlink: Disallow FDB configuration for non-Ethernet device
	net: ipmr: fix unresolved entry dumps
	net: bcmgenet: Poll internal PHY for GENETv5
	net/sched: cls_api: add missing validation of netlink attributes
	net/mlx5: Fix build break when CONFIG_SMP=n
	Linux 4.14.79

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-11-08 07:44:15 -08:00
John Fastabend
3c0cff34e9 bpf: sockmap, map_release does not hold refcnt for pinned maps
[ Upstream commit ba6b8de423f8d0dee48d6030288ed81c03ddf9f0 ]

Relying on map_release hook to decrement the reference counts when a
map is removed only works if the map is not being pinned. In the
pinned case the ref is decremented immediately and the BPF programs
released. After this BPF programs may not be in-use which is not
what the user would expect.

This patch moves the release logic into bpf_map_put_uref() and brings
sockmap in-line with how a similar case is handled in prog array maps.

Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not detached progs")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-04 14:52:44 +01:00
Greg Kroah-Hartman
4576e0eca9 Merge 4.14.26 into android-4.14
Changes in 4.14.26
	bpf: fix mlock precharge on arraymaps
	bpf: fix memory leak in lpm_trie map_free callback function
	bpf: fix rcu lockdep warning for lpm_trie map_free callback
	bpf, x64: implement retpoline for tail call
	bpf, arm64: fix out of bounds access in tail call
	bpf: add schedule points in percpu arrays management
	bpf: allow xadd only on aligned memory
	bpf, ppc64: fix out of bounds access in tail call
	KVM: x86: fix backward migration with async_PF
	Linux 4.14.26

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-03-11 17:37:01 +01:00
Eric Dumazet
e1760b3563 bpf: add schedule points in percpu arrays management
[ upstream commit 32fff239de37ef226d5b66329dd133f64d63b22d ]

syszbot managed to trigger RCU detected stalls in
bpf_array_free_percpu()

It takes time to allocate a huge percpu map, but even more time to free
it.

Since we run in process context, use cond_resched() to yield cpu if
needed.

Fixes: a10423b87a ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-11 16:23:22 +01:00
Daniel Borkmann
d9fd73c60b bpf: fix mlock precharge on arraymaps
[ upstream commit 9c2d63b843a5c8a8d0559cc067b5398aa5ec3ffc ]

syzkaller recently triggered OOM during percpu map allocation;
while there is work in progress by Dennis Zhou to add __GFP_NORETRY
semantics for percpu allocator under pressure, there seems also a
missing bpf_map_precharge_memlock() check in array map allocation.

Given today the actual bpf_map_charge_memlock() happens after the
find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
is there to bail out early before we go and do the map setup work
when we find that we hit the limits anyway. Therefore add this for
array map as well.

Fixes: 6c90598174 ("bpf: pre-allocate hash map elements")
Fixes: a10423b87a ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-11 16:23:21 +01:00
Greg Kroah-Hartman
9b68347c35 Merge 4.14.14 into android-4.14
Changes in 4.14.14
	dm bufio: fix shrinker scans when (nr_to_scan < retain_target)
	KVM: Fix stack-out-of-bounds read in write_mmio
	can: vxcan: improve handling of missing peer name attribute
	can: gs_usb: fix return value of the "set_bittiming" callback
	IB/srpt: Disable RDMA access by the initiator
	IB/srpt: Fix ACL lookup during login
	MIPS: Validate PR_SET_FP_MODE prctl(2) requests against the ABI of the task
	MIPS: Factor out NT_PRFPREG regset access helpers
	MIPS: Guard against any partial write attempt with PTRACE_SETREGSET
	MIPS: Consistently handle buffer counter with PTRACE_SETREGSET
	MIPS: Fix an FCSR access API regression with NT_PRFPREG and MSA
	MIPS: Also verify sizeof `elf_fpreg_t' with PTRACE_SETREGSET
	MIPS: Disallow outsized PTRACE_SETREGSET NT_PRFPREG regset accesses
	cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC
	kvm: vmx: Scrub hardware GPRs at VM-exit
	platform/x86: wmi: Call acpi_wmi_init() later
	iw_cxgb4: only call the cq comp_handler when the cq is armed
	iw_cxgb4: atomically flush the qp
	iw_cxgb4: only clear the ARMED bit if a notification is needed
	iw_cxgb4: reflect the original WR opcode in drain cqes
	iw_cxgb4: when flushing, complete all wrs in a chain
	x86/acpi: Handle SCI interrupts above legacy space gracefully
	ALSA: pcm: Remove incorrect snd_BUG_ON() usages
	ALSA: pcm: Workaround for weird PulseAudio behavior on rewind error
	ALSA: pcm: Add missing error checks in OSS emulation plugin builder
	ALSA: pcm: Abort properly at pending signal in OSS read/write loops
	ALSA: pcm: Allow aborting mutex lock at OSS read/write loops
	ALSA: aloop: Release cable upon open error path
	ALSA: aloop: Fix inconsistent format due to incomplete rule
	ALSA: aloop: Fix racy hw constraints adjustment
	x86/acpi: Reduce code duplication in mp_override_legacy_irq()
	8021q: fix a memory leak for VLAN 0 device
	ip6_tunnel: disable dst caching if tunnel is dual-stack
	net: core: fix module type in sock_diag_bind
	phylink: ensure we report link down when LOS asserted
	RDS: Heap OOB write in rds_message_alloc_sgs()
	RDS: null pointer dereference in rds_atomic_free_op
	net: fec: restore dev_id in the cases of probe error
	net: fec: defer probe if regulator is not ready
	net: fec: free/restore resource in related probe error pathes
	sctp: do not retransmit upon FragNeeded if PMTU discovery is disabled
	sctp: fix the handling of ICMP Frag Needed for too small MTUs
	sh_eth: fix TSU resource handling
	net: stmmac: enable EEE in MII, GMII or RGMII only
	sh_eth: fix SH7757 GEther initialization
	ipv6: fix possible mem leaks in ipv6_make_skb()
	ethtool: do not print warning for applications using legacy API
	mlxsw: spectrum_router: Fix NULL pointer deref
	net/sched: Fix update of lastuse in act modules implementing stats_update
	ipv6: sr: fix TLVs not being copied using setsockopt
	mlxsw: spectrum: Relax sanity checks during enslavement
	sfp: fix sfp-bus oops when removing socket/upstream
	membarrier: Disable preemption when calling smp_call_function_many()
	crypto: algapi - fix NULL dereference in crypto_remove_spawns()
	mmc: renesas_sdhi: Add MODULE_LICENSE
	rbd: reacquire lock should update lock owner client id
	rbd: set max_segments to USHRT_MAX
	iwlwifi: pcie: fix DMA memory mapping / unmapping
	x86/microcode/intel: Extend BDW late-loading with a revision check
	KVM: x86: Add memory barrier on vmcs field lookup
	KVM: PPC: Book3S PR: Fix WIMG handling under pHyp
	KVM: PPC: Book3S HV: Drop prepare_done from struct kvm_resize_hpt
	KVM: PPC: Book3S HV: Fix use after free in case of multiple resize requests
	KVM: PPC: Book3S HV: Always flush TLB in kvmppc_alloc_reset_hpt()
	drm/vmwgfx: Don't cache framebuffer maps
	drm/vmwgfx: Potential off by one in vmw_view_add()
	drm/i915/gvt: Clear the shadow page table entry after post-sync
	drm/i915: Whitelist SLICE_COMMON_ECO_CHICKEN1 on Geminilake.
	drm/i915: Move init_clock_gating() back to where it was
	drm/i915: Fix init_clock_gating for resume
	bpf: prevent out-of-bounds speculation
	bpf, array: fix overflow in max_entries and undefined behavior in index_mask
	bpf: arsh is not supported in 32 bit alu thus reject it
	USB: serial: cp210x: add IDs for LifeScan OneTouch Verio IQ
	USB: serial: cp210x: add new device ID ELV ALC 8xxx
	usb: misc: usb3503: make sure reset is low for at least 100us
	USB: fix usbmon BUG trigger
	USB: UDC core: fix double-free in usb_add_gadget_udc_release
	usbip: remove kernel addresses from usb device and urb debug msgs
	usbip: fix vudc_rx: harden CMD_SUBMIT path to handle malicious input
	usbip: vudc_tx: fix v_send_ret_submit() vulnerability to null xfer buffer
	staging: android: ashmem: fix a race condition in ASHMEM_SET_SIZE ioctl
	Bluetooth: Prevent stack info leak from the EFS element.
	uas: ignore UAS for Norelsys NS1068(X) chips
	mux: core: fix double get_device()
	kdump: write correct address of mem_section into vmcoreinfo
	apparmor: fix ptrace label match when matching stacked labels
	e1000e: Fix e1000_check_for_copper_link_ich8lan return value.
	x86/pti: Unbreak EFI old_memmap
	x86/Documentation: Add PTI description
	x86/cpufeatures: Add X86_BUG_SPECTRE_V[12]
	sysfs/cpu: Add vulnerability folder
	x86/cpu: Implement CPU vulnerabilites sysfs functions
	x86/tboot: Unbreak tboot with PTI enabled
	x86/mm/pti: Remove dead logic in pti_user_pagetable_walk*()
	x86/cpu/AMD: Make LFENCE a serializing instruction
	x86/cpu/AMD: Use LFENCE_RDTSC in preference to MFENCE_RDTSC
	sysfs/cpu: Fix typos in vulnerability documentation
	x86/alternatives: Fix optimize_nops() checking
	x86/pti: Make unpoison of pgd for trusted boot work for real
	objtool: Detect jumps to retpoline thunks
	objtool: Allow alternatives to be ignored
	x86/retpoline: Add initial retpoline support
	x86/spectre: Add boot time option to select Spectre v2 mitigation
	x86/retpoline/crypto: Convert crypto assembler indirect jumps
	x86/retpoline/entry: Convert entry assembler indirect jumps
	x86/retpoline/ftrace: Convert ftrace assembler indirect jumps
	x86/retpoline/hyperv: Convert assembler indirect jumps
	x86/retpoline/xen: Convert Xen hypercall indirect jumps
	x86/retpoline/checksum32: Convert assembler indirect jumps
	x86/retpoline/irq32: Convert assembler indirect jumps
	x86/retpoline: Fill return stack buffer on vmexit
	selftests/x86: Add test_vsyscall
	x86/pti: Fix !PCID and sanitize defines
	security/Kconfig: Correct the Documentation reference for PTI
	x86,perf: Disable intel_bts when PTI
	x86/retpoline: Remove compile time warning
	Linux 4.14.14

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-01-17 10:33:24 +01:00
Daniel Borkmann
67c05d9414 bpf, array: fix overflow in max_entries and undefined behavior in index_mask
commit bbeb6e4323dad9b5e0ee9f60c223dd532e2403b1 upstream.

syzkaller tried to alloc a map with 0xfffffffd entries out of a userns,
and thus unprivileged. With the recently added logic in b2157399cc98
("bpf: prevent out-of-bounds speculation") we round this up to the next
power of two value for max_entries for unprivileged such that we can
apply proper masking into potentially zeroed out map slots.

However, this will generate an index_mask of 0xffffffff, and therefore
a + 1 will let this overflow into new max_entries of 0. This will pass
allocation, etc, and later on map access we still enforce on the original
attr->max_entries value which was 0xfffffffd, therefore triggering GPF
all over the place. Thus bail out on overflow in such case.

Moreover, on 32 bit archs roundup_pow_of_two() can also not be used,
since fls_long(max_entries - 1) can result in 32 and 1UL << 32 in 32 bit
space is undefined. Therefore, do this by hand in a 64 bit variable.

This fixes all the issues triggered by syzkaller's reproducers.

Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
Reported-by: syzbot+b0efb8e572d01bce1ae0@syzkaller.appspotmail.com
Reported-by: syzbot+6c15e9744f75f2364773@syzkaller.appspotmail.com
Reported-by: syzbot+d2f5524fb46fd3b312ee@syzkaller.appspotmail.com
Reported-by: syzbot+61d23c95395cc90dbc2b@syzkaller.appspotmail.com
Reported-by: syzbot+0d363c942452cca68c01@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:45:25 +01:00
Alexei Starovoitov
a5dbaf8768 bpf: prevent out-of-bounds speculation
commit b2157399cc9898260d6031c5bfe45fe137c1fbe7 upstream.

Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.

To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.

Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index & index_mask' operation.

If prog_array map is created by unpriv user replace
  bpf_tail_call(ctx, map, index);
with
  if (index >= max_entries) {
    index &= map->index_mask;
    bpf_tail_call(ctx, map, index);
  }
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index >= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.

Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.

That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.

v2->v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:45:25 +01:00
Chenbo Feng
cace572e16 BACKPORT: bpf: Add file mode configuration into bpf maps
Introduce the map read/write flags to the eBPF syscalls that returns the
map fd. The flags is used to set up the file mode when construct a new
file descriptor for bpf maps. To not break the backward capability, the
f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise
it should be O_RDONLY or O_WRONLY. When the userspace want to modify or
read the map content, it will check the file mode to see if it is
allowed to make the change.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

Bug: 30950746
Change-Id: Icfad20f1abb77f91068d244fb0d87fa40824dd1b

(cherry picked from commit 6e71b04a82248ccf13a94b85cbc674a9fefe53f5)
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2017-12-18 21:11:22 +05:30
Daniel Borkmann
bc6d5031b4 bpf: do not test for PCPU_MIN_UNIT_SIZE before percpu allocations
PCPU_MIN_UNIT_SIZE is an implementation detail of the percpu
allocator. Given we support __GFP_NOWARN now, lets just let
the allocation request fail naturally instead. The two call
sites from BPF mistakenly assumed __GFP_NOWARN would work, so
no changes needed to their actual __alloc_percpu_gfp() calls
which use the flag already.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-19 13:13:50 +01:00
Daniel Borkmann
7b0c2a0508 bpf: inline map in map lookup functions for array and htab
Avoid two successive functions calls for the map in map lookup, first
is the bpf_map_lookup_elem() helper call, and second the callback via
map->ops->map_lookup_elem() to get to the map in map implementation.
Implementation inlines array and htab flavor for map in map lookups.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-19 21:56:34 -07:00
Martin KaFai Lau
96eabe7a40 bpf: Allow selecting numa node during map creation
The current map creation API does not allow to provide the numa-node
preference.  The memory usually comes from where the map-creation-process
is running.  The performance is not ideal if the bpf_prog is known to
always run in a numa node different from the map-creation-process.

One of the use case is sharding on CPU to different LRU maps (i.e.
an array of LRU maps).  Here is the test result of map_perf_test on
the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
CPU0 to be allocated from a remote numa node:

[ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]

># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec  #<<<

After specifying numa node:
># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<

This patch adds one field, numa_node, to the bpf_attr.  Since numa node 0
is a valid node, a new flag BPF_F_NUMA_NODE is also added.  The numa_node
field is honored if and only if the BPF_F_NUMA_NODE flag is set.

Numa node selection is not supported for percpu map.

This patch does not change all the kmalloc.  F.e.
'htab = kzalloc()' is not changed since the object
is small enough to stay in the cache.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-19 21:35:43 -07:00
Martin KaFai Lau
14dc6f04f4 bpf: Add syscall lookup support for fd array and htab
This patch allows userspace to do BPF_MAP_LOOKUP_ELEM on
BPF_MAP_TYPE_PROG_ARRAY,
BPF_MAP_TYPE_ARRAY_OF_MAPS and
BPF_MAP_TYPE_HASH_OF_MAPS.

The lookup returns a prog-id or map-id to the userspace.
The userspace can then use the BPF_PROG_GET_FD_BY_ID
or BPF_MAP_GET_FD_BY_ID to get a fd.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-29 13:13:25 -04:00
Alexei Starovoitov
f91840a32d perf, bpf: Add BPF support to all perf_event types
Allow BPF_PROG_TYPE_PERF_EVENT program types to attach to all
perf_event types, including HW_CACHE, RAW, and dynamic pmu events.
Only tracepoint/kprobe events are treated differently which require
BPF_PROG_TYPE_TRACEPOINT/BPF_PROG_TYPE_KPROBE program types accordingly.

Also add support for reading all event counters using
bpf_perf_event_read() helper.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-06-04 21:58:01 -04:00
Daniel Borkmann
a316338cb7 bpf: fix wrong exposure of map_flags into fdinfo for lpm
trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
attr->map_flags, since it does not support preallocation yet. We
check the flag, but we never copy the flag into trie->map.map_flags,
which is later on exposed into fdinfo and used by loaders such as
iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
whether a pinned map has the same spec as the one from the BPF obj
file and if not, bails out, which is currently the case for lpm
since it exposes always 0 as flags.

Also copy over flags in array_map_alloc() and stack_map_alloc().
They always have to be 0 right now, but we should make sure to not
miss to copy them over at a later point in time when we add actual
flags for them to use.

Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Reported-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-25 13:44:28 -04:00
Teng Qin
8fe4592438 bpf: map_get_next_key to return first key on NULL
When iterating through a map, we need to find a key that does not exist
in the map so map_get_next_key will give us the first key of the map.
This often requires a lot of guessing in production systems.

This patch makes map_get_next_key return the first key when the key
pointer in the parameter is NULL.

Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25 11:57:45 -04:00
Johannes Berg
40077e0cf6 bpf: remove struct bpf_map_type_list
There's no need to have struct bpf_map_type_list since
it just contains a list_head, the type, and the ops
pointer. Since the types are densely packed and not
actually dynamically registered, it's much easier and
smaller to have an array of type->ops pointer. Also
initialize this array statically to remove code needed
to initialize it.

In order to save duplicating the list, move it to the
types header file added by the previous patch and
include it in the same fashion.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-11 14:38:43 -04:00
Martin KaFai Lau
56f668dfe0 bpf: Add array of maps support
This patch adds a few helper funcs to enable map-in-map
support (i.e. outer_map->inner_map).  The first outer_map type
BPF_MAP_TYPE_ARRAY_OF_MAPS is also added in this patch.
The next patch will introduce a hash of maps type.

Any bpf map type can be acted as an inner_map.  The exception
is BPF_MAP_TYPE_PROG_ARRAY because the extra level of
indirection makes it harder to verify the owner_prog_type
and owner_jited.

Multi-level map-in-map is not supported (i.e. map->map is ok
but not map->map->map).

When adding an inner_map to an outer_map, it currently checks the
map_type, key_size, value_size, map_flags, max_entries and ops.
The verifier also uses those map's properties to do static analysis.
map_flags is needed because we need to ensure BPF_PROG_TYPE_PERF_EVENT
is using a preallocated hashtab for the inner_hash also.  ops and
max_entries are needed to generate inlined map-lookup instructions.
For simplicity reason, a simple '==' test is used for both map_flags
and max_entries.  The equality of ops is implied by the equality of
map_type.

During outer_map creation time, an inner_map_fd is needed to create an
outer_map.  However, the inner_map_fd's life time does not depend on the
outer_map.  The inner_map_fd is merely used to initialize
the inner_map_meta of the outer_map.

Also, for the outer_map:

* It allows element update and delete from syscall
* It allows element lookup from bpf_prog

The above is similar to the current fd_array pattern.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22 15:45:45 -07:00