Commit Graph

50857 Commits

Author SHA1 Message Date
Daniel Borkmann
a070594bc5 bpf: add bpf_skb_change_tail helper
This work adds a bpf_skb_change_tail() helper for tc BPF programs. The
basic idea is to expand or shrink the skb in a controlled manner. The
eBPF program can then rewrite the rest via helpers like bpf_skb_store_bytes(),
bpf_lX_csum_replace() and others rather than passing a raw buffer for
writing here.

bpf_skb_change_tail() is really a slow path helper and intended for
replies with f.e. ICMP control messages. Concept is similar to other
helpers like bpf_skb_change_proto() helper to keep the helper without
protocol specifics and let the BPF program mangle the remaining parts.
A flags field has been added and is reserved for now should we extend
the helper in future.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:40 +01:00
Brenden Blanco
70a4dfa7e1 bpf: add XDP prog type for early driver filter
Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a
new context type, struct xdp_md, is exposed to userspace. So far only
expose the packet start and end pointers, and only in read mode.

An XDP program must return one of the well known enum values, all other
return codes are reserved for future use. Unfortunately, this
restriction is hard to enforce at verification time, so take the
approach of warning at runtime when such programs are encountered. Out
of bounds return codes should alias to XDP_ABORTED.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:37 +01:00
Alexei Starovoitov
9ee8edce2e bpf: wire in data and data_end for cls_act_bpf
allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers.
The bpf helpers that change skb->data need to update data_end pointer as well.
The verifier checks that programs always reload data, data_end pointers
after calls to such bpf helpers.
We cannot add 'data_end' pointer to struct qdisc_skb_cb directly,
since it's embedded as-is by infiniband ipoib, so wrapper struct is needed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:37 +01:00
Daniel Borkmann
b021603f08 bpf: cleanup bpf_prog_run_{save,clear}_cb helpers
Move the details behind the cb[] access into a small helper to decouple
and make them generic for bpf_prog_run_save_cb()/bpf_prog_run_clear_cb()
that was introduced via commit ff936a04e5 ("bpf: fix cb access in socket
filter programs"). Also add a comment to better clarify what is done in
bpf_skb_cb().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:37 +01:00
Brenden Blanco
9160405788 bpf: add bpf_prog_add api for bulk prog refcnt
A subsystem may need to store many copies of a bpf program, each
deserving its own reference. Rather than requiring the caller to loop
one by one (with possible mid-loop failure), add a bulk bpf_prog_add
api.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:37 +01:00
Daniel Borkmann
eec187ff36 bpf: avoid stack copy and use skb ctx for event output
This work addresses a couple of issues bpf_skb_event_output()
helper currently has: i) We need two copies instead of just a
single one for the skb data when it should be part of a sample.
The data can be non-linear and thus needs to be extracted via
bpf_skb_load_bytes() helper first, and then copied once again
into the ring buffer slot. ii) Since bpf_skb_load_bytes()
currently needs to be used first, the helper needs to see a
constant size on the passed stack buffer to make sure BPF
verifier can do sanity checks on it during verification time.
Thus, just passing skb->len (or any other non-constant value)
wouldn't work, but changing bpf_skb_load_bytes() is also not
the proper solution, since the two copies are generally still
needed. iii) bpf_skb_load_bytes() is just for rather small
buffers like headers, since they need to sit on the limited
BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
this work improves the bpf_skb_event_output() helper to address
all 3 at once.

We can make use of the passed in skb context that we have in
the helper anyway, and use some of the reserved flag bits as
a length argument. The helper will use the new __output_custom()
facility from perf side with bpf_skb_copy() as callback helper
to walk and extract the data. It will pass the data for setup
to bpf_event_output(), which generates and pushes the raw record
with an additional frag part. The linear data used in the first
frag of the record serves as programmatically defined meta data
passed along with the appended sample.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:37 +01:00
Daniel Borkmann
3011d8b5e5 bpf: refactor bpf_prog_get and type check into helper
Since bpf_prog_get() and program type check is used in a couple of places,
refactor this into a small helper function that we can make use of. Since
the non RO prog->aux part is not used in performance critical paths and a
program destruction via RCU is rather very unlikley when doing the put, we
shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
check, but actually not taking the ref at all (due to being in fdget() /
fdput() section of the bpf fd) is even cleaner and makes the diff smaller
as well, so just go for that. Callsites are changed to make use of the new
helper where possible.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:36 +01:00
Daniel Borkmann
34e63182fd bpf: generally move prog destruction to RCU deferral
Jann Horn reported following analysis that could potentially result
in a very hard to trigger (if not impossible) UAF race, to quote his
event timeline:

 - Set up a process with threads T1, T2 and T3
 - Let T1 set up a socket filter F1 that invokes another filter F2
   through a BPF map [tail call]
 - Let T1 trigger the socket filter via a unix domain socket write,
   don't wait for completion
 - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
 - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
 - Let T3 close the file descriptor for F2, dropping the reference
   count of F2 to 2
 - At this point, T1 should have looked up F2 from the map, but not
   finished executing it
 - Let T3 remove F2 from the BPF map, dropping the reference count of
   F2 to 1
 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
   the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
   via schedule_work()
 - At this point, the BPF program could be freed
 - BPF execution is still running in a freed BPF program

While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
event fd we're doing the syscall on doesn't disappear from underneath us
for whole syscall time, it may not be the case for the bpf fd used as
an argument only after we did the put. It needs to be a valid fd pointing
to a BPF program at the time of the call to make the bpf_prog_get() and
while T2 gets preempted, F2 must have dropped reference to 1 on the other
CPU. The fput() from the close() in T3 should also add additionally delay
to the reference drop via exit_task_work() when bpf_prog_release() gets
called as well as scheduling bpf_prog_free_deferred().

That said, it makes nevertheless sense to move the BPF prog destruction
generally after RCU grace period to guarantee that such scenario above,
but also others as recently fixed in ceb56070359b ("bpf, perf: delay release
of BPF prog after grace period") with regards to tail calls won't happen.
Integrating bpf_prog_free_deferred() directly into the RCU callback is
not allowed since the invocation might happen from either softirq or
process context, so we're not permitted to block. Reviewing all bpf_prog_put()
invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
their destruction) with call_rcu() look good to me.

Since we don't know whether at the time of attaching the program, we're
already part of a tail call map, we need to use RCU variant. However, due
to this, there won't be severely more stress on the RCU callback queue:
situations with above bpf_prog_get() and bpf_prog_put() combo in practice
normally won't lead to releases, but even if they would, enough effort/
cycles have to be put into loading a BPF program into the kernel already.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:36 +01:00
Daniel Borkmann
434dd7d1e9 bpf, maps: flush own entries on perf map release
The behavior of perf event arrays are quite different from all
others as they are tightly coupled to perf event fds, f.e. shown
recently by commit e03e7ee34fdd ("perf/bpf: Convert perf_event_array
to use struct file") to make refcounting on perf event more robust.
A remaining issue that the current code still has is that since
additions to the perf event array take a reference on the struct
file via perf_event_get() and are only released via fput() (that
cleans up the perf event eventually via perf_event_release_kernel())
when the element is either manually removed from the map from user
space or automatically when the last reference on the perf event
map is dropped. However, this leads us to dangling struct file's
when the map gets pinned after the application owning the perf
event descriptor exits, and since the struct file reference will
in such case only be manually dropped or via pinned file removal,
it leads to the perf event living longer than necessary, consuming
needlessly resources for that time.

Relations between perf event fds and bpf perf event map fds can be
rather complex. F.e. maps can act as demuxers among different perf
event fds that can possibly be owned by different threads and based
on the index selection from the program, events get dispatched to
one of the per-cpu fd endpoints. One perf event fd (or, rather a
per-cpu set of them) can also live in multiple perf event maps at
the same time, listening for events. Also, another requirement is
that perf event fds can get closed from application side after they
have been attached to the perf event map, so that on exit perf event
map will take care of dropping their references eventually. Likewise,
when such maps are pinned, the intended behavior is that a user
application does bpf_obj_get(), puts its fds in there and on exit
when fd is released, they are dropped from the map again, so the map
acts rather as connector endpoint. This also makes perf event maps
inherently different from program arrays as described in more detail
in commit c9da161c65 ("bpf: fix clearing on persistent program
array maps").

To tackle this, map entries are marked by the map struct file that
added the element to the map. And when the last reference to that map
struct file is released from user space, then the tracked entries
are purged from the map. This is okay, because new map struct files
instances resp. frontends to the anon inode are provided via
bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
for retrieving a pinned map, but also when an initial instance is
created via map_create(). The rest is resolved by the vfs layer
automatically for us by keeping reference count on the map's struct
file. Any concurrent updates on the map slot are fine as well, it
just means that perf_event_fd_array_release() needs to delete less
of its own entires.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:36 +01:00
Daniel Borkmann
d991ed9658 bpf, maps: extend map_fd_get_ptr arguments
This patch extends map_fd_get_ptr() callback that is used by fd array
maps, so that struct file pointer from the related map can be passed
in. It's safe to remove map_update_elem() callback for the two maps since
this is only allowed from syscall side, but not from eBPF programs for these
two map types. Like in per-cpu map case, bpf_fd_array_map_update_elem()
needs to be called directly here due to the extra argument.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:36 +01:00
Daniel Borkmann
755ce69559 bpf, maps: add release callback
Add a release callback for maps that is invoked when the last
reference to its struct file is gone and the struct file about
to be released by vfs. The handler will be used by fd array maps.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:35 +01:00
Alexei Starovoitov
b0abf46b6a bpf: fix matching of data/data_end in verifier
The ctx structure passed into bpf programs is different depending on bpf
program type. The verifier incorrectly marked ctx->data and ctx->data_end
access based on ctx offset only. That caused loads in tracing programs
int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
to be incorrectly marked as PTR_TO_PACKET which later caused verifier
to reject the program that was actually valid in tracing context.
Fix this by doing program type specific matching of ctx offsets.

Fixes: 969bf05eb3ce ("bpf: direct packet access")
Reported-by: Sasha Goldshtein <goldshtn@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:35 +01:00
Arnaldo Carvalho de Melo
940cc3d4cd perf core: Per event callchain limit
Additionally to being able to control the system wide maximum depth via
/proc/sys/kernel/perf_event_max_stack, now we are able to ask for
different depths per event, using perf_event_attr.sample_max_stack for
that.

This uses an u16 hole at the end of perf_event_attr, that, when
perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
sample_max_stack is zero, means use perf_event_max_stack, otherwise
it'll be bounds checked under callchain_mutex.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:35 +01:00
Arnaldo Carvalho de Melo
b40ef34114 perf core: Pass max stack as a perf_callchain_entry context
This makes perf_callchain_{user,kernel}() receive the max stack
as context for the perf_callchain_entry, instead of accessing
the global sysctl_perf_event_max_stack.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:35 +01:00
Daniel Borkmann
15a6a30d32 bpf: add generic constant blinding for use in jits
This work adds a generic facility for use from eBPF JIT compilers
that allows for further hardening of JIT generated images through
blinding constants. In response to the original work on BPF JIT
spraying published by Keegan McAllister [1], most BPF JITs were
changed to make images read-only and start at a randomized offset
in the page, where the rest was filled with trap instructions. We
have this nowadays in x86, arm, arm64 and s390 JIT compilers.
Additionally, later work also made eBPF interpreter images read
only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86,
arm, arm64 and s390 archs as well currently. This is done by
default for mentioned JITs when JITing is enabled. Furthermore,
we had a generic and configurable constant blinding facility on our
todo for quite some time now to further make spraying harder, and
first implementation since around netconf 2016.

We found that for systems where untrusted users can load cBPF/eBPF
code where JIT is enabled, start offset randomization helps a bit
to make jumps into crafted payload harder, but in case where larger
programs that cross page boundary are injected, we again have some
part of the program opcodes at a page start offset. With improved
guessing and more reliable payload injection, chances can increase
to jump into such payload. Elena Reshetova recently wrote a test
case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which
can leave some more room for payloads. Note that for all this,
additional bugs in the kernel are still required to make the jump
(and of course to guess right, to not jump into a trap) and naturally
the JIT must be enabled, which is disabled by default.

For helping mitigation, the general idea is to provide an option
bpf_jit_harden that admins can tweak along with bpf_jit_enable, so
that for cases where JIT should be enabled for performance reasons,
the generated image can be further hardened with blinding constants
for unpriviledged users (bpf_jit_harden == 1), with trading off
performance for these, but not for privileged ones. We also added
the option of blinding for all users (bpf_jit_harden == 2), which
is quite helpful for testing f.e. with test_bpf.ko. There are no
further e.g. hardening levels of bpf_jit_harden switch intended,
rationale is to have it dead simple to use as on/off. Since this
functionality would need to be duplicated over and over for JIT
compilers to use, which are already complex enough, we provide a
generic eBPF byte-code level based blinding implementation, which is
then just transparently JITed. JIT compilers need to make only a few
changes to integrate this facility and can be migrated one by one.

This option is for eBPF JITs and will be used in x86, arm64, s390
without too much effort, and soon ppc64 JITs, thus that native eBPF
can be blinded as well as cBPF to eBPF migrations, so that both can
be covered with a single implementation. The rule for JITs is that
bpf_jit_blind_constants() must be called from bpf_int_jit_compile(),
and in case blinding is disabled, we follow normally with JITing the
passed program. In case blinding is enabled and we fail during the
process of blinding itself, we must return with the interpreter.
Similarly, in case the JITing process after the blinding failed, we
return normally to the interpreter with the non-blinded code. Meaning,
interpreter doesn't change in any way and operates on eBPF code as
usual. For doing this pre-JIT blinding step, we need to make use of
a helper/auxiliary register, here BPF_REG_AX. This is strictly internal
to the JIT and not in any way part of the eBPF architecture. Just like
in the same way as JITs internally make use of some helper registers
when emitting code, only that here the helper register is one
abstraction level higher in eBPF bytecode, but nevertheless in JIT
phase. That helper register is needed since f.e. manually written
program can issue loads to all registers of eBPF architecture.

The core concept with the additional register is: blind out all 32
and 64 bit constants by converting BPF_K based instructions into a
small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this
is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND,
and REG <OP> BPF_REG_AX, so actual operation on the target register
is translated from BPF_K into BPF_X one that is operating on
BPF_REG_AX's content. During rewriting phase when blinding, RND is
newly generated via prandom_u32() for each processed instruction.
64 bit loads are split into two 32 bit loads to make translation and
patching not too complex. Only basic thing required by JITs is to
call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other()
pair, and to map BPF_REG_AX into an unused register.

Small bpf_jit_disasm extract from [2] when applied to x86 JIT:

echo 0 > /proc/sys/net/core/bpf_jit_harden

  ffffffffa034f5e9 + <x>:
  [...]
  39:   mov    $0xa8909090,%eax
  3e:   mov    $0xa8909090,%eax
  43:   mov    $0xa8ff3148,%eax
  48:   mov    $0xa89081b4,%eax
  4d:   mov    $0xa8900bb0,%eax
  52:   mov    $0xa810e0c1,%eax
  57:   mov    $0xa8908eb4,%eax
  5c:   mov    $0xa89020b0,%eax
  [...]

echo 1 > /proc/sys/net/core/bpf_jit_harden

  ffffffffa034f1e5 + <x>:
  [...]
  39:   mov    $0xe1192563,%r10d
  3f:   xor    $0x4989b5f3,%r10d
  46:   mov    %r10d,%eax
  49:   mov    $0xb8296d93,%r10d
  4f:   xor    $0x10b9fd03,%r10d
  56:   mov    %r10d,%eax
  59:   mov    $0x8c381146,%r10d
  5f:   xor    $0x24c7200e,%r10d
  66:   mov    %r10d,%eax
  69:   mov    $0xeb2a830e,%r10d
  6f:   xor    $0x43ba02ba,%r10d
  76:   mov    %r10d,%eax
  79:   mov    $0xd9730af,%r10d
  7f:   xor    $0xa5073b1f,%r10d
  86:   mov    %r10d,%eax
  89:   mov    $0x9a45662b,%r10d
  8f:   xor    $0x325586ea,%r10d
  96:   mov    %r10d,%eax
  [...]

As can be seen, original constants that carry payload are hidden
when enabled, actual operations are transformed from constant-based
to register-based ones, making jumps into constants ineffective.
Above extract/example uses single BPF load instruction over and
over, but of course all instructions with constants are blinded.

Performance wise, JIT with blinding performs a bit slower than just
JIT and faster than interpreter case. This is expected, since we
still get all the performance benefits from JITing and in normal
use-cases not every single instruction needs to be blinded. Summing
up all 296 test cases averaged over multiple runs from test_bpf.ko
suite, interpreter was 55% slower than JIT only and JIT with blinding
was 8% slower than JIT only. Since there are also some extremes in
the test suite, I expect for ordinary workloads that the performance
for the JIT with blinding case is even closer to JIT only case,
f.e. nmap test case from suite has averaged timings in ns 29 (JIT),
35 (+ blinding), and 151 (interpreter).

BPF test suite, seccomp test suite, eBPF sample code and various
bigger networking eBPF programs have been tested with this and were
running fine. For testing purposes, I also adapted interpreter and
redirected blinded eBPF image to interpreter and also here all tests
pass.

  [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html
  [2] https://github.com/01org/jit-spray-poc-for-ksp/
  [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Elena Reshetova <elena.reshetova@intel.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:35 +01:00
Daniel Borkmann
ff58fbe811 bpf: move bpf_jit_enable declaration
Move the bpf_jit_enable declaration to the filter.h file where
most other core code is declared, also since we're going to add
a second knob there.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:35 +01:00
Daniel Borkmann
c49b112102 bpf: prepare bpf_int_jit_compile/bpf_prog_select_runtime apis
Since the blinding is strictly only called from inside eBPF JITs,
we need to change signatures for bpf_int_jit_compile() and
bpf_prog_select_runtime() first in order to prepare that the
eBPF program we're dealing with can change underneath. Hence,
for call sites, we need to return the latest prog. No functional
change in this patch.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:34 +01:00
Daniel Borkmann
1acbc6ed01 bpf: add bpf_patch_insn_single helper
Move the functionality to patch instructions out of the verifier
code and into the core as the new bpf_patch_insn_single() helper
will be needed later on for blinding as well. No changes in
functionality.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:34 +01:00
Alexei Starovoitov
76e4c70544 bpf: fix refcnt overflow
On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK,
the malicious application may overflow 32-bit bpf program refcnt.
It's also possible to overflow map refcnt on 1Tb system.
Impose 32k hard limit which means that the same bpf program or
map cannot be shared by more than 32k processes.

Fixes: 1be7f75d16 ("bpf: enable non-root eBPF programs")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:34 +01:00
Arnaldo Carvalho de Melo
7d27ed19d4 perf core: Allow setting up max frame stack depth via sysctl
The default remains 127, which is good for most cases, and not even hit
most of the time, but then for some cases, as reported by Brendan, 1024+
deep frames are appearing on the radar for things like groovy, ruby.

And in some workloads putting a _lower_ cap on this may make sense. One
that is per event still needs to be put in place tho.

The new file is:

  # cat /proc/sys/kernel/perf_event_max_stack
  127

Chaging it:

  # echo 256 > /proc/sys/kernel/perf_event_max_stack
  # cat /proc/sys/kernel/perf_event_max_stack
  256

But as soon as there is some event using callchains we get:

  # echo 512 > /proc/sys/kernel/perf_event_max_stack
  -bash: echo: write error: Device or resource busy
  #

Because we only allocate the callchain percpu data structures when there
is a user, which allows for changing the max easily, its just a matter
of having no callchain users at that point.

Reported-and-Tested-by: Brendan Gregg <brendan.d.gregg@gmail.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: David Ahern <dsahern@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Milian Wolff <milian.wolff@kdab.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Zefan Li <lizefan@huawei.com>
Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
Change-Id: Ic34ecdb4cc1e61257a2926062aa23c960dbd3b8f
2022-03-04 20:16:33 +01:00
Alexei Starovoitov
073a9fd134 perf: generalize perf_callchain
. avoid walking the stack when there is no room left in the buffer
. generalize get_perf_callchain() to be called from bpf helper

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:33 +01:00
Daniel Borkmann
990a68b5f8 bpf: add event output helper for notifications/sampling/logging
This patch adds a new helper for cls/act programs that can push events
to user space applications. For networking, this can be f.e. for sampling,
debugging, logging purposes or pushing of arbitrary wake-up events. The
idea is similar to a43eec3042 ("bpf: introduce bpf_perf_event_output()
helper") and 39111695b1 ("samples: bpf: add bpf_perf_event_output example").

The eBPF program utilizes a perf event array map that user space populates
with fds from perf_event_open(), the eBPF program calls into the helper
f.e. as skb_event_output(skb, &my_map, BPF_F_CURRENT_CPU, raw, sizeof(raw))
so that the raw data is pushed into the fd f.e. at the map index of the
current CPU.

User space can poll/mmap/etc on this and has a data channel for receiving
events that can be post-processed. The nice thing is that since the eBPF
program and user space application making use of it are tightly coupled,
they can define their own arbitrary raw data format and what/when they
want to push.

While f.e. packet headers could be one part of the meta data that is being
pushed, this is not a substitute for things like packet sockets as whole
packet is not being pushed and push is only done in a single direction.
Intention is more of a generically usable, efficient event pipe to applications.
Workflow is that tc can pin the map and applications can attach themselves
e.g. after cls/act setup to one or multiple map slots, demuxing is done by
the eBPF program.

Adding this facility is with minimal effort, it reuses the helper
introduced in a43eec3042 ("bpf: introduce bpf_perf_event_output() helper")
and we get its functionality for free by overloading its BPF_FUNC_ identifier
for cls/act programs, ctx is currently unused, but will be made use of in
future. Example will be added to iproute2's BPF example files.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:33 +01:00
Hannes Frederic Sowa
ff88d78d69 tun: use socket locks for sk_{attach,detatch}_filter
This reverts commit 5a5abb1fa3b05dd ("tun, bpf: fix suspicious RCU usage
in tun_{attach, detach}_filter") and replaces it to use lock_sock around
sk_{attach,detach}_filter. The checks inside filter.c are updated with
lockdep_sock_is_held to check for proper socket locks.

It keeps the code cleaner by ensuring that only one lock governs the
socket filter instead of two independent locks.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:32 +01:00
Daniel Borkmann
0c4eb94a68 bpf, verifier: add ARG_PTR_TO_RAW_STACK type
When passing buffers from eBPF stack space into a helper function, we have
ARG_PTR_TO_STACK argument type for helpers available. The verifier makes sure
that such buffers are initialized, within boundaries, etc.

However, the downside with this is that we have a couple of helper functions
such as bpf_skb_load_bytes() that fill out the passed buffer in the expected
success case anyway, so zero initializing them prior to the helper call is
unneeded/wasted instructions in the eBPF program that can be avoided.

Therefore, add a new helper function argument type called ARG_PTR_TO_RAW_STACK.
The idea is to skip the STACK_MISC check in check_stack_boundary() and color
the related stack slots as STACK_MISC after we checked all call arguments.

Helper functions using ARG_PTR_TO_RAW_STACK must make sure that every path of
the helper function will fill the provided buffer area, so that we cannot leak
any uninitialized stack memory. This f.e. means that error paths need to
memset() the buffers, but the expected fast-path doesn't have to do this
anymore.

Since there's no such helper needing more than at most one ARG_PTR_TO_RAW_STACK
argument, we can keep it simple and don't need to check for multiple areas.
Should in future such a use-case really appear, we have check_raw_mode() that
will make sure we implement support for it first.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:31 +01:00
Alexei Starovoitov
037c241aeb bpf: sanitize bpf tracepoint access
during bpf program loading remember the last byte of ctx access
and at the time of attaching the program to tracepoint check that
the program doesn't access bytes beyond defined in tracepoint fields

This also disallows access to __dynamic_array fields, but can be
relaxed in the future.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:31 +01:00
Alexei Starovoitov
9dd4308175 bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs
needs two wrapper functions to fetch 'struct pt_regs *' to convert
tracepoint bpf context into kprobe bpf context to reuse existing
helper functions

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:31 +01:00
Alexei Starovoitov
ba222c849b perf: split perf_trace_buf_prepare into alloc and update parts
split allows to move expensive update of 'struct trace_entry' to later phase.
Repurpose unused 1st argument of perf_tp_event() to indicate event type.

While splitting use temp variable 'rctx' instead of '*rctx' to avoid
unnecessary loads done by the compiler due to -fno-strict-aliasing

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:31 +01:00
Alexei Starovoitov
ba55e683a6 bpf: convert stackmap to pre-allocation
It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.

The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
     cpu0                          cpu1
     ----                          ----
user does lookup(stackidX)
starts copying ips into buffer
                                   delete(stackidX)
                                   calls bpf_get_stackid()
				   which recyles the element and
                                   overwrites with new stack trace

To avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.

Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.

Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:30 +01:00
Alexei Starovoitov
7a5597d3fe bpf: pre-allocate hash map elements
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.

The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work

At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.

Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.

While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.

Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.

Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.

Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.

Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:

1 cpu:
pcpu_ida	2.1M
pcpu_ida nolock	2.3M
bt		2.4M
kmalloc		1.8M
hlist+spinlock	2.3M
pcpu_freelist	2.6M

4 cpu:
pcpu_ida	1.5M
pcpu_ida nolock	1.8M
bt w/smp_align	1.7M
bt no/smp_align	1.1M
kmalloc		0.7M
hlist+spinlock	0.2M
pcpu_freelist	2.0M

8 cpu:
pcpu_ida	0.7M
bt w/smp_align	0.8M
kmalloc		0.4M
pcpu_freelist	1.5M

32 cpu:
kmalloc		0.13M
pcpu_freelist	0.49M

pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist

hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.

kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:30 +01:00
Alexei Starovoitov
0eaa91fd05 bpf: prevent kprobe+bpf deadlocks
if kprobe is placed within update or delete hash map helpers
that hold bucket spin lock and triggered bpf program is trying to
grab the spinlock for the same bucket on the same cpu, it will
deadlock.
Fix it by extending existing recursion prevention mechanism.

Note, map_lookup and other tracing helpers don't have this problem,
since they don't hold any locks and don't modify global data.
bpf_trace_printk has its own recursive check and ok as well.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:30 +01:00
Daniel Borkmann
ad56a61f37 bpf: add new arg_type that allows for 0 sized stack buffer
Currently, when we pass a buffer from the eBPF stack into a helper
function, the function proto indicates argument types as ARG_PTR_TO_STACK
and ARG_CONST_STACK_SIZE pair. If R<X> contains the former, then R<X+1>
must be of the latter type. Then, verifier checks whether the buffer
points into eBPF stack, is initialized, etc. The verifier also guarantees
that the constant value passed in R<X+1> is greater than 0, so helper
functions don't need to test for it and can always assume a non-NULL
initialized buffer as well as non-0 buffer size.

This patch adds a new argument types ARG_CONST_STACK_SIZE_OR_ZERO that
allows to also pass NULL as R<X> and 0 as R<X+1> into the helper function.
Such helper functions, of course, need to be able to handle these cases
internally then. Verifier guarantees that either R<X> == NULL && R<X+1> == 0
or R<X> != NULL && R<X+1> != 0 (like the case of ARG_CONST_STACK_SIZE), any
other combinations are not possible to load.

I went through various options of extending the verifier, and introducing
the type ARG_CONST_STACK_SIZE_OR_ZERO seems to have most minimal changes
needed to the verifier.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:30 +01:00
Alexei Starovoitov
5ccd789bbf bpf: introduce BPF_MAP_TYPE_STACK_TRACE
add new map type to store stack traces and corresponding helper
bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
@ctx: struct pt_regs*
@map: pointer to stack_trace map
@flags: bits 0-7 - numer of stack frames to skip
        bit 8 - collect user stack instead of kernel
        bit 9 - compare stacks by hash only
        bit 10 - if two different stacks hash into the same stackid
                 discard old
        other bits - reserved
Return: >= 0 stackid on success or negative error

stackid is a 32-bit integer handle that can be further combined with
other data (including other stackid) and used as a key into maps.

Userspace will access stackmap using standard lookup/delete syscall commands to
retrieve full stack trace for given stackid.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:29 +01:00
Craig Gallek
a383eadaa3 soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF
Expose socket options for setting a classic or extended BPF program
for use when selecting sockets in an SO_REUSEPORT group.  These options
can be used on the first socket to belong to a group before bind or
on any socket in the group after bind.

This change includes refactoring of the existing sk_filter code to
allow reuse of the existing BPF filter validation checks.

Signed-off-by: Craig Gallek <kraig@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
Change-Id: I4433dfd5d865bb69e8d3cab55cc68325829e8b49
2022-03-04 20:16:29 +01:00
Alexei Starovoitov
ea5434e7f5 bpf: add lookup/update support for per-cpu hash and array maps
The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.

Example of single counter aggregation in user space:
  unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
  long values[nr_cpus];
  long value = 0;

  bpf_lookup_elem(fd, key, values);
  for (i = 0; i < nr_cpus; i++)
    value += values[i];

The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:28 +01:00
Alexei Starovoitov
6f0dcf8e1d bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map
Primary use case is a histogram array of latency
where bpf program computes the latency of block requests or other
events and stores histogram of latency into array of 64 elements.
All cpus are constantly running, so normal increment is not accurate,
bpf_xadd causes cache ping-pong and this per-cpu approach allows
fastest collision-free counters.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:28 +01:00
Alexei Starovoitov
2fb3319a95 perf/bpf: Convert perf_event_array to use struct file
Robustify refcounting.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160126045947.GA40151@ast-mbp.thefacebook.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:28 +01:00
Daniel Borkmann
3ebb4bf0e8 bpf: fix clearing on persistent program array maps
Currently, when having map file descriptors pointing to program arrays,
there's still the issue that we unconditionally flush program array
contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
when such a file descriptor is released and is independent of the map's
refcount.

Having this flush independent of the refcount is for a reason: there
can be arbitrary complex dependency chains among tail calls, also circular
ones (direct or indirect, nesting limit determined during runtime), and
we need to make sure that the map drops all references to eBPF programs
it holds, so that the map's refcount can eventually drop to zero and
initiate its freeing. Btw, a walk of the whole dependency graph would
not be possible for various reasons, one being complexity and another
one inconsistency, i.e. new programs can be added to parts of the graph
at any time, so there's no guaranteed consistent state for the time of
such a walk.

Now, the program array pinning itself works, but the issue is that each
derived file descriptor on close would nevertheless call unconditionally
into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
this flush until the last reference to a user is dropped. As this only
concerns a subset of references (f.e. a prog array could hold a program
that itself has reference on the prog array holding it, etc), we need to
track them separately.

Short analysis on the refcounting: on map creation time usercnt will be
one, so there's no change in behaviour for bpf_map_release(), if unpinned.
If we already fail in map_create(), we are immediately freed, and no
file descriptor has been made public yet. In bpf_obj_pin_user(), we need
to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
reference, so before we drop the reference on the fd with fdput().
Therefore, if actual pinning fails, we need to drop that reference again
in bpf_any_put(), otherwise we keep holding it. When last reference
drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
care of dropping the usercnt again. In the bpf_obj_get_user() case, the
bpf_any_get() will grab a reference on the usercnt, still at a time when
we have the reference on the path. Should we later on fail to grab a new
file descriptor, bpf_any_put() will drop it, otherwise we hold it until
bpf_map_release() time.

Joint work with Alexei.

Fixes: b2197755b2 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:27 +01:00
Anay Wadhera
f2e27d4a68 Revert "bpf: fix clearing on persistent program array maps"
This reverts commit c9da161c65.

Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:27 +01:00
Anay Wadhera
e456abf58c Revert "bpf: fix refcnt overflow"
This reverts commit 3899251bdb.

Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:27 +01:00
Anay Wadhera
1e2614fecc Revert "bpf: add bpf_patch_insn_single helper"
This reverts commit 087a92287d.

Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:26 +01:00
Anay Wadhera
4b012c46e4 Revert "bpf: prevent out-of-bounds speculation"
This reverts commit 9a7fad4c0e.

Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:25 +01:00
Anay Wadhera
e361c0effb Revert "bpf: avoid false sharing of map refcount with max_entries"
This reverts commit 96d9b2338b.

Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:25 +01:00
Anay Wadhera
22c9fac00b Revert "bpf: generally move prog destruction to RCU deferral"
This reverts commit e25dc63aa3.

Signed-off-by: Chatur27 <jasonbright2709@gmail.com>
2022-03-04 20:16:24 +01:00
Michael Bestas
438071e031 Merge remote-tracking branch 'common/android-4.4-p' into android-msm-wahoo-4.4
* common/android-4.4-p:
  Linux 4.4.302
  Input: i8042 - Fix misplaced backport of "add ASUS Zenbook Flip to noselftest list"
  KVM: x86: Fix misplaced backport of "work around leak of uninitialized stack contents"
  Revert "tc358743: fix register i2c_rd/wr function fix"
  Revert "drm/radeon/ci: disable mclk switching for high refresh rates (v2)"
  Bluetooth: MGMT: Fix misplaced BT_HS check
  ipv4: tcp: send zero IPID in SYNACK messages
  ipv4: raw: lock the socket in raw_bind()
  hwmon: (lm90) Reduce maximum conversion rate for G781
  drm/msm: Fix wrong size calculation
  net-procfs: show net devices bound packet types
  ipv4: avoid using shared IP generator for connected sockets
  net: fix information leakage in /proc/net/ptype
  ipv6_tunnel: Rate limit warning messages
  scsi: bnx2fc: Flush destroy_work queue before calling bnx2fc_interface_put()
  USB: core: Fix hang in usb_kill_urb by adding memory barriers
  usb-storage: Add unusual-devs entry for VL817 USB-SATA bridge
  tty: Add support for Brainboxes UC cards.
  tty: n_gsm: fix SW flow control encoding/handling
  serial: stm32: fix software flow control transfer
  PM: wakeup: simplify the output logic of pm_show_wakelocks()
  udf: Fix NULL ptr deref when converting from inline format
  udf: Restore i_lenAlloc when inode expansion fails
  scsi: zfcp: Fix failed recovery on gone remote port with non-NPIV FCP devices
  s390/hypfs: include z/VM guests with access control group set
  Bluetooth: refactor malicious adv data check
  can: bcm: fix UAF of bcm op
  Linux 4.4.301
  drm/i915: Flush TLBs before releasing backing store
  Linux 4.4.300
  lib82596: Fix IRQ check in sni_82596_probe
  bcmgenet: add WOL IRQ check
  net_sched: restore "mpu xxx" handling
  dmaengine: at_xdmac: Fix at_xdmac_lld struct definition
  dmaengine: at_xdmac: Fix lld view setting
  dmaengine: at_xdmac: Print debug message after realeasing the lock
  dmaengine: at_xdmac: Don't start transactions at tx_submit level
  netns: add schedule point in ops_exit_list()
  net: axienet: fix number of TX ring slots for available check
  net: axienet: Wait for PhyRstCmplt after core reset
  af_unix: annote lockless accesses to unix_tot_inflight & gc_in_progress
  parisc: pdc_stable: Fix memory leak in pdcs_register_pathentries
  net/fsl: xgmac_mdio: Fix incorrect iounmap when removing module
  powerpc/fsl/dts: Enable WA for erratum A-009885 on fman3l MDIO buses
  ext4: don't use the orphan list when migrating an inode
  ext4: Fix BUG_ON in ext4_bread when write quota data
  ext4: set csum seed in tmp inode while migrating to extents
  ubifs: Error path in ubifs_remount_rw() seems to wrongly free write buffers
  power: bq25890: Enable continuous conversion for ADC at charging
  scsi: sr: Don't use GFP_DMA
  MIPS: Octeon: Fix build errors using clang
  i2c: designware-pci: Fix to change data types of hcnt and lcnt parameters
  ALSA: seq: Set upper limit of processed events
  w1: Misuse of get_user()/put_user() reported by sparse
  i2c: mpc: Correct I2C reset procedure
  powerpc/smp: Move setup_profiling_timer() under CONFIG_PROFILING
  i2c: i801: Don't silently correct invalid transfer size
  powerpc/btext: add missing of_node_put
  powerpc/cell: add missing of_node_put
  powerpc/powernv: add missing of_node_put
  powerpc/6xx: add missing of_node_put
  parisc: Avoid calling faulthandler_disabled() twice
  serial: core: Keep mctrl register state and cached copy in sync
  serial: pl010: Drop CR register reset on set_termios
  dm space map common: add bounds check to sm_ll_lookup_bitmap()
  dm btree: add a defensive bounds check to insert_at()
  net: mdio: Demote probed message to debug print
  btrfs: remove BUG_ON(!eie) in find_parent_nodes
  btrfs: remove BUG_ON() in find_parent_nodes()
  ACPICA: Executer: Fix the REFCLASS_REFOF case in acpi_ex_opcode_1A_0T_1R()
  ACPICA: Utilities: Avoid deleting the same object twice in a row
  um: registers: Rename function names to avoid conflicts and build problems
  ath9k: Fix out-of-bound memcpy in ath9k_hif_usb_rx_stream
  usb: hub: Add delay for SuperSpeed hub resume to let links transit to U0
  media: saa7146: hexium_gemini: Fix a NULL pointer dereference in hexium_attach()
  media: igorplugusb: receiver overflow should be reported
  net: bonding: debug: avoid printing debug logs when bond is not notifying peers
  iwlwifi: mvm: synchronize with FW after multicast commands
  media: m920x: don't use stack on USB reads
  media: saa7146: hexium_orion: Fix a NULL pointer dereference in hexium_attach()
  floppy: Add max size check for user space request
  mwifiex: Fix skb_over_panic in mwifiex_usb_recv()
  HSI: core: Fix return freed object in hsi_new_client
  media: b2c2: Add missing check in flexcop_pci_isr:
  usb: gadget: f_fs: Use stream_open() for endpoint files
  ar5523: Fix null-ptr-deref with unexpected WDCMSG_TARGET_START reply
  fs: dlm: filter user dlm messages for kernel locks
  Bluetooth: Fix debugfs entry leak in hci_register_dev()
  RDMA/cxgb4: Set queue pair state when being queried
  mips: bcm63xx: add support for clk_set_parent()
  mips: lantiq: add support for clk_set_parent()
  misc: lattice-ecp3-config: Fix task hung when firmware load failed
  ASoC: samsung: idma: Check of ioremap return value
  dmaengine: pxa/mmp: stop referencing config->slave_id
  RDMA/core: Let ib_find_gid() continue search even after empty entry
  char/mwave: Adjust io port register size
  ALSA: oss: fix compile error when OSS_DEBUG is enabled
  powerpc/prom_init: Fix improper check of prom_getprop()
  ALSA: hda: Add missing rwsem around snd_ctl_remove() calls
  ALSA: PCM: Add missing rwsem around snd_ctl_remove() calls
  ALSA: jack: Add missing rwsem around snd_ctl_remove() calls
  ext4: avoid trim error on fs with small groups
  net: mcs7830: handle usb read errors properly
  pcmcia: fix setting of kthread task states
  can: xilinx_can: xcan_probe(): check for error irq
  can: softing: softing_startstop(): fix set but not used variable warning
  spi: spi-meson-spifc: Add missing pm_runtime_disable() in meson_spifc_probe
  ppp: ensure minimum packet size in ppp_write()
  pcmcia: rsrc_nonstatic: Fix a NULL pointer dereference in nonstatic_find_mem_region()
  pcmcia: rsrc_nonstatic: Fix a NULL pointer dereference in __nonstatic_find_io_region()
  usb: ftdi-elan: fix memory leak on device disconnect
  media: msi001: fix possible null-ptr-deref in msi001_probe()
  media: saa7146: mxb: Fix a NULL pointer dereference in mxb_attach()
  media: dib8000: Fix a memleak in dib8000_init()
  floppy: Fix hang in watchdog when disk is ejected
  serial: amba-pl011: do not request memory region twice
  drm/amdgpu: Fix a NULL pointer dereference in amdgpu_connector_lcd_native_mode()
  arm64: dts: qcom: msm8916: fix MMC controller aliases
  netfilter: bridge: add support for pppoe filtering
  tty: serial: atmel: Call dma_async_issue_pending()
  tty: serial: atmel: Check return code of dmaengine_submit()
  crypto: qce - fix uaf on qce_ahash_register_one
  Bluetooth: stop proccessing malicious adv data
  Bluetooth: cmtp: fix possible panic when cmtp_init_sockets() fails
  PCI: Add function 1 DMA alias quirk for Marvell 88SE9125 SATA controller
  can: softing_cs: softingcs_probe(): fix memleak on registration failure
  media: stk1160: fix control-message timeouts
  media: pvrusb2: fix control-message timeouts
  media: dib0700: fix undefined behavior in tuner shutdown
  media: em28xx: fix control-message timeouts
  media: mceusb: fix control-message timeouts
  rtc: cmos: take rtc_lock while reading from CMOS
  nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind()
  HID: uhid: Fix worker destroying device without any protection
  rtlwifi: rtl8192cu: Fix WARNING when calling local_irq_restore() with interrupts enabled
  media: uvcvideo: fix division by zero at stream start
  drm/i915: Avoid bitwise vs logical OR warning in snb_wm_latency_quirk()
  can: gs_usb: gs_can_start_xmit(): zero-initialize hf->{flags,reserved}
  can: gs_usb: fix use of uninitialized variable, detach device on reception of invalid USB data
  mfd: intel-lpss: Fix too early PM enablement in the ACPI ->probe()
  USB: Fix "slab-out-of-bounds Write" bug in usb_hcd_poll_rh_status
  USB: core: Fix bug in resuming hub's handling of wakeup requests
  Bluetooth: bfusb: fix division by zero in send path
  Linux 4.4.299
  power: reset: ltc2952: Fix use of floating point literals
  mISDN: change function names to avoid conflicts
  net: udp: fix alignment problem in udp4_seq_show()
  ip6_vti: initialize __ip6_tnl_parm struct in vti6_siocdevprivate
  scsi: libiscsi: Fix UAF in iscsi_conn_get_param()/iscsi_conn_teardown()
  phonet: refcount leak in pep_sock_accep
  rndis_host: support Hytera digital radios
  xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocate
  sch_qfq: prevent shift-out-of-bounds in qfq_init_qdisc
  i40e: Fix incorrect netdev's real number of RX/TX queues
  mac80211: initialize variable have_higher_than_11mbit
  ieee802154: atusb: fix uninit value in atusb_set_extended_addr
  Bluetooth: btusb: Apply QCA Rome patches for some ATH3012 models
  bpf, test: fix ld_abs + vlan push/pop stress test
  Linux 4.4.298
  net: fix use-after-free in tw_timer_handler
  Input: spaceball - fix parsing of movement data packets
  Input: appletouch - initialize work before device registration
  scsi: vmw_pvscsi: Set residual data length conditionally
  usb: gadget: f_fs: Clear ffs_eventfd in ffs_data_clear.
  xhci: Fresco FL1100 controller should not have BROKEN_MSI quirk set.
  uapi: fix linux/nfc.h userspace compilation errors
  nfc: uapi: use kernel size_t to fix user-space builds
  selinux: initialize proto variable in selinux_ip_postroute_compat()
  recordmcount.pl: fix typo in s390 mcount regex
  platform/x86: apple-gmux: use resource_size() with res
  Linux 4.4.297
  phonet/pep: refuse to enable an unbound pipe
  hamradio: improve the incomplete fix to avoid NPD
  hamradio: defer ax25 kfree after unregister_netdev
  ax25: NPD bug when detaching AX25 device
  xen/blkfront: fix bug in backported patch
  ARM: 9169/1: entry: fix Thumb2 bug in iWMMXt exception handling
  ALSA: drivers: opl3: Fix incorrect use of vp->state
  ALSA: jack: Check the return value of kstrdup()
  hwmon: (lm90) Fix usage of CONFIG2 register in detect function
  drivers: net: smc911x: Check for error irq
  bonding: fix ad_actor_system option setting to default
  qlcnic: potential dereference null pointer of rx_queue->page_ring
  IB/qib: Fix memory leak in qib_user_sdma_queue_pkts()
  HID: holtek: fix mouse probing
  can: kvaser_usb: get CAN clock frequency from device
  net: usb: lan78xx: add Allied Telesis AT29M2-AF

 Conflicts:
	drivers/usb/gadget/function/f_fs.c

Change-Id: I54140777477cbab1b4c6b7d77558e92ca2b30e96
2022-02-09 19:41:39 +02:00
Greg Kroah-Hartman
875c0cc811 Merge 4.4.302 into android-4.4-p
Changes in 4.4.302
	can: bcm: fix UAF of bcm op
	Bluetooth: refactor malicious adv data check
	s390/hypfs: include z/VM guests with access control group set
	scsi: zfcp: Fix failed recovery on gone remote port with non-NPIV FCP devices
	udf: Restore i_lenAlloc when inode expansion fails
	udf: Fix NULL ptr deref when converting from inline format
	PM: wakeup: simplify the output logic of pm_show_wakelocks()
	serial: stm32: fix software flow control transfer
	tty: n_gsm: fix SW flow control encoding/handling
	tty: Add support for Brainboxes UC cards.
	usb-storage: Add unusual-devs entry for VL817 USB-SATA bridge
	USB: core: Fix hang in usb_kill_urb by adding memory barriers
	scsi: bnx2fc: Flush destroy_work queue before calling bnx2fc_interface_put()
	ipv6_tunnel: Rate limit warning messages
	net: fix information leakage in /proc/net/ptype
	ipv4: avoid using shared IP generator for connected sockets
	net-procfs: show net devices bound packet types
	drm/msm: Fix wrong size calculation
	hwmon: (lm90) Reduce maximum conversion rate for G781
	ipv4: raw: lock the socket in raw_bind()
	ipv4: tcp: send zero IPID in SYNACK messages
	Bluetooth: MGMT: Fix misplaced BT_HS check
	Revert "drm/radeon/ci: disable mclk switching for high refresh rates (v2)"
	Revert "tc358743: fix register i2c_rd/wr function fix"
	KVM: x86: Fix misplaced backport of "work around leak of uninitialized stack contents"
	Input: i8042 - Fix misplaced backport of "add ASUS Zenbook Flip to noselftest list"
	Linux 4.4.302

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I5191d3cb4df0fa8de60170d2fedf4a3c51380fdf
2022-02-03 10:00:04 +01:00
Congyu Liu
8f88c78d24 net: fix information leakage in /proc/net/ptype
commit 47934e06b65637c88a762d9c98329ae6e3238888 upstream.

In one net namespace, after creating a packet socket without binding
it to a device, users in other net namespaces can observe the new
`packet_type` added by this packet socket by reading `/proc/net/ptype`
file. This is minor information leakage as packet socket is
namespace aware.

Add a net pointer in `packet_type` to keep the net namespace of
of corresponding packet socket. In `ptype_seq_show`, this net pointer
must be checked when it is not NULL.

Fixes: 2feb27dbe0 ("[NETNS]: Minor information leak via /proc/net/ptype file.")
Signed-off-by: Congyu Liu <liu3101@purdue.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-02-03 09:27:53 +01:00
Michael Bestas
ec66dd5675 Merge remote-tracking branch 'common/android-4.4-p' into android-msm-wahoo-4.4
* common/android-4.4-p:
  Linux 4.4.296
  xen/netback: don't queue unlimited number of packages
  xen/console: harden hvc_xen against event channel storms
  xen/netfront: harden netfront against event channel storms
  xen/blkfront: harden blkfront against event channel storms
  Input: touchscreen - avoid bitwise vs logical OR warning
  ARM: 8805/2: remove unneeded naked function usage
  net: lan78xx: Avoid unnecessary self assignment
  net: systemport: Add global locking for descriptor lifecycle
  timekeeping: Really make sure wall_to_monotonic isn't positive
  USB: serial: option: add Telit FN990 compositions
  PCI/MSI: Clear PCI_MSIX_FLAGS_MASKALL on error
  USB: gadget: bRequestType is a bitfield, not a enum
  igbvf: fix double free in `igbvf_probe`
  soc/tegra: fuse: Fix bitwise vs. logical OR warning
  nfsd: fix use-after-free due to delegation race
  dm btree remove: fix use after free in rebalance_children()
  recordmcount.pl: look for jgnop instruction as well as bcrl on s390
  mac80211: send ADDBA requests using the tid/queue of the aggregation session
  hwmon: (dell-smm) Fix warning on /proc/i8k creation error
  net: netlink: af_netlink: Prevent empty skb by adding a check on len.
  i2c: rk3x: Handle a spurious start completion interrupt flag
  parisc/agp: Annotate parisc agp init functions with __init
  nfc: fix segfault in nfc_genl_dump_devices_done
  FROMGIT: USB: gadget: bRequestType is a bitfield, not a enum
  Linux 4.4.295
  irqchip: nvic: Fix offset for Interrupt Priority Offsets
  irqchip/irq-gic-v3-its.c: Force synchronisation when issuing INVALL
  iio: accel: kxcjk-1013: Fix possible memory leak in probe and remove
  iio: itg3200: Call iio_trigger_notify_done() on error
  iio: ltr501: Don't return error code in trigger handler
  iio: mma8452: Fix trigger reference couting
  iio: stk3310: Don't return error code in interrupt handler
  usb: core: config: fix validation of wMaxPacketValue entries
  USB: gadget: zero allocate endpoint 0 buffers
  USB: gadget: detect too-big endpoint 0 requests
  net/qla3xxx: fix an error code in ql_adapter_up()
  net, neigh: clear whole pneigh_entry at alloc time
  net: fec: only clear interrupt of handling queue in fec_enet_rx_queue()
  net: altera: set a couple error code in probe()
  net: cdc_ncm: Allow for dwNtbOutMaxSize to be unset or zero
  block: fix ioprio_get(IOPRIO_WHO_PGRP) vs setuid(2)
  tracefs: Set all files to the same group ownership as the mount option
  signalfd: use wake_up_pollfree()
  binder: use wake_up_pollfree()
  wait: add wake_up_pollfree()
  libata: add horkage for ASMedia 1092
  can: pch_can: pch_can_rx_normal: fix use after free
  tracefs: Have new files inherit the ownership of their parent
  ALSA: pcm: oss: Handle missing errors in snd_pcm_oss_change_params*()
  ALSA: pcm: oss: Limit the period size to 16MB
  ALSA: pcm: oss: Fix negative period/buffer sizes
  ALSA: ctl: Fix copy of updated id with element read/write
  mm: bdi: initialize bdi_min_ratio when bdi is unregistered
  nfc: fix potential NULL pointer deref in nfc_genl_dump_ses_done
  can: sja1000: fix use after free in ems_pcmcia_add_card()
  HID: check for valid USB device for many HID drivers
  HID: wacom: fix problems when device is not a valid USB device
  HID: add USB_HID dependancy on some USB HID drivers
  HID: add USB_HID dependancy to hid-chicony
  HID: add USB_HID dependancy to hid-prodikeys
  HID: add hid_is_usb() function to make it simpler for USB detection
  HID: introduce hid_is_using_ll_driver
  UPSTREAM: USB: gadget: zero allocate endpoint 0 buffers
  UPSTREAM: USB: gadget: detect too-big endpoint 0 requests
  Linux 4.4.294
  serial: pl011: Add ACPI SBSA UART match id
  tty: serial: msm_serial: Deactivate RX DMA for polling support
  vgacon: Propagate console boot parameters before calling `vc_resize'
  parisc: Fix "make install" on newer debian releases
  siphash: use _unaligned version by default
  net: qlogic: qlcnic: Fix a NULL pointer dereference in qlcnic_83xx_add_rings()
  natsemi: xtensa: fix section mismatch warnings
  fget: check that the fd still exists after getting a ref to it
  fs: add fget_many() and fput_many()
  sata_fsl: fix warning in remove_proc_entry when rmmod sata_fsl
  sata_fsl: fix UAF in sata_fsl_port_stop when rmmod sata_fsl
  kprobes: Limit max data_size of the kretprobe instances
  net: ethernet: dec: tulip: de4x5: fix possible array overflows in type3_infoblock()
  net: tulip: de4x5: fix the problem that the array 'lp->phy[8]' may be out of bound
  scsi: iscsi: Unblock session then wake up error handler
  s390/setup: avoid using memblock_enforce_memory_limit
  platform/x86: thinkpad_acpi: Fix WWAN device disabled issue after S3 deep
  net: return correct error code
  hugetlb: take PMD sharing into account when flushing tlb/caches
  tty: hvc: replace BUG_ON() with negative return value
  xen/netfront: don't trust the backend response data blindly
  xen/netfront: disentangle tx_skb_freelist
  xen/netfront: don't read data from request on the ring page
  xen/netfront: read response from backend only once
  xen/blkfront: don't trust the backend response data blindly
  xen/blkfront: don't take local copy of a request from the ring page
  xen/blkfront: read response from backend only once
  xen: sync include/xen/interface/io/ring.h with Xen's newest version
  shm: extend forced shm destroy to support objects from several IPC nses
  fuse: release pipe buf after last use
  fuse: fix page stealing
  NFC: add NCI_UNREG flag to eliminate the race
  proc/vmcore: fix clearing user buffer by properly using clear_user()
  hugetlbfs: flush TLBs correctly after huge_pmd_unshare
  tracing: Check pid filtering when creating events
  tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows
  scsi: mpt3sas: Fix kernel panic during drive powercycle test
  ARM: socfpga: Fix crash with CONFIG_FORTIRY_SOURCE
  NFSv42: Don't fail clone() unless the OP_CLONE operation failed
  net: ieee802154: handle iftypes as u32
  ASoC: topology: Add missing rwsem around snd_ctl_remove() calls
  ARM: dts: BCM5301X: Add interrupt properties to GPIO node
  xen: detect uninitialized xenbus in xenbus_init
  xen: don't continue xenstore initialization in case of errors
  staging: rtl8192e: Fix use after free in _rtl92e_pci_disconnect()
  ALSA: ctxfi: Fix out-of-range access
  binder: fix test regression due to sender_euid change
  usb: hub: Fix locking issues with address0_mutex
  usb: hub: Fix usb enumeration issue due to address0 race
  USB: serial: option: add Fibocom FM101-GL variants
  USB: serial: option: add Telit LE910S1 0x9200 composition
  staging: ion: Prevent incorrect reference counting behavour

 Conflicts:
	fs/file_table.c

Change-Id: Ie66d41bc083d3d53fc89fba73b6e57bcc18e1c4a
2021-12-27 01:17:29 +02:00
Michael Bestas
19f08171e3 Merge remote-tracking branch 'common/android-4.4-p' into android-msm-wahoo-4.4
* common/android-4.4-p:
  Linux 4.4.293
  usb: max-3421: Use driver data instead of maintaining a list of bound devices
  ASoC: DAPM: Cover regression by kctl change notification fix
  batman-adv: Avoid WARN_ON timing related checks
  batman-adv: Don't always reallocate the fragmentation skb head
  batman-adv: Reserve needed_*room for fragments
  batman-adv: Consider fragmentation for needed_headroom
  batman-adv: set .owner to THIS_MODULE
  batman-adv: mcast: fix duplicate mcast packets from BLA backbone to mesh
  batman-adv: mcast: fix duplicate mcast packets in BLA backbone from mesh
  batman-adv: mcast: fix duplicate mcast packets in BLA backbone from LAN
  batman-adv: Prevent duplicated softif_vlan entry
  batman-adv: Fix multicast TT issues with bogus ROAM flags
  batman-adv: Keep fragments equally sized
  drm/amdgpu: fix set scaling mode Full/Full aspect/Center not works on vga and dvi connectors
  drm/udl: fix control-message timeout
  cfg80211: call cfg80211_stop_ap when switch from P2P_GO type
  parisc/sticon: fix reverse colors
  btrfs: fix memory ordering between normal and ordered work functions
  mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag
  hexagon: export raw I/O routines for modules
  tun: fix bonding active backup with arp monitoring
  NFC: reorder the logic in nfc_{un,}register_device
  NFC: reorganize the functions in nci_request
  platform/x86: hp_accel: Fix an error handling path in 'lis3lv02d_probe()'
  mips: bcm63xx: add support for clk_get_parent()
  net: bnx2x: fix variable dereferenced before check
  sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()
  mips: BCM63XX: ensure that CPU_SUPPORTS_32BIT_KERNEL is set
  sh: define __BIG_ENDIAN for math-emu
  sh: fix kconfig unmet dependency warning for FRAME_POINTER
  maple: fix wrong return value of maple_bus_init().
  sh: check return code of request_irq
  powerpc/dcr: Use cmplwi instead of 3-argument cmpli
  ALSA: gus: fix null pointer dereference on pointer block
  powerpc/5200: dts: fix memory node unit name
  scsi: target: Fix alua_tg_pt_gps_count tracking
  scsi: target: Fix ordered tag handling
  MIPS: sni: Fix the build
  tty: tty_buffer: Fix the softlockup issue in flush_to_ldisc
  usb: host: ohci-tmio: check return value after calling platform_get_resource()
  ARM: dts: omap: fix gpmc,mux-add-data type
  scsi: advansys: Fix kernel pointer leak
  usb: musb: tusb6010: check return value after calling platform_get_resource()
  scsi: lpfc: Fix list_add() corruption in lpfc_drain_txq()
  net: batman-adv: fix error handling
  PCI/MSI: Destroy sysfs before freeing entries
  parisc/entry: fix trace test in syscall exit path
  PCI: Add PCI_EXP_DEVCTL_PAYLOAD_* macros
  mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
  ARM: 9156/1: drop cc-option fallbacks for architecture selection
  USB: chipidea: fix interrupt deadlock
  vsock: prevent unnecessary refcnt inc for nonblocking connect
  nfc: pn533: Fix double free when pn533_fill_fragment_skbs() fails
  llc: fix out-of-bound array index in llc_sk_dev_hash()
  bonding: Fix a use-after-free problem when bond_sysfs_slave_add() failed
  net: davinci_emac: Fix interrupt pacing disable
  xen-pciback: Fix return in pm_ctrl_init()
  scsi: qla2xxx: Turn off target reset during issue_lip
  watchdog: f71808e_wdt: fix inaccurate report in WDIOC_GETTIMEOUT
  m68k: set a default value for MEMORY_RESERVE
  netfilter: nfnetlink_queue: fix OOB when mac header was cleared
  dmaengine: at_xdmac: fix AT_XDMAC_CC_PERID() macro
  RDMA/mlx4: Return missed an error if device doesn't support steering
  scsi: csiostor: Uninitialized data in csio_ln_vnp_read_cbfn()
  power: supply: rt5033_battery: Change voltage values to µV
  usb: gadget: hid: fix error code in do_config()
  serial: 8250_dw: Drop wrong use of ACPI_PTR()
  video: fbdev: chipsfb: use memset_io() instead of memset()
  memory: fsl_ifc: fix leak of irq and nand_irq in fsl_ifc_ctrl_probe
  JFS: fix memleak in jfs_mount
  scsi: dc395: Fix error case unwinding
  ARM: s3c: irq-s3c24xx: Fix return value check for s3c24xx_init_intc()
  crypto: pcrypt - Delay write to padata->info
  libertas: Fix possible memory leak in probe and disconnect
  libertas_tf: Fix possible memory leak in probe and disconnect
  smackfs: use netlbl_cfg_cipsov4_del() for deleting cipso_v4_doi
  mwifiex: Send DELBA requests according to spec
  platform/x86: thinkpad_acpi: Fix bitwise vs. logical warning
  net: stream: don't purge sk_error_queue in sk_stream_kill_queues()
  drm/msm: uninitialized variable in msm_gem_import()
  memstick: jmb38x_ms: use appropriate free function in jmb38x_ms_alloc_host()
  memstick: avoid out-of-range warning
  b43: fix a lower bounds test
  b43legacy: fix a lower bounds test
  crypto: qat - detect PFVF collision after ACK
  ath9k: Fix potential interrupt storm on queue reset
  cpuidle: Fix kobject memory leaks in error paths
  media: si470x: Avoid card name truncation
  media: dvb-usb: fix ununit-value in az6027_rc_query
  parisc/kgdb: add kgdb_roundup() to make kgdb work with idle polling
  parisc: fix warning in flush_tlb_all
  ARM: 9136/1: ARMv7-M uses BE-8, not BE-32
  ARM: clang: Do not rely on lr register for stacktrace
  smackfs: use __GFP_NOFAIL for smk_cipso_doi()
  iwlwifi: mvm: disable RX-diversity in powersave
  PM: hibernate: Get block device exclusively in swsusp_check()
  mwl8k: Fix use-after-free in mwl8k_fw_state_machine()
  lib/xz: Validate the value before assigning it to an enum variable
  lib/xz: Avoid overlapping memcpy() with invalid input with in-place decompression
  memstick: r592: Fix a UAF bug when removing the driver
  ACPI: battery: Accept charges over the design capacity as full
  ath: dfs_pattern_detector: Fix possible null-pointer dereference in channel_detector_create()
  tracefs: Have tracefs directories not set OTH permission bits by default
  media: usb: dvd-usb: fix uninit-value bug in dibusb_read_eeprom_byte()
  ACPICA: Avoid evaluating methods too early during system resume
  ia64: don't do IA64_CMPXCHG_DEBUG without CONFIG_PRINTK
  media: mceusb: return without resubmitting URB in case of -EPROTO error.
  media: s5p-mfc: fix possible null-pointer dereference in s5p_mfc_probe()
  media: uvcvideo: Set capability in s_param
  media: netup_unidvb: handle interrupt properly according to the firmware
  media: mt9p031: Fix corrupted frame after restarting stream
  x86: Increase exception stack sizes
  smackfs: Fix use-after-free in netlbl_catmap_walk()
  MIPS: lantiq: dma: reset correct number of channel
  MIPS: lantiq: dma: add small delay after reset
  platform/x86: wmi: do not fail if disabling fails
  Bluetooth: fix use-after-free error in lock_sock_nested()
  Bluetooth: sco: Fix lock_sock() blockage by memcpy_from_msg()
  USB: iowarrior: fix control-message timeouts
  USB: serial: keyspan: fix memleak on probe errors
  iio: dac: ad5446: Fix ad5622_write() return value
  quota: correct error number in free_dqentry()
  quota: check block number when reading the block in quota file
  ALSA: mixer: fix deadlock in snd_mixer_oss_set_volume
  ALSA: mixer: oss: Fix racy access to slots
  power: supply: max17042_battery: use VFSOC for capacity when no rsns
  power: supply: max17042_battery: Prevent int underflow in set_soc_threshold
  signal: Remove the bogus sigkill_pending in ptrace_stop
  mwifiex: Read a PCI register after writing the TX ring write pointer
  wcn36xx: Fix HT40 capability for 2Ghz band
  PCI: Mark Atheros QCA6174 to avoid bus reset
  ath6kl: fix control-message timeout
  ath6kl: fix division by zero in send path
  mwifiex: fix division by zero in fw download path
  EDAC/sb_edac: Fix top-of-high-memory value for Broadwell/Haswell
  hwmon: (pmbus/lm25066) Add offset coefficients
  btrfs: fix lost error handling when replaying directory deletes
  vmxnet3: do not stop tx queues after netif_device_detach()
  spi: spl022: fix Microwire full duplex mode
  xen/netfront: stop tx queues during live migration
  mmc: winbond: don't build on M68K
  hyperv/vmbus: include linux/bitops.h
  x86/irq: Ensure PI wakeup handler is unregistered before module unload
  ALSA: timer: Unconditionally unlink slave instances, too
  ALSA: timer: Fix use-after-free problem
  ALSA: synth: missing check for possible NULL after the call to kstrdup
  ALSA: line6: fix control and interrupt message timeouts
  ALSA: 6fire: fix control and bulk message timeouts
  ALSA: ua101: fix division by zero at probe
  media: ite-cir: IR receiver stop working after receive overflow
  parisc: Fix ptrace check on syscall return
  mmc: dw_mmc: Dont wait for DRTO on Write RSP error
  ocfs2: fix data corruption on truncate
  libata: fix read log timeout value
  Input: i8042 - Add quirk for Fujitsu Lifebook T725
  Input: elantench - fix misreporting trackpoint coordinates
  xhci: Fix USB 3.1 enumeration issues by increasing roothub power-on-good delay
  binder: use cred instead of task for selinux checks
  binder: use euid from cred instead of using task
  FROMGIT: binder: fix test regression due to sender_euid change
  BACKPORT: binder: use cred instead of task for selinux checks
  BACKPORT: binder: use euid from cred instead of using task
  BACKPORT: ip_gre: add validation for csum_start
  Linux 4.4.292
  rsi: fix control-message timeout
  staging: rtl8192u: fix control-message timeouts
  staging: r8712u: fix control-message timeout
  comedi: vmk80xx: fix bulk and interrupt message timeouts
  comedi: vmk80xx: fix bulk-buffer overflow
  comedi: vmk80xx: fix transfer-buffer overflows
  staging: comedi: drivers: replace le16_to_cpu() with usb_endpoint_maxp()
  comedi: ni_usb6501: fix NULL-deref in command paths
  comedi: dt9812: fix DMA buffers on stack
  isofs: Fix out of bound access for corrupted isofs image
  usb: hso: fix error handling code of hso_create_net_device
  printk/console: Allow to disable console output by using console="" or console=null
  usb-storage: Add compatibility quirk flags for iODD 2531/2541
  usb: gadget: Mark USB_FSL_QE broken on 64-bit
  IB/qib: Protect from buffer overflow in struct qib_user_sdma_pkt fields
  IB/qib: Use struct_size() helper
  net: hso: register netdev later to avoid a race condition
  ARM: 9120/1: Revert "amba: make use of -1 IRQs warn"
  scsi: core: Put LLD module refcnt after SCSI device is released

 Conflicts:
	net/bluetooth/l2cap_sock.c

Change-Id: I066f7145b7245f9c95e1c78f84f0871a9825150f
2021-12-27 01:16:43 +02:00
Greg Kroah-Hartman
f9fe8dd7fb Merge 4.4.295 into android-4.4-p
Changes in 4.4.295
	HID: introduce hid_is_using_ll_driver
	HID: add hid_is_usb() function to make it simpler for USB detection
	HID: add USB_HID dependancy to hid-prodikeys
	HID: add USB_HID dependancy to hid-chicony
	HID: add USB_HID dependancy on some USB HID drivers
	HID: wacom: fix problems when device is not a valid USB device
	HID: check for valid USB device for many HID drivers
	can: sja1000: fix use after free in ems_pcmcia_add_card()
	nfc: fix potential NULL pointer deref in nfc_genl_dump_ses_done
	mm: bdi: initialize bdi_min_ratio when bdi is unregistered
	ALSA: ctl: Fix copy of updated id with element read/write
	ALSA: pcm: oss: Fix negative period/buffer sizes
	ALSA: pcm: oss: Limit the period size to 16MB
	ALSA: pcm: oss: Handle missing errors in snd_pcm_oss_change_params*()
	tracefs: Have new files inherit the ownership of their parent
	can: pch_can: pch_can_rx_normal: fix use after free
	libata: add horkage for ASMedia 1092
	wait: add wake_up_pollfree()
	binder: use wake_up_pollfree()
	signalfd: use wake_up_pollfree()
	tracefs: Set all files to the same group ownership as the mount option
	block: fix ioprio_get(IOPRIO_WHO_PGRP) vs setuid(2)
	net: cdc_ncm: Allow for dwNtbOutMaxSize to be unset or zero
	net: altera: set a couple error code in probe()
	net: fec: only clear interrupt of handling queue in fec_enet_rx_queue()
	net, neigh: clear whole pneigh_entry at alloc time
	net/qla3xxx: fix an error code in ql_adapter_up()
	USB: gadget: detect too-big endpoint 0 requests
	USB: gadget: zero allocate endpoint 0 buffers
	usb: core: config: fix validation of wMaxPacketValue entries
	iio: stk3310: Don't return error code in interrupt handler
	iio: mma8452: Fix trigger reference couting
	iio: ltr501: Don't return error code in trigger handler
	iio: itg3200: Call iio_trigger_notify_done() on error
	iio: accel: kxcjk-1013: Fix possible memory leak in probe and remove
	irqchip/irq-gic-v3-its.c: Force synchronisation when issuing INVALL
	irqchip: nvic: Fix offset for Interrupt Priority Offsets
	Linux 4.4.295

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4810f7e2ebd538739fb0d62f9eade0921a9badfb
2021-12-14 10:35:51 +01:00
Eric Biggers
d0ceebaae0 wait: add wake_up_pollfree()
commit 42288cb44c4b5fff7653bc392b583a2b8bd6a8c0 upstream.

Several ->poll() implementations are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case.  This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution.  This solution is for the queue to be cleared
before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.

However, that has a bug: wake_up_poll() calls __wake_up() with
nr_exclusive=1.  Therefore, if there are multiple "exclusive" waiters,
and the wakeup function for the first one returns a positive value, only
that one will be called.  That's *not* what's needed for POLLFREE;
POLLFREE is special in that it really needs to wake up everyone.

Considering the three non-blocking poll systems:

- io_uring poll doesn't handle POLLFREE at all, so it is broken anyway.

- aio poll is unaffected, since it doesn't support exclusive waits.
  However, that's fragile, as someone could add this feature later.

- epoll doesn't appear to be broken by this, since its wakeup function
  returns 0 when it sees POLLFREE.  But this is fragile.

Although there is a workaround (see epoll), it's better to define a
function which always sends POLLFREE to all waiters.  Add such a
function.  Also make it verify that the queue really becomes empty after
all waiters have been woken up.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-12-14 10:03:49 +01:00