This commit defers reporting of RCU-preempt quiescent states at
rcu_read_unlock_special() time when any of interrupts, softirq, or
preemption are disabled. These deferred quiescent states are reported
at a later RCU_SOFTIRQ, context switch, idle entry, or CPU-hotplug
offline operation. Of course, if another RCU read-side critical
section has started in the meantime, the reporting of the quiescent
state will be further deferred.
This also means that disabling preemption, interrupts, and/or
softirqs will act as an RCU-preempt read-side critical section.
This is enforced by checking preempt_count() as needed.
Some special cases must be handled on an ad-hoc basis, for example,
context switch is a quiescent state even though both the scheduler and
do_exit() disable preemption. In these cases, additional calls to
rcu_preempt_deferred_qs() override the preemption disabling. Similar
logic overrides disabled interrupts in rcu_preempt_check_callbacks()
because in this case the quiescent state happened just before the
corresponding scheduling-clock interrupt.
In theory, this change lifts a long-standing restriction that required
that if interrupts were disabled across a call to rcu_read_unlock()
that the matching rcu_read_lock() also be contained within that
interrupts-disabled region of code. Because the reporting of the
corresponding RCU-preempt quiescent state is now deferred until
after interrupts have been enabled, it is no longer possible for this
situation to result in deadlocks involving the scheduler's runqueue and
priority-inheritance locks. This may allow some code simplification that
might reduce interrupt latency a bit. Unfortunately, in practice this
would also defer deboosting a low-priority task that had been subjected
to RCU priority boosting, so real-time-response considerations might
well force this restriction to remain in place.
Because RCU-preempt grace periods are now blocked not only by RCU
read-side critical sections, but also by disabling of interrupts,
preemption, and softirqs, it will be possible to eliminate RCU-bh and
RCU-sched in favor of RCU-preempt in CONFIG_PREEMPT=y kernels. This may
require some additional plumbing to provide the network denial-of-service
guarantees that have been traditionally provided by RCU-bh. Once these
are in place, CONFIG_PREEMPT=n kernels will be able to fold RCU-bh
into RCU-sched. This would mean that all kernels would have but
one flavor of RCU, which would open the door to significant code
cleanup.
Moving to a single flavor of RCU would also have the beneficial effect
of reducing the NOCB kthreads by at least a factor of two.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: Apply rcu_read_unlock_special() preempt_count() feedback
from Joel Fernandes. ]
[ paulmck: Adjust rcu_eqs_enter() call to rcu_preempt_deferred_qs() in
response to bug reports from kbuild test robot. ]
[ paulmck: Fix bug located by kbuild test robot involving recursion
via rcu_preempt_deferred_qs(). ]
Change-Id: If76ec913be9db64e7d1e70f408407229f5291af2
Expose kernel's BTF under the name vmlinux to be more uniform with using
kernel module names as file names in the future.
Fixes: 341dfcf8d78e ("btf: expose BTF info through sysfs")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Change-Id: I0c478e34bd67044a37b166b60c1de07025c3ec90
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Make .BTF section allocated and expose its contents through sysfs.
/sys/kernel/btf directory is created to contain all the BTFs present
inside kernel. Currently there is only kernel's main BTF, represented as
/sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
each module will expose its BTF as /sys/kernel/btf/<module-name> file.
Current approach relies on a few pieces coming together:
1. pahole is used to take almost final vmlinux image (modulo .BTF and
kallsyms) and generate .BTF section by converting DWARF info into
BTF. This section is not allocated and not mapped to any segment,
though, so is not yet accessible from inside kernel at runtime.
2. objcopy dumps .BTF contents into binary file and subsequently
convert binary file into linkable object file with automatically
generated symbols _binary__btf_kernel_bin_start and
_binary__btf_kernel_bin_end, pointing to start and end, respectively,
of BTF raw data.
3. final vmlinux image is generated by linking this object file (and
kallsyms, if necessary). sysfs_btf.c then creates
/sys/kernel/btf/kernel file and exposes embedded BTF contents through
it. This allows, e.g., libbpf and bpftool access BTF info at
well-known location, without resorting to searching for vmlinux image
on disk (location of which is not standardized and vmlinux image
might not be even available in some scenarios, e.g., inside qemu
during testing).
Alternative approach using .incbin assembler directive to embed BTF
contents directly was attempted but didn't work, because sysfs_proc.o is
not re-compiled during link-vmlinux.sh stage. This is required, though,
to update embedded BTF data (initially empty data is embedded, then
pahole generates BTF info and we need to regenerate sysfs_btf.o with
updated contents, but it's too late at that point).
If BTF couldn't be generated due to missing or too old pahole,
sysfs_btf.c handles that gracefully by detecting that
_binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
/sys/kernel/btf at all.
v2->v3:
- added Documentation/ABI/testing/sysfs-kernel-btf (Greg K-H);
- created proper kobject (btf_kobj) for btf directory (Greg K-H);
- undo v2 change of reusing vmlinux, as it causes extra kallsyms pass
due to initially missing __binary__btf_kernel_bin_{start/end} symbols;
v1->v2:
- allow kallsyms stage to re-use vmlinux generated by gen_btf();
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Change-Id: Ife57356dd14b9ae0d2e3801eea7c602c6245952f
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Currently, AF_XDP only supports a fixed frame-size memory scheme where
each frame is referenced via an index (idx). A user passes the frame
index to the kernel, and the kernel acts upon the data. Some NICs,
however, do not have a fixed frame-size model, instead they have a
model where a memory window is passed to the hardware and multiple
frames are filled into that window (referred to as the "type-writer"
model).
By changing the descriptor format from the current frame index
addressing scheme, AF_XDP can in the future be extended to support
these kinds of NICs.
In the index-based model, an idx refers to a frame of size
frame_size. Addressing a frame in the UMEM is done by offseting the
UMEM starting address by a global offset, idx * frame_size + offset.
Communicating via the fill- and completion-rings are done by means of
idx.
In this commit, the idx is removed in favor of an address (addr),
which is a relative address ranging over the UMEM. To convert an
idx-based address to the new addr is simply: addr = idx * frame_size +
offset.
We also stop referring to the UMEM "frame" as a frame. Instead it is
simply called a chunk.
To transfer ownership of a chunk to the kernel, the addr of the chunk
is passed in the fill-ring. Note, that the kernel will mask addr to
make it chunk aligned, so there is no need for userspace to do
that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or
3000 to the fill-ring will refer to the same chunk.
On the completion-ring, the addr will match that of the Tx descriptor,
passed to the kernel.
Changing the descriptor format to use chunks/addr will allow for
future changes to move to a type-writer based model, where multiple
frames can reside in one chunk. In this model passing one single chunk
into the fill-ring, would potentially result in multiple Rx
descriptors.
This commit changes the uapi of AF_XDP sockets, and updates the
documentation.
Change-Id: I375aff25e625bddb164b6047cd60c4deb35ba557
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This is a sample application for AF_XDP sockets. The application
supports three different modes of operation: rxdrop, txonly and l2fwd.
To show-case a simple round-robin load-balancing between a set of
sockets in an xskmap, set the RR_LB compile time define option to 1 in
"xdpsock.h".
v2: The entries variable was calculated twice in {umem,xq}_nb_avail.
Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
Change-Id: I39c4d5fb7c5d4448cc44206520c3d8aa3c13da4e
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
A lot of ->destroy_inode() instances end with call_rcu() of a callback
that does RCU-delayed part of freeing. Introduce a new method for
doing just that, with saner signature.
Rules:
->destroy_inode ->free_inode
f g immediate call of f(),
RCU-delayed call of g()
f NULL immediate call of f(),
no RCU-delayed calls
NULL g RCU-delayed call of g()
NULL NULL RCU-delayed default freeing
IOW, NULL ->free_inode gives the same behaviour as now.
Note that NULL, NULL is equivalent to NULL, free_inode_nonrcu; we could
mandate the latter form, but that would have very little benefit beyond
making rules a bit more symmetric. It would break backwards compatibility,
require extra boilerplate and expected semantics for (NULL, NULL) pair
would have no use whatsoever...
Change-Id: I828dd519b381a1f25013ec3d1eae0b6835eb9c26
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The new eBPF instruction class JMP32 uses the reserved class number 0x6.
Kernel BPF ISA documentation updated accordingly.
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Change-Id: Idfa9c7ef537b67b06ffea33db60785f754f9bfea
Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
dereference_symbol_descriptor() invokes appropriate ARCH specific
function descriptor dereference callbacks:
- dereference_kernel_function_descriptor() if the pointer is a
kernel symbol;
- dereference_module_function_descriptor() if the pointer is a
module symbol.
This is the last step needed to make '%pS/%ps' smart enough to
handle function descriptor dereference on affected ARCHs and
to retire '%pF/%pf'.
To refresh it:
Some architectures (ia64, ppc64, parisc64) use an indirect pointer
for C function pointers - the function pointer points to a function
descriptor and we need to dereference it to get the actual function
pointer.
Function descriptors live in .opd elf section and all affected
ARCHs (ia64, ppc64, parisc64) handle it properly for kernel and
modules. So we, technically, can decide if the dereference is
needed by simply looking at the pointer: if it belongs to .opd
section then we need to dereference it.
The kernel and modules have their own .opd sections, obviously,
that's why we need to split dereference_function_descriptor()
and use separate kernel and module dereference arch callbacks.
Link: http://lkml.kernel.org/r/20171206043649.GB15885@jagdpanzerIV
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: James Bottomley <jejb@parisc-linux.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Change-Id: I102222c3b164340bd683e29fe61cc963eb5106eb
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Tony Luck <tony.luck@intel.com> #ia64
Tested-by: Santosh Sivaraj <santosh@fossix.org> #powerpc
Tested-by: Helge Deller <deller@gmx.de> #parisc64
Signed-off-by: Petr Mladek <pmladek@suse.com>
From now on, I'll start using my @kernel.org as my development e-mail.
As such, let's remove the entries that point to the old
mchehab@s-opensource.com at MAINTAINERS file.
For the files written with a copyright with mchehab@s-opensource,
let's keep Samsung on their names, using mchehab+samsung@kernel.org,
in order to keep pointing to my employer, with sponsors the work.
For the files written before I join Samsung (on July, 4 2013),
let's just use mchehab@kernel.org.
For bug reports, we can simply point to just kernel.org, as
this will reach my mchehab+samsung inbox anyway.
[Linux4: Only keep RC related bits]
Change-Id: I336d95e919b1566af32b7dac66e2da301503d0ea
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Brian Warner <brian.warner@samsung.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Since the kernel now modifies the timeout, make it possible to retrieve
the current value.
Change-Id: If925b7d4bc77cce2b1d8200251aac1db124b4d9c
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
On the raspberry pi, we might have two lirc devices; one for sending and
one for receiving. This change makes it much more apparent which one
is which.
Change-Id: I8c8a566886f184575b47cec5ec4e2d91dc3c7dff
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Since commit a60d64b15c20 ("media: lirc: lirc interface should not be
a raw decoder"), the message in the documentation is incorrect as the
module name is rc_core, not lirc_dev. Since the message is not useful,
just make the message debug and remove it from the documentation.
Change-Id: I945a6a2f99e27e8a62e62a3e8f5547b73de93c0a
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Since commit 9b6192589be7 ("media: lirc: implement scancode sending"),
and commit de142c324106 ("media: lirc: implement reading scancode")
the lirc features ioctl for raw IR devices advertises two modes for
sending and receiving.
The lirc daemon now fails to detect a raw IR device, both for transmit
and receive.
To fix this, do not advertise the scancode mode in the lirc features
for raw IR devices (however do keep it for scancode devices). The mode
can still be used via the LIRC_SET_{REC,SEND}_MODE ioctl.
Change-Id: Ibdfe2d4ef9f561380a0e27e705df8cd79faa81bc
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
The semantics for lirc IR transmit with raw IR is that the write call
should block until the IR is transmitted. Some drivers have no idea
when this actually is (e.g. mceusb), so there is a wait.
This is useful for userspace, as it might want to send a IR button press,
a gap of a predefined number of milliseconds, and then send a repeat
message.
It turns out that for transmitting scancodes this feature is even more
useful, as user space has no idea how long the IR is. So, maintain
the existing semantics for IR scancode transmit.
Change-Id: Ib8070358522cacc6afcd1516757599f2550a1bbb
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
This is part of the uAPI. Add it to the documentation again,
and fix cross-references.
Change-Id: I47aee8a3c0d080db6245b03c12299f0e191123ac
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Acked-by: Sean Young <sean@mess.org>
Lirc supports a new mode which requires documentation.
Change-Id: Icc626fba59fdfc2adb7d544aa77ad91d7314d2e6
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
rc-core has replaced the lirc kapi many years ago, and now with the last
driver ported to rc-core, we can finally remove it.
Note this has no effect on userspace.
All future IR drivers should use the rc-core api.
Change-Id: Ie00c3ecbc293b3420ba8112768ef5a706b840af3
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
This is a duplicate of rcdev->driver_name.
Change-Id: I185f4dc3df62e03e2a759b34b29d5e840a33be68
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
LIRCCODE is a lirc mode where a driver produces driver-dependent
codes for receive and transmit. No driver uses this any more. The
LIRC_GET_LENGTH ioctl was used for this mode only.
Change-Id: Ib3e2deb561c9e3fbd499e14a82beb0799da457a4
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Some operators prefer IPv6 path selection to use a standard 5-tuple
hash rather than just an L3 hash with the flow the label. To that end
add support to IPv6 for multipath hash policy similar to bf4e0a3db9
("net: ipv4: add support for ECMP hash policy choice"). The default
is still L3 which covers source and destination addresses along with
flow label and IPv6 protocol.
Change-Id: I7a664a2fbec460203df6aa467f2c478763235146
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 8200 (IPv6) defines Hop-by-Hop options and Destination options
extension headers. Both of these carry a list of TLVs which is
only limited by the maximum length of the extension header (2048
bytes). By the spec a host must process all the TLVs in these
options, however these could be used as a fairly obvious
denial of service attack. I think this could in fact be
a significant DOS vector on the Internet, one mitigating
factor might be that many FWs drop all packets with EH (and
obviously this is only IPv6) so an Internet wide attack might not
be so effective (yet!).
By my calculation, the worse case packet with TLVs in a standard
1500 byte MTU packet that would be processed by the stack contains
1282 invidual TLVs (including pad TLVS) or 724 two byte TLVs. I
wrote a quick test program that floods a whole bunch of these
packets to a host and sure enough there is substantial time spent
in ip6_parse_tlv. These packets contain nothing but unknown TLVS
(that are ignored), TLV padding, and bogus UDP header with zero
payload length.
25.38% [kernel] [k] __fib6_clean_all
21.63% [kernel] [k] ip6_parse_tlv
4.21% [kernel] [k] __local_bh_enable_ip
2.18% [kernel] [k] ip6_pol_route.isra.39
1.98% [kernel] [k] fib6_walk_continue
1.88% [kernel] [k] _raw_write_lock_bh
1.65% [kernel] [k] dst_release
This patch adds configurable limits to Destination and Hop-by-Hop
options. There are three limits that may be set:
- Limit the number of options in a Hop-by-Hop or Destination options
extension header.
- Limit the byte length of a Hop-by-Hop or Destination options
extension header.
- Disallow unrecognized options in a Hop-by-Hop or Destination
options extension header.
The limits are set in corresponding sysctls:
ipv6.sysctl.max_dst_opts_cnt
ipv6.sysctl.max_hbh_opts_cnt
ipv6.sysctl.max_dst_opts_len
ipv6.sysctl.max_hbh_opts_len
If a max_*_opts_cnt is less than zero then unknown TLVs are disallowed.
The number of known TLVs that are allowed is the absolute value of
this number.
If a limit is exceeded when processing an extension header the packet is
dropped.
Default values are set to 8 for options counts, and set to INT_MAX
for maximum length. Note the choice to limit options to 8 is an
arbitrary guess (roughly based on the fact that the stack supports
three HBH options and just one destination option).
These limits have being proposed in draft-ietf-6man-rfc6434-bis.
Tested (by Martin Lau)
I tested out 1 thread (i.e. one raw_udp process).
I changed the net.ipv6.max_dst_(opts|hbh)_number between 8 to 2048.
With sysctls setting to 2048, the softirq% is packed to 100%.
With 8, the softirq% is almost unnoticable from mpstat.
v2;
- Code and documention cleanup.
- Change references of RFC2460 to be RFC8200.
- Add reference to RFC6434-bis where the limits will be in standard.
Change-Id: I0642f24d68966d73b9247973faf3473f4944b3bd
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Prior to this patch, active Fast Open is paused on a specific
destination IP address if the previous connections to the
IP address have experienced recurring timeouts . But recent
experiments by Microsoft (https://goo.gl/cykmn7) and Mozilla
browsers indicate the isssue is often caused by broken middle-boxes
sitting close to the client. Therefore it is much better user
experience if Fast Open is disabled out-right globally to avoid
experiencing further timeouts on connections toward other
destinations.
This patch changes the destination-IP disablement to global
disablement if a connection experiencing recurring timeouts
or aborts due to timeout. Repeated incidents would still
exponentially increase the pause time, starting from an hour.
This is extremely conservative but an unfortunate compromise to
minimize bad experience due to broken middle-boxes.
Reported-by: Dragana Damjanovic <ddamjanovic@mozilla.com>
Reported-by: Patrick McManus <mcmanus@ducksong.com>
Change-Id: Ia2698f86b82005d73a0744356d9787395ffb8376
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Wei Wang <weiwan@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a writeup on how to use the XFRM device offload API, and
mention this new file in the index.
Change-Id: I2b148393010d6487c45c9d1ff555a89f22a8acca
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
commit e71e2ace5721a8b921dca18b045069e7bb411277 upstream.
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
This patch (of 2):
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
[1] https://source.android.com/devices/tech/debug/tagged-pointers
[2] tools/testing/selftests/vm/userfaultfd.c
Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
Fixes: 63f0c6037965 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
Change-Id: I2e764cbb1867cd8018d177d449a05cf4d0d51576
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alistair Delva <adelva@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Martin <Dave.Martin@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mitch Phillips <mitchp@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: William McVicker <willmcvicker@google.com>
Cc: <stable@vger.kernel.org> [5.4]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Userfaultfd can be misued to make it easier to exploit existing
use-after-free (and similar) bugs that might otherwise only make a
short window or race condition available. By using userfaultfd to
stall a kernel thread, a malicious program can keep some state that it
wrote, stable for an extended period, which it can then access using an
existing exploit. While it doesn't cause the exploit itself, and while
it's not the only thing that can stall a kernel thread when accessing a
memory location, it's one of the few that never needs privilege.
We can add a flag, allowing userfaultfd to be restricted, so that in
general it won't be useable by arbitrary user programs, but in
environments that require userfaultfd it can be turned back on.
Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
whether userfaultfd is allowed by unprivileged users. When this is
set to zero, only privileged users (root user, or users with the
CAP_SYS_PTRACE capability) will be able to use the userfaultfd
syscalls.
Andrea said:
: The only difference between the bpf sysctl and the userfaultfd sysctl
: this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
: requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
: because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
: already if it's doing other kind of tracking on processes runtime, in
: addition of userfaultfd. In other words both syscalls works only for
: root, when the two sysctl are opt-in set to 1.
[dgilbert@redhat.com: changelog additions]
[akpm@linux-foundation.org: documentation tweak, per Mike]
Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.com
Change-Id: Ied2500a773b06ac1fdc378e61fd5403a270114a6
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Maya Gokhale <gokhale2@llnl.gov>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Martin Cracauer <cracauer@cons.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
Cc: Marty McFadden <mcfadden8@llnl.gov>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On some x86 tablets with a silead touchscreen the windows logo on the
front is a capacitive home button. Touching this button results in a touch
with bits 12-15 of the Y coordinates set, while normally only the lower 12
are used.
Detect this and report a KEY_LEFTMETA press when this happens. Note for
now we only respond to the Y coordinate bits 12-15 containing 0x01, on some
tablets *without* a capacative button I've noticed these bits containing
0x04 when crossing the edges of the screen.
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
(cherry picked from commit eca3be9b95ac7cf9442654a54962859d74f8e38a)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
SCTP GSO skbs have a gso_size of GSO_BY_FRAGS, so any sort of
unconditionally mangling of that will result in nonsense value
and would corrupt the skb later on.
Therefore, i) add two helpers skb_increase_gso_size() and
skb_decrease_gso_size() that would throw a one time warning and
bail out for such skbs and ii) refuse and return early with an
error in those BPF helpers that are affected. We do need to bail
out as early as possible from there before any changes on the
skb have been performed.
Fixes: 6578171a7f ("bpf: add bpf_skb_change_proto helper")
Co-authored-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
(cherry picked from commit d02f51cbcf12b09ab945873e35046045875eed9a)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Most of this is extracted from 90017accff ("sctp: Add GSO support"),
with some extra text about GSO_BY_FRAGS and the need to check for it.
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Daniel Axtens <dja@axtens.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a677088922831d94d292ca3891b148a8ba0b5fa1)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit e7862eda309ecfccc36bb5558d937ed3ace07f3f upstream.
The AMD Zen4 core supports a new feature called Automatic IBRS.
It is a "set-and-forget" feature that means that, like Intel's Enhanced IBRS,
h/w manages its IBRS mitigation resources automatically across CPL transitions.
The feature is advertised by CPUID_Fn80000021_EAX bit 8 and is enabled by
setting MSR C000_0080 (EFER) bit 21.
Enable Automatic IBRS by default if the CPU feature is present. It typically
provides greater performance over the incumbent generic retpolines mitigation.
Reuse the SPECTRE_V2_EIBRS spectre_v2_mitigation enum. AMD Automatic IBRS and
Intel Enhanced IBRS have similar enablement. Add NO_EIBRS_PBRSB to
cpu_vuln_whitelist, since AMD Automatic IBRS isn't affected by PBRSB-eIBRS.
The kernel command line option spectre_v2=eibrs is used to select AMD Automatic
IBRS, if available.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Sean Christopherson <seanjc@google.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/20230124163319.2277355-8-kim.phillips@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a7268b3424863578814d99a1dd603f6bb5b9f977)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit 06cb31cc761823ef444ba4e1df11347342a6e745 upstream.
commit 7c693f54c873691 ("x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS")
adds the "ibrs " option in
Documentation/admin-guide/kernel-parameters.txt but omits it to
Documentation/admin-guide/hw-vuln/spectre.rst, add it.
Signed-off-by: Lin Yujun <linyujun809@huawei.com>
Link: https://lore.kernel.org/r/20220830123614.23007-1-linyujun809@huawei.com
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3e4f86cfda46ef6320f385b80496d3f65d5ed63d)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
[ Upstream commit 98bfa34462fb6af70b845d28561072f80bacdb9b ]
Fix little inconsistencies in Documentation: make case and spacing
match surrounding text.
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Stable-dep-of: 4848229494a3 ("irqchip/jcore-aic: Fix missing allocation of IRQ descriptors")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit a564e23f0f99759f453dbefcb9160dec6d99df96 ]
md is the last driver using the legacy media_changed method. Switch
it over to (not so) new ->clear_events approach, which also removes the
need for the ->revalidate_disk method.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[axboe: remove unused 'bdops' variable in disk_clear_events()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Stable-dep-of: 9674f54e41ff ("md: Don't clear MD_CLOSING when the raid is about to stop")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit dc51c01a3d5a796e18520a186f56e13f8e70749f)
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
[ Upstream commit 5b3fbd61b9d1f4ed2db95aaf03f9adae0373784d ]
The Documentation/ABI/testing/sysfs-class-net-statistics documentation
is pointing to the wrong path for the interface. Documentation is
pointing to /sys/class/<iface>, instead of /sys/class/net/<iface>.
Fix it by adding the `net/` directory before the interface.
Fixes: 6044f97006 ("net: sysfs: document /sys/class/net/statistics/*")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit e7928873d9ac5a6194f0ffc56549d4262af7e568)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
[ Upstream commit e528afb72a481977456bb18345d4e7f6b85fa7b1 ]
Sync the ABI description with the interface statistics that are currently
available through sysfs.
CC: Jarod Wilson <jarod@redhat.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: linux-doc@vger.kernel.org
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 5b3fbd61b9d1 ("net: sysfs: Fix /sys/class/net/<iface> path for statistics")
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit b908fdcb6bbc07a3314afb386415b616fa01732f)
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
[ Upstream commit ae3f4b44641dfff969604735a0dcbf931f383285 ]
The documentation is pointing to the wrong path for the interface.
Documentation is pointing to /sys/class/<iface>, instead of
/sys/class/net/<iface>.
Fix it by adding the `net/` directory before the interface.
Fixes: 1a02ef76ac ("net: sysfs: add documentation entries for /sys/class/<iface>/queues")
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240131102150.728960-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
(cherry picked from commit 3dc7b3ffd5c539124ee8fc42a32a91b5df13717d)
[vegard: fix trivial conflict]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
commit 4fee0915e649bd0cea56dece6d96f8f4643df33c upstream.
Because the linux-distros group forces reporters to release information
about reported bugs, and they impose arbitrary deadlines in having those
bugs fixed despite not actually being kernel developers, the kernel
security team recommends not interacting with them at all as this just
causes confusion and the early-release of reported security problems.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/2023063020-throat-pantyhose-f110@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 553a5c03e90a6087e88f8ff878335ef0621536fb upstream
The Gather Data Sampling (GDS) vulnerability allows malicious software
to infer stale data previously stored in vector registers. This may
include sensitive data such as cryptographic keys. GDS is mitigated in
microcode, and systems with up-to-date microcode are protected by
default. However, any affected system that is running with older
microcode will still be vulnerable to GDS attacks.
Since the gather instructions used by the attacker are part of the
AVX2 and AVX512 extensions, disabling these extensions prevents gather
instructions from being executed, thereby mitigating the system from
GDS. Disabling AVX2 is sufficient, but we don't have the granularity
to do this. The XCR0[2] disables AVX, with no option to just disable
AVX2.
Add a kernel parameter gather_data_sampling=force that will enable the
microcode mitigation if available, otherwise it will disable AVX on
affected systems.
This option will be ignored if cmdline mitigations=off.
This is a *big* hammer. It is known to break buggy userspace that
uses incomplete, buggy AVX enumeration. Unfortunately, such userspace
does exist in the wild:
https://www.mail-archive.com/bug-coreutils@gnu.org/msg33046.html
[ dhansen: add some more ominous warnings about disabling AVX ]
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8974eb588283b7d44a7c91fa09fcbaf380339f3a upstream
Gather Data Sampling (GDS) is a hardware vulnerability which allows
unprivileged speculative access to data which was previously stored in
vector registers.
Intel processors that support AVX2 and AVX512 have gather instructions
that fetch non-contiguous data elements from memory. On vulnerable
hardware, when a gather instruction is transiently executed and
encounters a fault, stale data from architectural or internal vector
registers may get transiently stored to the destination vector
register allowing an attacker to infer the stale data using typical
side channel techniques like cache timing attacks.
This mitigation is different from many earlier ones for two reasons.
First, it is enabled by default and a bit must be set to *DISABLE* it.
This is the opposite of normal mitigation polarity. This means GDS can
be mitigated simply by updating microcode and leaving the new control
bit alone.
Second, GDS has a "lock" bit. This lock bit is there because the
mitigation affects the hardware security features KeyLocker and SGX.
It needs to be enabled and *STAY* enabled for these features to be
mitigated against GDS.
The mitigation is enabled in the microcode by default. Disable it by
setting gather_data_sampling=off or by disabling all mitigations with
mitigations=off. The mitigation status can be checked by reading:
/sys/devices/system/cpu/vulnerabilities/gather_data_sampling
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1202cdd665315c525b5237e96e0bedc76d7e754f upstream.
DECnet is an obsolete network protocol that receives more attention
from kernel janitors than users. It belongs in computer protocol
history museum not in Linux kernel.
It has been "Orphaned" in kernel since 2010. The iproute2 support
for DECnet was dropped in 5.0 release. The documentation link on
Sourceforge says it is abandoned there as well.
Leave the UAPI alone to keep userspace programs compiling.
This means that there is still an empty neighbour table
for AF_DECNET.
The table of /proc/sys/net entries was updated to match
current directories and reformatted to be alphabetical.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: David Ahern <dsahern@kernel.org>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>