diff --git a/Documentation/ABI/testing/sysfs-kernel-btf b/Documentation/ABI/testing/sysfs-kernel-btf new file mode 100644 index 000000000000..2c9744b2cd59 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-btf @@ -0,0 +1,17 @@ +What: /sys/kernel/btf +Date: Aug 2019 +KernelVersion: 5.5 +Contact: bpf@vger.kernel.org +Description: + Contains BTF type information and related data for kernel and + kernel modules. + +What: /sys/kernel/btf/vmlinux +Date: Aug 2019 +KernelVersion: 5.5 +Contact: bpf@vger.kernel.org +Description: + Read-only binary attribute exposing kernel's own BTF type + information with description of all internal kernel types. See + Documentation/bpf/btf.rst for detailed description of format + itself. diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 62e847bcdcdd..8ccbce4e2f76 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2393,30 +2393,9 @@ when invoked from a CPU-hotplug notifier.

RCU depends on the scheduler, and the scheduler uses RCU to protect some of its data structures. -This means the scheduler is forbidden from acquiring -the runqueue locks and the priority-inheritance locks -in the middle of an outermost RCU read-side critical section unless either -(1) it releases them before exiting that same -RCU read-side critical section, or -(2) interrupts are disabled across -that entire RCU read-side critical section. -This same prohibition also applies (recursively!) to any lock that is acquired -while holding any lock to which this prohibition applies. -Adhering to this rule prevents preemptible RCU from invoking -rcu_read_unlock_special() while either runqueue or -priority-inheritance locks are held, thus avoiding deadlock. - -

-Prior to v4.4, it was only necessary to disable preemption across -RCU read-side critical sections that acquired scheduler locks. -In v4.4, expedited grace periods started using IPIs, and these -IPIs could force a rcu_read_unlock() to take the slowpath. -Therefore, this expedited-grace-period change required disabling of -interrupts, not just preemption. - -

-For RCU's part, the preemptible-RCU rcu_read_unlock() -implementation must be written carefully to avoid similar deadlocks. +The preemptible-RCU rcu_read_unlock() +implementation must therefore be written carefully to avoid deadlocks +involving the scheduler's runqueue and priority-inheritance locks. In particular, rcu_read_unlock() must tolerate an interrupt where the interrupt handler invokes both rcu_read_lock() and rcu_read_unlock(). @@ -2425,7 +2404,7 @@ negative nesting levels to avoid destructive recursion via interrupt handler's use of RCU.

-This pair of mutual scheduler-RCU requirements came as a +This scheduler-RCU requirement came as a complete surprise.

@@ -2436,9 +2415,28 @@ when running context-switch-heavy workloads when built with CONFIG_NO_HZ_FULL=y did come as a surprise [PDF]. RCU has made good progress towards meeting this requirement, even -for context-switch-have CONFIG_NO_HZ_FULL=y workloads, +for context-switch-heavy CONFIG_NO_HZ_FULL=y workloads, but there is room for further improvement. +

+In the past, it was forbidden to disable interrupts across an +rcu_read_unlock() unless that interrupt-disabled region +of code also included the matching rcu_read_lock(). +Violating this restriction could result in deadlocks involving the +scheduler's runqueue and priority-inheritance spinlocks. +This restriction was lifted when interrupt-disabled calls to +rcu_read_unlock() started deferring the reporting of +the resulting RCU-preempt quiescent state until the end of that +interrupts-disabled region. +This deferred reporting means that the scheduler's runqueue and +priority-inheritance locks cannot be held while reporting an RCU-preempt +quiescent state, which lifts the earlier restriction, at least from +a deadlock perspective. +Unfortunately, real-time systems using RCU priority boosting may +need this restriction to remain in effect because deferred +quiescent-state reporting also defers deboosting, which in turn +degrades real-time latencies. +

Tracing and RCU

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index a39e9311d5ae..ce4bcfd12b8e 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -119,6 +119,7 @@ set: yes --------------------------- super_operations --------------------------- prototypes: struct inode *(*alloc_inode)(struct super_block *sb); + void (*free_inode)(struct inode *); void (*destroy_inode)(struct inode *); void (*dirty_inode) (struct inode *, int flags); int (*write_inode) (struct inode *, struct writeback_control *wbc); @@ -140,6 +141,7 @@ locking rules: All may block [not true, see below] s_umount alloc_inode: +free_inode: called from RCU callback destroy_inode: dirty_inode: write_inode: diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index c757c1c3cb81..bc34d7dd253e 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -607,9 +607,38 @@ in your dentry operations instead. to specify the fields and sync type requested by statx. Filesystems not supporting any statx-specific features may ignore the new arguments. -- -[mandatory] +[recommended] + ->lookup() instances doing an equivalent of + if (IS_ERR(inode)) + return ERR_CAST(inode); + return d_splice_alias(inode, dentry); + don't need to bother with the check - d_splice_alias() will do the + right thing when given ERR_PTR(...) as inode. Moreover, passing NULL + inode to d_splice_alias() will also do the right thing (equivalent of + d_add(dentry, NULL); return NULL;), so that kind of special cases + also doesn't need a separate treatment. +-- +[strongly recommended] + take the RCU-delayed parts of ->destroy_inode() into a new method - + ->free_inode(). If ->destroy_inode() becomes empty - all the better, + just get rid of it. Synchronous work (e.g. the stuff that can't + be done from an RCU callback, or any WARN_ON() where we want the + stack trace) *might* be movable to ->evict_inode(); however, + that goes only for the things that are not needed to balance something + done by ->alloc_inode(). IOW, if it's cleaning up the stuff that + might have accumulated over the life of in-core inode, ->evict_inode() + might be a fit. - [should've been added in 2016] stale comment in finish_open() - nonwithstanding, failure exits in ->atomic_open() instances should - *NOT* fput() the file, no matter what. Everything is handled by the - caller. + Rules for inode destruction: + * if ->destroy_inode() is non-NULL, it gets called + * if ->free_inode() is non-NULL, it gets scheduled by call_rcu() + * combination of NULL ->destroy_inode and NULL ->free_inode is + treated as NULL/free_inode_nonrcu, to preserve the compatibility. + + Note that the callback (be it via ->free_inode() or explicit call_rcu() + in ->destroy_inode()) is *NOT* ordered wrt superblock destruction; + as the matter of fact, the superblock and all associated structures + might be already gone. The filesystem driver is guaranteed to be still + there, but that's it. Freeing memory in the callback is fine; doing + more than that is possible, but requires a lot of care and is best + avoided. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index df2f752ab0b5..b9baeea45d9e 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -251,7 +251,6 @@ Table 1-2: Contents of the status files (as of 4.8) VmExe size of text segment VmLib size of shared library code VmPTE size of page table entries - VmPMD size of second level page tables VmSwap amount of swap used by anonymous private data (shmem swap usage is not included) HugetlbPages size of hugetlb memory portions diff --git a/Documentation/media/kapi/rc-core.rst b/Documentation/media/kapi/rc-core.rst index a45895886257..41c2256dbf6a 100644 --- a/Documentation/media/kapi/rc-core.rst +++ b/Documentation/media/kapi/rc-core.rst @@ -7,8 +7,3 @@ Remote Controller core .. kernel-doc:: include/media/rc-core.h .. kernel-doc:: include/media/rc-map.h - -LIRC -~~~~ - -.. kernel-doc:: include/media/lirc_dev.h diff --git a/Documentation/media/lirc.h.rst.exceptions b/Documentation/media/lirc.h.rst.exceptions index c130617a9986..c6e3a35d2c4e 100644 --- a/Documentation/media/lirc.h.rst.exceptions +++ b/Documentation/media/lirc.h.rst.exceptions @@ -28,6 +28,36 @@ ignore define LIRC_CAN_SEND_MASK ignore define LIRC_CAN_REC_MASK ignore define LIRC_CAN_SET_REC_DUTY_CYCLE +# Obsolete ioctls + +ignore ioctl LIRC_GET_LENGTH + +# rc protocols + +ignore symbol RC_PROTO_UNKNOWN +ignore symbol RC_PROTO_OTHER +ignore symbol RC_PROTO_RC5 +ignore symbol RC_PROTO_RC5X_20 +ignore symbol RC_PROTO_RC5_SZ +ignore symbol RC_PROTO_JVC +ignore symbol RC_PROTO_SONY12 +ignore symbol RC_PROTO_SONY15 +ignore symbol RC_PROTO_SONY20 +ignore symbol RC_PROTO_NEC +ignore symbol RC_PROTO_NECX +ignore symbol RC_PROTO_NEC32 +ignore symbol RC_PROTO_SANYO +ignore symbol RC_PROTO_MCIR2_KBD +ignore symbol RC_PROTO_MCIR2_MSE +ignore symbol RC_PROTO_RC6_0 +ignore symbol RC_PROTO_RC6_6A_20 +ignore symbol RC_PROTO_RC6_6A_24 +ignore symbol RC_PROTO_RC6_6A_32 +ignore symbol RC_PROTO_RC6_MCE +ignore symbol RC_PROTO_SHARP +ignore symbol RC_PROTO_XMP +ignore symbol RC_PROTO_CEC + # Undocumented macros ignore define PULSE_BIT @@ -40,3 +70,4 @@ ignore define LIRC_VALUE_MASK ignore define LIRC_MODE2_MASK ignore define LIRC_MODE_RAW +ignore define LIRC_MODE_LIRCCODE diff --git a/Documentation/media/uapi/rc/keytable.c.rst b/Documentation/media/uapi/rc/keytable.c.rst index e6ce1e3f5a78..217237f93b37 100644 --- a/Documentation/media/uapi/rc/keytable.c.rst +++ b/Documentation/media/uapi/rc/keytable.c.rst @@ -7,7 +7,7 @@ file: uapi/v4l/keytable.c /* keytable.c - This program allows checking/replacing keys at IR - Copyright (C) 2006-2009 Mauro Carvalho Chehab + Copyright (C) 2006-2009 Mauro Carvalho Chehab This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by diff --git a/Documentation/media/uapi/rc/lirc-dev-intro.rst b/Documentation/media/uapi/rc/lirc-dev-intro.rst index d1936eeb9ce0..11516c8bff62 100644 --- a/Documentation/media/uapi/rc/lirc-dev-intro.rst +++ b/Documentation/media/uapi/rc/lirc-dev-intro.rst @@ -6,19 +6,19 @@ Introduction ************ -The LIRC device interface is a bi-directional interface for transporting -raw IR data between userspace and kernelspace. Fundamentally, it is just -a chardev (/dev/lircX, for X = 0, 1, 2, ...), with a number of standard -struct file_operations defined on it. With respect to transporting raw -IR data to and fro, the essential fops are read, write and ioctl. +LIRC stands for Linux Infrared Remote Control. The LIRC device interface is +a bi-directional interface for transporting raw IR and decoded scancodes +data between userspace and kernelspace. Fundamentally, it is just a chardev +(/dev/lircX, for X = 0, 1, 2, ...), with a number of standard struct +file_operations defined on it. With respect to transporting raw IR and +decoded scancodes to and fro, the essential fops are read, write and ioctl. Example dmesg output upon a driver registering w/LIRC: .. code-block:: none $ dmesg |grep lirc_dev - lirc_dev: IR Remote Control driver registered, major 248 - rc rc0: lirc_dev: driver ir-lirc-codec (mceusb) registered at minor = 0 + rc rc0: lirc_dev: driver mceusb registered at minor = 0, raw IR receiver, raw IR transmitter What you should see for a chardev: @@ -36,6 +36,43 @@ LIRC modes LIRC supports some modes of receiving and sending IR codes, as shown on the following table. +.. _lirc-mode-scancode: +.. _lirc-scancode-flag-toggle: +.. _lirc-scancode-flag-repeat: + +``LIRC_MODE_SCANCODE`` + + This mode is for both sending and receiving IR. + + For transmitting (aka sending), create a ``struct lirc_scancode`` with + the desired scancode set in the ``scancode`` member, :c:type:`rc_proto` + set the IR protocol, and all other members set to 0. Write this struct to + the lirc device. + + For receiving, you read ``struct lirc_scancode`` from the lirc device, + with ``scancode`` set to the received scancode and the IR protocol + :c:type:`rc_proto`. If the scancode maps to a valid key code, this is set + in the ``keycode`` field, else it is set to ``KEY_RESERVED``. + + The ``flags`` can have ``LIRC_SCANCODE_FLAG_TOGGLE`` set if the toggle + bit is set in protocols that support it (e.g. rc-5 and rc-6), or + ``LIRC_SCANCODE_FLAG_REPEAT`` for when a repeat is received for protocols + that support it (e.g. nec). + + In the Sanyo and NEC protocol, if you hold a button on remote, rather than + repeating the entire scancode, the remote sends a shorter message with + no scancode, which just means button is held, a "repeat". When this is + received, the ``LIRC_SCANCODE_FLAG_REPEAT`` is set and the scancode and + keycode is repeated. + + With nec, there is no way to distinguish "button hold" from "repeatedly + pressing the same button". The rc-5 and rc-6 protocols have a toggle bit. + When a button is released and pressed again, the toggle bit is inverted. + If the toggle bit is set, the ``LIRC_SCANCODE_FLAG_TOGGLE`` is set. + + The ``timestamp`` field is filled with the time nanoseconds + (in ``CLOCK_MONOTONIC``) when the scancode was decoded. + .. _lirc-mode-mode2: ``LIRC_MODE_MODE2`` @@ -72,21 +109,6 @@ on the following table. this packet will be sent, with the number of microseconds with no IR. -.. _lirc-mode-lirccode: - -``LIRC_MODE_LIRCCODE`` - - This mode can be used for IR receive and send. - - The IR signal is decoded internally by the receiver, or encoded by the - transmitter. The LIRC interface represents the scancode as byte string, - which might not be a u32, it can be any length. The value is entirely - driver dependent. This mode is used by some older lirc drivers. - - The length of each code depends on the driver, which can be retrieved - with :ref:`lirc_get_length`. This length is used both - for transmitting and receiving IR. - .. _lirc-mode-pulse: ``LIRC_MODE_PULSE`` @@ -99,3 +121,13 @@ on the following table. of entries. This mode is used only for IR send. + + +************************** +Remote Controller protocol +************************** + +An enum :c:type:`rc_proto` in the :ref:`lirc_header` lists all the +supported IR protocols: + +.. kernel-doc:: include/uapi/linux/lirc.h diff --git a/Documentation/media/uapi/rc/lirc-func.rst b/Documentation/media/uapi/rc/lirc-func.rst index 9b5a772ec96c..9656423a3f28 100644 --- a/Documentation/media/uapi/rc/lirc-func.rst +++ b/Documentation/media/uapi/rc/lirc-func.rst @@ -17,8 +17,8 @@ LIRC Function Reference lirc-get-rec-resolution lirc-set-send-duty-cycle lirc-get-timeout + lirc-get-rec-timeout lirc-set-rec-timeout - lirc-get-length lirc-set-rec-carrier lirc-set-rec-carrier-range lirc-set-send-carrier diff --git a/Documentation/media/uapi/rc/lirc-get-features.rst b/Documentation/media/uapi/rc/lirc-get-features.rst index 64f89a4f9d9c..889a8807037b 100644 --- a/Documentation/media/uapi/rc/lirc-get-features.rst +++ b/Documentation/media/uapi/rc/lirc-get-features.rst @@ -55,15 +55,24 @@ LIRC features ``LIRC_CAN_REC_MODE2`` - The driver is capable of receiving using - :ref:`LIRC_MODE_MODE2 `. + This is raw IR driver for receiving. This means that + :ref:`LIRC_MODE_MODE2 ` is used. This also implies + that :ref:`LIRC_MODE_SCANCODE ` is also supported, + as long as the kernel is recent enough. Use the + :ref:`lirc_set_rec_mode` to switch modes. .. _LIRC-CAN-REC-LIRCCODE: ``LIRC_CAN_REC_LIRCCODE`` - The driver is capable of receiving using - :ref:`LIRC_MODE_LIRCCODE `. + Unused. Kept just to avoid breaking uAPI. + +.. _LIRC-CAN-REC-SCANCODE: + +``LIRC_CAN_REC_SCANCODE`` + + This is a scancode driver for receiving. This means that + :ref:`LIRC_MODE_SCANCODE ` is used. .. _LIRC-CAN-SET-SEND-CARRIER: @@ -157,7 +166,10 @@ LIRC features ``LIRC_CAN_SEND_PULSE`` The driver supports sending (also called as IR blasting or IR TX) using - :ref:`LIRC_MODE_PULSE `. + :ref:`LIRC_MODE_PULSE `. This implies that + :ref:`LIRC_MODE_SCANCODE ` is also supported for + transmit, as long as the kernel is recent enough. Use the + :ref:`lirc_set_send_mode` to switch modes. .. _LIRC-CAN-SEND-MODE2: @@ -170,8 +182,7 @@ LIRC features ``LIRC_CAN_SEND_LIRCCODE`` - The driver supports sending (also called as IR blasting or IR TX) using - :ref:`LIRC_MODE_LIRCCODE `. + Unused. Kept just to avoid breaking uAPI. Return Value diff --git a/Documentation/media/uapi/rc/lirc-get-length.rst b/Documentation/media/uapi/rc/lirc-get-length.rst deleted file mode 100644 index 3990af5de0e9..000000000000 --- a/Documentation/media/uapi/rc/lirc-get-length.rst +++ /dev/null @@ -1,44 +0,0 @@ -.. -*- coding: utf-8; mode: rst -*- - -.. _lirc_get_length: - -********************* -ioctl LIRC_GET_LENGTH -********************* - -Name -==== - -LIRC_GET_LENGTH - Retrieves the code length in bits. - -Synopsis -======== - -.. c:function:: int ioctl( int fd, LIRC_GET_LENGTH, __u32 *length ) - :name: LIRC_GET_LENGTH - -Arguments -========= - -``fd`` - File descriptor returned by open(). - -``length`` - length, in bits - - -Description -=========== - -Retrieves the code length in bits (only for -:ref:`LIRC_MODE_LIRCCODE `). -Reads on the device must be done in blocks matching the bit count. -The bit could should be rounded up so that it matches full bytes. - - -Return Value -============ - -On success 0 is returned, on error -1 and the ``errno`` variable is set -appropriately. The generic error codes are described at the -:ref:`Generic Error Codes ` chapter. diff --git a/Documentation/media/uapi/rc/lirc-get-rec-mode.rst b/Documentation/media/uapi/rc/lirc-get-rec-mode.rst index a4eb6c0a26e9..34919feaf392 100644 --- a/Documentation/media/uapi/rc/lirc-get-rec-mode.rst +++ b/Documentation/media/uapi/rc/lirc-get-rec-mode.rst @@ -34,9 +34,8 @@ Description =========== Get/set supported receive modes. Only :ref:`LIRC_MODE_MODE2 ` -and :ref:`LIRC_MODE_LIRCCODE ` are supported for IR -receive. Use :ref:`lirc_get_features` to find out which modes the driver -supports. +and :ref:`LIRC_MODE_SCANCODE ` are supported. +Use :ref:`lirc_get_features` to find out which modes the driver supports. Return Value ============ diff --git a/Documentation/media/uapi/rc/lirc-get-send-mode.rst b/Documentation/media/uapi/rc/lirc-get-send-mode.rst index a169b234290e..e39383f08e21 100644 --- a/Documentation/media/uapi/rc/lirc-get-send-mode.rst +++ b/Documentation/media/uapi/rc/lirc-get-send-mode.rst @@ -37,7 +37,7 @@ Description Get/set current transmit mode. Only :ref:`LIRC_MODE_PULSE ` and -:ref:`LIRC_MODE_LIRCCODE ` is supported by for IR send, +:ref:`LIRC_MODE_SCANCODE ` are supported by for IR send, depending on the driver. Use :ref:`lirc_get_features` to find out which modes the driver supports. diff --git a/Documentation/media/uapi/rc/lirc-read.rst b/Documentation/media/uapi/rc/lirc-read.rst index ff14a69104e5..c024aaffb8ad 100644 --- a/Documentation/media/uapi/rc/lirc-read.rst +++ b/Documentation/media/uapi/rc/lirc-read.rst @@ -45,13 +45,20 @@ descriptor ``fd`` into the buffer starting at ``buf``. If ``count`` is zero, is greater than ``SSIZE_MAX``, the result is unspecified. The exact format of the data depends on what :ref:`lirc_modes` a driver -uses. Use :ref:`lirc_get_features` to get the supported mode. +uses. Use :ref:`lirc_get_features` to get the supported mode, and use +:ref:`lirc_set_rec_mode` set the current active mode. -The generally preferred mode for receive is -:ref:`LIRC_MODE_MODE2 `, -in which packets containing an int value describing an IR signal are +The mode :ref:`LIRC_MODE_MODE2 ` is for raw IR, +in which packets containing an unsigned int value describing an IR signal are read from the chardev. +Alternatively, :ref:`LIRC_MODE_SCANCODE ` can be available, +in this mode scancodes which are either decoded by software decoders, or +by hardware decoders. The :c:type:`rc_proto` member is set to the +protocol used for transmission, and ``scancode`` to the decoded scancode, +and the ``keycode`` set to the keycode or ``KEY_RESERVED``. + + Return Value ============ diff --git a/Documentation/media/uapi/rc/lirc-set-rec-timeout.rst b/Documentation/media/uapi/rc/lirc-set-rec-timeout.rst index b3e16bbdbc90..a833a6a4c25a 100644 --- a/Documentation/media/uapi/rc/lirc-set-rec-timeout.rst +++ b/Documentation/media/uapi/rc/lirc-set-rec-timeout.rst @@ -1,19 +1,23 @@ .. -*- coding: utf-8; mode: rst -*- .. _lirc_set_rec_timeout: +.. _lirc_get_rec_timeout: -************************** -ioctl LIRC_SET_REC_TIMEOUT -************************** +*************************************************** +ioctl LIRC_GET_REC_TIMEOUT and LIRC_SET_REC_TIMEOUT +*************************************************** Name ==== -LIRC_SET_REC_TIMEOUT - sets the integer value for IR inactivity timeout. +LIRC_GET_REC_TIMEOUT/LIRC_SET_REC_TIMEOUT - Get/set the integer value for IR inactivity timeout. Synopsis ======== +.. c:function:: int ioctl( int fd, LIRC_GET_REC_TIMEOUT, __u32 *timeout ) + :name: LIRC_GET_REC_TIMEOUT + .. c:function:: int ioctl( int fd, LIRC_SET_REC_TIMEOUT, __u32 *timeout ) :name: LIRC_SET_REC_TIMEOUT @@ -30,7 +34,7 @@ Arguments Description =========== -Sets the integer value for IR inactivity timeout. +Get and set the integer value for IR inactivity timeout. If supported by the hardware, setting it to 0 disables all hardware timeouts and data should be reported as soon as possible. If the exact value diff --git a/Documentation/media/uapi/rc/lirc-write.rst b/Documentation/media/uapi/rc/lirc-write.rst index 2aad0fef4a5b..d4566b0a2015 100644 --- a/Documentation/media/uapi/rc/lirc-write.rst +++ b/Documentation/media/uapi/rc/lirc-write.rst @@ -42,21 +42,32 @@ Description referenced by the file descriptor ``fd`` from the buffer starting at ``buf``. -The exact format of the data depends on what mode a driver uses, use -:ref:`lirc_get_features` to get the supported mode. +The exact format of the data depends on what mode a driver is in, use +:ref:`lirc_get_features` to get the supported modes and use +:ref:`lirc_set_send_mode` set the mode. When in :ref:`LIRC_MODE_PULSE ` mode, the data written to the chardev is a pulse/space sequence of integer values. Pulses and spaces are only marked implicitly by their position. The data must start and end with a pulse, therefore, the data must always include an uneven number of -samples. The write function must block until the data has been transmitted +samples. The write function blocks until the data has been transmitted by the hardware. If more data is provided than the hardware can send, the driver returns ``EINVAL``. +When in :ref:`LIRC_MODE_SCANCODE ` mode, one +``struct lirc_scancode`` must be written to the chardev at a time, else +``EINVAL`` is returned. Set the desired scancode in the ``scancode`` member, +and the protocol in the :c:type:`rc_proto`: member. All other members must be +set to 0, else ``EINVAL`` is returned. If there is no protocol encoder +for the protocol or the scancode is not valid for the specified protocol, +``EINVAL`` is returned. The write function blocks until the scancode +is transmitted by the hardware. + + Return Value ============ -On success, the number of bytes read is returned. It is not an error if +On success, the number of bytes written is returned. It is not an error if this number is smaller than the number of bytes requested, or the amount of data required for one frame. On error, -1 is returned, and the ``errno`` variable is set appropriately. The generic error codes are described at the diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst new file mode 100644 index 000000000000..ff929cfab4f4 --- /dev/null +++ b/Documentation/networking/af_xdp.rst @@ -0,0 +1,312 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====== +AF_XDP +====== + +Overview +======== + +AF_XDP is an address family that is optimized for high performance +packet processing. + +This document assumes that the reader is familiar with BPF and XDP. If +not, the Cilium project has an excellent reference guide at +http://cilium.readthedocs.io/en/latest/bpf/. + +Using the XDP_REDIRECT action from an XDP program, the program can +redirect ingress frames to other XDP enabled netdevs, using the +bpf_redirect_map() function. AF_XDP sockets enable the possibility for +XDP programs to redirect frames to a memory buffer in a user-space +application. + +An AF_XDP socket (XSK) is created with the normal socket() +syscall. Associated with each XSK are two rings: the RX ring and the +TX ring. A socket can receive packets on the RX ring and it can send +packets on the TX ring. These rings are registered and sized with the +setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory +to have at least one of these rings for each socket. An RX or TX +descriptor ring points to a data buffer in a memory area called a +UMEM. RX and TX can share the same UMEM so that a packet does not have +to be copied between RX and TX. Moreover, if a packet needs to be kept +for a while due to a possible retransmit, the descriptor that points +to that packet can be changed to point to another and reused right +away. This again avoids copying data. + +The UMEM consists of a number of equally sized chunks. A descriptor in +one of the rings references a frame by referencing its addr. The addr +is simply an offset within the entire UMEM region. The user space +allocates memory for this UMEM using whatever means it feels is most +appropriate (malloc, mmap, huge pages, etc). This memory area is then +registered with the kernel using the new setsockopt XDP_UMEM_REG. The +UMEM also has two rings: the FILL ring and the COMPLETION ring. The +fill ring is used by the application to send down addr for the kernel +to fill in with RX packet data. References to these frames will then +appear in the RX ring once each packet has been received. The +completion ring, on the other hand, contains frame addr that the +kernel has transmitted completely and can now be used again by user +space, for either TX or RX. Thus, the frame addrs appearing in the +completion ring are addrs that were previously transmitted using the +TX ring. In summary, the RX and FILL rings are used for the RX path +and the TX and COMPLETION rings are used for the TX path. + +The socket is then finally bound with a bind() call to a device and a +specific queue id on that device, and it is not until bind is +completed that traffic starts to flow. + +The UMEM can be shared between processes, if desired. If a process +wants to do this, it simply skips the registration of the UMEM and its +corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind +call and submits the XSK of the process it would like to share UMEM +with as well as its own newly created XSK socket. The new process will +then receive frame addr references in its own RX ring that point to +this shared UMEM. Note that since the ring structures are +single-consumer / single-producer (for performance reasons), the new +process has to create its own socket with associated RX and TX rings, +since it cannot share this with the other process. This is also the +reason that there is only one set of FILL and COMPLETION rings per +UMEM. It is the responsibility of a single process to handle the UMEM. + +How is then packets distributed from an XDP program to the XSKs? There +is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The +user-space application can place an XSK at an arbitrary place in this +map. The XDP program can then redirect a packet to a specific index in +this map and at this point XDP validates that the XSK in that map was +indeed bound to that device and ring number. If not, the packet is +dropped. If the map is empty at that index, the packet is also +dropped. This also means that it is currently mandatory to have an XDP +program loaded (and one XSK in the XSKMAP) to be able to get any +traffic to user space through the XSK. + +AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the +driver does not have support for XDP, or XDP_SKB is explicitly chosen +when loading the XDP program, XDP_SKB mode is employed that uses SKBs +together with the generic XDP support and copies out the data to user +space. A fallback mode that works for any network device. On the other +hand, if the driver has support for XDP, it will be used by the AF_XDP +code to provide better performance, but there is still a copy of the +data into user space. + +Concepts +======== + +In order to use an AF_XDP socket, a number of associated objects need +to be setup. + +Jonathan Corbet has also written an excellent article on LWN, +"Accelerating networking with AF_XDP". It can be found at +https://lwn.net/Articles/750845/. + +UMEM +---- + +UMEM is a region of virtual contiguous memory, divided into +equal-sized frames. An UMEM is associated to a netdev and a specific +queue id of that netdev. It is created and configured (chunk size, +headroom, start address and size) by using the XDP_UMEM_REG setsockopt +system call. A UMEM is bound to a netdev and queue id, via the bind() +system call. + +An AF_XDP is socket linked to a single UMEM, but one UMEM can have +multiple AF_XDP sockets. To share an UMEM created via one socket A, +the next socket B can do this by setting the XDP_SHARED_UMEM flag in +struct sockaddr_xdp member sxdp_flags, and passing the file descriptor +of A to struct sockaddr_xdp member sxdp_shared_umem_fd. + +The UMEM has two single-producer/single-consumer rings, that are used +to transfer ownership of UMEM frames between the kernel and the +user-space application. + +Rings +----- + +There are a four different kind of rings: Fill, Completion, RX and +TX. All rings are single-producer/single-consumer, so the user-space +application need explicit synchronization of multiple +processes/threads are reading/writing to them. + +The UMEM uses two rings: Fill and Completion. Each socket associated +with the UMEM must have an RX queue, TX queue or both. Say, that there +is a setup with four sockets (all doing TX and RX). Then there will be +one Fill ring, one Completion ring, four TX rings and four RX rings. + +The rings are head(producer)/tail(consumer) based rings. A producer +writes the data ring at the index pointed out by struct xdp_ring +producer member, and increasing the producer index. A consumer reads +the data ring at the index pointed out by struct xdp_ring consumer +member, and increasing the consumer index. + +The rings are configured and created via the _RING setsockopt system +calls and mmapped to user-space using the appropriate offset to mmap() +(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and +XDP_UMEM_PGOFF_COMPLETION_RING). + +The size of the rings need to be of size power of two. + +UMEM Fill Ring +~~~~~~~~~~~~~~ + +The Fill ring is used to transfer ownership of UMEM frames from +user-space to kernel-space. The UMEM addrs are passed in the ring. As +an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has +16 chunks and can pass addrs between 0 and 64k. + +Frames passed to the kernel are used for the ingress path (RX rings). + +The user application produces UMEM addrs to this ring. Note that the +kernel will mask the incoming addr. E.g. for a chunk size of 2k, the +log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050 +and 3000 refers to the same chunk. + + +UMEM Completetion Ring +~~~~~~~~~~~~~~~~~~~~~~ + +The Completion Ring is used transfer ownership of UMEM frames from +kernel-space to user-space. Just like the Fill ring, UMEM indicies are +used. + +Frames passed from the kernel to user-space are frames that has been +sent (TX ring) and can be used by user-space again. + +The user application consumes UMEM addrs from this ring. + + +RX Ring +~~~~~~~ + +The RX ring is the receiving side of a socket. Each entry in the ring +is a struct xdp_desc descriptor. The descriptor contains UMEM offset +(addr) and the length of the data (len). + +If no frames have been passed to kernel via the Fill ring, no +descriptors will (or can) appear on the RX ring. + +The user application consumes struct xdp_desc descriptors from this +ring. + +TX Ring +~~~~~~~ + +The TX ring is used to send frames. The struct xdp_desc descriptor is +filled (index, length and offset) and passed into the ring. + +To start the transfer a sendmsg() system call is required. This might +be relaxed in the future. + +The user application produces struct xdp_desc descriptors to this +ring. + +XSKMAP / BPF_MAP_TYPE_XSKMAP +---------------------------- + +On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that +is used in conjunction with bpf_redirect_map() to pass the ingress +frame to a socket. + +The user application inserts the socket into the map, via the bpf() +system call. + +Note that if an XDP program tries to redirect to a socket that does +not match the queue configuration and netdev, the frame will be +dropped. E.g. an AF_XDP socket is bound to netdev eth0 and +queue 17. Only the XDP program executing for eth0 and queue 17 will +successfully pass data to the socket. Please refer to the sample +application (samples/bpf/) in for an example. + +Usage +===== + +In order to use AF_XDP sockets there are two parts needed. The +user-space application and the XDP program. For a complete setup and +usage example, please refer to the sample application. The user-space +side is xdpsock_user.c and the XDP side xdpsock_kern.c. + +Naive ring dequeue and enqueue could look like this:: + + // struct xdp_rxtx_ring { + // __u32 *producer; + // __u32 *consumer; + // struct xdp_desc *desc; + // }; + + // struct xdp_umem_ring { + // __u32 *producer; + // __u32 *consumer; + // __u64 *desc; + // }; + + // typedef struct xdp_rxtx_ring RING; + // typedef struct xdp_umem_ring RING; + + // typedef struct xdp_desc RING_TYPE; + // typedef __u64 RING_TYPE; + + int dequeue_one(RING *ring, RING_TYPE *item) + { + __u32 entries = *ring->producer - *ring->consumer; + + if (entries == 0) + return -1; + + // read-barrier! + + *item = ring->desc[*ring->consumer & (RING_SIZE - 1)]; + (*ring->consumer)++; + return 0; + } + + int enqueue_one(RING *ring, const RING_TYPE *item) + { + u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer); + + if (free_entries == 0) + return -1; + + ring->desc[*ring->producer & (RING_SIZE - 1)] = *item; + + // write-barrier! + + (*ring->producer)++; + return 0; + } + + +For a more optimized version, please refer to the sample application. + +Sample application +================== + +There is a xdpsock benchmarking/test application included that +demonstrates how to use AF_XDP sockets with both private and shared +UMEMs. Say that you would like your UDP traffic from port 4242 to end +up in queue 16, that we will enable AF_XDP on. Here, we use ethtool +for this:: + + ethtool -N p3p2 rx-flow-hash udp4 fn + ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ + action 16 + +Running the rxdrop benchmark in XDP_DRV mode can then be done +using:: + + samples/bpf/xdpsock -i p3p2 -q 16 -r -N + +For XDP_SKB mode, use the switch "-S" instead of "-N" and all options +can be displayed with "-h", as usual. + +Credits +======= + +- Björn Töpel (AF_XDP core) +- Magnus Karlsson (AF_XDP core) +- Alexander Duyck +- Alexei Starovoitov +- Daniel Borkmann +- Jesper Dangaard Brouer +- John Fastabend +- Jonathan Corbet (LWN coverage) +- Michael S. Tsirkin +- Qi Z Zhang +- Willem de Bruijn + diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 87814859cfc2..44293164244c 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -857,7 +857,7 @@ Three LSB bits store instruction class which is one of: BPF_STX 0x03 BPF_STX 0x03 BPF_ALU 0x04 BPF_ALU 0x04 BPF_JMP 0x05 BPF_JMP 0x05 - BPF_RET 0x06 [ class 6 unused, for future if needed ] + BPF_RET 0x06 BPF_JMP32 0x06 BPF_MISC 0x07 BPF_ALU64 0x07 When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... @@ -894,9 +894,9 @@ If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ BPF_END 0xd0 /* eBPF only: endianness conversion */ -If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of: +If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of: - BPF_JA 0x00 + BPF_JA 0x00 /* BPF_JMP only */ BPF_JEQ 0x10 BPF_JGT 0x20 BPF_JGE 0x30 @@ -904,8 +904,8 @@ If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of: BPF_JNE 0x50 /* eBPF only: jump != */ BPF_JSGT 0x60 /* eBPF only: signed '>' */ BPF_JSGE 0x70 /* eBPF only: signed '>=' */ - BPF_CALL 0x80 /* eBPF only: function call */ - BPF_EXIT 0x90 /* eBPF only: function return */ + BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ + BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ BPF_JSLT 0xc0 /* eBPF only: signed '<' */ @@ -928,8 +928,9 @@ Classic BPF wastes the whole BPF_RET class to represent a single 'ret' operation. Classic BPF_RET | BPF_K means copy imm32 into return register and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT in eBPF means function exit only. The eBPF program needs to store return -value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently -unused and reserved for future use. +value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as +BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide +operands for the comparisons instead. For load and store instructions the 8-bit 'code' field is divided as: diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 66e620866245..94c43ffffaef 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -6,6 +6,7 @@ Contents: .. toctree:: :maxdepth: 2 + af_xdp batman-adv kapi z8530book diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index b6f70207146f..c2ba72a09d43 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -612,6 +612,7 @@ tcp_fastopen_blackhole_timeout_sec - INTEGER This time period will grow exponentially when more blackhole issues get detected right after Fastopen is re-enabled and will reset to initial value when the blackhole issue goes away. + 0 to disable the blackhole detection. By default, it is set to 1hr. tcp_fwmark_accept - BOOLEAN @@ -1389,6 +1390,13 @@ flowlabel_reflect - BOOLEAN FALSE: disabled Default: FALSE +fib_multipath_hash_policy - INTEGER + Controls which hash policy to use for multipath routes. + Default: 0 (Layer 3) + Possible values: + 0 - Layer 3 (source and destination addresses plus flow label) + 1 - Layer 4 (standard 5-tuple) + anycast_src_echo_reply - BOOLEAN Controls the use of anycast addresses as source addresses for ICMPv6 echo reply @@ -1412,6 +1420,30 @@ mld_qrv - INTEGER Default: 2 (as specified by RFC3810 9.1) Minimum: 1 (as specified by RFC6636 4.5) +max_dst_opts_cnt - INTEGER + Maximum number of non-padding TLVs allowed in a Destination + options extension header. If this value is less than zero + then unknown options are disallowed and the number of known + TLVs allowed is the absolute value of this number. + Default: 8 + +max_hbh_opts_cnt - INTEGER + Maximum number of non-padding TLVs allowed in a Hop-by-Hop + options extension header. If this value is less than zero + then unknown options are disallowed and the number of known + TLVs allowed is the absolute value of this number. + Default: 8 + +max dst_opts_len - INTEGER + Maximum length allowed for a Destination options extension + header. + Default: INT_MAX (unlimited) + +max hbh_opts_len - INTEGER + Maximum length allowed for a Hop-by-Hop options extension + header. + Default: INT_MAX (unlimited) + IPv6 Fragmentation: ip6frag_high_thresh - INTEGER diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt index 25c8ad37d86c..ed4aa7f023df 100644 --- a/Documentation/printk-formats.txt +++ b/Documentation/printk-formats.txt @@ -61,41 +61,31 @@ Symbols/Function Pointers :: + %pS versatile_init+0x0/0x110 + %ps versatile_init %pF versatile_init+0x0/0x110 %pf versatile_init - %pS versatile_init+0x0/0x110 %pSR versatile_init+0x9/0x110 (with __builtin_extract_return_addr() translation) - %ps versatile_init %pB prev_fn_of_versatile_init+0x88/0x88 -The ``F`` and ``f`` specifiers are for printing function pointers, -for example, f->func, &gettimeofday. They have the same result as -``S`` and ``s`` specifiers. But they do an extra conversion on -ia64, ppc64 and parisc64 architectures where the function pointers -are actually function descriptors. +The ``S`` and ``s`` specifiers are used for printing a pointer in symbolic +format. They result in the symbol name with (``S``) or without (``s``) +offsets. If KALLSYMS are disabled then the symbol address is printed instead. -The ``S`` and ``s`` specifiers can be used for printing symbols -from direct addresses, for example, __builtin_return_address(0), -(void *)regs->ip. They result in the symbol name with (``S``) or -without (``s``) offsets. If KALLSYMS are disabled then the symbol -address is printed instead. +Note, that the ``F`` and ``f`` specifiers are identical to ``S`` (``s``) +and thus deprecated. We have ``F`` and ``f`` because on ia64, ppc64 and +parisc64 function pointers are indirect and, in fact, are function +descriptors, which require additional dereferencing before we can lookup +the symbol. As of now, ``S`` and ``s`` perform dereferencing on those +platforms (when needed), so ``F`` and ``f`` exist for compatibility +reasons only. The ``B`` specifier results in the symbol name with offsets and should be used when printing stack backtraces. The specifier takes into consideration the effect of compiler optimisations which may occur when tail-call``s are used and marked with the noreturn GCC attribute. -Examples:: - - printk("Going to call: %pF\n", gettimeofday); - printk("Going to call: %pF\n", p->func); - printk("%s: called from %pS\n", __func__, (void *)_RET_IP_); - printk("%s: called from %pS\n", __func__, - (void *)__builtin_return_address(0)); - printk("Faulted at %pS\n", (void *)regs->ip); - printk(" %s%pB\n", (reliable ? "" : "? "), (void *)*stack); - Kernel Pointers =============== diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 946748424871..33c6df730a6b 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -641,7 +641,7 @@ oom_dump_tasks Enables a system-wide task dump (excluding kernel threads) to be produced when the kernel performs an OOM-killing and includes such information as -pid, uid, tgid, vm size, rss, nr_ptes, nr_pmds, swapents, oom_score_adj +pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj score, and name. This is helpful to determine why the OOM killer was invoked, to identify the rogue task that caused it, and to determine why the OOM killer chose the task it did to kill. diff --git a/MAINTAINERS b/MAINTAINERS index bf8526bd9ed9..a17554af111f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2498,7 +2498,6 @@ F: Documentation/devicetree/bindings/sound/axentia,* F: sound/soc/atmel/tse850-pcm5142.c AZ6007 DVB DRIVER -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-media@vger.kernel.org W: https://linuxtv.org @@ -3040,7 +3039,6 @@ F: include/linux/btrfs* F: include/uapi/linux/btrfs* BTTV VIDEO4LINUX DRIVER -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-media@vger.kernel.org W: https://linuxtv.org @@ -3763,7 +3761,6 @@ S: Maintained F: drivers/media/dvb-frontends/cx24120* CX88 VIDEO4LINUX DRIVER -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-media@vger.kernel.org W: https://linuxtv.org @@ -4910,7 +4907,6 @@ F: drivers/edac/thunderx_edac* EDAC-CORE M: Borislav Petkov -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-edac@vger.kernel.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git for-next @@ -4939,7 +4935,6 @@ S: Maintained F: drivers/edac/fsl_ddr_edac.* EDAC-GHES -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-edac@vger.kernel.org S: Maintained @@ -4956,21 +4951,18 @@ S: Maintained F: drivers/edac/i5000_edac.c EDAC-I5400 -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-edac@vger.kernel.org S: Maintained F: drivers/edac/i5400_edac.c EDAC-I7300 -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-edac@vger.kernel.org S: Maintained F: drivers/edac/i7300_edac.c EDAC-I7CORE -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-edac@vger.kernel.org S: Maintained @@ -5020,7 +5012,6 @@ S: Maintained F: drivers/edac/r82600_edac.c EDAC-SBRIDGE -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-edac@vger.kernel.org S: Maintained @@ -5073,7 +5064,6 @@ S: Maintained F: drivers/net/ethernet/ibm/ehea/ EM28XX VIDEO4LINUX DRIVER -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-media@vger.kernel.org W: https://linuxtv.org @@ -6758,6 +6748,13 @@ M: James Hogan S: Maintained F: drivers/media/rc/img-ir/ +IMON SOUNDGRAPH USB IR RECEIVER +M: Sean Young +L: linux-media@vger.kernel.org +S: Maintained +F: drivers/media/rc/imon_raw.c +F: drivers/media/rc/imon.c + IMS TWINTURBO FRAMEBUFFER DRIVER L: linux-fbdev@vger.kernel.org S: Orphan @@ -8594,7 +8591,6 @@ S: Maintained F: drivers/media/dvb-frontends/stv6111* MEDIA INPUT INFRASTRUCTURE (V4L/DVB) -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab P: LinuxTV.org Project L: linux-media@vger.kernel.org @@ -11766,7 +11762,6 @@ S: Odd Fixes F: drivers/media/i2c/saa6588* SAA7134 VIDEO4LINUX DRIVER -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-media@vger.kernel.org W: https://linuxtv.org @@ -12254,7 +12249,6 @@ S: Maintained F: drivers/media/radio/si4713/radio-usb-si4713.c SIANO DVB DRIVER -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-media@vger.kernel.org W: https://linuxtv.org @@ -13126,7 +13120,6 @@ S: Maintained F: drivers/media/i2c/tda9840* TEA5761 TUNER DRIVER -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-media@vger.kernel.org W: https://linuxtv.org @@ -13135,7 +13128,6 @@ S: Odd fixes F: drivers/media/tuners/tea5761.* TEA5767 TUNER DRIVER -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-media@vger.kernel.org W: https://linuxtv.org @@ -13542,7 +13534,6 @@ F: Documentation/networking/tlan.txt F: drivers/net/ethernet/ti/tlan.* TM6000 VIDEO4LINUX DRIVER -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-media@vger.kernel.org W: https://linuxtv.org @@ -14708,7 +14699,6 @@ S: Maintained F: arch/x86/entry/vdso/ XC2028/3028 TUNER DRIVER -M: Mauro Carvalho Chehab M: Mauro Carvalho Chehab L: linux-media@vger.kernel.org W: https://linuxtv.org @@ -14716,6 +14706,14 @@ T: git git://linuxtv.org/media_tree.git S: Maintained F: drivers/media/tuners/tuner-xc2028.* +XDP SOCKETS (AF_XDP) +M: Björn Töpel +M: Magnus Karlsson +L: netdev@vger.kernel.org +S: Maintained +F: kernel/bpf/xskmap.c +F: net/xdp/ + XEN BLOCK SUBSYSTEM M: Konrad Rzeszutek Wilk M: Roger Pau Monné diff --git a/Makefile b/Makefile index 6512f8b47dbe..935009b5736f 100644 --- a/Makefile +++ b/Makefile @@ -408,6 +408,7 @@ OBJCOPY = $(CROSS_COMPILE)objcopy OBJDUMP = $(CROSS_COMPILE)objdump READELF = $(CROSS_COMPILE)readelf endif +PAHOLE = pahole AWK = awk GENKSYMS = scripts/genksyms/genksyms INSTALLKERNEL := installkernel @@ -472,7 +473,7 @@ GCC_PLUGINS_CFLAGS := CLANG_FLAGS := export ARCH SRCARCH CONFIG_SHELL HOSTCC HOSTCFLAGS CROSS_COMPILE LD CC -export CPP AR NM STRIP OBJCOPY OBJDUMP READELF HOSTLDFLAGS HOST_LOADLIBES +export CPP AR NM STRIP OBJCOPY OBJDUMP PAHOLE READELF HOSTLDFLAGS HOST_LOADLIBES export MAKE AWK GENKSYMS INSTALLKERNEL PERL PYTHON UTS_MACHINE export HOSTCXX HOSTCXXFLAGS LDFLAGS_MODULE CHECK CHECKFLAGS diff --git a/arch/Kconfig b/arch/Kconfig index 60b8d811afc9..bb31eb5d863a 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -199,6 +199,9 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_FUNCTION_ERROR_INJECTION + bool + config HAVE_NMI bool @@ -240,6 +243,10 @@ config ARCH_HAS_SET_MEMORY config ARCH_HAS_CPU_FINALIZE_INIT bool +# Select if arch has all set_direct_map_invalid/default() functions +config ARCH_HAS_SET_DIRECT_MAP + bool + # Select if arch init_task initializer is different to init/init_task.c config ARCH_INIT_TASK bool diff --git a/arch/alpha/include/asm/atomic.h b/arch/alpha/include/asm/atomic.h index 85867d3cea64..1850abfff37b 100644 --- a/arch/alpha/include/asm/atomic.h +++ b/arch/alpha/include/asm/atomic.h @@ -193,7 +193,7 @@ ATOMIC_OPS(xor, xor) #define atomic_xchg(v, new) (xchg(&((v)->counter), new)) /** - * __atomic_add_unless - add unless the number is a given value + * atomic_fetch_add_unless - add unless the number is a given value * @v: pointer of type atomic_t * @a: the amount to add to v... * @u: ...unless v is equal to u. @@ -201,7 +201,7 @@ ATOMIC_OPS(xor, xor) * Atomically adds @a to @v, so long as it was not @u. * Returns the old value of @v. */ -static __inline__ int __atomic_add_unless(atomic_t *v, int a, int u) +static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, new, old; smp_mb(); diff --git a/arch/arc/include/asm/atomic.h b/arch/arc/include/asm/atomic.h index c98b59ac0612..2873e07dcb21 100644 --- a/arch/arc/include/asm/atomic.h +++ b/arch/arc/include/asm/atomic.h @@ -309,7 +309,7 @@ ATOMIC_OPS(xor, ^=, CTOP_INST_AXOR_DI_R2_R2_R3) #undef ATOMIC_OP /** - * __atomic_add_unless - add unless the number is a given value + * atomic_fetch_add_unless - add unless the number is a given value * @v: pointer of type atomic_t * @a: the amount to add to v... * @u: ...unless v is equal to u. @@ -317,7 +317,7 @@ ATOMIC_OPS(xor, ^=, CTOP_INST_AXOR_DI_R2_R2_R3) * Atomically adds @a to @v, so long as it was not @u. * Returns the old value of @v */ -#define __atomic_add_unless(v, a, u) \ +#define atomic_fetch_add_unless(v, a, u) \ ({ \ int c, old; \ \ diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h index 66d0e215a773..9d56d0727c9b 100644 --- a/arch/arm/include/asm/atomic.h +++ b/arch/arm/include/asm/atomic.h @@ -130,7 +130,7 @@ static inline int atomic_cmpxchg_relaxed(atomic_t *ptr, int old, int new) } #define atomic_cmpxchg_relaxed atomic_cmpxchg_relaxed -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int oldval, newval; unsigned long tmp; @@ -215,7 +215,7 @@ static inline int atomic_cmpxchg(atomic_t *v, int old, int new) return ret; } -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; diff --git a/arch/arm/mach-omap2/Makefile b/arch/arm/mach-omap2/Makefile index 38f1748a4500..fdc8abb3d524 100644 --- a/arch/arm/mach-omap2/Makefile +++ b/arch/arm/mach-omap2/Makefile @@ -78,7 +78,6 @@ endif omap-4-5-pm-common = omap-mpuss-lowpower.o obj-$(CONFIG_ARCH_OMAP4) += $(omap-4-5-pm-common) obj-$(CONFIG_SOC_OMAP5) += $(omap-4-5-pm-common) -obj-$(CONFIG_OMAP_PM_NOOP) += omap-pm-noop.o ifeq ($(CONFIG_PM),y) obj-$(CONFIG_ARCH_OMAP2) += pm24xx.o diff --git a/arch/arm/mach-omap2/display.c b/arch/arm/mach-omap2/display.c index b01b7515b6cc..46cc20a72361 100644 --- a/arch/arm/mach-omap2/display.c +++ b/arch/arm/mach-omap2/display.c @@ -32,7 +32,6 @@ #include #include "omap_hwmod.h" #include "omap_device.h" -#include "omap-pm.h" #include "common.h" #include "soc.h" @@ -131,11 +130,6 @@ static void omap_dsi_disable_pads(int dsi_id, unsigned lane_mask) omap4_dsi_mux_pads(dsi_id, 0); } -static int omap_dss_set_min_bus_tput(struct device *dev, unsigned long tput) -{ - return omap_pm_set_min_bus_tput(dev, OCP_INITIATOR_AGENT, tput); -} - static enum omapdss_version __init omap_display_get_version(void) { if (cpu_is_omap24xx()) @@ -174,7 +168,6 @@ static int __init omapdss_init_fbdev(void) static struct omap_dss_board_info board_data = { .dsi_enable_pads = omap_dsi_enable_pads, .dsi_disable_pads = omap_dsi_disable_pads, - .set_min_bus_tput = omap_dss_set_min_bus_tput, }; struct device_node *node; int r; diff --git a/arch/arm/mach-omap2/hsmmc.c b/arch/arm/mach-omap2/hsmmc.c index 6d28aa20a7d3..eb3aece1f8f7 100644 --- a/arch/arm/mach-omap2/hsmmc.c +++ b/arch/arm/mach-omap2/hsmmc.c @@ -20,7 +20,6 @@ #include "soc.h" #include "omap_device.h" -#include "omap-pm.h" #include "hsmmc.h" #include "control.h" diff --git a/arch/arm/mach-omap2/i2c.c b/arch/arm/mach-omap2/i2c.c index 91a21c3923b2..37ff25ee3d89 100644 --- a/arch/arm/mach-omap2/i2c.c +++ b/arch/arm/mach-omap2/i2c.c @@ -22,7 +22,6 @@ #include "soc.h" #include "omap_hwmod.h" #include "omap_device.h" -#include "omap-pm.h" #include "prm.h" #include "common.h" diff --git a/arch/arm/mach-omap2/io.c b/arch/arm/mach-omap2/io.c index cb5d7314cf99..5f358930684a 100644 --- a/arch/arm/mach-omap2/io.c +++ b/arch/arm/mach-omap2/io.c @@ -37,7 +37,6 @@ #include "clock.h" #include "clock2xxx.h" #include "clock3xxx.h" -#include "omap-pm.h" #include "sdrc.h" #include "control.h" #include "serial.h" @@ -421,8 +420,6 @@ static void __init __maybe_unused omap_hwmod_init_postsetup(void) postsetup_state = _HWMOD_STATE_ENABLED; #endif omap_hwmod_for_each(_set_hwmod_postsetup_state, &postsetup_state); - - omap_pm_if_early_init(); } static void __init __maybe_unused omap_common_late_init(void) diff --git a/arch/arm/mach-omap2/omap-pm-noop.c b/arch/arm/mach-omap2/omap-pm-noop.c deleted file mode 100644 index 4ead077ea4e7..000000000000 --- a/arch/arm/mach-omap2/omap-pm-noop.c +++ /dev/null @@ -1,176 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -/* - * omap-pm-noop.c - OMAP power management interface - dummy version - * - * This code implements the OMAP power management interface to - * drivers, CPUIdle, CPUFreq, and DSP Bridge. It is strictly for - * debug/demonstration use, as it does nothing but printk() whenever a - * function is called (when DEBUG is defined, below) - * - * Copyright (C) 2008-2009 Texas Instruments, Inc. - * Copyright (C) 2008-2009 Nokia Corporation - * Paul Walmsley - * - * Interface developed by (in alphabetical order): - * Karthik Dasu, Tony Lindgren, Rajendra Nayak, Sakari Poussa, Veeramanikandan - * Raju, Anand Sawant, Igor Stoppa, Paul Walmsley, Richard Woodruff - */ - -#undef DEBUG - -#include -#include -#include -#include - -#include "omap_device.h" -#include "omap-pm.h" - -static bool off_mode_enabled; -static int dummy_context_loss_counter; - -/* - * Device-driver-originated constraints (via board-*.c files) - */ - -int omap_pm_set_max_mpu_wakeup_lat(struct device *dev, long t) -{ - if (!dev || t < -1) { - WARN(1, "OMAP PM: %s: invalid parameter(s)", __func__); - return -EINVAL; - } - - if (t == -1) - pr_debug("OMAP PM: remove max MPU wakeup latency constraint: dev %s\n", - dev_name(dev)); - else - pr_debug("OMAP PM: add max MPU wakeup latency constraint: dev %s, t = %ld usec\n", - dev_name(dev), t); - - /* - * For current Linux, this needs to map the MPU to a - * powerdomain, then go through the list of current max lat - * constraints on the MPU and find the smallest. If - * the latency constraint has changed, the code should - * recompute the state to enter for the next powerdomain - * state. - * - * TI CDP code can call constraint_set here. - */ - - return 0; -} - -int omap_pm_set_min_bus_tput(struct device *dev, u8 agent_id, unsigned long r) -{ - if (!dev || (agent_id != OCP_INITIATOR_AGENT && - agent_id != OCP_TARGET_AGENT)) { - WARN(1, "OMAP PM: %s: invalid parameter(s)", __func__); - return -EINVAL; - } - - if (r == 0) - pr_debug("OMAP PM: remove min bus tput constraint: dev %s for agent_id %d\n", - dev_name(dev), agent_id); - else - pr_debug("OMAP PM: add min bus tput constraint: dev %s for agent_id %d: rate %ld KiB\n", - dev_name(dev), agent_id, r); - - /* - * This code should model the interconnect and compute the - * required clock frequency, convert that to a VDD2 OPP ID, then - * set the VDD2 OPP appropriately. - * - * TI CDP code can call constraint_set here on the VDD2 OPP. - */ - - return 0; -} - -/* - * DSP Bridge-specific constraints - */ - - -/** - * omap_pm_enable_off_mode - notify OMAP PM that off-mode is enabled - * - * Intended for use only by OMAP PM core code to notify this layer - * that off mode has been enabled. - */ -void omap_pm_enable_off_mode(void) -{ - off_mode_enabled = true; -} - -/** - * omap_pm_disable_off_mode - notify OMAP PM that off-mode is disabled - * - * Intended for use only by OMAP PM core code to notify this layer - * that off mode has been disabled. - */ -void omap_pm_disable_off_mode(void) -{ - off_mode_enabled = false; -} - -/* - * Device context loss tracking - */ - -#ifdef CONFIG_ARCH_OMAP2PLUS - -int omap_pm_get_dev_context_loss_count(struct device *dev) -{ - struct platform_device *pdev = to_platform_device(dev); - int count; - - if (WARN_ON(!dev)) - return -ENODEV; - - if (dev->pm_domain == &omap_device_pm_domain) { - count = omap_device_get_context_loss_count(pdev); - } else { - WARN_ONCE(off_mode_enabled, "omap_pm: using dummy context loss counter; device %s should be converted to omap_device", - dev_name(dev)); - - count = dummy_context_loss_counter; - - if (off_mode_enabled) { - count++; - /* - * Context loss count has to be a non-negative value. - * Clear the sign bit to get a value range from 0 to - * INT_MAX. - */ - count &= INT_MAX; - dummy_context_loss_counter = count; - } - } - - pr_debug("OMAP PM: context loss count for dev %s = %d\n", - dev_name(dev), count); - - return count; -} - -#else - -int omap_pm_get_dev_context_loss_count(struct device *dev) -{ - return dummy_context_loss_counter; -} - -#endif - -/* Should be called before clk framework init */ -int __init omap_pm_if_early_init(void) -{ - return 0; -} - -/* Must be called after clock framework is initialized */ -int __init omap_pm_if_init(void) -{ - return 0; -} diff --git a/arch/arm/mach-omap2/omap-pm.h b/arch/arm/mach-omap2/omap-pm.h deleted file mode 100644 index 5ba5df47f91b..000000000000 --- a/arch/arm/mach-omap2/omap-pm.h +++ /dev/null @@ -1,161 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* - * omap-pm.h - OMAP power management interface - * - * Copyright (C) 2008-2010 Texas Instruments, Inc. - * Copyright (C) 2008-2010 Nokia Corporation - * Paul Walmsley - * - * Interface developed by (in alphabetical order): Karthik Dasu, Jouni - * Högander, Tony Lindgren, Rajendra Nayak, Sakari Poussa, - * Veeramanikandan Raju, Anand Sawant, Igor Stoppa, Paul Walmsley, - * Richard Woodruff - */ - -#ifndef ASM_ARM_ARCH_OMAP_OMAP_PM_H -#define ASM_ARM_ARCH_OMAP_OMAP_PM_H - -#include -#include -#include -#include - -/* - * agent_id values for use with omap_pm_set_min_bus_tput(): - * - * OCP_INITIATOR_AGENT is only valid for devices that can act as - * initiators -- it represents the device's L3 interconnect - * connection. OCP_TARGET_AGENT represents the device's L4 - * interconnect connection. - */ -#define OCP_TARGET_AGENT 1 -#define OCP_INITIATOR_AGENT 2 - -/** - * omap_pm_if_early_init - OMAP PM init code called before clock fw init - * @mpu_opp_table: array ptr to struct omap_opp for MPU - * @dsp_opp_table: array ptr to struct omap_opp for DSP - * @l3_opp_table : array ptr to struct omap_opp for CORE - * - * Initialize anything that must be configured before the clock - * framework starts. The "_if_" is to avoid name collisions with the - * PM idle-loop code. - */ -int __init omap_pm_if_early_init(void); - -/** - * omap_pm_if_init - OMAP PM init code called after clock fw init - * - * The main initialization code. OPP tables are passed in here. The - * "_if_" is to avoid name collisions with the PM idle-loop code. - */ -int __init omap_pm_if_init(void); - -/* - * Device-driver-originated constraints (via board-*.c files, platform_data) - */ - - -/** - * omap_pm_set_max_mpu_wakeup_lat - set the maximum MPU wakeup latency - * @dev: struct device * requesting the constraint - * @t: maximum MPU wakeup latency in microseconds - * - * Request that the maximum interrupt latency for the MPU to be no - * greater than @t microseconds. "Interrupt latency" in this case is - * defined as the elapsed time from the occurrence of a hardware or - * timer interrupt to the time when the device driver's interrupt - * service routine has been entered by the MPU. - * - * It is intended that underlying PM code will use this information to - * determine what power state to put the MPU powerdomain into, and - * possibly the CORE powerdomain as well, since interrupt handling - * code currently runs from SDRAM. Advanced PM or board*.c code may - * also configure interrupt controller priorities, OCP bus priorities, - * CPU speed(s), etc. - * - * This function will not affect device wakeup latency, e.g., time - * elapsed from when a device driver enables a hardware device with - * clk_enable(), to when the device is ready for register access or - * other use. To control this device wakeup latency, use - * omap_pm_set_max_dev_wakeup_lat() - * - * Multiple calls to omap_pm_set_max_mpu_wakeup_lat() will replace the - * previous t value. To remove the latency target for the MPU, call - * with t = -1. - * - * XXX This constraint will be deprecated soon in favor of the more - * general omap_pm_set_max_dev_wakeup_lat() - * - * Returns -EINVAL for an invalid argument, -ERANGE if the constraint - * is not satisfiable, or 0 upon success. - */ -int omap_pm_set_max_mpu_wakeup_lat(struct device *dev, long t); - - -/** - * omap_pm_set_min_bus_tput - set minimum bus throughput needed by device - * @dev: struct device * requesting the constraint - * @tbus_id: interconnect to operate on (OCP_{INITIATOR,TARGET}_AGENT) - * @r: minimum throughput (in KiB/s) - * - * Request that the minimum data throughput on the OCP interconnect - * attached to device @dev interconnect agent @tbus_id be no less - * than @r KiB/s. - * - * It is expected that the OMAP PM or bus code will use this - * information to set the interconnect clock to run at the lowest - * possible speed that satisfies all current system users. The PM or - * bus code will adjust the estimate based on its model of the bus, so - * device driver authors should attempt to specify an accurate - * quantity for their device use case, and let the PM or bus code - * overestimate the numbers as necessary to handle request/response - * latency, other competing users on the system, etc. On OMAP2/3, if - * a driver requests a minimum L4 interconnect speed constraint, the - * code will also need to add an minimum L3 interconnect speed - * constraint, - * - * Multiple calls to omap_pm_set_min_bus_tput() will replace the - * previous rate value for this device. To remove the interconnect - * throughput restriction for this device, call with r = 0. - * - * Returns -EINVAL for an invalid argument, -ERANGE if the constraint - * is not satisfiable, or 0 upon success. - */ -int omap_pm_set_min_bus_tput(struct device *dev, u8 agent_id, unsigned long r); - - -/* - * CPUFreq-originated constraint - * - * In the future, this should be handled by custom OPP clocktype - * functions. - */ - - -/* - * Device context loss tracking - */ - -/** - * omap_pm_get_dev_context_loss_count - return count of times dev has lost ctx - * @dev: struct device * - * - * This function returns the number of times that the device @dev has - * lost its internal context. This generally occurs on a powerdomain - * transition to OFF. Drivers use this as an optimization to avoid restoring - * context if the device hasn't lost it. To use, drivers should initially - * call this in their context save functions and store the result. Early in - * the driver's context restore function, the driver should call this function - * again, and compare the result to the stored counter. If they differ, the - * driver must restore device context. If the number of context losses - * exceeds the maximum positive integer, the function will wrap to 0 and - * continue counting. Returns the number of context losses for this device, - * or negative value upon error. - */ -int omap_pm_get_dev_context_loss_count(struct device *dev); - -void omap_pm_enable_off_mode(void); -void omap_pm_disable_off_mode(void); - -#endif diff --git a/arch/arm/mach-omap2/pdata-quirks.c b/arch/arm/mach-omap2/pdata-quirks.c index 2477f6086de4..70d280c72b66 100644 --- a/arch/arm/mach-omap2/pdata-quirks.c +++ b/arch/arm/mach-omap2/pdata-quirks.c @@ -25,7 +25,6 @@ #include #include #include -#include #include #include @@ -33,7 +32,6 @@ #include "common-board-devices.h" #include "control.h" #include "omap_device.h" -#include "omap-pm.h" #include "omap-secure.h" #include "soc.h" #include "hsmmc.h" @@ -411,18 +409,6 @@ static struct pwm_omap_dmtimer_pdata pwm_dmtimer_pdata = { }; #endif -static struct ir_rx51_platform_data __maybe_unused rx51_ir_data = { - .set_max_mpu_wakeup_lat = omap_pm_set_max_mpu_wakeup_lat, -}; - -static struct platform_device __maybe_unused rx51_ir_device = { - .name = "ir_rx51", - .id = -1, - .dev = { - .platform_data = &rx51_ir_data, - }, -}; - #if IS_ENABLED(CONFIG_SND_OMAP_SOC_MCBSP) static struct omap_mcbsp_platform_data mcbsp_pdata; static void __init omap3_mcbsp_init(void) diff --git a/arch/arm/mach-omap2/pm-debug.c b/arch/arm/mach-omap2/pm-debug.c index 5c46ea6756d7..acb698d5780f 100644 --- a/arch/arm/mach-omap2/pm-debug.c +++ b/arch/arm/mach-omap2/pm-debug.c @@ -31,7 +31,6 @@ #include "clock.h" #include "powerdomain.h" #include "clockdomain.h" -#include "omap-pm.h" #include "soc.h" #include "cm2xxx_3xxx.h" @@ -240,10 +239,6 @@ static int option_set(void *data, u64 val) *option = val; if (option == &enable_off_mode) { - if (val) - omap_pm_enable_off_mode(); - else - omap_pm_disable_off_mode(); if (cpu_is_omap34xx()) omap3_pm_off_mode_enable(val); } diff --git a/arch/arm/mach-omap2/pm.c b/arch/arm/mach-omap2/pm.c index 6f68576e5695..b98c46d7f112 100644 --- a/arch/arm/mach-omap2/pm.c +++ b/arch/arm/mach-omap2/pm.c @@ -16,11 +16,11 @@ #include #include #include +#include #include #include -#include "omap-pm.h" #include "omap_device.h" #include "common.h" @@ -230,14 +230,6 @@ static void __init omap4_init_voltages(void) omap2_set_init_voltage("iva", "dpll_iva_m5x2_ck", "iva"); } -static int __init omap2_common_pm_init(void) -{ - omap_pm_if_init(); - - return 0; -} -omap_postcore_initcall(omap2_common_pm_init); - int __init omap2_common_pm_late_init(void) { /* Init the voltage layer */ diff --git a/arch/arm/mach-omap2/timer.c b/arch/arm/mach-omap2/timer.c index c421d12b3203..11032df74daa 100644 --- a/arch/arm/mach-omap2/timer.c +++ b/arch/arm/mach-omap2/timer.c @@ -50,7 +50,6 @@ #include "omap_device.h" #include #include -#include "omap-pm.h" #include "soc.h" #include "common.h" diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c index 6a1e9b44be99..006421cc9257 100644 --- a/arch/arm/mm/pgd.c +++ b/arch/arm/mm/pgd.c @@ -142,7 +142,7 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd_base) pte = pmd_pgtable(*pmd); pmd_clear(pmd); pte_free(mm, pte); - atomic_long_dec(&mm->nr_ptes); + mm_dec_nr_ptes(mm); no_pmd: pud_clear(pud); pmd_free(mm, pmd); diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c index 68aa2f6d9f83..009d8e97d70a 100644 --- a/arch/arm/net/bpf_jit_32.c +++ b/arch/arm/net/bpf_jit_32.c @@ -1545,6 +1545,9 @@ exit: } break; } + /* speculation barrier */ + case BPF_ST | BPF_NOSPEC: + break; /* ST: *(size *)(dst + off) = imm */ case BPF_ST | BPF_MEM | BPF_W: case BPF_ST | BPF_MEM | BPF_H: diff --git a/arch/arm/plat-omap/Kconfig b/arch/arm/plat-omap/Kconfig index 7276afee30b3..4609e1c8a1cd 100644 --- a/arch/arm/plat-omap/Kconfig +++ b/arch/arm/plat-omap/Kconfig @@ -121,16 +121,6 @@ config OMAP_SERIAL_WAKE to data on the serial RX line. This allows you to wake the system from serial console. -choice - prompt "OMAP PM layer selection" - depends on ARCH_OMAP - default OMAP_PM_NOOP - -config OMAP_PM_NOOP - bool "No-op/debug PM layer" - -endchoice - endmenu endif diff --git a/arch/arm64/include/asm/atomic.h b/arch/arm64/include/asm/atomic.h index c0235e0ff849..264d20339f74 100644 --- a/arch/arm64/include/asm/atomic.h +++ b/arch/arm64/include/asm/atomic.h @@ -125,7 +125,7 @@ #define atomic_dec_and_test(v) (atomic_dec_return(v) == 0) #define atomic_sub_and_test(i, v) (atomic_sub_return((i), (v)) == 0) #define atomic_add_negative(i, v) (atomic_add_return((i), (v)) < 0) -#define __atomic_add_unless(v, a, u) ___atomic_add_unless(v, a, u,) +#define atomic_fetch_add_unless(v, a, u) ___atomic_add_unless(v, a, u,) #define atomic_andnot atomic_andnot /* diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 7b1b649da4b9..93586c84fc07 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -99,6 +99,20 @@ static inline void emit_a64_mov_i64(const int reg, const u64 val, } } +static inline void emit_addr_mov_i64(const int reg, const u64 val, + struct jit_ctx *ctx) +{ + u64 tmp = val; + int shift = 0; + + emit(A64_MOVZ(1, reg, tmp & 0xffff, shift), ctx); + for (;shift < 48;) { + tmp >>= 16; + shift += 16; + emit(A64_MOVK(1, reg, tmp & 0xffff, shift), ctx); + } +} + static inline void emit_a64_mov_i(const int is64, const int reg, const s32 val, struct jit_ctx *ctx) { @@ -606,7 +620,10 @@ emit_cond_jmp: const u8 r0 = bpf2a64[BPF_REG_0]; const u64 func = (u64)__bpf_call_base + imm; - emit_a64_mov_i64(tmp, func, ctx); + if (ctx->prog->is_func) + emit_addr_mov_i64(tmp, func, ctx); + else + emit_a64_mov_i64(tmp, func, ctx); emit(A64_BLR(tmp), ctx); emit(A64_MOV(1, r0, A64_R(0)), ctx); break; @@ -661,6 +678,19 @@ emit_cond_jmp: } break; + /* speculation barrier */ + case BPF_ST | BPF_NOSPEC: + /* + * Nothing required here. + * + * In case of arm64, we rely on the firmware mitigation of + * Speculative Store Bypass as controlled via the ssbd kernel + * parameter. Whenever the mitigation is enabled, it works + * for all of the kernel code with no need to provide any + * additional instructions. + */ + break; + /* ST: *(size *)(dst + off) = imm */ case BPF_ST | BPF_MEM | BPF_W: case BPF_ST | BPF_MEM | BPF_H: @@ -847,11 +877,19 @@ static inline void bpf_flush_icache(void *start, void *end) flush_icache_range((unsigned long)start, (unsigned long)end); } +struct arm64_jit_data { + struct bpf_binary_header *header; + u8 *image; + struct jit_ctx ctx; +}; + struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) { struct bpf_prog *tmp, *orig_prog = prog; struct bpf_binary_header *header; + struct arm64_jit_data *jit_data; bool tmp_blinded = false; + bool extra_pass = false; struct jit_ctx ctx; int image_size; u8 *image_ptr; @@ -870,13 +908,30 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) prog = tmp; } + jit_data = prog->aux->jit_data; + if (!jit_data) { + jit_data = kzalloc(sizeof(*jit_data), GFP_KERNEL); + if (!jit_data) { + prog = orig_prog; + goto out; + } + prog->aux->jit_data = jit_data; + } + if (jit_data->ctx.offset) { + ctx = jit_data->ctx; + image_ptr = jit_data->image; + header = jit_data->header; + extra_pass = true; + image_size = sizeof(u32) * ctx.idx; + goto skip_init_ctx; + } memset(&ctx, 0, sizeof(ctx)); ctx.prog = prog; ctx.offset = kcalloc(prog->len, sizeof(int), GFP_KERNEL); if (ctx.offset == NULL) { prog = orig_prog; - goto out; + goto out_off; } /* 1. Initial fake pass to compute ctx->idx. */ @@ -907,6 +962,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) /* 2. Now, the actual pass. */ ctx.image = (__le32 *)image_ptr; +skip_init_ctx: ctx.idx = 0; build_prologue(&ctx); @@ -932,7 +988,21 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) bpf_flush_icache(header, ctx.image + ctx.idx); - bpf_jit_binary_lock_ro(header); + if (!prog->is_func || extra_pass) { + if (extra_pass && ctx.idx != jit_data->ctx.idx) { + pr_err_once("multi-func JIT bug %d != %d\n", + ctx.idx, jit_data->ctx.idx); + bpf_jit_binary_free(header); + prog->bpf_func = NULL; + prog->jited = 0; + goto out_off; + } + bpf_jit_binary_lock_ro(header); + } else { + jit_data->ctx = ctx; + jit_data->image = image_ptr; + jit_data->header = header; + } prog->bpf_func = (void *)ctx.image; prog->jited = 1; prog->jited_len = image_size; @@ -940,8 +1010,12 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) uh_call(UH_APP_RKP, RKP_BFP_LOAD, (u64)header, (u64)(header->pages * PAGE_SIZE), RKP_BPF_JIT_LOAD, 0); #endif + if (!prog->is_func || extra_pass) { out_off: - kfree(ctx.offset); + kfree(ctx.offset); + kfree(jit_data); + prog->aux->jit_data = NULL; + } out: if (tmp_blinded) bpf_jit_prog_release_other(prog, prog == orig_prog ? diff --git a/arch/frv/include/asm/atomic.h b/arch/frv/include/asm/atomic.h index e93c9494503a..a4d17c0272b3 100644 --- a/arch/frv/include/asm/atomic.h +++ b/arch/frv/include/asm/atomic.h @@ -146,7 +146,7 @@ static inline void atomic64_dec(atomic64_t *v) #define atomic64_cmpxchg(v, old, new) (__cmpxchg_64(old, new, &(v)->counter)) #define atomic64_xchg(v, new) (__xchg_64(new, &(v)->counter)) -static __inline__ int __atomic_add_unless(atomic_t *v, int a, int u) +static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; c = atomic_read(v); diff --git a/arch/h8300/include/asm/atomic.h b/arch/h8300/include/asm/atomic.h index 941e7554e886..4465cfc30a3a 100644 --- a/arch/h8300/include/asm/atomic.h +++ b/arch/h8300/include/asm/atomic.h @@ -94,7 +94,7 @@ static inline int atomic_cmpxchg(atomic_t *v, int old, int new) return ret; } -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int ret; h8300flags flags; diff --git a/arch/hexagon/include/asm/atomic.h b/arch/hexagon/include/asm/atomic.h index d4e283b4f335..afb210546675 100644 --- a/arch/hexagon/include/asm/atomic.h +++ b/arch/hexagon/include/asm/atomic.h @@ -164,7 +164,7 @@ ATOMIC_OPS(xor) #undef ATOMIC_OP /** - * __atomic_add_unless - add unless the number is a given value + * atomic_fetch_add_unless - add unless the number is a given value * @v: pointer to value * @a: amount to add * @u: unless value is equal to u @@ -173,7 +173,7 @@ ATOMIC_OPS(xor) * */ -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int __oldval; register int tmp; diff --git a/arch/ia64/include/asm/atomic.h b/arch/ia64/include/asm/atomic.h index 28e02c99be6d..5daf6c0fe3bb 100644 --- a/arch/ia64/include/asm/atomic.h +++ b/arch/ia64/include/asm/atomic.h @@ -237,7 +237,7 @@ ATOMIC64_FETCH_OP(xor, ^) (cmpxchg(&((v)->counter), old, new)) #define atomic64_xchg(v, new) (xchg(&((v)->counter), new)) -static __inline__ int __atomic_add_unless(atomic_t *v, int a, int u) +static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; c = atomic_read(v); diff --git a/arch/m32r/include/asm/atomic.h b/arch/m32r/include/asm/atomic.h index 8bf67e55ff54..123fb4ca942a 100644 --- a/arch/m32r/include/asm/atomic.h +++ b/arch/m32r/include/asm/atomic.h @@ -249,7 +249,7 @@ static __inline__ int atomic_dec_return(atomic_t *v) #define atomic_xchg(v, new) (xchg(&((v)->counter), new)) /** - * __atomic_add_unless - add unless the number is a given value + * atomic_fetch_add_unless - add unless the number is a given value * @v: pointer of type atomic_t * @a: the amount to add to v... * @u: ...unless v is equal to u. @@ -257,7 +257,7 @@ static __inline__ int atomic_dec_return(atomic_t *v) * Atomically adds @a to @v, so long as it was not @u. * Returns the old value of @v. */ -static __inline__ int __atomic_add_unless(atomic_t *v, int a, int u) +static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; c = atomic_read(v); diff --git a/arch/m68k/include/asm/atomic.h b/arch/m68k/include/asm/atomic.h index e993e2860ee1..8022d9ea1213 100644 --- a/arch/m68k/include/asm/atomic.h +++ b/arch/m68k/include/asm/atomic.h @@ -211,7 +211,7 @@ static inline int atomic_add_negative(int i, atomic_t *v) return c != 0; } -static __inline__ int __atomic_add_unless(atomic_t *v, int a, int u) +static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; c = atomic_read(v); diff --git a/arch/metag/include/asm/atomic_lnkget.h b/arch/metag/include/asm/atomic_lnkget.h index 17e8c61c946d..4346e3831f6a 100644 --- a/arch/metag/include/asm/atomic_lnkget.h +++ b/arch/metag/include/asm/atomic_lnkget.h @@ -154,7 +154,7 @@ static inline int atomic_xchg(atomic_t *v, int new) return old; } -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int result, temp; diff --git a/arch/metag/include/asm/atomic_lock1.h b/arch/metag/include/asm/atomic_lock1.h index 2ce8fa3a79c2..a0c8ff162fbb 100644 --- a/arch/metag/include/asm/atomic_lock1.h +++ b/arch/metag/include/asm/atomic_lock1.h @@ -122,7 +122,7 @@ static inline int atomic_cmpxchg(atomic_t *v, int old, int new) #define atomic_xchg(v, new) (xchg(&((v)->counter), new)) -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int ret; unsigned long flags; diff --git a/arch/mips/include/asm/atomic.h b/arch/mips/include/asm/atomic.h index 0ab176bdb8e8..02fc1553cf9b 100644 --- a/arch/mips/include/asm/atomic.h +++ b/arch/mips/include/asm/atomic.h @@ -275,7 +275,7 @@ static __inline__ int atomic_sub_if_positive(int i, atomic_t * v) #define atomic_xchg(v, new) (xchg(&((v)->counter), (new))) /** - * __atomic_add_unless - add unless the number is a given value + * atomic_fetch_add_unless - add unless the number is a given value * @v: pointer of type atomic_t * @a: the amount to add to v... * @u: ...unless v is equal to u. @@ -283,7 +283,7 @@ static __inline__ int atomic_sub_if_positive(int i, atomic_t * v) * Atomically adds @a to @v, so long as it was not @u. * Returns the old value of @v. */ -static __inline__ int __atomic_add_unless(atomic_t *v, int a, int u) +static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; c = atomic_read(v); diff --git a/arch/mips/net/ebpf_jit.c b/arch/mips/net/ebpf_jit.c index 57a7a9d68475..9e01a4fbc41d 100644 --- a/arch/mips/net/ebpf_jit.c +++ b/arch/mips/net/ebpf_jit.c @@ -1433,6 +1433,9 @@ ld_skb_common: } break; + case BPF_ST | BPF_NOSPEC: /* speculation barrier */ + break; + case BPF_ST | BPF_B | BPF_MEM: case BPF_ST | BPF_H | BPF_MEM: case BPF_ST | BPF_W | BPF_MEM: @@ -1871,7 +1874,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) unsigned int image_size; u8 *image_ptr; - if (!bpf_jit_enable || !cpu_has_mips64r2) + if (!prog->jit_requested || !cpu_has_mips64r2) return prog; tmp = bpf_jit_blind_constants(prog); diff --git a/arch/mn10300/include/asm/atomic.h b/arch/mn10300/include/asm/atomic.h index 36389efd45e8..6c42d6a0a907 100644 --- a/arch/mn10300/include/asm/atomic.h +++ b/arch/mn10300/include/asm/atomic.h @@ -144,7 +144,7 @@ static inline void atomic_dec(atomic_t *v) #define atomic_dec_and_test(v) (atomic_sub_return(1, (v)) == 0) #define atomic_inc_and_test(v) (atomic_add_return(1, (v)) == 0) -#define __atomic_add_unless(v, a, u) \ +#define atomic_fetch_add_unless(v, a, u) \ ({ \ int c, old; \ c = atomic_read(v); \ diff --git a/arch/openrisc/include/asm/atomic.h b/arch/openrisc/include/asm/atomic.h index 146e1660f00e..b589fac39b92 100644 --- a/arch/openrisc/include/asm/atomic.h +++ b/arch/openrisc/include/asm/atomic.h @@ -100,7 +100,7 @@ ATOMIC_OP(xor) * * This is often used through atomic_inc_not_zero() */ -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int old, tmp; @@ -119,7 +119,7 @@ static inline int __atomic_add_unless(atomic_t *v, int a, int u) return old; } -#define __atomic_add_unless __atomic_add_unless +#define atomic_fetch_add_unless atomic_fetch_add_unless #include diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h index 614bcc7673f5..3c48acb14035 100644 --- a/arch/parisc/include/asm/atomic.h +++ b/arch/parisc/include/asm/atomic.h @@ -78,7 +78,7 @@ static __inline__ int atomic_read(const atomic_t *v) #define atomic_xchg(v, new) (xchg(&((v)->counter), new)) /** - * __atomic_add_unless - add unless the number is a given value + * atomic_fetch_add_unless - add unless the number is a given value * @v: pointer of type atomic_t * @a: the amount to add to v... * @u: ...unless v is equal to u. @@ -86,7 +86,7 @@ static __inline__ int atomic_read(const atomic_t *v) * Atomically adds @a to @v, so long as it was not @u. * Returns the old value of @v. */ -static __inline__ int __atomic_add_unless(atomic_t *v, int a, int u) +static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; c = atomic_read(v); diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h index 682b3e6a1e21..1483261080a1 100644 --- a/arch/powerpc/include/asm/atomic.h +++ b/arch/powerpc/include/asm/atomic.h @@ -218,7 +218,7 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t *v) #define atomic_xchg_relaxed(v, new) xchg_relaxed(&((v)->counter), (new)) /** - * __atomic_add_unless - add unless the number is a given value + * atomic_fetch_add_unless - add unless the number is a given value * @v: pointer of type atomic_t * @a: the amount to add to v... * @u: ...unless v is equal to u. @@ -226,13 +226,13 @@ static __inline__ int atomic_dec_return_relaxed(atomic_t *v) * Atomically adds @a to @v, so long as it was not @u. * Returns the old value of @v. */ -static __inline__ int __atomic_add_unless(atomic_t *v, int a, int u) +static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int t; __asm__ __volatile__ ( PPC_ATOMIC_ENTRY_BARRIER -"1: lwarx %0,0,%1 # __atomic_add_unless\n\ +"1: lwarx %0,0,%1 # atomic_fetch_add_unless\n\ cmpw 0,%0,%3 \n\ beq 2f \n\ add %0,%2,%0 \n" @@ -538,7 +538,7 @@ static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u) __asm__ __volatile__ ( PPC_ATOMIC_ENTRY_BARRIER -"1: ldarx %0,0,%1 # __atomic_add_unless\n\ +"1: ldarx %0,0,%1 # atomic_fetch_add_unless\n\ cmpd 0,%0,%3 \n\ beq 2f \n\ add %0,%2,%0 \n" diff --git a/arch/powerpc/kernel/vmlinux.lds.S b/arch/powerpc/kernel/vmlinux.lds.S index 7346875ac61d..ca3d8e121663 100644 --- a/arch/powerpc/kernel/vmlinux.lds.S +++ b/arch/powerpc/kernel/vmlinux.lds.S @@ -332,12 +332,6 @@ SECTIONS *(.branch_lt) } -#ifdef CONFIG_DEBUG_INFO_BTF - .BTF : AT(ADDR(.BTF) - LOAD_OFFSET) { - *(.BTF) - } -#endif - .opd : AT(ADDR(.opd) - LOAD_OFFSET) { *(.opd) } diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index ad08e13138de..7551f7435f06 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -437,6 +437,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, pgd_t *pgd, pud = pud_offset(pgd, start); pgd_clear(pgd); pud_free_tlb(tlb, pud, start); + mm_dec_nr_puds(tlb->mm); } /* diff --git a/arch/powerpc/net/Makefile b/arch/powerpc/net/Makefile index 02d369ca6a53..809f019d3cba 100644 --- a/arch/powerpc/net/Makefile +++ b/arch/powerpc/net/Makefile @@ -3,7 +3,7 @@ # Arch-specific network modules # ifeq ($(CONFIG_PPC64),y) -obj-$(CONFIG_BPF_JIT) += bpf_jit_asm64.o bpf_jit_comp64.o +obj-$(CONFIG_BPF_JIT) += bpf_jit_comp64.o else obj-$(CONFIG_BPF_JIT) += bpf_jit_asm.o bpf_jit_comp.o endif diff --git a/arch/powerpc/net/bpf_jit64.h b/arch/powerpc/net/bpf_jit64.h index bb944b6018d7..d2353270223e 100644 --- a/arch/powerpc/net/bpf_jit64.h +++ b/arch/powerpc/net/bpf_jit64.h @@ -20,7 +20,7 @@ * with our redzone usage. * * [ prev sp ] <------------- - * [ nv gpr save area ] 8*8 | + * [ nv gpr save area ] 6*8 | * [ tail_call_cnt ] 8 | * [ local_tmp_var ] 8 | * fp (r31) --> [ ebpf stack space ] 512 | @@ -28,8 +28,8 @@ * sp (r1) ---> [ stack pointer ] -------------- */ -/* for gpr non volatile registers BPG_REG_6 to 10, plus skb cache registers */ -#define BPF_PPC_STACK_SAVE (8*8) +/* for gpr non volatile registers BPG_REG_6 to 10 */ +#define BPF_PPC_STACK_SAVE (6*8) /* for bpf JIT code internal usage */ #define BPF_PPC_STACK_LOCALS 16 /* Ensure this is quadword aligned */ @@ -39,10 +39,8 @@ #ifndef __ASSEMBLY__ /* BPF register usage */ -#define SKB_HLEN_REG (MAX_BPF_JIT_REG + 0) -#define SKB_DATA_REG (MAX_BPF_JIT_REG + 1) -#define TMP_REG_1 (MAX_BPF_JIT_REG + 2) -#define TMP_REG_2 (MAX_BPF_JIT_REG + 3) +#define TMP_REG_1 (MAX_BPF_JIT_REG + 0) +#define TMP_REG_2 (MAX_BPF_JIT_REG + 1) /* BPF to ppc register mappings */ static const int b2p[] = { @@ -63,28 +61,12 @@ static const int b2p[] = { [BPF_REG_FP] = 31, /* eBPF jit internal registers */ [BPF_REG_AX] = 2, - [SKB_HLEN_REG] = 25, - [SKB_DATA_REG] = 26, [TMP_REG_1] = 9, [TMP_REG_2] = 10 }; -/* PPC NVR range -- update this if we ever use NVRs below r24 */ -#define BPF_PPC_NVR_MIN 24 - -/* Assembly helpers */ -#define DECLARE_LOAD_FUNC(func) u64 func(u64 r3, u64 r4); \ - u64 func##_negative_offset(u64 r3, u64 r4); \ - u64 func##_positive_offset(u64 r3, u64 r4); - -DECLARE_LOAD_FUNC(sk_load_word); -DECLARE_LOAD_FUNC(sk_load_half); -DECLARE_LOAD_FUNC(sk_load_byte); - -#define CHOOSE_LOAD_FUNC(imm, func) \ - (imm < 0 ? \ - (imm >= SKF_LL_OFF ? func##_negative_offset : func) : \ - func##_positive_offset) +/* PPC NVR range -- update this if we ever use NVRs below r27 */ +#define BPF_PPC_NVR_MIN 27 /* * WARNING: These can use TMP_REG_2 if the offset is not at word boundary, @@ -108,15 +90,14 @@ DECLARE_LOAD_FUNC(sk_load_byte); #define SEEN_FUNC 0x1000 /* might call external helpers */ #define SEEN_STACK 0x2000 /* uses BPF stack */ -#define SEEN_SKB 0x4000 /* uses sk_buff */ -#define SEEN_TAILCALL 0x8000 /* uses tail calls */ +#define SEEN_TAILCALL 0x4000 /* uses tail calls */ struct codegen_context { /* * This is used to track register usage as well * as calls to external helpers. * - register usage is tracked with corresponding - * bits (r3-r10 and r25-r31) + * bits (r3-r10 and r27-r31) * - rest of the bits can be used to track other * things -- for now, we use bits 16 to 23 * encoded in SEEN_* macros above diff --git a/arch/powerpc/net/bpf_jit_asm64.S b/arch/powerpc/net/bpf_jit_asm64.S deleted file mode 100644 index 7e4c51430b84..000000000000 --- a/arch/powerpc/net/bpf_jit_asm64.S +++ /dev/null @@ -1,180 +0,0 @@ -/* - * bpf_jit_asm64.S: Packet/header access helper functions - * for PPC64 BPF compiler. - * - * Copyright 2016, Naveen N. Rao - * IBM Corporation - * - * Based on bpf_jit_asm.S by Matt Evans - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; version 2 - * of the License. - */ - -#include -#include -#include "bpf_jit64.h" - -/* - * All of these routines are called directly from generated code, - * with the below register usage: - * r27 skb pointer (ctx) - * r25 skb header length - * r26 skb->data pointer - * r4 offset - * - * Result is passed back in: - * r8 data read in host endian format (accumulator) - * - * r9 is used as a temporary register - */ - -#define r_skb r27 -#define r_hlen r25 -#define r_data r26 -#define r_off r4 -#define r_val r8 -#define r_tmp r9 - -_GLOBAL_TOC(sk_load_word) - cmpdi r_off, 0 - blt bpf_slow_path_word_neg - b sk_load_word_positive_offset - -_GLOBAL_TOC(sk_load_word_positive_offset) - /* Are we accessing past headlen? */ - subi r_tmp, r_hlen, 4 - cmpd r_tmp, r_off - blt bpf_slow_path_word - /* Nope, just hitting the header. cr0 here is eq or gt! */ - LWZX_BE r_val, r_data, r_off - blr /* Return success, cr0 != LT */ - -_GLOBAL_TOC(sk_load_half) - cmpdi r_off, 0 - blt bpf_slow_path_half_neg - b sk_load_half_positive_offset - -_GLOBAL_TOC(sk_load_half_positive_offset) - subi r_tmp, r_hlen, 2 - cmpd r_tmp, r_off - blt bpf_slow_path_half - LHZX_BE r_val, r_data, r_off - blr - -_GLOBAL_TOC(sk_load_byte) - cmpdi r_off, 0 - blt bpf_slow_path_byte_neg - b sk_load_byte_positive_offset - -_GLOBAL_TOC(sk_load_byte_positive_offset) - cmpd r_hlen, r_off - ble bpf_slow_path_byte - lbzx r_val, r_data, r_off - blr - -/* - * Call out to skb_copy_bits: - * Allocate a new stack frame here to remain ABI-compliant in - * stashing LR. - */ -#define bpf_slow_path_common(SIZE) \ - mflr r0; \ - std r0, PPC_LR_STKOFF(r1); \ - stdu r1, -(STACK_FRAME_MIN_SIZE + BPF_PPC_STACK_LOCALS)(r1); \ - mr r3, r_skb; \ - /* r4 = r_off as passed */ \ - addi r5, r1, STACK_FRAME_MIN_SIZE; \ - li r6, SIZE; \ - bl skb_copy_bits; \ - nop; \ - /* save r5 */ \ - addi r5, r1, STACK_FRAME_MIN_SIZE; \ - /* r3 = 0 on success */ \ - addi r1, r1, STACK_FRAME_MIN_SIZE + BPF_PPC_STACK_LOCALS; \ - ld r0, PPC_LR_STKOFF(r1); \ - mtlr r0; \ - cmpdi r3, 0; \ - blt bpf_error; /* cr0 = LT */ - -bpf_slow_path_word: - bpf_slow_path_common(4) - /* Data value is on stack, and cr0 != LT */ - LWZX_BE r_val, 0, r5 - blr - -bpf_slow_path_half: - bpf_slow_path_common(2) - LHZX_BE r_val, 0, r5 - blr - -bpf_slow_path_byte: - bpf_slow_path_common(1) - lbzx r_val, 0, r5 - blr - -/* - * Call out to bpf_internal_load_pointer_neg_helper - */ -#define sk_negative_common(SIZE) \ - mflr r0; \ - std r0, PPC_LR_STKOFF(r1); \ - stdu r1, -STACK_FRAME_MIN_SIZE(r1); \ - mr r3, r_skb; \ - /* r4 = r_off, as passed */ \ - li r5, SIZE; \ - bl bpf_internal_load_pointer_neg_helper; \ - nop; \ - addi r1, r1, STACK_FRAME_MIN_SIZE; \ - ld r0, PPC_LR_STKOFF(r1); \ - mtlr r0; \ - /* R3 != 0 on success */ \ - cmpldi r3, 0; \ - beq bpf_error_slow; /* cr0 = EQ */ - -bpf_slow_path_word_neg: - lis r_tmp, -32 /* SKF_LL_OFF */ - cmpd r_off, r_tmp /* addr < SKF_* */ - blt bpf_error /* cr0 = LT */ - b sk_load_word_negative_offset - -_GLOBAL_TOC(sk_load_word_negative_offset) - sk_negative_common(4) - LWZX_BE r_val, 0, r3 - blr - -bpf_slow_path_half_neg: - lis r_tmp, -32 /* SKF_LL_OFF */ - cmpd r_off, r_tmp /* addr < SKF_* */ - blt bpf_error /* cr0 = LT */ - b sk_load_half_negative_offset - -_GLOBAL_TOC(sk_load_half_negative_offset) - sk_negative_common(2) - LHZX_BE r_val, 0, r3 - blr - -bpf_slow_path_byte_neg: - lis r_tmp, -32 /* SKF_LL_OFF */ - cmpd r_off, r_tmp /* addr < SKF_* */ - blt bpf_error /* cr0 = LT */ - b sk_load_byte_negative_offset - -_GLOBAL_TOC(sk_load_byte_negative_offset) - sk_negative_common(1) - lbzx r_val, 0, r3 - blr - -bpf_error_slow: - /* fabricate a cr0 = lt */ - li r_tmp, -1 - cmpdi r_tmp, 0 -bpf_error: - /* - * Entered with cr0 = lt - * Generated code will 'blt epilogue', returning 0. - */ - li r_val, 0 - blr diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c index c504d5bc7d43..1d213bdb393c 100644 --- a/arch/powerpc/net/bpf_jit_comp64.c +++ b/arch/powerpc/net/bpf_jit_comp64.c @@ -59,7 +59,7 @@ static inline bool bpf_has_stack_frame(struct codegen_context *ctx) * [ prev sp ] <------------- * [ ... ] | * sp (r1) ---> [ stack pointer ] -------------- - * [ nv gpr save area ] 8*8 + * [ nv gpr save area ] 6*8 * [ tail_call_cnt ] 8 * [ local_tmp_var ] 8 * [ unused red zone ] 208 bytes protected @@ -87,21 +87,6 @@ static int bpf_jit_stack_offsetof(struct codegen_context *ctx, int reg) BUG(); } -static void bpf_jit_emit_skb_loads(u32 *image, struct codegen_context *ctx) -{ - /* - * Load skb->len and skb->data_len - * r3 points to skb - */ - PPC_LWZ(b2p[SKB_HLEN_REG], 3, offsetof(struct sk_buff, len)); - PPC_LWZ(b2p[TMP_REG_1], 3, offsetof(struct sk_buff, data_len)); - /* header_len = len - data_len */ - PPC_SUB(b2p[SKB_HLEN_REG], b2p[SKB_HLEN_REG], b2p[TMP_REG_1]); - - /* skb->data pointer */ - PPC_BPF_LL(b2p[SKB_DATA_REG], 3, offsetof(struct sk_buff, data)); -} - static void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx) { int i; @@ -144,18 +129,6 @@ static void bpf_jit_build_prologue(u32 *image, struct codegen_context *ctx) if (bpf_is_seen_register(ctx, i)) PPC_BPF_STL(b2p[i], 1, bpf_jit_stack_offsetof(ctx, b2p[i])); - /* - * Save additional non-volatile regs if we cache skb - * Also, setup skb data - */ - if (ctx->seen & SEEN_SKB) { - PPC_BPF_STL(b2p[SKB_HLEN_REG], 1, - bpf_jit_stack_offsetof(ctx, b2p[SKB_HLEN_REG])); - PPC_BPF_STL(b2p[SKB_DATA_REG], 1, - bpf_jit_stack_offsetof(ctx, b2p[SKB_DATA_REG])); - bpf_jit_emit_skb_loads(image, ctx); - } - /* Setup frame pointer to point to the bpf stack area */ if (bpf_is_seen_register(ctx, BPF_REG_FP)) PPC_ADDI(b2p[BPF_REG_FP], 1, @@ -171,14 +144,6 @@ static void bpf_jit_emit_common_epilogue(u32 *image, struct codegen_context *ctx if (bpf_is_seen_register(ctx, i)) PPC_BPF_LL(b2p[i], 1, bpf_jit_stack_offsetof(ctx, b2p[i])); - /* Restore non-volatile registers used for skb cache */ - if (ctx->seen & SEEN_SKB) { - PPC_BPF_LL(b2p[SKB_HLEN_REG], 1, - bpf_jit_stack_offsetof(ctx, b2p[SKB_HLEN_REG])); - PPC_BPF_LL(b2p[SKB_DATA_REG], 1, - bpf_jit_stack_offsetof(ctx, b2p[SKB_DATA_REG])); - } - /* Tear down our stack frame */ if (bpf_has_stack_frame(ctx)) { PPC_ADDI(1, 1, BPF_PPC_STACKFRAME); @@ -199,7 +164,33 @@ static void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx) PPC_BLR(); } -static void bpf_jit_emit_func_call(u32 *image, struct codegen_context *ctx, u64 func) +static void bpf_jit_emit_func_call_hlp(u32 *image, struct codegen_context *ctx, + u64 func) +{ +#ifdef PPC64_ELF_ABI_v1 + /* func points to the function descriptor */ + PPC_LI64(b2p[TMP_REG_2], func); + /* Load actual entry point from function descriptor */ + PPC_BPF_LL(b2p[TMP_REG_1], b2p[TMP_REG_2], 0); + /* ... and move it to LR */ + PPC_MTLR(b2p[TMP_REG_1]); + /* + * Load TOC from function descriptor at offset 8. + * We can clobber r2 since we get called through a + * function pointer (so caller will save/restore r2) + * and since we don't use a TOC ourself. + */ + PPC_BPF_LL(2, b2p[TMP_REG_2], 8); +#else + /* We can clobber r12 */ + PPC_FUNC_ADDR(12, func); + PPC_MTLR(12); +#endif + PPC_BLRL(); +} + +static void bpf_jit_emit_func_call_rel(u32 *image, struct codegen_context *ctx, + u64 func) { unsigned int i, ctx_idx = ctx->idx; @@ -304,7 +295,7 @@ static int bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32 o /* Assemble the body code between the prologue & epilogue */ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, struct codegen_context *ctx, - u32 *addrs) + u32 *addrs, bool extra_pass) { const struct bpf_insn *insn = fp->insnsi; int flen = fp->len; @@ -319,8 +310,9 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image, u32 src_reg = b2p[insn[i].src_reg]; s16 off = insn[i].off; s32 imm = insn[i].imm; + bool func_addr_fixed; + u64 func_addr; u64 imm64; - u8 *func; u32 true_cond; u32 tmp_idx; @@ -652,6 +644,12 @@ emit_clear: } break; + /* + * BPF_ST NOSPEC (speculation barrier) + */ + case BPF_ST | BPF_NOSPEC: + break; + /* * BPF_ST(X) */ @@ -762,29 +760,22 @@ emit_clear: break; /* - * Call kernel helper + * Call kernel helper or bpf function */ case BPF_JMP | BPF_CALL: ctx->seen |= SEEN_FUNC; - func = (u8 *) __bpf_call_base + imm; - /* Save skb pointer if we need to re-cache skb data */ - if ((ctx->seen & SEEN_SKB) && - bpf_helper_changes_pkt_data(func)) - PPC_BPF_STL(3, 1, bpf_jit_stack_local(ctx)); - - bpf_jit_emit_func_call(image, ctx, (u64)func); + ret = bpf_jit_get_func_addr(fp, &insn[i], extra_pass, + &func_addr, &func_addr_fixed); + if (ret < 0) + return ret; + if (func_addr_fixed) + bpf_jit_emit_func_call_hlp(image, ctx, func_addr); + else + bpf_jit_emit_func_call_rel(image, ctx, func_addr); /* move return value from r3 to BPF_REG_0 */ PPC_MR(b2p[BPF_REG_0], 3); - - /* refresh skb cache */ - if ((ctx->seen & SEEN_SKB) && - bpf_helper_changes_pkt_data(func)) { - /* reload skb pointer to r3 */ - PPC_BPF_LL(3, 1, bpf_jit_stack_local(ctx)); - bpf_jit_emit_skb_loads(image, ctx); - } break; /* @@ -901,65 +892,6 @@ cond_branch: PPC_BCC(true_cond, addrs[i + 1 + off]); break; - /* - * Loads from packet header/data - * Assume 32-bit input value in imm and X (src_reg) - */ - - /* Absolute loads */ - case BPF_LD | BPF_W | BPF_ABS: - func = (u8 *)CHOOSE_LOAD_FUNC(imm, sk_load_word); - goto common_load_abs; - case BPF_LD | BPF_H | BPF_ABS: - func = (u8 *)CHOOSE_LOAD_FUNC(imm, sk_load_half); - goto common_load_abs; - case BPF_LD | BPF_B | BPF_ABS: - func = (u8 *)CHOOSE_LOAD_FUNC(imm, sk_load_byte); -common_load_abs: - /* - * Load from [imm] - * Load into r4, which can just be passed onto - * skb load helpers as the second parameter - */ - PPC_LI32(4, imm); - goto common_load; - - /* Indirect loads */ - case BPF_LD | BPF_W | BPF_IND: - func = (u8 *)sk_load_word; - goto common_load_ind; - case BPF_LD | BPF_H | BPF_IND: - func = (u8 *)sk_load_half; - goto common_load_ind; - case BPF_LD | BPF_B | BPF_IND: - func = (u8 *)sk_load_byte; -common_load_ind: - /* - * Load from [src_reg + imm] - * Treat src_reg as a 32-bit value - */ - PPC_EXTSW(4, src_reg); - if (imm) { - if (imm >= -32768 && imm < 32768) - PPC_ADDI(4, 4, IMM_L(imm)); - else { - PPC_LI32(b2p[TMP_REG_1], imm); - PPC_ADD(4, 4, b2p[TMP_REG_1]); - } - } - -common_load: - ctx->seen |= SEEN_SKB; - ctx->seen |= SEEN_FUNC; - bpf_jit_emit_func_call(image, ctx, (u64)func); - - /* - * Helper returns 'lt' condition on error, and an - * appropriate return value in BPF_REG_0 - */ - PPC_BCC(COND_LT, exit_addr); - break; - /* * Tail call */ @@ -988,6 +920,14 @@ common_load: return 0; } +struct powerpc64_jit_data { + struct bpf_binary_header *header; + u32 *addrs; + u8 *image; + u32 proglen; + struct codegen_context ctx; +}; + struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) { u32 proglen; @@ -995,6 +935,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) u8 *image = NULL; u32 *code_base; u32 *addrs; + struct powerpc64_jit_data *jit_data; struct codegen_context cgctx; int pass; int flen; @@ -1002,8 +943,9 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) struct bpf_prog *org_fp = fp; struct bpf_prog *tmp_fp; bool bpf_blinded = false; + bool extra_pass = false; - if (!bpf_jit_enable) + if (!fp->jit_requested) return org_fp; tmp_fp = bpf_jit_blind_constants(org_fp); @@ -1015,20 +957,41 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) fp = tmp_fp; } + jit_data = fp->aux->jit_data; + if (!jit_data) { + jit_data = kzalloc(sizeof(*jit_data), GFP_KERNEL); + if (!jit_data) { + fp = org_fp; + goto out; + } + fp->aux->jit_data = jit_data; + } + flen = fp->len; + addrs = jit_data->addrs; + if (addrs) { + cgctx = jit_data->ctx; + image = jit_data->image; + bpf_hdr = jit_data->header; + proglen = jit_data->proglen; + alloclen = proglen + FUNCTION_DESCR_SIZE; + extra_pass = true; + goto skip_init_ctx; + } + addrs = kzalloc((flen+1) * sizeof(*addrs), GFP_KERNEL); if (addrs == NULL) { fp = org_fp; - goto out; + goto out_addrs; } memset(&cgctx, 0, sizeof(struct codegen_context)); /* Scouting faux-generate pass 0 */ - if (bpf_jit_build_body(fp, 0, &cgctx, addrs)) { + if (bpf_jit_build_body(fp, 0, &cgctx, addrs, false)) { /* We hit something illegal or unsupported. */ fp = org_fp; - goto out; + goto out_addrs; } /* @@ -1046,9 +1009,10 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) bpf_jit_fill_ill_insns); if (!bpf_hdr) { fp = org_fp; - goto out; + goto out_addrs; } +skip_init_ctx: code_base = (u32 *)(image + FUNCTION_DESCR_SIZE); /* Code generation passes 1-2 */ @@ -1056,7 +1020,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) /* Now build the prologue, body code & epilogue for real. */ cgctx.idx = 0; bpf_jit_build_prologue(code_base, &cgctx); - bpf_jit_build_body(fp, code_base, &cgctx, addrs); + bpf_jit_build_body(fp, code_base, &cgctx, addrs, extra_pass); bpf_jit_build_epilogue(code_base, &cgctx); if (bpf_jit_enable > 1) @@ -1082,10 +1046,20 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp) fp->jited_len = alloclen; bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE)); + if (!fp->is_func || extra_pass) { +out_addrs: + kfree(addrs); + kfree(jit_data); + fp->aux->jit_data = NULL; + } else { + jit_data->addrs = addrs; + jit_data->ctx = cgctx; + jit_data->proglen = proglen; + jit_data->image = image; + jit_data->header = bpf_hdr; + } out: - kfree(addrs); - if (bpf_blinded) bpf_jit_prog_release_other(fp, fp == org_fp ? tmp_fp : org_fp); diff --git a/arch/s390/include/asm/atomic.h b/arch/s390/include/asm/atomic.h index 4b55532f15c4..c2858cdd8c29 100644 --- a/arch/s390/include/asm/atomic.h +++ b/arch/s390/include/asm/atomic.h @@ -90,7 +90,7 @@ static inline int atomic_cmpxchg(atomic_t *v, int old, int new) return __atomic_cmpxchg(&v->counter, old, new); } -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; c = atomic_read(v); diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h index a6cc744ff5fb..3ffc56124725 100644 --- a/arch/s390/include/asm/mmu_context.h +++ b/arch/s390/include/asm/mmu_context.h @@ -44,6 +44,8 @@ static inline int init_new_context(struct task_struct *tsk, mm->context.asce_limit = STACK_TOP_MAX; mm->context.asce = __pa(mm->pgd) | _ASCE_TABLE_LENGTH | _ASCE_USER_BITS | _ASCE_TYPE_REGION3; + /* pgd_alloc() did not account this pud */ + mm_inc_nr_puds(mm); break; case -PAGE_SIZE: /* forked 5-level task, set new asce with new_mm->pgd */ @@ -59,7 +61,7 @@ static inline int init_new_context(struct task_struct *tsk, /* forked 2-level compat task, set new asce with new mm->pgd */ mm->context.asce = __pa(mm->pgd) | _ASCE_TABLE_LENGTH | _ASCE_USER_BITS | _ASCE_TYPE_SEGMENT; - /* pgd_alloc() did not increase mm->nr_pmds */ + /* pgd_alloc() did not account this pmd */ mm_inc_nr_pmds(mm); } crst_table_init((unsigned long *) mm->pgd, pgd_entry_type(mm)); diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c index 6b1003fdd05d..fbe672551993 100644 --- a/arch/s390/net/bpf_jit_comp.c +++ b/arch/s390/net/bpf_jit_comp.c @@ -931,6 +931,11 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, struct bpf_prog *fp, int i break; } break; + /* + * BPF_NOSPEC (speculation barrier) + */ + case BPF_ST | BPF_NOSPEC: + break; /* * BPF_ST(X) */ diff --git a/arch/sh/include/asm/atomic.h b/arch/sh/include/asm/atomic.h index 0fd0099f43cc..ef45931ebac5 100644 --- a/arch/sh/include/asm/atomic.h +++ b/arch/sh/include/asm/atomic.h @@ -46,7 +46,7 @@ #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n))) /** - * __atomic_add_unless - add unless the number is a given value + * atomic_fetch_add_unless - add unless the number is a given value * @v: pointer of type atomic_t * @a: the amount to add to v... * @u: ...unless v is equal to u. @@ -54,7 +54,7 @@ * Atomically adds @a to @v, so long as it was not @u. * Returns the old value of @v. */ -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; c = atomic_read(v); diff --git a/arch/sparc/include/asm/atomic_32.h b/arch/sparc/include/asm/atomic_32.h index 0c3b3b4a9963..a139319a3d1d 100644 --- a/arch/sparc/include/asm/atomic_32.h +++ b/arch/sparc/include/asm/atomic_32.h @@ -27,7 +27,7 @@ int atomic_fetch_or(int, atomic_t *); int atomic_fetch_xor(int, atomic_t *); int atomic_cmpxchg(atomic_t *, int, int); int atomic_xchg(atomic_t *, int); -int __atomic_add_unless(atomic_t *, int, int); +int atomic_fetch_add_unless(atomic_t *, int, int); void atomic_set(atomic_t *, int); #define atomic_set_release(v, i) atomic_set((v), (i)) diff --git a/arch/sparc/include/asm/atomic_64.h b/arch/sparc/include/asm/atomic_64.h index 28db058d471b..f416fd3d2708 100644 --- a/arch/sparc/include/asm/atomic_64.h +++ b/arch/sparc/include/asm/atomic_64.h @@ -89,7 +89,7 @@ static inline int atomic_xchg(atomic_t *v, int new) return xchg(&v->counter, new); } -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; c = atomic_read(v); diff --git a/arch/sparc/lib/atomic32.c b/arch/sparc/lib/atomic32.c index 465a901a0ada..281fa634bb1a 100644 --- a/arch/sparc/lib/atomic32.c +++ b/arch/sparc/lib/atomic32.c @@ -95,7 +95,7 @@ int atomic_cmpxchg(atomic_t *v, int old, int new) } EXPORT_SYMBOL(atomic_cmpxchg); -int __atomic_add_unless(atomic_t *v, int a, int u) +int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int ret; unsigned long flags; @@ -107,7 +107,7 @@ int __atomic_add_unless(atomic_t *v, int a, int u) spin_unlock_irqrestore(ATOMIC_HASH(v), flags); return ret; } -EXPORT_SYMBOL(__atomic_add_unless); +EXPORT_SYMBOL(atomic_fetch_add_unless); /* Atomic operations are already serializing */ void atomic_set(atomic_t *v, int i) diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index c572e573c540..3a2886fcc10b 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -397,7 +397,7 @@ static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, pmd_clear(pmd); pte_free_tlb(tlb, token, addr); - atomic_long_dec(&tlb->mm->nr_ptes); + mm_dec_nr_ptes(tlb->mm); } static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud, @@ -472,6 +472,7 @@ static void hugetlb_free_pud_range(struct mmu_gather *tlb, pgd_t *pgd, pud = pud_offset(pgd, start); pgd_clear(pgd); pud_free_tlb(tlb, pud, start); + mm_dec_nr_puds(tlb->mm); } void hugetlb_free_pgd_range(struct mmu_gather *tlb, diff --git a/arch/sparc/net/bpf_jit_comp_64.c b/arch/sparc/net/bpf_jit_comp_64.c index 85ae4b0d5fbc..1b2971e2c6ff 100644 --- a/arch/sparc/net/bpf_jit_comp_64.c +++ b/arch/sparc/net/bpf_jit_comp_64.c @@ -1317,6 +1317,9 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx) emit(opcode | RS1(src) | rs2 | RD(dst), ctx); break; } + /* speculation barrier */ + case BPF_ST | BPF_NOSPEC: + break; /* ST: *(size *)(dst + off) = imm */ case BPF_ST | BPF_MEM | BPF_W: case BPF_ST | BPF_MEM | BPF_H: diff --git a/arch/tile/include/asm/atomic_32.h b/arch/tile/include/asm/atomic_32.h index 53a423e7cb92..75759d13bf81 100644 --- a/arch/tile/include/asm/atomic_32.h +++ b/arch/tile/include/asm/atomic_32.h @@ -72,7 +72,7 @@ static inline int atomic_add_return(int i, atomic_t *v) } /** - * __atomic_add_unless - add unless the number is already a given value + * atomic_fetch_add_unless - add unless the number is already a given value * @v: pointer of type atomic_t * @a: the amount to add to v... * @u: ...unless v is equal to u. @@ -80,7 +80,7 @@ static inline int atomic_add_return(int i, atomic_t *v) * Atomically adds @a to @v, so long as @v was not already @u. * Returns the old value of @v. */ -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { smp_mb(); /* barrier for proper semantics */ return _atomic_xchg_add_unless(&v->counter, a, u); diff --git a/arch/tile/include/asm/atomic_64.h b/arch/tile/include/asm/atomic_64.h index 4cefa0c9fd81..b72cfe4d29b0 100644 --- a/arch/tile/include/asm/atomic_64.h +++ b/arch/tile/include/asm/atomic_64.h @@ -97,7 +97,7 @@ static inline void atomic_xor(int i, atomic_t *v) } while (guess != oldval); } -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int guess, oldval = v->counter; do { diff --git a/arch/unicore32/mm/pgd.c b/arch/unicore32/mm/pgd.c index c572a28c76c9..a830a300aaa1 100644 --- a/arch/unicore32/mm/pgd.c +++ b/arch/unicore32/mm/pgd.c @@ -97,7 +97,7 @@ void free_pgd_slow(struct mm_struct *mm, pgd_t *pgd) pte = pmd_pgtable(*pmd); pmd_clear(pmd); pte_free(mm, pte); - atomic_long_dec(&mm->nr_ptes); + mm_dec_nr_ptes(mm); pmd_free(mm, pmd); mm_dec_nr_pmds(mm); free: diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index bdcd2eef9201..35b39fc581d1 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -60,6 +60,7 @@ config X86 select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 select ARCH_HAS_SET_MEMORY select ARCH_HAS_SG_CHAIN + select ARCH_HAS_SET_DIRECT_MAP select ARCH_HAS_STRICT_KERNEL_RWX select ARCH_HAS_STRICT_MODULE_RWX select ARCH_HAS_UBSAN_SANITIZE_ALL @@ -155,6 +156,7 @@ config X86 select HAVE_KERNEL_XZ select HAVE_KPROBES select HAVE_KPROBES_ON_FTRACE + select HAVE_FUNCTION_ERROR_INJECTION select HAVE_KRETPROBES select HAVE_KVM select HAVE_LIVEPATCH if X86_64 diff --git a/arch/x86/include/asm/atomic.h b/arch/x86/include/asm/atomic.h index 46ca345cb674..2ecac924f4d9 100644 --- a/arch/x86/include/asm/atomic.h +++ b/arch/x86/include/asm/atomic.h @@ -254,7 +254,7 @@ static inline int arch_atomic_fetch_xor(int i, atomic_t *v) } /** - * __arch_atomic_add_unless - add unless the number is already a given value + * arch_atomic_fetch_add_unless - add unless the number is already a given value * @v: pointer of type atomic_t * @a: the amount to add to v... * @u: ...unless v is equal to u. @@ -262,7 +262,7 @@ static inline int arch_atomic_fetch_xor(int i, atomic_t *v) * Atomically adds @a to @v, so long as @v was not already @u. * Returns the old value of @v. */ -static __always_inline int __arch_atomic_add_unless(atomic_t *v, int a, int u) +static __always_inline int arch_atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c = arch_atomic_read(v); diff --git a/arch/x86/include/asm/error-injection.h b/arch/x86/include/asm/error-injection.h new file mode 100644 index 000000000000..47b7a1296245 --- /dev/null +++ b/arch/x86/include/asm/error-injection.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_ERROR_INJECTION_H +#define _ASM_ERROR_INJECTION_H + +#include +#include +#include +#include + +asmlinkage void just_return_func(void); +void override_function_with_return(struct pt_regs *regs); + +#endif /* _ASM_ERROR_INJECTION_H */ diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h index 6cf65437b5e5..c6c3b1f4306a 100644 --- a/arch/x86/include/asm/kprobes.h +++ b/arch/x86/include/asm/kprobes.h @@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size; void arch_remove_kprobe(struct kprobe *p); asmlinkage void kretprobe_trampoline(void); +#ifdef CONFIG_KPROBES_ON_FTRACE +extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs); +#endif + /* Architecture specific copy of original instruction*/ struct arch_specific_insn { /* copy of the original instruction */ diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 8603d127f73c..ee696efec99f 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -109,6 +109,11 @@ static inline unsigned long regs_return_value(struct pt_regs *regs) return regs->ax; } +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc) +{ + regs->ax = rc; +} + /* * user_mode(regs) determines whether a register set came from user * mode. On x86_32, this is true if V8086 mode was enabled OR if the diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index bd090367236c..9abd49698129 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -84,6 +84,9 @@ int set_pages_nx(struct page *page, int numpages); int set_pages_ro(struct page *page, int numpages); int set_pages_rw(struct page *page, int numpages); +int set_direct_map_invalid_noflush(struct page *page); +int set_direct_map_default_noflush(struct page *page); + extern int kernel_set_to_readonly; void set_kernel_text_rw(void); void set_kernel_text_ro(void); diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c index bcfee4f69b0e..53deb0b23078 100644 --- a/arch/x86/kernel/kprobes/ftrace.c +++ b/arch/x86/kernel/kprobes/ftrace.c @@ -102,3 +102,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p) p->ainsn.boostable = false; return 0; } + +asmlinkage void override_func(void); +asm( + ".type override_func, @function\n" + "override_func:\n" + " ret\n" + ".size override_func, .-override_func\n" +); + +void arch_ftrace_kprobe_override_function(struct pt_regs *regs) +{ + regs->ip = (unsigned long)&override_func; +} +NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e41fd810cfa5..6f5250680405 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1610,7 +1610,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr) raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); offset = kvm_compute_tsc_offset(vcpu, data); - ns = ktime_get_boot_ns(); + ns = ktime_get_boottime_ns(); elapsed = ns - kvm->arch.last_tsc_nsec; if (vcpu->arch.virtual_tsc_khz) { @@ -1927,7 +1927,7 @@ u64 get_kvmclock_ns(struct kvm *kvm) spin_lock(&ka->pvclock_gtod_sync_lock); if (!ka->use_master_clock) { spin_unlock(&ka->pvclock_gtod_sync_lock); - return ktime_get_boot_ns() + ka->kvmclock_offset; + return ktime_get_boottime_ns() + ka->kvmclock_offset; } hv_clock.tsc_timestamp = ka->master_cycle_now; @@ -1943,7 +1943,7 @@ u64 get_kvmclock_ns(struct kvm *kvm) &hv_clock.tsc_to_system_mul); ret = __pvclock_read_cycles(&hv_clock, rdtsc()); } else - ret = ktime_get_boot_ns() + ka->kvmclock_offset; + ret = ktime_get_boottime_ns() + ka->kvmclock_offset; put_cpu(); @@ -2042,7 +2042,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) } if (!use_master_clock) { host_tsc = rdtsc(); - kernel_ns = ktime_get_boot_ns(); + kernel_ns = ktime_get_boottime_ns(); } tsc_timestamp = kvm_read_l1_tsc(v, host_tsc); @@ -8172,7 +8172,7 @@ int kvm_arch_hardware_enable(void) * before any KVM threads can be running. Unfortunately, we can't * bring the TSCs fully up to date with real time, as we aren't yet far * enough into CPU bringup that we know how much real time has actually - * elapsed; our helper function, ktime_get_boot_ns() will be using boot + * elapsed; our helper function, ktime_get_boottime_ns() will be using boot * variables that haven't been updated yet. * * So we simply find the maximum observed TSC above, then record the @@ -8411,7 +8411,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) mutex_init(&kvm->arch.hyperv.hv_lock); spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock); - kvm->arch.kvmclock_offset = -ktime_get_boot_ns(); + kvm->arch.kvmclock_offset = -ktime_get_boottime_ns(); pvclock_update_vm_gtod_copy(kvm); INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn); diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile index 60b410ff31e8..55f70309be9c 100644 --- a/arch/x86/lib/Makefile +++ b/arch/x86/lib/Makefile @@ -38,6 +38,7 @@ lib-y += memcpy_$(BITS).o lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o +lib-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o lib-$(CONFIG_RETPOLINE) += retpoline.o obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o diff --git a/arch/x86/lib/error-inject.c b/arch/x86/lib/error-inject.c new file mode 100644 index 000000000000..7b881d03d0dd --- /dev/null +++ b/arch/x86/lib/error-inject.c @@ -0,0 +1,19 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include + +asmlinkage void just_return_func(void); + +asm( + ".type just_return_func, @function\n" + "just_return_func:\n" + " ret\n" + ".size just_return_func, .-just_return_func\n" +); + +void override_function_with_return(struct pt_regs *regs) +{ + regs->ip = (unsigned long)&just_return_func; +} +NOKPROBE_SYMBOL(override_function_with_return); diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index eaee1a7ed0b5..da9b4c91b04b 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -1976,8 +1976,6 @@ int set_pages_rw(struct page *page, int numpages) return set_memory_rw(addr, numpages); } -#ifdef CONFIG_DEBUG_PAGEALLOC - static int __set_pages_p(struct page *page, int numpages) { unsigned long tempaddr = (unsigned long) page_address(page); @@ -2016,6 +2014,17 @@ static int __set_pages_np(struct page *page, int numpages) return __change_page_attr_set_clr(&cpa, 0); } +int set_direct_map_invalid_noflush(struct page *page) +{ + return __set_pages_np(page, 1); +} + +int set_direct_map_default_noflush(struct page *page) +{ + return __set_pages_p(page, 1); +} + +#ifdef CONFIG_DEBUG_PAGEALLOC void __kernel_map_pages(struct page *page, int numpages, int enable) { if (PageHighMem(page)) @@ -2049,7 +2058,6 @@ void __kernel_map_pages(struct page *page, int numpages, int enable) } #ifdef CONFIG_HIBERNATION - bool kernel_page_present(struct page *page) { unsigned int level; diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 84e5c5bdaa74..e5b62984483e 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -114,13 +114,12 @@ static inline void pgd_list_del(pgd_t *pgd) static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm) { - BUILD_BUG_ON(sizeof(virt_to_page(pgd)->index) < sizeof(mm)); - virt_to_page(pgd)->index = (pgoff_t)mm; + virt_to_page(pgd)->pt_mm = mm; } struct mm_struct *pgd_page_get_mm(struct page *page) { - return (struct mm_struct *)page->index; + return page->pt_mm; } static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd) diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c index 639f56dc626a..73faaf6e6852 100644 --- a/arch/x86/mm/pti.c +++ b/arch/x86/mm/pti.c @@ -1,15 +1,7 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2017 Intel Corporation. All rights reserved. * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * * This code is based in part on work published here: * * https://github.com/IAIK/KAISER diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 94b4d9d869d5..e9f20eafafbc 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -750,6 +750,13 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, } break; + /* speculation barrier */ + case BPF_ST | BPF_NOSPEC: + if (boot_cpu_has(X86_FEATURE_XMM2)) + /* Emit 'lfence' */ + EMIT3(0x0F, 0xAE, 0xE8); + break; + /* ST: *(u8*)(dst_reg + off) = imm */ case BPF_ST | BPF_MEM | BPF_B: if (is_ereg(dst_reg)) @@ -1197,8 +1204,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) addrs[i] = proglen; } ctx.cleanup_addr = proglen; - skip_init_addrs: + /* JITed image shrinks with every pass and the loop iterates * until the image stops shrinking. Very large bpf programs * may converge on the last pass. In such case do one more diff --git a/arch/xtensa/include/asm/atomic.h b/arch/xtensa/include/asm/atomic.h index e7a23f2a519a..4188e56c06c9 100644 --- a/arch/xtensa/include/asm/atomic.h +++ b/arch/xtensa/include/asm/atomic.h @@ -275,7 +275,7 @@ ATOMIC_OPS(xor) #define atomic_xchg(v, new) (xchg(&((v)->counter), new)) /** - * __atomic_add_unless - add unless the number is a given value + * atomic_fetch_add_unless - add unless the number is a given value * @v: pointer of type atomic_t * @a: the amount to add to v... * @u: ...unless v is equal to u. @@ -283,7 +283,7 @@ ATOMIC_OPS(xor) * Atomically adds @a to @v, so long as it was not @u. * Returns the old value of @v. */ -static __inline__ int __atomic_add_unless(atomic_t *v, int a, int u) +static __inline__ int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; c = atomic_read(v); diff --git a/crypto/lrw.c b/crypto/lrw.c index 1b73fec817cf..79b5bc97f5c8 100644 --- a/crypto/lrw.c +++ b/crypto/lrw.c @@ -189,7 +189,7 @@ static int post_crypt(struct skcipher_request *req) if (rctx->dst != sg) { rctx->dst[0] = *sg; sg_unmark_end(rctx->dst); - scatterwalk_crypto_chain(rctx->dst, sg_next(sg), 0, 2); + scatterwalk_crypto_chain(rctx->dst, sg_next(sg), 2); } rctx->dst[0].length -= offset - sg->offset; rctx->dst[0].offset = offset; @@ -266,7 +266,7 @@ static int pre_crypt(struct skcipher_request *req) if (rctx->src != sg) { rctx->src[0] = *sg; sg_unmark_end(rctx->src); - scatterwalk_crypto_chain(rctx->src, sg_next(sg), 0, 2); + scatterwalk_crypto_chain(rctx->src, sg_next(sg), 2); } rctx->src[0].length -= offset - sg->offset; rctx->src[0].offset = offset; diff --git a/crypto/scatterwalk.c b/crypto/scatterwalk.c index c16c94f88733..d0b92c1cd6e9 100644 --- a/crypto/scatterwalk.c +++ b/crypto/scatterwalk.c @@ -91,7 +91,7 @@ struct scatterlist *scatterwalk_ffwd(struct scatterlist dst[2], sg_init_table(dst, 2); sg_set_page(dst, sg_page(src), src->length - len, src->offset + len); - scatterwalk_crypto_chain(dst, sg_next(src), 0, 2); + scatterwalk_crypto_chain(dst, sg_next(src), 2); return dst; } diff --git a/crypto/xts.c b/crypto/xts.c index f5fba941d6f6..123552522b80 100644 --- a/crypto/xts.c +++ b/crypto/xts.c @@ -138,7 +138,7 @@ static int post_crypt(struct skcipher_request *req) if (rctx->dst != sg) { rctx->dst[0] = *sg; sg_unmark_end(rctx->dst); - scatterwalk_crypto_chain(rctx->dst, sg_next(sg), 0, 2); + scatterwalk_crypto_chain(rctx->dst, sg_next(sg), 2); } rctx->dst[0].length -= offset - sg->offset; rctx->dst[0].offset = offset; @@ -204,7 +204,7 @@ static int pre_crypt(struct skcipher_request *req) if (rctx->src != sg) { rctx->src[0] = *sg; sg_unmark_end(rctx->src); - scatterwalk_crypto_chain(rctx->src, sg_next(sg), 0, 2); + scatterwalk_crypto_chain(rctx->src, sg_next(sg), 2); } rctx->src[0].length -= offset - sg->offset; rctx->src[0].offset = offset; diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 41e1b135cf97..ab368bdee05a 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c index feeb95d574fa..1516f357a18d 100644 --- a/drivers/acpi/nfit/mce.c +++ b/drivers/acpi/nfit/mce.c @@ -1,16 +1,8 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * NFIT - Machine Check Handler * * Copyright(c) 2013-2016 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/acpi/nfit/nfit.h b/drivers/acpi/nfit/nfit.h index 54292db61262..9deeb15efd91 100644 --- a/drivers/acpi/nfit/nfit.h +++ b/drivers/acpi/nfit/nfit.h @@ -1,16 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * NVDIMM Firmware Interface Table - NFIT * * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef __NFIT_H__ #define __NFIT_H__ diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 859d227504f7..45928e4da2b2 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -60,7 +60,7 @@ static int atomic_inc_return_safe(atomic_t *v) { unsigned int counter; - counter = (unsigned int)__atomic_add_unless(v, 1, 0); + counter = (unsigned int)atomic_fetch_add_unless(v, 1, 0); if (counter <= (unsigned int)INT_MAX) return (int)counter; diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index b6fc4f04636d..ea569dc0ba94 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -1,14 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright(c) 2016 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef __DAX_PRIVATE_H__ #define __DAX_PRIVATE_H__ diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 6c179c2a9ff9..1baa8ed6bb7b 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2017 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/firmware/efi/efi-pstore.c b/drivers/firmware/efi/efi-pstore.c index 4277147f7140..0f7d97917197 100644 --- a/drivers/firmware/efi/efi-pstore.c +++ b/drivers/firmware/efi/efi-pstore.c @@ -28,10 +28,9 @@ static int efi_pstore_close(struct pstore_info *psi) return 0; } -static inline u64 generic_id(unsigned long timestamp, - unsigned int part, int count) +static inline u64 generic_id(u64 timestamp, unsigned int part, int count) { - return ((u64) timestamp * 100 + part) * 1000 + count; + return (timestamp * 100 + part) * 1000 + count; } static int efi_pstore_read_func(struct efivar_entry *entry, @@ -42,7 +41,8 @@ static int efi_pstore_read_func(struct efivar_entry *entry, int i; int cnt; unsigned int part; - unsigned long time, size; + unsigned long size; + u64 time; if (efi_guidcmp(entry->var.VendorGuid, vendor)) return 0; @@ -50,7 +50,7 @@ static int efi_pstore_read_func(struct efivar_entry *entry, for (i = 0; i < DUMP_NAME_LEN; i++) name[i] = entry->var.VariableName[i]; - if (sscanf(name, "dump-type%u-%u-%d-%lu-%c", + if (sscanf(name, "dump-type%u-%u-%d-%llu-%c", &record->type, &part, &cnt, &time, &data_type) == 5) { record->id = generic_id(time, part, cnt); record->part = part; @@ -62,7 +62,7 @@ static int efi_pstore_read_func(struct efivar_entry *entry, else record->compressed = false; record->ecc_notice_size = 0; - } else if (sscanf(name, "dump-type%u-%u-%d-%lu", + } else if (sscanf(name, "dump-type%u-%u-%d-%llu", &record->type, &part, &cnt, &time) == 4) { record->id = generic_id(time, part, cnt); record->part = part; @@ -71,7 +71,7 @@ static int efi_pstore_read_func(struct efivar_entry *entry, record->time.tv_nsec = 0; record->compressed = false; record->ecc_notice_size = 0; - } else if (sscanf(name, "dump-type%u-%u-%lu", + } else if (sscanf(name, "dump-type%u-%u-%llu", &record->type, &part, &time) == 3) { /* * Check if an old format, @@ -250,9 +250,10 @@ static int efi_pstore_write(struct pstore_record *record) /* Since we copy the entire length of name, make sure it is wiped. */ memset(name, 0, sizeof(name)); - snprintf(name, sizeof(name), "dump-type%u-%u-%d-%lu-%c", + snprintf(name, sizeof(name), "dump-type%u-%u-%d-%lld-%c", record->type, record->part, record->count, - record->time.tv_sec, record->compressed ? 'C' : 'D'); + (long long)record->time.tv_sec, + record->compressed ? 'C' : 'D'); for (i = 0; i < DUMP_NAME_LEN; i++) efi_name[i] = name[i]; @@ -326,15 +327,15 @@ static int efi_pstore_erase(struct pstore_record *record) char name[DUMP_NAME_LEN]; int ret; - snprintf(name, sizeof(name), "dump-type%u-%u-%d-%lu", + snprintf(name, sizeof(name), "dump-type%u-%u-%d-%lld", record->type, record->part, record->count, - record->time.tv_sec); + (long long)record->time.tv_sec); ret = efi_pstore_erase_name(name); if (ret != -ENOENT) return ret; - snprintf(name, sizeof(name), "dump-type%u-%u-%lu", - record->type, record->part, record->time.tv_sec); + snprintf(name, sizeof(name), "dump-type%u-%u-%lld", + record->type, record->part, (long long)record->time.tv_sec); ret = efi_pstore_erase_name(name); return ret; diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 8a05efa7edf0..e0d6f0d598dc 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -713,7 +713,6 @@ static int kfd_ioctl_get_clock_counters(struct file *filep, { struct kfd_ioctl_get_clock_counters_args *args = data; struct kfd_dev *dev; - struct timespec64 time; dev = kfd_device_by_id(args->gpu_id); if (dev) @@ -725,11 +724,8 @@ static int kfd_ioctl_get_clock_counters(struct file *filep, args->gpu_clock_counter = 0; /* No access to rdtsc. Using raw monotonic time */ - getrawmonotonic64(&time); - args->cpu_clock_counter = (uint64_t)timespec64_to_ns(&time); - - get_monotonic_boottime64(&time); - args->system_clock_counter = (uint64_t)timespec64_to_ns(&time); + args->cpu_clock_counter = ktime_get_raw_ns(); + args->system_clock_counter = ktime_get_boottime_ns(); /* Since the counter is in nano-seconds we use 1GHz frequency */ args->system_clock_freq = 1000000000; diff --git a/drivers/gpu/drm/msm/sde/sde_trace.h b/drivers/gpu/drm/msm/sde/sde_trace.h index 61807d572272..dff0d27ac127 100644 --- a/drivers/gpu/drm/msm/sde/sde_trace.h +++ b/drivers/gpu/drm/msm/sde/sde_trace.h @@ -166,7 +166,7 @@ TRACE_EVENT(tracing_mark_write, #define SDE_TRACE_EVTLOG_SIZE 15 TRACE_EVENT(sde_evtlog, - TP_PROTO(const char *tag, u32 tag_id, u32 cnt, u32 data[]), + TP_PROTO(const char *tag, u32 tag_id, u32 cnt, u32 *data), TP_ARGS(tag, tag_id, cnt, data), TP_STRUCT__entry( __field(int, pid) diff --git a/drivers/iio/humidity/dht11.c b/drivers/iio/humidity/dht11.c index 2a22ad920333..43e502fcf199 100644 --- a/drivers/iio/humidity/dht11.c +++ b/drivers/iio/humidity/dht11.c @@ -158,8 +158,8 @@ static int dht11_decode(struct dht11 *dht11, int offset) return -EIO; } - dht11->timestamp = ktime_get_boot_ns(); - if (hum_int < 20) { /* DHT22 */ + dht11->timestamp = ktime_get_boottime_ns(); + if (hum_int < 4) { /* DHT22: 100000 = (3*256+232)*100 */ dht11->temperature = (((temp_int & 0x7f) << 8) + temp_dec) * ((temp_int & 0x80) ? -100 : 100); dht11->humidity = ((hum_int << 8) + hum_dec) * 100; @@ -186,7 +186,7 @@ static irqreturn_t dht11_handle_irq(int irq, void *data) /* TODO: Consider making the handler safe for IRQ sharing */ if (dht11->num_edges < DHT11_EDGES_PER_READ && dht11->num_edges >= 0) { - dht11->edges[dht11->num_edges].ts = ktime_get_boot_ns(); + dht11->edges[dht11->num_edges].ts = ktime_get_boottime_ns(); dht11->edges[dht11->num_edges++].value = gpio_get_value(dht11->gpio); @@ -205,7 +205,7 @@ static int dht11_read_raw(struct iio_dev *iio_dev, int ret, timeres, offset; mutex_lock(&dht11->lock); - if (dht11->timestamp + DHT11_DATA_VALID_TIME < ktime_get_boot_ns()) { + if (dht11->timestamp + DHT11_DATA_VALID_TIME < ktime_get_boottime_ns()) { timeres = ktime_get_resolution_ns(); dev_dbg(dht11->dev, "current timeresolution: %dns\n", timeres); if (timeres > DHT11_MIN_TIMERES) { @@ -332,7 +332,7 @@ static int dht11_probe(struct platform_device *pdev) return -EINVAL; } - dht11->timestamp = ktime_get_boot_ns() - DHT11_DATA_VALID_TIME - 1; + dht11->timestamp = ktime_get_boottime_ns() - DHT11_DATA_VALID_TIME - 1; dht11->num_edges = -1; platform_set_drvdata(pdev, iio); diff --git a/drivers/iio/industrialio-core.c b/drivers/iio/industrialio-core.c index 12d73ebcadfa..298afc096a24 100644 --- a/drivers/iio/industrialio-core.c +++ b/drivers/iio/industrialio-core.c @@ -207,35 +207,27 @@ static int iio_device_set_clock(struct iio_dev *indio_dev, clockid_t clock_id) */ s64 iio_get_time_ns(const struct iio_dev *indio_dev) { - struct timespec tp; + struct timespec64 tp; switch (iio_device_get_clock(indio_dev)) { case CLOCK_REALTIME: - ktime_get_real_ts(&tp); - break; + return ktime_get_real_ns(); case CLOCK_MONOTONIC: - ktime_get_ts(&tp); - break; + return ktime_get_ns(); case CLOCK_MONOTONIC_RAW: - getrawmonotonic(&tp); - break; + return ktime_get_raw_ns(); case CLOCK_REALTIME_COARSE: - tp = current_kernel_time(); - break; + return ktime_to_ns(ktime_get_coarse_real()); case CLOCK_MONOTONIC_COARSE: - tp = get_monotonic_coarse(); - break; + ktime_get_coarse_ts64(&tp); + return timespec64_to_ns(&tp); case CLOCK_BOOTTIME: - get_monotonic_boottime(&tp); - break; + return ktime_get_boottime_ns(); case CLOCK_TAI: - timekeeping_clocktai(&tp); - break; + return ktime_get_clocktai_ns(); default: BUG(); } - - return timespec_to_ns(&tp); } EXPORT_SYMBOL(iio_get_time_ns); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index dd00530675d0..f5aa1d65ff4d 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -1330,7 +1330,7 @@ static bool validate_ipv6_net_dev(struct net_device *net_dev, IPV6_ADDR_LINKLOCAL; struct rt6_info *rt = rt6_lookup(dev_net(net_dev), &dst_addr->sin6_addr, &src_addr->sin6_addr, net_dev->ifindex, - strict); + NULL, strict); bool ret; if (!rt) diff --git a/drivers/infiniband/core/rdma_core.c b/drivers/infiniband/core/rdma_core.c index 1984d6cee3e0..39bc14f68d70 100644 --- a/drivers/infiniband/core/rdma_core.c +++ b/drivers/infiniband/core/rdma_core.c @@ -121,7 +121,7 @@ static int uverbs_try_lock_object(struct ib_uobject *uobj, bool exclusive) * this lock. */ if (!exclusive) - return __atomic_add_unless(&uobj->usecnt, 1, -1) == -1 ? + return atomic_fetch_add_unless(&uobj->usecnt, 1, -1) == -1 ? -EBUSY : 0; /* lock is either WRITE or DESTROY - should be exclusive */ diff --git a/drivers/infiniband/hw/mlx4/alias_GUID.c b/drivers/infiniband/hw/mlx4/alias_GUID.c index f2d975c2659d..620a4075bcb4 100644 --- a/drivers/infiniband/hw/mlx4/alias_GUID.c +++ b/drivers/infiniband/hw/mlx4/alias_GUID.c @@ -310,7 +310,7 @@ static void aliasguid_query_handler(int status, if (status) { pr_debug("(port: %d) failed: status = %d\n", cb_ctx->port, status); - rec->time_to_run = ktime_get_boot_ns() + 1 * NSEC_PER_SEC; + rec->time_to_run = ktime_get_boottime_ns() + 1 * NSEC_PER_SEC; goto out; } @@ -416,7 +416,7 @@ next_entry: be64_to_cpu((__force __be64)rec->guid_indexes), be64_to_cpu((__force __be64)applied_guid_indexes), be64_to_cpu((__force __be64)declined_guid_indexes)); - rec->time_to_run = ktime_get_boot_ns() + + rec->time_to_run = ktime_get_boottime_ns() + resched_delay_sec * NSEC_PER_SEC; } else { rec->status = MLX4_GUID_INFO_STATUS_SET; @@ -709,7 +709,7 @@ static int get_low_record_time_index(struct mlx4_ib_dev *dev, u8 port, } } if (resched_delay_sec) { - u64 curr_time = ktime_get_boot_ns(); + u64 curr_time = ktime_get_boottime_ns(); *resched_delay_sec = (low_record_time < curr_time) ? 0 : div_u64((low_record_time - curr_time), NSEC_PER_SEC); diff --git a/drivers/media/cec/cec-adap.c b/drivers/media/cec/cec-adap.c index 2765ca732b95..852b184ccaac 100644 --- a/drivers/media/cec/cec-adap.c +++ b/drivers/media/cec/cec-adap.c @@ -1790,9 +1790,6 @@ static int cec_receive_notify(struct cec_adapter *adap, struct cec_msg *msg, int la_idx = cec_log_addr2idx(adap, dest_laddr); bool from_unregistered = init_laddr == 0xf; struct cec_msg tx_cec_msg = { }; -#ifdef CONFIG_MEDIA_CEC_RC - int scancode; -#endif dprintk(2, "%s: %*ph\n", __func__, msg->len, msg->msg); @@ -1888,9 +1885,11 @@ static int cec_receive_notify(struct cec_adapter *adap, struct cec_msg *msg, */ case 0x60: if (msg->len == 2) - scancode = msg->msg[2]; + rc_keydown(adap->rc, RC_PROTO_CEC, + msg->msg[2], 0); else - scancode = msg->msg[2] << 8 | msg->msg[3]; + rc_keydown(adap->rc, RC_PROTO_CEC, + msg->msg[2] << 8 | msg->msg[3], 0); break; /* * Other function messages that are not handled. @@ -1903,54 +1902,11 @@ static int cec_receive_notify(struct cec_adapter *adap, struct cec_msg *msg, */ case 0x56: case 0x57: case 0x67: case 0x68: case 0x69: case 0x6a: - scancode = -1; break; default: - scancode = msg->msg[2]; + rc_keydown(adap->rc, RC_PROTO_CEC, msg->msg[2], 0); break; } - - /* Was repeating, but keypress timed out */ - if (adap->rc_repeating && !adap->rc->keypressed) { - adap->rc_repeating = false; - adap->rc_last_scancode = -1; - } - /* Different keypress from last time, ends repeat mode */ - if (adap->rc_last_scancode != scancode) { - rc_keyup(adap->rc); - adap->rc_repeating = false; - } - /* We can't handle this scancode */ - if (scancode < 0) { - adap->rc_last_scancode = scancode; - break; - } - - /* Send key press */ - rc_keydown(adap->rc, RC_PROTO_CEC, scancode, 0); - - /* When in repeating mode, we're done */ - if (adap->rc_repeating) - break; - - /* - * We are not repeating, but the new scancode is - * the same as the last one, and this second key press is - * within 550 ms (the 'Follower Safety Timeout') from the - * previous key press, so we now enable the repeating mode. - */ - if (adap->rc_last_scancode == scancode && - msg->rx_ts - adap->rc_last_keypress < 550 * NSEC_PER_MSEC) { - adap->rc_repeating = true; - break; - } - /* - * Not in repeating mode, so avoid triggering repeat mode - * by calling keyup. - */ - rc_keyup(adap->rc); - adap->rc_last_scancode = scancode; - adap->rc_last_keypress = msg->rx_ts; #endif break; @@ -1960,8 +1916,6 @@ static int cec_receive_notify(struct cec_adapter *adap, struct cec_msg *msg, break; #ifdef CONFIG_MEDIA_CEC_RC rc_keyup(adap->rc); - adap->rc_repeating = false; - adap->rc_last_scancode = -1; #endif break; diff --git a/drivers/media/cec/cec-core.c b/drivers/media/cec/cec-core.c index 648136e552d5..c0f8afaf3396 100644 --- a/drivers/media/cec/cec-core.c +++ b/drivers/media/cec/cec-core.c @@ -277,11 +277,9 @@ struct cec_adapter *cec_allocate_adapter(const struct cec_adap_ops *ops, adap->rc->input_id.version = 1; adap->rc->driver_name = CEC_NAME; adap->rc->allowed_protocols = RC_PROTO_BIT_CEC; - adap->rc->enabled_protocols = RC_PROTO_BIT_CEC; adap->rc->priv = adap; adap->rc->map_name = RC_MAP_CEC; - adap->rc->timeout = MS_TO_NS(100); - adap->rc_last_scancode = -1; + adap->rc->timeout = MS_TO_NS(550); #endif return adap; } @@ -313,17 +311,6 @@ int cec_register_adapter(struct cec_adapter *adap, adap->rc = NULL; return res; } - /* - * The REP_DELAY for CEC is really the time between the initial - * 'User Control Pressed' message and the second. The first - * keypress is always seen as non-repeating, the second - * (provided it has the same UI Command) will start the 'Press - * and Hold' (aka repeat) behavior. By setting REP_DELAY to the - * same value as REP_PERIOD the expected CEC behavior is - * reproduced. - */ - adap->rc->input_dev->rep[REP_DELAY] = - adap->rc->input_dev->rep[REP_PERIOD]; } #endif diff --git a/drivers/media/dvb-frontends/sp887x.c b/drivers/media/dvb-frontends/sp887x.c index 18352bd3c8c6..8b9618dcaa63 100644 --- a/drivers/media/dvb-frontends/sp887x.c +++ b/drivers/media/dvb-frontends/sp887x.c @@ -57,7 +57,7 @@ static int sp887x_writereg (struct sp887x_state* state, u16 reg, u16 data) int ret; if ((ret = i2c_transfer(state->i2c, &msg, 1)) != 1) { - /** + /* * in case of soft reset we ignore ACK errors... */ if (!(reg == 0xf1a && data == 0x000 && @@ -130,7 +130,7 @@ static void sp887x_setup_agc (struct sp887x_state* state) #define BLOCKSIZE 30 #define FW_SIZE 0x4000 -/** +/* * load firmware and setup MPEG interface... */ static int sp887x_initial_setup (struct dvb_frontend* fe, const struct firmware *fw) @@ -279,7 +279,7 @@ static int configure_reg0xc05(struct dtv_frontend_properties *p, u16 *reg0xc05) return 0; } -/** +/* * estimates division of two 24bit numbers, * derived from the ves1820/stv0299 driver code */ diff --git a/drivers/media/dvb-frontends/tua6100.c b/drivers/media/dvb-frontends/tua6100.c index 5541ffa4d2f3..e09b07ab6c1c 100644 --- a/drivers/media/dvb-frontends/tua6100.c +++ b/drivers/media/dvb-frontends/tua6100.c @@ -1,4 +1,4 @@ -/** +/* * Driver for Infineon tua6100 pll. * * (c) 2006 Andrew de Quincey diff --git a/drivers/media/dvb-frontends/zl10036.c b/drivers/media/dvb-frontends/zl10036.c index 6cd4fb8bc271..231c51acf22f 100644 --- a/drivers/media/dvb-frontends/zl10036.c +++ b/drivers/media/dvb-frontends/zl10036.c @@ -1,4 +1,4 @@ -/** +/* * Driver for Zarlink zl10036 DVB-S silicon tuner * * Copyright (C) 2006 Tino Reichardt @@ -157,7 +157,7 @@ static int zl10036_sleep(struct dvb_frontend *fe) return ret; } -/** +/* * register map of the ZL10036/ZL10038 * * reg[default] content @@ -219,7 +219,7 @@ static int zl10036_set_bandwidth(struct zl10036_state *state, u32 fbw) if (fbw <= 28820) { br = _BR_MAXIMUM; } else { - /** + /* * f(bw)=34,6MHz f(xtal)=10.111MHz * br = (10111/34600) * 63 * 1/K = 14; */ @@ -315,7 +315,7 @@ static int zl10036_set_params(struct dvb_frontend *fe) || (frequency > fe->ops.info.frequency_max)) return -EINVAL; - /** + /* * alpha = 1.35 for dvb-s * fBW = (alpha*symbolrate)/(2*0.8) * 1.35 / (2*0.8) = 27 / 32 diff --git a/drivers/media/i2c/ir-kbd-i2c.c b/drivers/media/i2c/ir-kbd-i2c.c index a374e2a0ac3d..8b5f7d0435e4 100644 --- a/drivers/media/i2c/ir-kbd-i2c.c +++ b/drivers/media/i2c/ir-kbd-i2c.c @@ -460,7 +460,6 @@ static int ir_probe(struct i2c_client *client, const struct i2c_device_id *id) */ rc->map_name = ir->ir_codes; rc->allowed_protocols = rc_proto; - rc->enabled_protocols = rc_proto; if (!rc->driver_name) rc->driver_name = MODULE_NAME; diff --git a/drivers/media/i2c/ov5647.c b/drivers/media/i2c/ov5647.c index 95ce90fdb876..210aa822399c 100644 --- a/drivers/media/i2c/ov5647.c +++ b/drivers/media/i2c/ov5647.c @@ -407,8 +407,8 @@ static int ov5647_sensor_set_register(struct v4l2_subdev *sd, } #endif -/** - * @short Subdev core operations registration +/* + * Subdev core operations registration */ static const struct v4l2_subdev_core_ops ov5647_subdev_core_ops = { .s_power = ov5647_sensor_power, diff --git a/drivers/media/pci/solo6x10/solo6x10-enc.c b/drivers/media/pci/solo6x10/solo6x10-enc.c index d28211bb9674..58d6b5131dd0 100644 --- a/drivers/media/pci/solo6x10/solo6x10-enc.c +++ b/drivers/media/pci/solo6x10/solo6x10-enc.c @@ -175,7 +175,7 @@ out: return 0; } -/** +/* * Set channel Quality Profile (0-3). */ void solo_s_jpeg_qp(struct solo_dev *solo_dev, unsigned int ch, diff --git a/drivers/media/platform/rcar_fdp1.c b/drivers/media/platform/rcar_fdp1.c index 5965e34e36cc..f37a6bfb0609 100644 --- a/drivers/media/platform/rcar_fdp1.c +++ b/drivers/media/platform/rcar_fdp1.c @@ -1134,7 +1134,7 @@ static int fdp1_device_process(struct fdp1_ctx *ctx) * mem2mem callbacks */ -/** +/* * job_ready() - check whether an instance is ready to be scheduled to run */ static int fdp1_m2m_job_ready(void *priv) diff --git a/drivers/media/platform/sh_veu.c b/drivers/media/platform/sh_veu.c index a4f593220ef0..80c1b80f6586 100644 --- a/drivers/media/platform/sh_veu.c +++ b/drivers/media/platform/sh_veu.c @@ -267,7 +267,7 @@ static void sh_veu_process(struct sh_veu_dev *veu, sh_veu_reg_write(veu, VEU_EIER, 1); /* enable interrupt in VEU */ } -/** +/* * sh_veu_device_run() - prepares and starts the device * * This will be called by the framework when it decides to schedule a particular diff --git a/drivers/media/platform/sti/hva/hva-h264.c b/drivers/media/platform/sti/hva/hva-h264.c index e6f247a983c7..d69c58211107 100644 --- a/drivers/media/platform/sti/hva/hva-h264.c +++ b/drivers/media/platform/sti/hva/hva-h264.c @@ -134,7 +134,7 @@ enum hva_h264_sei_payload_type { SEI_FRAME_PACKING_ARRANGEMENT = 45 }; -/** +/* * stereo Video Info struct */ struct hva_h264_stereo_video_sei { @@ -146,7 +146,9 @@ struct hva_h264_stereo_video_sei { u8 right_view_self_contained_flag; }; -/** +/* + * struct hva_h264_td + * * @frame_width: width in pixels of the buffer containing the input frame * @frame_height: height in pixels of the buffer containing the input frame * @frame_num: the parameter to be written in the slice header @@ -352,7 +354,9 @@ struct hva_h264_td { u32 addr_brc_in_out_parameter; }; -/** +/* + * struct hva_h264_slice_po + * * @ slice_size: slice size * @ slice_start_time: start time * @ slice_stop_time: stop time @@ -365,7 +369,9 @@ struct hva_h264_slice_po { u32 slice_num; }; -/** +/* + * struct hva_h264_po + * * @ bitstream_size: bitstream size * @ dct_bitstream_size: dtc bitstream size * @ stuffing_bits: number of stuffing bits inserted by the encoder @@ -391,7 +397,9 @@ struct hva_h264_task { struct hva_h264_po po; }; -/** +/* + * struct hva_h264_ctx + * * @seq_info: sequence information buffer * @ref_frame: reference frame buffer * @rec_frame: reconstructed frame buffer diff --git a/drivers/media/platform/ti-vpe/vpe.c b/drivers/media/platform/ti-vpe/vpe.c index bbd8bb611915..74f03f9c7901 100644 --- a/drivers/media/platform/ti-vpe/vpe.c +++ b/drivers/media/platform/ti-vpe/vpe.c @@ -931,7 +931,7 @@ static struct vpe_ctx *file2ctx(struct file *file) * mem2mem callbacks */ -/** +/* * job_ready() - check whether an instance is ready to be scheduled to run */ static int job_ready(void *priv) diff --git a/drivers/media/platform/vim2m.c b/drivers/media/platform/vim2m.c index b01fba020d5f..0592f40b23f3 100644 --- a/drivers/media/platform/vim2m.c +++ b/drivers/media/platform/vim2m.c @@ -343,7 +343,7 @@ static void schedule_irq(struct vim2m_dev *dev, int msec_timeout) * mem2mem callbacks */ -/** +/* * job_ready() - check whether an instance is ready to be scheduled to run */ static int job_ready(void *priv) diff --git a/drivers/media/rc/Kconfig b/drivers/media/rc/Kconfig index 6af8fcae1600..30f05cbedebd 100644 --- a/drivers/media/rc/Kconfig +++ b/drivers/media/rc/Kconfig @@ -2,7 +2,6 @@ menuconfig RC_CORE tristate "Remote Controller support" depends on INPUT - default y ---help--- Enable support for Remote Controllers on Linux. This is needed in order to support several video capture adapters, @@ -17,39 +16,37 @@ menuconfig RC_CORE if RC_CORE source "drivers/media/rc/keymaps/Kconfig" -menuconfig RC_DECODERS - bool "Remote controller decoders" +config LIRC + bool "LIRC user interface" + depends on RC_CORE + ---help--- + Enable this option to enable the Linux Infrared Remote + Control user interface (e.g. /dev/lirc*). This interface + passes raw IR to and from userspace, which is needed for + IR transmitting (aka "blasting") and for the lirc daemon. + +config BPF_LIRC_MODE2 + bool "Support for eBPF programs attached to lirc devices" + depends on BPF_SYSCALL + depends on RC_CORE=y + depends on LIRC + help + Allow attaching eBPF programs to a lirc device using the bpf(2) + syscall command BPF_PROG_ATTACH. This is supported for raw IR + receivers. + + These eBPF programs can be used to decode IR into scancodes, for + IR protocols not supported by the kernel decoders. + +menuconfig RC_DECODERS + bool "Remote controller decoders" depends on RC_CORE - default y if RC_DECODERS -config LIRC - tristate "LIRC interface driver" - depends on RC_CORE - - ---help--- - Enable this option to build the Linux Infrared Remote - Control (LIRC) core device interface driver. The LIRC - interface passes raw IR to and from userspace, where the - LIRC daemon handles protocol decoding for IR reception and - encoding for IR transmitting (aka "blasting"). - -config IR_LIRC_CODEC - tristate "Enable IR to LIRC bridge" - depends on RC_CORE - depends on LIRC - default y - - ---help--- - Enable this option to pass raw IR to and from userspace via - the LIRC interface. - - config IR_NEC_DECODER tristate "Enable IR raw decoder for the NEC protocol" depends on RC_CORE select BITREVERSE - default y ---help--- Enable this option if you have IR with NEC protocol, and @@ -59,7 +56,6 @@ config IR_RC5_DECODER tristate "Enable IR raw decoder for the RC-5 protocol" depends on RC_CORE select BITREVERSE - default y ---help--- Enable this option if you have IR with RC-5 protocol, and @@ -69,7 +65,6 @@ config IR_RC6_DECODER tristate "Enable IR raw decoder for the RC6 protocol" depends on RC_CORE select BITREVERSE - default y ---help--- Enable this option if you have an infrared remote control which @@ -79,7 +74,6 @@ config IR_JVC_DECODER tristate "Enable IR raw decoder for the JVC protocol" depends on RC_CORE select BITREVERSE - default y ---help--- Enable this option if you have an infrared remote control which @@ -89,7 +83,6 @@ config IR_SONY_DECODER tristate "Enable IR raw decoder for the Sony protocol" depends on RC_CORE select BITREVERSE - default y ---help--- Enable this option if you have an infrared remote control which @@ -98,7 +91,6 @@ config IR_SONY_DECODER config IR_SANYO_DECODER tristate "Enable IR raw decoder for the Sanyo protocol" depends on RC_CORE - default y ---help--- Enable this option if you have an infrared remote control which @@ -108,7 +100,6 @@ config IR_SANYO_DECODER config IR_SHARP_DECODER tristate "Enable IR raw decoder for the Sharp protocol" depends on RC_CORE - default y ---help--- Enable this option if you have an infrared remote control which @@ -119,7 +110,6 @@ config IR_MCE_KBD_DECODER tristate "Enable IR raw decoder for the MCE keyboard/mouse protocol" depends on RC_CORE select BITREVERSE - default y ---help--- Enable this option if you have a Microsoft Remote Keyboard for @@ -130,11 +120,19 @@ config IR_XMP_DECODER tristate "Enable IR raw decoder for the XMP protocol" depends on RC_CORE select BITREVERSE - default y ---help--- Enable this option if you have IR with XMP protocol, and if the IR is decoded in software + +config IR_IMON_DECODER + tristate "Enable IR raw decoder for the iMON protocol" + depends on RC_CORE + ---help--- + Enable this option if you have iMON PAD or Antec Veris infrared + remote control and you would like to use it with a raw IR + receiver, or if you wish to use an encoder to transmit this IR. + endif #RC_DECODERS menuconfig RC_DEVICES @@ -164,7 +162,7 @@ config RC_ATI_REMOTE config IR_ENE tristate "ENE eHome Receiver/Transceiver (pnp id: ENE0100/ENE02xxx)" - depends on PNP + depends on PNP || COMPILE_TEST depends on RC_CORE ---help--- Say Y here to enable support for integrated infrared receiver @@ -179,6 +177,7 @@ config IR_ENE config IR_HIX5HD2 tristate "Hisilicon hix5hd2 IR remote control" depends on RC_CORE + depends on OF || COMPILE_TEST help Say Y here if you want to use hisilicon hix5hd2 remote control. To compile this driver as a module, choose M here: the module will be @@ -198,6 +197,18 @@ config IR_IMON To compile this driver as a module, choose M here: the module will be called imon. +config IR_IMON_RAW + tristate "SoundGraph iMON Receiver (early raw IR models)" + depends on USB_ARCH_HAS_HCD + depends on RC_CORE + select USB + ---help--- + Say Y here if you want to use a SoundGraph iMON IR Receiver, + early raw models. + + To compile this driver as a module, choose M here: the + module will be called imon_raw. + config IR_MCEUSB tristate "Windows Media Center Ed. eHome Infrared Transceiver" depends on USB_ARCH_HAS_HCD @@ -212,7 +223,7 @@ config IR_MCEUSB config IR_ITE_CIR tristate "ITE Tech Inc. IT8712/IT8512 Consumer Infrared Transceiver" - depends on PNP + depends on PNP || COMPILE_TEST depends on RC_CORE ---help--- Say Y here to enable support for integrated infrared receivers @@ -225,7 +236,7 @@ config IR_ITE_CIR config IR_FINTEK tristate "Fintek Consumer Infrared Transceiver" - depends on PNP + depends on PNP || COMPILE_TEST depends on RC_CORE ---help--- Say Y here to enable support for integrated infrared receiver @@ -259,7 +270,7 @@ config IR_MTK config IR_NUVOTON tristate "Nuvoton w836x7hg Consumer Infrared Transceiver" - depends on PNP + depends on PNP || COMPILE_TEST depends on RC_CORE ---help--- Say Y here to enable support for integrated infrared receiver @@ -286,6 +297,7 @@ config IR_REDRAT3 config IR_SPI tristate "SPI connected IR LED" depends on SPI && LIRC + depends on OF || COMPILE_TEST ---help--- Say Y if you want to use an IR LED connected through SPI bus. @@ -306,7 +318,7 @@ config IR_STREAMZAP config IR_WINBOND_CIR tristate "Winbond IR remote control" - depends on X86 && PNP + depends on (X86 && PNP) || COMPILE_TEST depends on RC_CORE select NEW_LEDS select LEDS_CLASS @@ -393,6 +405,7 @@ config RC_LOOPBACK config IR_GPIO_CIR tristate "GPIO IR remote control" depends on RC_CORE + depends on (OF && GPIOLIB) || COMPILE_TEST ---help--- Say Y if you want to use GPIO based IR Receiver. @@ -403,6 +416,7 @@ config IR_GPIO_TX tristate "GPIO IR Bit Banging Transmitter" depends on RC_CORE depends on LIRC + depends on (OF && GPIOLIB) || COMPILE_TEST ---help--- Say Y if you want to a GPIO based IR transmitter. This is a bit banging driver. @@ -415,6 +429,7 @@ config IR_PWM_TX depends on RC_CORE depends on LIRC depends on PWM + depends on OF || COMPILE_TEST ---help--- Say Y if you want to use a PWM based IR transmitter. This is more power efficient than the bit banging gpio driver. @@ -455,20 +470,29 @@ config IR_SERIAL config IR_SERIAL_TRANSMITTER bool "Serial Port Transmitter" - default y depends on IR_SERIAL ---help--- Serial Port Transmitter support config IR_SIR - tristate "Built-in SIR IrDA port" - depends on RC_CORE - ---help--- + tristate "Built-in SIR IrDA port" + depends on RC_CORE + ---help--- Say Y if you want to use a IrDA SIR port Transceivers. To compile this driver as a module, choose M here: the module will be called sir-ir. +config IR_TANGO + tristate "Sigma Designs SMP86xx IR decoder" + depends on RC_CORE + depends on ARCH_TANGO || COMPILE_TEST + ---help--- + Adds support for the HW IR decoder embedded on Sigma Designs + Tango-based systems (SMP86xx, SMP87xx). + The HW decoder supports NEC, RC-5, RC-6 IR protocols. + When compiled as a module, look for tango-ir. + config IR_ZX tristate "ZTE ZX IR remote control" depends on RC_CORE diff --git a/drivers/media/rc/Makefile b/drivers/media/rc/Makefile index 82ab74ed934c..2e4045d366a5 100644 --- a/drivers/media/rc/Makefile +++ b/drivers/media/rc/Makefile @@ -1,10 +1,11 @@ # SPDX-License-Identifier: GPL-2.0 -rc-core-objs := rc-main.o rc-ir-raw.o obj-y += keymaps/ obj-$(CONFIG_RC_CORE) += rc-core.o -obj-$(CONFIG_LIRC) += lirc_dev.o +rc-core-y := rc-main.o rc-ir-raw.o +rc-core-$(CONFIG_LIRC) += lirc_dev.o +rc-core-$(CONFIG_BPF_LIRC_MODE2) += bpf-lirc.o obj-$(CONFIG_IR_NEC_DECODER) += ir-nec-decoder.o obj-$(CONFIG_IR_RC5_DECODER) += ir-rc5-decoder.o obj-$(CONFIG_IR_RC6_DECODER) += ir-rc6-decoder.o @@ -13,13 +14,14 @@ obj-$(CONFIG_IR_SONY_DECODER) += ir-sony-decoder.o obj-$(CONFIG_IR_SANYO_DECODER) += ir-sanyo-decoder.o obj-$(CONFIG_IR_SHARP_DECODER) += ir-sharp-decoder.o obj-$(CONFIG_IR_MCE_KBD_DECODER) += ir-mce_kbd-decoder.o -obj-$(CONFIG_IR_LIRC_CODEC) += ir-lirc-codec.o obj-$(CONFIG_IR_XMP_DECODER) += ir-xmp-decoder.o +obj-$(CONFIG_IR_IMON_DECODER) += ir-imon-decoder.o # stand-alone IR receivers/transmitters obj-$(CONFIG_RC_ATI_REMOTE) += ati_remote.o obj-$(CONFIG_IR_HIX5HD2) += ir-hix5hd2.o obj-$(CONFIG_IR_IMON) += imon.o +obj-$(CONFIG_IR_IMON_RAW) += imon_raw.o obj-$(CONFIG_IR_ITE_CIR) += ite-cir.o obj-$(CONFIG_IR_MCEUSB) += mceusb.o obj-$(CONFIG_IR_FINTEK) += fintek-cir.o @@ -45,4 +47,5 @@ obj-$(CONFIG_IR_SERIAL) += serial_ir.o obj-$(CONFIG_IR_SIR) += sir_ir.o obj-$(CONFIG_IR_MTK) += mtk-cir.o obj-$(CONFIG_IR_ZX) += zx-irdec.o -obj-$(CONFIG_IR_MSM_GENI) += msm-geni-ir.o \ No newline at end of file +obj-$(CONFIG_IR_MSM_GENI) += msm-geni-ir.o +obj-$(CONFIG_IR_TANGO) += tango-ir.o diff --git a/drivers/media/rc/ati_remote.c b/drivers/media/rc/ati_remote.c index 8e3af398a6c4..01c82da8e9aa 100644 --- a/drivers/media/rc/ati_remote.c +++ b/drivers/media/rc/ati_remote.c @@ -198,7 +198,7 @@ static const struct ati_receiver_type type_firefly = { .default_keymap = RC_MAP_SNAPSTREAM_FIREFLY }; -static struct usb_device_id ati_remote_table[] = { +static const struct usb_device_id ati_remote_table[] = { { USB_DEVICE(ATI_REMOTE_VENDOR_ID, LOLA_REMOTE_PRODUCT_ID), .driver_info = (unsigned long)&type_ati diff --git a/drivers/media/rc/bpf-lirc.c b/drivers/media/rc/bpf-lirc.c new file mode 100644 index 000000000000..a3fc3812eaee --- /dev/null +++ b/drivers/media/rc/bpf-lirc.c @@ -0,0 +1,315 @@ +// SPDX-License-Identifier: GPL-2.0 +// bpf-lirc.c - handles bpf +// +// Copyright (C) 2018 Sean Young + +#include +#include +#include +#include "rc-core-priv.h" + +/* + * BPF interface for raw IR + */ +const struct bpf_prog_ops lirc_mode2_prog_ops = { +}; + +BPF_CALL_1(bpf_rc_repeat, u32*, sample) +{ + struct ir_raw_event_ctrl *ctrl; + + ctrl = container_of(sample, struct ir_raw_event_ctrl, bpf_sample); + + rc_repeat(ctrl->dev); + + return 0; +} + +static const struct bpf_func_proto rc_repeat_proto = { + .func = bpf_rc_repeat, + .gpl_only = true, /* rc_repeat is EXPORT_SYMBOL_GPL */ + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + +/* + * Currently rc-core does not support 64-bit scancodes, but there are many + * known protocols with more than 32 bits. So, define the interface as u64 + * as a future-proof. + */ +BPF_CALL_4(bpf_rc_keydown, u32*, sample, u32, protocol, u64, scancode, + u32, toggle) +{ + struct ir_raw_event_ctrl *ctrl; + + ctrl = container_of(sample, struct ir_raw_event_ctrl, bpf_sample); + + rc_keydown(ctrl->dev, protocol, scancode, toggle != 0); + + return 0; +} + +static const struct bpf_func_proto rc_keydown_proto = { + .func = bpf_rc_keydown, + .gpl_only = true, /* rc_keydown is EXPORT_SYMBOL_GPL */ + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_ANYTHING, +}; + +static const struct bpf_func_proto * +lirc_mode2_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_rc_repeat: + return &rc_repeat_proto; + case BPF_FUNC_rc_keydown: + return &rc_keydown_proto; + case BPF_FUNC_map_lookup_elem: + return &bpf_map_lookup_elem_proto; + case BPF_FUNC_map_update_elem: + return &bpf_map_update_elem_proto; + case BPF_FUNC_map_delete_elem: + return &bpf_map_delete_elem_proto; + case BPF_FUNC_map_push_elem: + return &bpf_map_push_elem_proto; + case BPF_FUNC_map_pop_elem: + return &bpf_map_pop_elem_proto; + case BPF_FUNC_map_peek_elem: + return &bpf_map_peek_elem_proto; + case BPF_FUNC_ktime_get_ns: + return &bpf_ktime_get_ns_proto; + case BPF_FUNC_ktime_get_boot_ns: + return &bpf_ktime_get_boot_ns_proto; + case BPF_FUNC_tail_call: + return &bpf_tail_call_proto; + case BPF_FUNC_get_prandom_u32: + return &bpf_get_prandom_u32_proto; + case BPF_FUNC_trace_printk: + if (capable(CAP_SYS_ADMIN)) + return bpf_get_trace_printk_proto(); + /* fall through */ + default: + return NULL; + } +} + +static bool lirc_mode2_is_valid_access(int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + /* We have one field of u32 */ + return type == BPF_READ && off == 0 && size == sizeof(u32); +} + +const struct bpf_verifier_ops lirc_mode2_verifier_ops = { + .get_func_proto = lirc_mode2_func_proto, + .is_valid_access = lirc_mode2_is_valid_access +}; + +#define BPF_MAX_PROGS 64 + +static int lirc_bpf_attach(struct rc_dev *rcdev, struct bpf_prog *prog) +{ + struct bpf_prog_array __rcu *old_array; + struct bpf_prog_array *new_array; + struct ir_raw_event_ctrl *raw; + int ret; + + if (rcdev->driver_type != RC_DRIVER_IR_RAW) + return -EINVAL; + + ret = mutex_lock_interruptible(&ir_raw_handler_lock); + if (ret) + return ret; + + raw = rcdev->raw; + if (!raw) { + ret = -ENODEV; + goto unlock; + } + + if (raw->progs && bpf_prog_array_length(raw->progs) >= BPF_MAX_PROGS) { + ret = -E2BIG; + goto unlock; + } + + old_array = raw->progs; + ret = bpf_prog_array_copy(old_array, NULL, prog, &new_array); + if (ret < 0) + goto unlock; + + rcu_assign_pointer(raw->progs, new_array); + bpf_prog_array_free(old_array); + +unlock: + mutex_unlock(&ir_raw_handler_lock); + return ret; +} + +static int lirc_bpf_detach(struct rc_dev *rcdev, struct bpf_prog *prog) +{ + struct bpf_prog_array __rcu *old_array; + struct bpf_prog_array *new_array; + struct ir_raw_event_ctrl *raw; + int ret; + + if (rcdev->driver_type != RC_DRIVER_IR_RAW) + return -EINVAL; + + ret = mutex_lock_interruptible(&ir_raw_handler_lock); + if (ret) + return ret; + + raw = rcdev->raw; + if (!raw) { + ret = -ENODEV; + goto unlock; + } + + old_array = raw->progs; + ret = bpf_prog_array_copy(old_array, prog, NULL, &new_array); + /* + * Do not use bpf_prog_array_delete_safe() as we would end up + * with a dummy entry in the array, and the we would free the + * dummy in lirc_bpf_free() + */ + if (ret) + goto unlock; + + rcu_assign_pointer(raw->progs, new_array); + bpf_prog_array_free(old_array); + bpf_prog_put(prog); +unlock: + mutex_unlock(&ir_raw_handler_lock); + return ret; +} + +void lirc_bpf_run(struct rc_dev *rcdev, u32 sample) +{ + struct ir_raw_event_ctrl *raw = rcdev->raw; + + raw->bpf_sample = sample; + + if (raw->progs) + BPF_PROG_RUN_ARRAY(raw->progs, &raw->bpf_sample, BPF_PROG_RUN); +} + +/* + * This should be called once the rc thread has been stopped, so there can be + * no concurrent bpf execution. + */ +void lirc_bpf_free(struct rc_dev *rcdev) +{ + struct bpf_prog_array_item *item; + + if (!rcdev->raw->progs) + return; + + item = rcu_dereference(rcdev->raw->progs)->items; + while (item->prog) { + bpf_prog_put(item->prog); + item++; + } + + bpf_prog_array_free(rcdev->raw->progs); +} + +int lirc_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog) +{ + struct rc_dev *rcdev; + int ret; + + if (attr->attach_flags) + return -EINVAL; + + rcdev = rc_dev_get_from_fd(attr->target_fd); + if (IS_ERR(rcdev)) + return PTR_ERR(rcdev); + + ret = lirc_bpf_attach(rcdev, prog); + + put_device(&rcdev->dev); + + return ret; +} + +int lirc_prog_detach(const union bpf_attr *attr) +{ + struct bpf_prog *prog; + struct rc_dev *rcdev; + int ret; + + if (attr->attach_flags) + return -EINVAL; + + prog = bpf_prog_get_type(attr->attach_bpf_fd, + BPF_PROG_TYPE_LIRC_MODE2); + if (IS_ERR(prog)) + return PTR_ERR(prog); + + rcdev = rc_dev_get_from_fd(attr->target_fd); + if (IS_ERR(rcdev)) { + bpf_prog_put(prog); + return PTR_ERR(rcdev); + } + + ret = lirc_bpf_detach(rcdev, prog); + + bpf_prog_put(prog); + put_device(&rcdev->dev); + + return ret; +} + +int lirc_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) +{ + __u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids); + struct bpf_prog_array __rcu *progs; + struct rc_dev *rcdev; + u32 cnt, flags = 0; + int ret; + + if (attr->query.query_flags) + return -EINVAL; + + rcdev = rc_dev_get_from_fd(attr->query.target_fd); + if (IS_ERR(rcdev)) + return PTR_ERR(rcdev); + + if (rcdev->driver_type != RC_DRIVER_IR_RAW) { + ret = -EINVAL; + goto put; + } + + ret = mutex_lock_interruptible(&ir_raw_handler_lock); + if (ret) + goto put; + + progs = rcdev->raw->progs; + cnt = progs ? bpf_prog_array_length(progs) : 0; + + if (copy_to_user(&uattr->query.prog_cnt, &cnt, sizeof(cnt))) { + ret = -EFAULT; + goto unlock; + } + + if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags))) { + ret = -EFAULT; + goto unlock; + } + + if (attr->query.prog_cnt != 0 && prog_ids && cnt) + ret = bpf_prog_array_copy_to_user(progs, prog_ids, + attr->query.prog_cnt); + +unlock: + mutex_unlock(&ir_raw_handler_lock); +put: + put_device(&rcdev->dev); + + return ret; +} diff --git a/drivers/media/rc/ene_ir.c b/drivers/media/rc/ene_ir.c index 4761b2a72d8e..8cf2a5c0575a 100644 --- a/drivers/media/rc/ene_ir.c +++ b/drivers/media/rc/ene_ir.c @@ -670,9 +670,9 @@ exit: } /* timer to simulate tx done interrupt */ -static void ene_tx_irqsim(unsigned long data) +static void ene_tx_irqsim(struct timer_list *t) { - struct ene_device *dev = (struct ene_device *)data; + struct ene_device *dev = from_timer(dev, t, tx_sim_timer); unsigned long flags; spin_lock_irqsave(&dev->hw_lock, flags); @@ -1045,8 +1045,7 @@ static int ene_probe(struct pnp_dev *pnp_dev, const struct pnp_device_id *id) if (!dev->hw_learning_and_tx_capable && txsim) { dev->hw_learning_and_tx_capable = true; - setup_timer(&dev->tx_sim_timer, ene_tx_irqsim, - (long unsigned int)dev); + timer_setup(&dev->tx_sim_timer, ene_tx_irqsim, 0); pr_warn("Simulation of TX activated\n"); } diff --git a/drivers/media/rc/gpio-ir-recv.c b/drivers/media/rc/gpio-ir-recv.c index 7248b3662285..ed5cfde4d9e7 100644 --- a/drivers/media/rc/gpio-ir-recv.c +++ b/drivers/media/rc/gpio-ir-recv.c @@ -14,119 +14,64 @@ #include #include #include -#include +#include #include #include #include #include #include #include -#include -#define GPIO_IR_DRIVER_NAME "gpio-rc-recv" #define GPIO_IR_DEVICE_NAME "gpio_ir_recv" struct gpio_rc_dev { struct rc_dev *rcdev; - int gpio_nr; - bool active_low; + struct gpio_desc *gpiod; + int irq; }; -#ifdef CONFIG_OF -/* - * Translate OpenFirmware node properties into platform_data - */ -static int gpio_ir_recv_get_devtree_pdata(struct device *dev, - struct gpio_ir_recv_platform_data *pdata) -{ - struct device_node *np = dev->of_node; - enum of_gpio_flags flags; - int gpio; - - gpio = of_get_gpio_flags(np, 0, &flags); - if (gpio < 0) { - if (gpio != -EPROBE_DEFER) - dev_err(dev, "Failed to get gpio flags (%d)\n", gpio); - return gpio; - } - - pdata->gpio_nr = gpio; - pdata->active_low = (flags & OF_GPIO_ACTIVE_LOW); - /* probe() takes care of map_name == NULL or allowed_protos == 0 */ - pdata->map_name = of_get_property(np, "linux,rc-map-name", NULL); - pdata->allowed_protos = 0; - - return 0; -} - -static const struct of_device_id gpio_ir_recv_of_match[] = { - { .compatible = "gpio-ir-receiver", }, - { }, -}; -MODULE_DEVICE_TABLE(of, gpio_ir_recv_of_match); - -#else /* !CONFIG_OF */ - -#define gpio_ir_recv_get_devtree_pdata(dev, pdata) (-ENOSYS) - -#endif - static irqreturn_t gpio_ir_recv_irq(int irq, void *dev_id) { + int val; struct gpio_rc_dev *gpio_dev = dev_id; - int gval; - int rc = 0; - gval = gpio_get_value(gpio_dev->gpio_nr); + val = gpiod_get_value(gpio_dev->gpiod); + if (val >= 0) + ir_raw_event_store_edge(gpio_dev->rcdev, val == 1); - if (gval < 0) - goto err_get_value; - - if (gpio_dev->active_low) - gval = !gval; - - rc = ir_raw_event_store_edge(gpio_dev->rcdev, gval == 1); - if (rc < 0) - goto err_get_value; - -err_get_value: return IRQ_HANDLED; } static int gpio_ir_recv_probe(struct platform_device *pdev) { + struct device *dev = &pdev->dev; + struct device_node *np = dev->of_node; struct gpio_rc_dev *gpio_dev; struct rc_dev *rcdev; - const struct gpio_ir_recv_platform_data *pdata = - pdev->dev.platform_data; int rc; - if (pdev->dev.of_node) { - struct gpio_ir_recv_platform_data *dtpdata = - devm_kzalloc(&pdev->dev, sizeof(*dtpdata), GFP_KERNEL); - if (!dtpdata) - return -ENOMEM; - rc = gpio_ir_recv_get_devtree_pdata(&pdev->dev, dtpdata); - if (rc) - return rc; - pdata = dtpdata; - } + if (!np) + return -ENODEV; - if (!pdata) - return -EINVAL; - - if (pdata->gpio_nr < 0) - return -EINVAL; - - gpio_dev = kzalloc(sizeof(struct gpio_rc_dev), GFP_KERNEL); + gpio_dev = devm_kzalloc(dev, sizeof(*gpio_dev), GFP_KERNEL); if (!gpio_dev) return -ENOMEM; - rcdev = rc_allocate_device(RC_DRIVER_IR_RAW); - if (!rcdev) { - rc = -ENOMEM; - goto err_allocate_device; + gpio_dev->gpiod = devm_gpiod_get(dev, NULL, GPIOD_IN); + if (IS_ERR(gpio_dev->gpiod)) { + rc = PTR_ERR(gpio_dev->gpiod); + /* Just try again if this happens */ + if (rc != -EPROBE_DEFER) + dev_err(dev, "error getting gpio (%d)\n", rc); + return rc; } + gpio_dev->irq = gpiod_to_irq(gpio_dev->gpiod); + if (gpio_dev->irq < 0) + return gpio_dev->irq; + + rcdev = devm_rc_allocate_device(dev, RC_DRIVER_IR_RAW); + if (!rcdev) + return -ENOMEM; rcdev->priv = gpio_dev; rcdev->device_name = GPIO_IR_DEVICE_NAME; @@ -135,92 +80,54 @@ static int gpio_ir_recv_probe(struct platform_device *pdev) rcdev->input_id.vendor = 0x0001; rcdev->input_id.product = 0x0001; rcdev->input_id.version = 0x0100; - rcdev->dev.parent = &pdev->dev; - rcdev->driver_name = GPIO_IR_DRIVER_NAME; + rcdev->dev.parent = dev; + rcdev->driver_name = KBUILD_MODNAME; rcdev->min_timeout = 1; rcdev->timeout = IR_DEFAULT_TIMEOUT; rcdev->max_timeout = 10 * IR_DEFAULT_TIMEOUT; - if (pdata->allowed_protos) - rcdev->allowed_protocols = pdata->allowed_protos; - else - rcdev->allowed_protocols = RC_PROTO_BIT_ALL_IR_DECODER; - rcdev->map_name = pdata->map_name ?: RC_MAP_EMPTY; + rcdev->allowed_protocols = RC_PROTO_BIT_ALL_IR_DECODER; + rcdev->map_name = of_get_property(np, "linux,rc-map-name", NULL); + if (!rcdev->map_name) + rcdev->map_name = RC_MAP_EMPTY; gpio_dev->rcdev = rcdev; - gpio_dev->gpio_nr = pdata->gpio_nr; - gpio_dev->active_low = pdata->active_low; + if (of_property_read_bool(np, "wakeup-source")) + device_init_wakeup(dev, true); - rc = gpio_request(pdata->gpio_nr, "gpio-ir-recv"); - if (rc < 0) - goto err_gpio_request; - rc = gpio_direction_input(pdata->gpio_nr); - if (rc < 0) - goto err_gpio_direction_input; - - rc = rc_register_device(rcdev); + rc = devm_rc_register_device(dev, rcdev); if (rc < 0) { - dev_err(&pdev->dev, "failed to register rc device\n"); - goto err_register_rc_device; + dev_err(dev, "failed to register rc device (%d)\n", rc); + return rc; } platform_set_drvdata(pdev, gpio_dev); - rc = request_any_context_irq(gpio_to_irq(pdata->gpio_nr), - gpio_ir_recv_irq, - IRQF_TRIGGER_FALLING | IRQF_TRIGGER_RISING, - "gpio-ir-recv-irq", gpio_dev); - if (rc < 0) - goto err_request_irq; - - return 0; - -err_request_irq: - rc_unregister_device(rcdev); - rcdev = NULL; -err_register_rc_device: -err_gpio_direction_input: - gpio_free(pdata->gpio_nr); -err_gpio_request: - rc_free_device(rcdev); -err_allocate_device: - kfree(gpio_dev); - return rc; -} - -static int gpio_ir_recv_remove(struct platform_device *pdev) -{ - struct gpio_rc_dev *gpio_dev = platform_get_drvdata(pdev); - - free_irq(gpio_to_irq(gpio_dev->gpio_nr), gpio_dev); - rc_unregister_device(gpio_dev->rcdev); - gpio_free(gpio_dev->gpio_nr); - kfree(gpio_dev); - return 0; + return devm_request_irq(dev, gpio_dev->irq, gpio_ir_recv_irq, + IRQF_TRIGGER_FALLING | IRQF_TRIGGER_RISING, + "gpio-ir-recv-irq", gpio_dev); } #ifdef CONFIG_PM static int gpio_ir_recv_suspend(struct device *dev) { - struct platform_device *pdev = to_platform_device(dev); - struct gpio_rc_dev *gpio_dev = platform_get_drvdata(pdev); + struct gpio_rc_dev *gpio_dev = dev_get_drvdata(dev); if (device_may_wakeup(dev)) - enable_irq_wake(gpio_to_irq(gpio_dev->gpio_nr)); + enable_irq_wake(gpio_dev->irq); else - disable_irq(gpio_to_irq(gpio_dev->gpio_nr)); + disable_irq(gpio_dev->irq); return 0; } static int gpio_ir_recv_resume(struct device *dev) { - struct platform_device *pdev = to_platform_device(dev); - struct gpio_rc_dev *gpio_dev = platform_get_drvdata(pdev); + struct gpio_rc_dev *gpio_dev = dev_get_drvdata(dev); if (device_may_wakeup(dev)) - disable_irq_wake(gpio_to_irq(gpio_dev->gpio_nr)); + disable_irq_wake(gpio_dev->irq); else - enable_irq(gpio_to_irq(gpio_dev->gpio_nr)); + enable_irq(gpio_dev->irq); return 0; } @@ -231,11 +138,16 @@ static const struct dev_pm_ops gpio_ir_recv_pm_ops = { }; #endif +static const struct of_device_id gpio_ir_recv_of_match[] = { + { .compatible = "gpio-ir-receiver", }, + { }, +}; +MODULE_DEVICE_TABLE(of, gpio_ir_recv_of_match); + static struct platform_driver gpio_ir_recv_driver = { .probe = gpio_ir_recv_probe, - .remove = gpio_ir_recv_remove, .driver = { - .name = GPIO_IR_DRIVER_NAME, + .name = KBUILD_MODNAME, .of_match_table = of_match_ptr(gpio_ir_recv_of_match), #ifdef CONFIG_PM .pm = &gpio_ir_recv_pm_ops, diff --git a/drivers/media/rc/igorplugusb.c b/drivers/media/rc/igorplugusb.c index 2a0325d1f9de..98a13532a596 100644 --- a/drivers/media/rc/igorplugusb.c +++ b/drivers/media/rc/igorplugusb.c @@ -139,9 +139,9 @@ static void igorplugusb_cmd(struct igorplugusb *ir, int cmd) dev_err(ir->dev, "submit urb failed: %d", ret); } -static void igorplugusb_timer(unsigned long data) +static void igorplugusb_timer(struct timer_list *t) { - struct igorplugusb *ir = (struct igorplugusb *)data; + struct igorplugusb *ir = from_timer(ir, t, timer); igorplugusb_cmd(ir, GET_INFRACODE); } @@ -176,7 +176,7 @@ static int igorplugusb_probe(struct usb_interface *intf, ir->dev = &intf->dev; - setup_timer(&ir->timer, igorplugusb_timer, (unsigned long)ir); + timer_setup(&ir->timer, igorplugusb_timer, 0); ir->request.bRequest = GET_INFRACODE; ir->request.bRequestType = USB_TYPE_VENDOR | USB_DIR_IN; @@ -247,7 +247,7 @@ static void igorplugusb_disconnect(struct usb_interface *intf) usb_free_urb(ir->urb); } -static struct usb_device_id igorplugusb_table[] = { +static const struct usb_device_id igorplugusb_table[] = { /* Igor Plug USB (Atmel's Manufact. ID) */ { USB_DEVICE(0x03eb, 0x0002) }, /* Fit PC2 Infrared Adapter */ diff --git a/drivers/media/rc/iguanair.c b/drivers/media/rc/iguanair.c index 03dbbfba71fc..1df9522c30fa 100644 --- a/drivers/media/rc/iguanair.c +++ b/drivers/media/rc/iguanair.c @@ -347,26 +347,23 @@ static int iguanair_set_tx_mask(struct rc_dev *dev, uint32_t mask) static int iguanair_tx(struct rc_dev *dev, unsigned *txbuf, unsigned count) { struct iguanair *ir = dev->priv; - uint8_t space; - unsigned i, size, periods, bytes; + unsigned int i, size, p, periods; int rc; mutex_lock(&ir->lock); /* convert from us to carrier periods */ - for (i = space = size = 0; i < count; i++) { + for (i = size = 0; i < count; i++) { periods = DIV_ROUND_CLOSEST(txbuf[i] * ir->carrier, 1000000); - bytes = DIV_ROUND_UP(periods, 127); - if (size + bytes > ir->bufsize) { - rc = -EINVAL; - goto out; - } while (periods) { - unsigned p = min(periods, 127u); - ir->packet->payload[size++] = p | space; + p = min(periods, 127u); + if (size >= ir->bufsize) { + rc = -EINVAL; + goto out; + } + ir->packet->payload[size++] = p | ((i & 1) ? 0x80 : 0); periods -= p; } - space ^= 0x80; } ir->packet->header.start = 0; diff --git a/drivers/media/rc/img-ir/Kconfig b/drivers/media/rc/img-ir/Kconfig index a896d3c83a1c..d2c6617d468e 100644 --- a/drivers/media/rc/img-ir/Kconfig +++ b/drivers/media/rc/img-ir/Kconfig @@ -1,7 +1,7 @@ config IR_IMG tristate "ImgTec IR Decoder" depends on RC_CORE - depends on METAG || MIPS || COMPILE_TEST + depends on MIPS || COMPILE_TEST select IR_IMG_HW if !IR_IMG_RAW help Say Y or M here if you want to use the ImgTec infrared decoder diff --git a/drivers/media/rc/img-ir/img-ir-core.c b/drivers/media/rc/img-ir/img-ir-core.c index 03fe080278df..bcbabeeab12a 100644 --- a/drivers/media/rc/img-ir/img-ir-core.c +++ b/drivers/media/rc/img-ir/img-ir-core.c @@ -92,10 +92,9 @@ static int img_ir_probe(struct platform_device *pdev) /* Private driver data */ priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL); - if (!priv) { - dev_err(&pdev->dev, "cannot allocate device data\n"); + if (!priv) return -ENOMEM; - } + platform_set_drvdata(pdev, priv); priv->dev = &pdev->dev; spin_lock_init(&priv->lock); diff --git a/drivers/media/rc/img-ir/img-ir-hw.c b/drivers/media/rc/img-ir/img-ir-hw.c index 82fdf4cc0824..ec4ded84cd17 100644 --- a/drivers/media/rc/img-ir/img-ir-hw.c +++ b/drivers/media/rc/img-ir/img-ir-hw.c @@ -339,7 +339,7 @@ static void img_ir_decoder_preprocess(struct img_ir_decoder *decoder) /** * img_ir_decoder_convert() - Generate internal timings in decoder. * @decoder: Decoder to be converted to internal timings. - * @timings: Timing register values. + * @reg_timings: Timing register values. * @clock_hz: IR clock rate in Hz. * * Fills out the repeat timings and timing register values for a specific clock @@ -867,9 +867,9 @@ static void img_ir_handle_data(struct img_ir_priv *priv, u32 len, u64 raw) } /* timer function to end waiting for repeat. */ -static void img_ir_end_timer(unsigned long arg) +static void img_ir_end_timer(struct timer_list *t) { - struct img_ir_priv *priv = (struct img_ir_priv *)arg; + struct img_ir_priv *priv = from_timer(priv, t, hw.end_timer); spin_lock_irq(&priv->lock); img_ir_end_repeat(priv); @@ -881,9 +881,9 @@ static void img_ir_end_timer(unsigned long arg) * cleared when invalid interrupts were generated due to a quirk in the * img-ir decoder. */ -static void img_ir_suspend_timer(unsigned long arg) +static void img_ir_suspend_timer(struct timer_list *t) { - struct img_ir_priv *priv = (struct img_ir_priv *)arg; + struct img_ir_priv *priv = from_timer(priv, t, hw.suspend_timer); spin_lock_irq(&priv->lock); /* @@ -1055,9 +1055,8 @@ int img_ir_probe_hw(struct img_ir_priv *priv) img_ir_probe_hw_caps(priv); /* Set up the end timer */ - setup_timer(&hw->end_timer, img_ir_end_timer, (unsigned long)priv); - setup_timer(&hw->suspend_timer, img_ir_suspend_timer, - (unsigned long)priv); + timer_setup(&hw->end_timer, img_ir_end_timer, 0); + timer_setup(&hw->suspend_timer, img_ir_suspend_timer, 0); /* Register a clock notifier */ if (!IS_ERR(priv->clk)) { diff --git a/drivers/media/rc/img-ir/img-ir-raw.c b/drivers/media/rc/img-ir/img-ir-raw.c index 64714efc1145..6e545680d3b6 100644 --- a/drivers/media/rc/img-ir/img-ir-raw.c +++ b/drivers/media/rc/img-ir/img-ir-raw.c @@ -67,9 +67,9 @@ void img_ir_isr_raw(struct img_ir_priv *priv, u32 irq_status) * order to be assured of the final space. If there are no edges for a certain * time we use this timer to emit a final sample to satisfy them. */ -static void img_ir_echo_timer(unsigned long arg) +static void img_ir_echo_timer(struct timer_list *t) { - struct img_ir_priv *priv = (struct img_ir_priv *)arg; + struct img_ir_priv *priv = from_timer(priv, t, raw.timer); spin_lock_irq(&priv->lock); @@ -107,7 +107,7 @@ int img_ir_probe_raw(struct img_ir_priv *priv) int error; /* Set up the echo timer */ - setup_timer(&raw->timer, img_ir_echo_timer, (unsigned long)priv); + timer_setup(&raw->timer, img_ir_echo_timer, 0); /* Allocate raw decoder */ raw->rdev = rdev = rc_allocate_device(RC_DRIVER_IR_RAW); diff --git a/drivers/media/rc/imon.c b/drivers/media/rc/imon.c index 875e00e4a8a0..c78e1a4a10ec 100644 --- a/drivers/media/rc/imon.c +++ b/drivers/media/rc/imon.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -37,7 +38,6 @@ #include #include -#include #include #define MOD_AUTHOR "Jarod Wilson " @@ -92,7 +92,6 @@ struct imon_usb_dev_descr { __u16 flags; #define IMON_NO_FLAGS 0 #define IMON_NEED_20MS_PKT_DELAY 1 -#define IMON_IR_RAW 2 struct imon_panel_key_table key_table[]; }; @@ -123,12 +122,6 @@ struct imon_context { unsigned char usb_tx_buf[8]; unsigned int send_packet_delay; - struct rx_data { - int count; /* length of 0 or 1 sequence */ - int prev_bit; /* logic level of sequence */ - int initial_space; /* initial space flag */ - } rx; - struct tx_t { unsigned char data_buf[35]; /* user data buffer */ struct completion finished; /* wait for write to finish */ @@ -331,10 +324,6 @@ static const struct imon_usb_dev_descr imon_DH102 = { } }; -static const struct imon_usb_dev_descr imon_ir_raw = { - .flags = IMON_IR_RAW, -}; - /* * USB Device ID for iMON USB Control Boards * @@ -346,7 +335,7 @@ static const struct imon_usb_dev_descr imon_ir_raw = { * devices use the SoundGraph vendor ID (0x15c2). This driver only supports * the ffdc and later devices, which do onboard decoding. */ -static struct usb_device_id imon_usb_id_table[] = { +static const struct usb_device_id imon_usb_id_table[] = { /* * Several devices with this same device ID, all use iMON_PAD.inf * SoundGraph iMON PAD (IR & VFD) @@ -418,18 +407,6 @@ static struct usb_device_id imon_usb_id_table[] = { /* device specifics unknown */ { USB_DEVICE(0x15c2, 0x0046), .driver_info = (unsigned long)&imon_default_table}, - /* TriGem iMON (IR only) -- TG_iMON.inf */ - { USB_DEVICE(0x0aa8, 0x8001), - .driver_info = (unsigned long)&imon_ir_raw}, - /* SoundGraph iMON (IR only) -- sg_imon.inf */ - { USB_DEVICE(0x04e8, 0xff30), - .driver_info = (unsigned long)&imon_ir_raw}, - /* SoundGraph iMON VFD (IR & VFD) -- iMON_VFD.inf */ - { USB_DEVICE(0x0aa8, 0xffda), - .driver_info = (unsigned long)&imon_ir_raw}, - /* SoundGraph iMON SS (IR & VFD) -- iMON_SS.inf */ - { USB_DEVICE(0x15c2, 0xffda), - .driver_info = (unsigned long)&imon_ir_raw}, {} }; @@ -492,7 +469,7 @@ static void free_imon_context(struct imon_context *ictx) dev_dbg(dev, "%s: iMON context freed\n", __func__); } -/** +/* * Called when the Display device (e.g. /dev/lcd0) * is opened by the application. */ @@ -542,7 +519,7 @@ exit: return retval; } -/** +/* * Called when the display device (e.g. /dev/lcd0) * is closed by the application. */ @@ -575,7 +552,7 @@ static int display_close(struct inode *inode, struct file *file) return retval; } -/** +/* * Sends a packet to the device -- this function must be called with * ictx->lock held, or its unlock/lock sequence while waiting for tx * to complete can/will lead to a deadlock. @@ -602,8 +579,7 @@ static int send_packet(struct imon_context *ictx) ictx->tx_urb->actual_length = 0; } else { /* fill request into kmalloc'ed space: */ - control_req = kmalloc(sizeof(struct usb_ctrlrequest), - GFP_KERNEL); + control_req = kmalloc(sizeof(*control_req), GFP_KERNEL); if (control_req == NULL) return -ENOMEM; @@ -664,7 +640,7 @@ static int send_packet(struct imon_context *ictx) return retval; } -/** +/* * Sends an associate packet to the iMON 2.4G. * * This might not be such a good idea, since it has an id collision with @@ -694,7 +670,7 @@ static int send_associate_24g(struct imon_context *ictx) return retval; } -/** +/* * Sends packets to setup and show clock on iMON display * * Arguments: year - last 2 digits of year, month - 1..12, @@ -781,7 +757,7 @@ static int send_set_imon_clock(struct imon_context *ictx, return retval; } -/** +/* * These are the sysfs functions to handle the association on the iMON 2.4G LT. */ static ssize_t show_associate_remote(struct device *d, @@ -823,7 +799,7 @@ static ssize_t store_associate_remote(struct device *d, return count; } -/** +/* * sysfs functions to control internal imon clock */ static ssize_t show_imon_clock(struct device *d, @@ -923,7 +899,7 @@ static const struct attribute_group imon_rf_attr_group = { .attrs = imon_rf_sysfs_entries }; -/** +/* * Writes data to the VFD. The iMON VFD is 2x16 characters * and requires data in 5 consecutive USB interrupt packets, * each packet but the last carrying 7 bytes. @@ -942,7 +918,7 @@ static ssize_t vfd_write(struct file *file, const char __user *buf, int seq; int retval = 0; struct imon_context *ictx; - const unsigned char vfd_packet6[] = { + static const unsigned char vfd_packet6[] = { 0x01, 0x00, 0x00, 0x00, 0x00, 0xFF, 0xFF }; ictx = file->private_data; @@ -1009,7 +985,7 @@ exit: return (!retval) ? n_bytes : retval; } -/** +/* * Writes data to the LCD. The iMON OEM LCD screen expects 8-byte * packets. We accept data as 16 hexadecimal digits, followed by a * newline (to make it easy to drive the device from a command-line @@ -1067,7 +1043,7 @@ exit: return (!retval) ? n_bytes : retval; } -/** +/* * Callback function for USB core API: transmit data */ static void usb_tx_callback(struct urb *urb) @@ -1088,12 +1064,12 @@ static void usb_tx_callback(struct urb *urb) complete(&ictx->tx.finished); } -/** +/* * report touchscreen input */ -static void imon_touch_display_timeout(unsigned long data) +static void imon_touch_display_timeout(struct timer_list *t) { - struct imon_context *ictx = (struct imon_context *)data; + struct imon_context *ictx = from_timer(ictx, t, ttimer); if (ictx->display_type != IMON_DISPLAY_TYPE_VGA) return; @@ -1104,7 +1080,7 @@ static void imon_touch_display_timeout(unsigned long data) input_sync(ictx->touch); } -/** +/* * iMON IR receivers support two different signal sets -- those used by * the iMON remotes, and those used by the Windows MCE remotes (which is * really just RC-6), but only one or the other at a time, as the signals @@ -1134,18 +1110,18 @@ static int imon_ir_change_protocol(struct rc_dev *rc, u64 *rc_proto) dev_dbg(dev, "Configuring IR receiver for MCE protocol\n"); ir_proto_packet[0] = 0x01; *rc_proto = RC_PROTO_BIT_RC6_MCE; - } else if (*rc_proto & RC_PROTO_BIT_OTHER) { + } else if (*rc_proto & RC_PROTO_BIT_IMON) { dev_dbg(dev, "Configuring IR receiver for iMON protocol\n"); if (!pad_stabilize) dev_dbg(dev, "PAD stabilize functionality disabled\n"); /* ir_proto_packet[0] = 0x00; // already the default */ - *rc_proto = RC_PROTO_BIT_OTHER; + *rc_proto = RC_PROTO_BIT_IMON; } else { dev_warn(dev, "Unsupported IR protocol specified, overriding to iMON IR protocol\n"); if (!pad_stabilize) dev_dbg(dev, "PAD stabilize functionality disabled\n"); /* ir_proto_packet[0] = 0x00; // already the default */ - *rc_proto = RC_PROTO_BIT_OTHER; + *rc_proto = RC_PROTO_BIT_IMON; } memcpy(ictx->usb_tx_buf, &ir_proto_packet, sizeof(ir_proto_packet)); @@ -1166,30 +1142,7 @@ out: return retval; } -static inline int tv2int(const struct timeval *a, const struct timeval *b) -{ - int usecs = 0; - int sec = 0; - - if (b->tv_usec > a->tv_usec) { - usecs = 1000000; - sec--; - } - - usecs += a->tv_usec - b->tv_usec; - - sec += a->tv_sec - b->tv_sec; - sec *= 1000; - usecs /= 1000; - sec += usecs; - - if (sec < 0) - sec = 1000; - - return sec; -} - -/** +/* * The directional pad behaves a bit differently, depending on whether this is * one of the older ffdc devices or a newer device. Newer devices appear to * have a higher resolution matrix for more precise mouse movement, but it @@ -1199,16 +1152,16 @@ static inline int tv2int(const struct timeval *a, const struct timeval *b) */ static int stabilize(int a, int b, u16 timeout, u16 threshold) { - struct timeval ct; - static struct timeval prev_time = {0, 0}; - static struct timeval hit_time = {0, 0}; + ktime_t ct; + static ktime_t prev_time; + static ktime_t hit_time; static int x, y, prev_result, hits; int result = 0; - int msec, msec_hit; + long msec, msec_hit; - do_gettimeofday(&ct); - msec = tv2int(&ct, &prev_time); - msec_hit = tv2int(&ct, &hit_time); + ct = ktime_get(); + msec = ktime_ms_delta(ct, prev_time); + msec_hit = ktime_ms_delta(ct, hit_time); if (msec > 100) { x = 0; @@ -1432,7 +1385,7 @@ static void imon_pad_to_keys(struct imon_context *ictx, unsigned char *buf) rel_x = buf[2]; rel_y = buf[3]; - if (ictx->rc_proto == RC_PROTO_BIT_OTHER && pad_stabilize) { + if (ictx->rc_proto == RC_PROTO_BIT_IMON && pad_stabilize) { if ((buf[1] == 0) && ((rel_x != 0) || (rel_y != 0))) { dir = stabilize((int)rel_x, (int)rel_y, timeout, threshold); @@ -1499,7 +1452,7 @@ static void imon_pad_to_keys(struct imon_context *ictx, unsigned char *buf) buf[0] = 0x01; buf[1] = buf[4] = buf[5] = buf[6] = buf[7] = 0; - if (ictx->rc_proto == RC_PROTO_BIT_OTHER && pad_stabilize) { + if (ictx->rc_proto == RC_PROTO_BIT_IMON && pad_stabilize) { dir = stabilize((int)rel_x, (int)rel_y, timeout, threshold); if (!dir) { @@ -1541,7 +1494,7 @@ static void imon_pad_to_keys(struct imon_context *ictx, unsigned char *buf) } } -/** +/* * figure out if these is a press or a release. We don't actually * care about repeats, as those will be auto-generated within the IR * subsystem for repeating scancodes. @@ -1590,94 +1543,11 @@ static int imon_parse_press_type(struct imon_context *ictx, return press_type; } -/** +/* * Process the incoming packet */ -/** - * Convert bit count to time duration (in us) and submit - * the value to lirc_dev. - */ -static void submit_data(struct imon_context *context) -{ - DEFINE_IR_RAW_EVENT(ev); - - ev.pulse = context->rx.prev_bit; - ev.duration = US_TO_NS(context->rx.count * BIT_DURATION); - ir_raw_event_store_with_filter(context->rdev, &ev); -} - -/** - * Process the incoming packet - */ -static void imon_incoming_ir_raw(struct imon_context *context, +static void imon_incoming_packet(struct imon_context *ictx, struct urb *urb, int intf) -{ - int len = urb->actual_length; - unsigned char *buf = urb->transfer_buffer; - struct device *dev = context->dev; - int octet, bit; - unsigned char mask; - - if (len != 8) { - dev_warn(dev, "imon %s: invalid incoming packet size (len = %d, intf%d)\n", - __func__, len, intf); - return; - } - - if (debug) - dev_info(dev, "raw packet: %*ph\n", len, buf); - /* - * Translate received data to pulse and space lengths. - * Received data is active low, i.e. pulses are 0 and - * spaces are 1. - * - * My original algorithm was essentially similar to - * Changwoo Ryu's with the exception that he switched - * the incoming bits to active high and also fed an - * initial space to LIRC at the start of a new sequence - * if the previous bit was a pulse. - * - * I've decided to adopt his algorithm. - */ - - if (buf[7] == 1 && context->rx.initial_space) { - /* LIRC requires a leading space */ - context->rx.prev_bit = 0; - context->rx.count = 4; - submit_data(context); - context->rx.count = 0; - } - - for (octet = 0; octet < 5; ++octet) { - mask = 0x80; - for (bit = 0; bit < 8; ++bit) { - int curr_bit = !(buf[octet] & mask); - - if (curr_bit != context->rx.prev_bit) { - if (context->rx.count) { - submit_data(context); - context->rx.count = 0; - } - context->rx.prev_bit = curr_bit; - } - ++context->rx.count; - mask >>= 1; - } - } - - if (buf[7] == 10) { - if (context->rx.count) { - submit_data(context); - context->rx.count = 0; - } - context->rx.initial_space = context->rx.prev_bit; - } - - ir_raw_event_handle(context->rdev); -} - -static void imon_incoming_scancode(struct imon_context *ictx, - struct urb *urb, int intf) { int len = urb->actual_length; unsigned char *buf = urb->transfer_buffer; @@ -1686,9 +1556,9 @@ static void imon_incoming_scancode(struct imon_context *ictx, u32 kc; u64 scancode; int press_type = 0; - int msec; - struct timeval t; - static struct timeval prev_time = { 0, 0 }; + long msec; + ktime_t t; + static ktime_t prev_time; u8 ktype; /* filter out junk data on the older 0xffdc imon devices */ @@ -1765,11 +1635,18 @@ static void imon_incoming_scancode(struct imon_context *ictx, if (press_type == 0) rc_keyup(ictx->rdev); else { - if (ictx->rc_proto == RC_PROTO_BIT_RC6_MCE || - ictx->rc_proto == RC_PROTO_BIT_OTHER) - rc_keydown(ictx->rdev, - ictx->rc_proto == RC_PROTO_BIT_RC6_MCE ? RC_PROTO_RC6_MCE : RC_PROTO_OTHER, - ictx->rc_scancode, ictx->rc_toggle); + enum rc_proto proto; + + if (ictx->rc_proto == RC_PROTO_BIT_RC6_MCE) + proto = RC_PROTO_RC6_MCE; + else if (ictx->rc_proto == RC_PROTO_BIT_IMON) + proto = RC_PROTO_IMON; + else + return; + + rc_keydown(ictx->rdev, proto, ictx->rc_scancode, + ictx->rc_toggle); + spin_lock_irqsave(&ictx->kc_lock, flags); ictx->last_keycode = ictx->kc; spin_unlock_irqrestore(&ictx->kc_lock, flags); @@ -1780,10 +1657,10 @@ static void imon_incoming_scancode(struct imon_context *ictx, /* Only panel type events left to process now */ spin_lock_irqsave(&ictx->kc_lock, flags); - do_gettimeofday(&t); + t = ktime_get(); /* KEY_MUTE repeats from knob need to be suppressed */ if (ictx->kc == KEY_MUTE && ictx->kc == ictx->last_keycode) { - msec = tv2int(&t, &prev_time); + msec = ktime_ms_delta(t, prev_time); if (msec < ictx->idev->rep[REP_DELAY]) { spin_unlock_irqrestore(&ictx->kc_lock, flags); return; @@ -1828,7 +1705,7 @@ not_input_data: } } -/** +/* * Callback function for USB core API: receive data */ static void usb_rx_callback_intf0(struct urb *urb) @@ -1859,10 +1736,7 @@ static void usb_rx_callback_intf0(struct urb *urb) break; case 0: - if (ictx->rdev->driver_type == RC_DRIVER_IR_RAW) - imon_incoming_ir_raw(ictx, urb, intfnum); - else - imon_incoming_scancode(ictx, urb, intfnum); + imon_incoming_packet(ictx, urb, intfnum); break; default: @@ -1903,10 +1777,7 @@ static void usb_rx_callback_intf1(struct urb *urb) break; case 0: - if (ictx->rdev->driver_type == RC_DRIVER_IR_RAW) - imon_incoming_ir_raw(ictx, urb, intfnum); - else - imon_incoming_scancode(ictx, urb, intfnum); + imon_incoming_packet(ictx, urb, intfnum); break; default: @@ -1932,7 +1803,7 @@ static void imon_get_ffdc_type(struct imon_context *ictx) { u8 ffdc_cfg_byte = ictx->usb_rx_buf[6]; u8 detected_display_type = IMON_DISPLAY_TYPE_NONE; - u64 allowed_protos = RC_PROTO_BIT_OTHER; + u64 allowed_protos = RC_PROTO_BIT_IMON; switch (ffdc_cfg_byte) { /* iMON Knob, no display, iMON IR + vol knob */ @@ -1953,6 +1824,7 @@ static void imon_get_ffdc_type(struct imon_context *ictx) break; /* iMON VFD, iMON IR */ case 0x24: + case 0x30: case 0x85: dev_info(ictx->dev, "0xffdc iMON VFD, iMON IR"); detected_display_type = IMON_DISPLAY_TYPE_VFD; @@ -1976,11 +1848,18 @@ static void imon_get_ffdc_type(struct imon_context *ictx) detected_display_type = IMON_DISPLAY_TYPE_LCD; allowed_protos = RC_PROTO_BIT_RC6_MCE; break; + /* no display, iMON IR */ + case 0x26: + dev_info(ictx->dev, "0xffdc iMON Inside, iMON IR"); + ictx->display_supported = false; + break; default: dev_info(ictx->dev, "Unknown 0xffdc device, defaulting to VFD and iMON IR"); detected_display_type = IMON_DISPLAY_TYPE_VFD; - /* We don't know which one it is, allow user to set the - * RC6 one from userspace if OTHER wasn't correct. */ + /* + * We don't know which one it is, allow user to set the + * RC6 one from userspace if IMON wasn't correct. + */ allowed_protos |= RC_PROTO_BIT_RC6_MCE; break; } @@ -2019,14 +1898,11 @@ static void imon_set_display_type(struct imon_context *ictx) case 0x0041: case 0x0042: case 0x0043: - case 0x8001: - case 0xff30: configured_display_type = IMON_DISPLAY_TYPE_NONE; ictx->display_supported = false; break; case 0x0036: case 0x0044: - case 0xffda: default: configured_display_type = IMON_DISPLAY_TYPE_VFD; break; @@ -2048,11 +1924,10 @@ static struct rc_dev *imon_init_rdev(struct imon_context *ictx) { struct rc_dev *rdev; int ret; - const unsigned char fp_packet[] = { 0x40, 0x00, 0x00, 0x00, - 0x00, 0x00, 0x00, 0x88 }; + static const unsigned char fp_packet[] = { + 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x88 }; - rdev = rc_allocate_device(ictx->dev_descr->flags & IMON_IR_RAW ? - RC_DRIVER_IR_RAW : RC_DRIVER_SCANCODE); + rdev = rc_allocate_device(RC_DRIVER_SCANCODE); if (!rdev) { dev_err(ictx->dev, "remote control dev allocation failed\n"); goto out; @@ -2070,12 +1945,8 @@ static struct rc_dev *imon_init_rdev(struct imon_context *ictx) rdev->dev.parent = ictx->dev; rdev->priv = ictx; - if (ictx->dev_descr->flags & IMON_IR_RAW) - rdev->allowed_protocols = RC_PROTO_BIT_ALL_IR_DECODER; - else - /* iMON PAD or MCE */ - rdev->allowed_protocols = RC_PROTO_BIT_OTHER | - RC_PROTO_BIT_RC6_MCE; + /* iMON PAD or MCE */ + rdev->allowed_protocols = RC_PROTO_BIT_IMON | RC_PROTO_BIT_RC6_MCE; rdev->change_protocol = imon_ir_change_protocol; rdev->driver_name = MOD_NAME; @@ -2093,8 +1964,7 @@ static struct rc_dev *imon_init_rdev(struct imon_context *ictx) imon_set_display_type(ictx); - if (ictx->rc_proto == RC_PROTO_BIT_RC6_MCE || - ictx->dev_descr->flags & IMON_IR_RAW) + if (ictx->rc_proto == RC_PROTO_BIT_RC6_MCE) rdev->map_name = RC_MAP_IMON_MCE; else rdev->map_name = RC_MAP_IMON_PAD; @@ -2311,11 +2181,10 @@ static struct imon_context *imon_init_intf0(struct usb_interface *intf, struct usb_host_interface *iface_desc; int ret = -ENOMEM; - ictx = kzalloc(sizeof(struct imon_context), GFP_KERNEL); - if (!ictx) { - dev_err(dev, "%s: kzalloc failed for context", __func__); + ictx = kzalloc(sizeof(*ictx), GFP_KERNEL); + if (!ictx) goto exit; - } + rx_urb = usb_alloc_urb(0, GFP_KERNEL); if (!rx_urb) goto rx_urb_alloc_failed; @@ -2414,8 +2283,7 @@ static struct imon_context *imon_init_intf1(struct usb_interface *intf, mutex_lock(&ictx->lock); if (ictx->display_type == IMON_DISPLAY_TYPE_VGA) { - setup_timer(&ictx->ttimer, imon_touch_display_timeout, - (unsigned long)ictx); + timer_setup(&ictx->ttimer, imon_touch_display_timeout, 0); } ictx->usbdev_intf1 = usb_get_dev(interface_to_usbdev(intf)); @@ -2489,7 +2357,7 @@ static void imon_init_display(struct imon_context *ictx, } -/** +/* * Callback function for USB core API: Probe */ static int imon_probe(struct usb_interface *interface, @@ -2587,7 +2455,7 @@ fail: return ret; } -/** +/* * Callback function for USB core API: disconnect */ static void imon_disconnect(struct usb_interface *interface) diff --git a/drivers/media/rc/imon_raw.c b/drivers/media/rc/imon_raw.c new file mode 100644 index 000000000000..32709f96de14 --- /dev/null +++ b/drivers/media/rc/imon_raw.c @@ -0,0 +1,199 @@ +// SPDX-License-Identifier: GPL-2.0+ +// +// Copyright (C) 2018 Sean Young + +#include +#include +#include +#include + +/* Each bit is 250us */ +#define BIT_DURATION 250000 + +struct imon { + struct device *dev; + struct urb *ir_urb; + struct rc_dev *rcdev; + u8 ir_buf[8]; + char phys[64]; +}; + +/* + * ffs/find_next_bit() searches in the wrong direction, so open-code our own. + */ +static inline int is_bit_set(const u8 *buf, int bit) +{ + return buf[bit / 8] & (0x80 >> (bit & 7)); +} + +static void imon_ir_data(struct imon *imon) +{ + DEFINE_IR_RAW_EVENT(rawir); + int offset = 0, size = 5 * 8; + int bit; + + dev_dbg(imon->dev, "data: %*ph", 8, imon->ir_buf); + + while (offset < size) { + bit = offset; + while (!is_bit_set(imon->ir_buf, bit) && bit < size) + bit++; + dev_dbg(imon->dev, "pulse: %d bits", bit - offset); + if (bit > offset) { + rawir.pulse = true; + rawir.duration = (bit - offset) * BIT_DURATION; + ir_raw_event_store_with_filter(imon->rcdev, &rawir); + } + + if (bit >= size) + break; + + offset = bit; + while (is_bit_set(imon->ir_buf, bit) && bit < size) + bit++; + dev_dbg(imon->dev, "space: %d bits", bit - offset); + + rawir.pulse = false; + rawir.duration = (bit - offset) * BIT_DURATION; + ir_raw_event_store_with_filter(imon->rcdev, &rawir); + + offset = bit; + } + + if (imon->ir_buf[7] == 0x0a) { + ir_raw_event_set_idle(imon->rcdev, true); + ir_raw_event_handle(imon->rcdev); + } +} + +static void imon_ir_rx(struct urb *urb) +{ + struct imon *imon = urb->context; + int ret; + + switch (urb->status) { + case 0: + if (imon->ir_buf[7] != 0xff) + imon_ir_data(imon); + break; + case -ECONNRESET: + case -ENOENT: + case -ESHUTDOWN: + usb_unlink_urb(urb); + return; + case -EPIPE: + default: + dev_dbg(imon->dev, "error: urb status = %d", urb->status); + break; + } + + ret = usb_submit_urb(urb, GFP_ATOMIC); + if (ret && ret != -ENODEV) + dev_warn(imon->dev, "failed to resubmit urb: %d", ret); +} + +static int imon_probe(struct usb_interface *intf, + const struct usb_device_id *id) +{ + struct usb_endpoint_descriptor *ir_ep = NULL; + struct usb_host_interface *idesc; + struct usb_device *udev; + struct rc_dev *rcdev; + struct imon *imon; + int i, ret; + + udev = interface_to_usbdev(intf); + idesc = intf->cur_altsetting; + + for (i = 0; i < idesc->desc.bNumEndpoints; i++) { + struct usb_endpoint_descriptor *ep = &idesc->endpoint[i].desc; + + if (usb_endpoint_is_int_in(ep)) { + ir_ep = ep; + break; + } + } + + if (!ir_ep) { + dev_err(&intf->dev, "IR endpoint missing"); + return -ENODEV; + } + + imon = devm_kmalloc(&intf->dev, sizeof(*imon), GFP_KERNEL); + if (!imon) + return -ENOMEM; + + imon->ir_urb = usb_alloc_urb(0, GFP_KERNEL); + if (!imon->ir_urb) + return -ENOMEM; + + imon->dev = &intf->dev; + usb_fill_int_urb(imon->ir_urb, udev, + usb_rcvintpipe(udev, ir_ep->bEndpointAddress), + imon->ir_buf, sizeof(imon->ir_buf), + imon_ir_rx, imon, ir_ep->bInterval); + + rcdev = devm_rc_allocate_device(&intf->dev, RC_DRIVER_IR_RAW); + if (!rcdev) { + ret = -ENOMEM; + goto free_urb; + } + + usb_make_path(udev, imon->phys, sizeof(imon->phys)); + + rcdev->device_name = "iMON Station"; + rcdev->driver_name = KBUILD_MODNAME; + rcdev->input_phys = imon->phys; + usb_to_input_id(udev, &rcdev->input_id); + rcdev->dev.parent = &intf->dev; + rcdev->allowed_protocols = RC_PROTO_BIT_ALL_IR_DECODER; + rcdev->map_name = RC_MAP_IMON_RSC; + rcdev->rx_resolution = BIT_DURATION; + rcdev->priv = imon; + + ret = devm_rc_register_device(&intf->dev, rcdev); + if (ret) + goto free_urb; + + imon->rcdev = rcdev; + + ret = usb_submit_urb(imon->ir_urb, GFP_KERNEL); + if (ret) + goto free_urb; + + usb_set_intfdata(intf, imon); + + return 0; + +free_urb: + usb_free_urb(imon->ir_urb); + return ret; +} + +static void imon_disconnect(struct usb_interface *intf) +{ + struct imon *imon = usb_get_intfdata(intf); + + usb_kill_urb(imon->ir_urb); + usb_free_urb(imon->ir_urb); +} + +static const struct usb_device_id imon_table[] = { + /* SoundGraph iMON (IR only) -- sg_imon.inf */ + { USB_DEVICE(0x04e8, 0xff30) }, + {} +}; + +static struct usb_driver imon_driver = { + .name = KBUILD_MODNAME, + .probe = imon_probe, + .disconnect = imon_disconnect, + .id_table = imon_table +}; + +module_usb_driver(imon_driver); + +MODULE_DESCRIPTION("Early raw iMON IR devices"); +MODULE_AUTHOR("Sean Young "); +MODULE_LICENSE("GPL"); +MODULE_DEVICE_TABLE(usb, imon_table); diff --git a/drivers/media/rc/ir-hix5hd2.c b/drivers/media/rc/ir-hix5hd2.c index 0ce11c41dfae..700ab4c563d0 100644 --- a/drivers/media/rc/ir-hix5hd2.c +++ b/drivers/media/rc/ir-hix5hd2.c @@ -71,9 +71,10 @@ struct hix5hd2_ir_priv { unsigned long rate; }; -static void hix5hd2_ir_enable(struct hix5hd2_ir_priv *dev, bool on) +static int hix5hd2_ir_enable(struct hix5hd2_ir_priv *dev, bool on) { u32 val; + int ret = 0; if (dev->regmap) { regmap_read(dev->regmap, IR_CLK, &val); @@ -87,10 +88,11 @@ static void hix5hd2_ir_enable(struct hix5hd2_ir_priv *dev, bool on) regmap_write(dev->regmap, IR_CLK, val); } else { if (on) - clk_prepare_enable(dev->clock); + ret = clk_prepare_enable(dev->clock); else clk_disable_unprepare(dev->clock); } + return ret; } static int hix5hd2_ir_config(struct hix5hd2_ir_priv *priv) @@ -127,9 +129,18 @@ static int hix5hd2_ir_config(struct hix5hd2_ir_priv *priv) static int hix5hd2_ir_open(struct rc_dev *rdev) { struct hix5hd2_ir_priv *priv = rdev->priv; + int ret; - hix5hd2_ir_enable(priv, true); - return hix5hd2_ir_config(priv); + ret = hix5hd2_ir_enable(priv, true); + if (ret) + return ret; + + ret = hix5hd2_ir_config(priv); + if (ret) { + hix5hd2_ir_enable(priv, false); + return ret; + } + return 0; } static void hix5hd2_ir_close(struct rc_dev *rdev) @@ -239,7 +250,9 @@ static int hix5hd2_ir_probe(struct platform_device *pdev) ret = PTR_ERR(priv->clock); goto err; } - clk_prepare_enable(priv->clock); + ret = clk_prepare_enable(priv->clock); + if (ret) + goto err; priv->rate = clk_get_rate(priv->clock); rdev->allowed_protocols = RC_PROTO_BIT_ALL_IR_DECODER; @@ -309,9 +322,17 @@ static int hix5hd2_ir_suspend(struct device *dev) static int hix5hd2_ir_resume(struct device *dev) { struct hix5hd2_ir_priv *priv = dev_get_drvdata(dev); + int ret; - hix5hd2_ir_enable(priv, true); - clk_prepare_enable(priv->clock); + ret = hix5hd2_ir_enable(priv, true); + if (ret) + return ret; + + ret = clk_prepare_enable(priv->clock); + if (ret) { + hix5hd2_ir_enable(priv, false); + return ret; + } writel_relaxed(0x01, priv->base + IR_ENABLE); writel_relaxed(0x00, priv->base + IR_INTM); diff --git a/drivers/media/rc/ir-imon-decoder.c b/drivers/media/rc/ir-imon-decoder.c new file mode 100644 index 000000000000..67c1b0c15aae --- /dev/null +++ b/drivers/media/rc/ir-imon-decoder.c @@ -0,0 +1,323 @@ +// SPDX-License-Identifier: GPL-2.0+ +// ir-imon-decoder.c - handle iMon protocol +// +// Copyright (C) 2018 by Sean Young + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include "rc-core-priv.h" + +#define IMON_UNIT 415662 /* ns */ +#define IMON_BITS 30 +#define IMON_CHKBITS (BIT(30) | BIT(25) | BIT(24) | BIT(22) | \ + BIT(21) | BIT(20) | BIT(19) | BIT(18) | \ + BIT(17) | BIT(16) | BIT(14) | BIT(13) | \ + BIT(12) | BIT(11) | BIT(10) | BIT(9)) + +/* + * This protocol has 30 bits. The format is one IMON_UNIT header pulse, + * followed by 30 bits. Each bit is one IMON_UNIT check field, and then + * one IMON_UNIT field with the actual bit (1=space, 0=pulse). + * The check field is always space for some bits, for others it is pulse if + * both the preceding and current bit are zero, else space. IMON_CHKBITS + * defines which bits are of type check. + * + * There is no way to distinguish an incomplete message from one where + * the lower bits are all set, iow. the last pulse is for the lowest + * bit which is 0. + */ +enum imon_state { + STATE_INACTIVE, + STATE_BIT_CHK, + STATE_BIT_START, + STATE_FINISHED, + STATE_ERROR, +}; + +static void ir_imon_decode_scancode(struct rc_dev *dev) +{ + struct imon_dec *imon = &dev->raw->imon; + + /* Keyboard/Mouse toggle */ + if (imon->bits == 0x299115b7) + imon->stick_keyboard = !imon->stick_keyboard; + + if ((imon->bits & 0xfc0000ff) == 0x680000b7) { + int rel_x, rel_y; + u8 buf; + + buf = imon->bits >> 16; + rel_x = (buf & 0x08) | (buf & 0x10) >> 2 | + (buf & 0x20) >> 4 | (buf & 0x40) >> 6; + if (imon->bits & 0x02000000) + rel_x |= ~0x0f; + buf = imon->bits >> 8; + rel_y = (buf & 0x08) | (buf & 0x10) >> 2 | + (buf & 0x20) >> 4 | (buf & 0x40) >> 6; + if (imon->bits & 0x01000000) + rel_y |= ~0x0f; + + if (rel_x && rel_y && imon->stick_keyboard) { + if (abs(rel_y) > abs(rel_x)) + imon->bits = rel_y > 0 ? + 0x289515b7 : /* KEY_DOWN */ + 0x2aa515b7; /* KEY_UP */ + else + imon->bits = rel_x > 0 ? + 0x2ba515b7 : /* KEY_RIGHT */ + 0x29a515b7; /* KEY_LEFT */ + } + + if (!imon->stick_keyboard) { + struct lirc_scancode lsc = { + .scancode = imon->bits, + .rc_proto = RC_PROTO_IMON, + }; + + ir_lirc_scancode_event(dev, &lsc); + + input_event(imon->idev, EV_MSC, MSC_SCAN, imon->bits); + + input_report_rel(imon->idev, REL_X, rel_x); + input_report_rel(imon->idev, REL_Y, rel_y); + + input_report_key(imon->idev, BTN_LEFT, + (imon->bits & 0x00010000) != 0); + input_report_key(imon->idev, BTN_RIGHT, + (imon->bits & 0x00040000) != 0); + input_sync(imon->idev); + return; + } + } + + rc_keydown(dev, RC_PROTO_IMON, imon->bits, 0); +} + +/** + * ir_imon_decode() - Decode one iMON pulse or space + * @dev: the struct rc_dev descriptor of the device + * @ev: the struct ir_raw_event descriptor of the pulse/space + * + * This function returns -EINVAL if the pulse violates the state machine + */ +static int ir_imon_decode(struct rc_dev *dev, struct ir_raw_event ev) +{ + struct imon_dec *data = &dev->raw->imon; + + if (!is_timing_event(ev)) { + if (ev.reset) + data->state = STATE_INACTIVE; + return 0; + } + + dev_dbg(&dev->dev, + "iMON decode started at state %d bitno %d (%uus %s)\n", + data->state, data->count, TO_US(ev.duration), + TO_STR(ev.pulse)); + + /* + * Since iMON protocol is a series of bits, if at any point + * we encounter an error, make sure that any remaining bits + * aren't parsed as a scancode made up of less bits. + * + * Note that if the stick is held, then the remote repeats + * the scancode with about 12ms between them. So, make sure + * we have at least 10ms of space after an error. That way, + * we're at a new scancode. + */ + if (data->state == STATE_ERROR) { + if (!ev.pulse && ev.duration > MS_TO_NS(10)) + data->state = STATE_INACTIVE; + return 0; + } + + for (;;) { + if (!geq_margin(ev.duration, IMON_UNIT, IMON_UNIT / 2)) + return 0; + + decrease_duration(&ev, IMON_UNIT); + + switch (data->state) { + case STATE_INACTIVE: + if (ev.pulse) { + data->state = STATE_BIT_CHK; + data->bits = 0; + data->count = IMON_BITS; + } + break; + case STATE_BIT_CHK: + if (IMON_CHKBITS & BIT(data->count)) + data->last_chk = ev.pulse; + else if (ev.pulse) + goto err_out; + data->state = STATE_BIT_START; + break; + case STATE_BIT_START: + data->bits <<= 1; + if (!ev.pulse) + data->bits |= 1; + + if (IMON_CHKBITS & BIT(data->count)) { + if (data->last_chk != !(data->bits & 3)) + goto err_out; + } + + if (!data->count--) + data->state = STATE_FINISHED; + else + data->state = STATE_BIT_CHK; + break; + case STATE_FINISHED: + if (ev.pulse) + goto err_out; + ir_imon_decode_scancode(dev); + data->state = STATE_INACTIVE; + break; + } + } + +err_out: + dev_dbg(&dev->dev, + "iMON decode failed at state %d bitno %d (%uus %s)\n", + data->state, data->count, TO_US(ev.duration), + TO_STR(ev.pulse)); + + data->state = STATE_ERROR; + + return -EINVAL; +} + +/** + * ir_imon_encode() - Encode a scancode as a stream of raw events + * + * @protocol: protocol to encode + * @scancode: scancode to encode + * @events: array of raw ir events to write into + * @max: maximum size of @events + * + * Returns: The number of events written. + * -ENOBUFS if there isn't enough space in the array to fit the + * encoding. In this case all @max events will have been written. + */ +static int ir_imon_encode(enum rc_proto protocol, u32 scancode, + struct ir_raw_event *events, unsigned int max) +{ + struct ir_raw_event *e = events; + int i, pulse; + + if (!max--) + return -ENOBUFS; + init_ir_raw_event_duration(e, 1, IMON_UNIT); + + for (i = IMON_BITS; i >= 0; i--) { + if (BIT(i) & IMON_CHKBITS) + pulse = !(scancode & (BIT(i) | BIT(i + 1))); + else + pulse = 0; + + if (pulse == e->pulse) { + e->duration += IMON_UNIT; + } else { + if (!max--) + return -ENOBUFS; + init_ir_raw_event_duration(++e, pulse, IMON_UNIT); + } + + pulse = !(scancode & BIT(i)); + + if (pulse == e->pulse) { + e->duration += IMON_UNIT; + } else { + if (!max--) + return -ENOBUFS; + init_ir_raw_event_duration(++e, pulse, IMON_UNIT); + } + } + + if (e->pulse) + e++; + + return e - events; +} + +static int ir_imon_register(struct rc_dev *dev) +{ + struct input_dev *idev; + struct imon_dec *imon = &dev->raw->imon; + int ret; + + idev = input_allocate_device(); + if (!idev) + return -ENOMEM; + + snprintf(imon->name, sizeof(imon->name), + "iMON PAD Stick (%s)", dev->device_name); + idev->name = imon->name; + idev->phys = dev->input_phys; + + /* Mouse bits */ + set_bit(EV_REL, idev->evbit); + set_bit(EV_KEY, idev->evbit); + set_bit(REL_X, idev->relbit); + set_bit(REL_Y, idev->relbit); + set_bit(BTN_LEFT, idev->keybit); + set_bit(BTN_RIGHT, idev->keybit); + + /* Report scancodes too */ + set_bit(EV_MSC, idev->evbit); + set_bit(MSC_SCAN, idev->mscbit); + + input_set_drvdata(idev, imon); + + ret = input_register_device(idev); + if (ret < 0) { + input_free_device(idev); + return -EIO; + } + + imon->idev = idev; + imon->stick_keyboard = false; + + return 0; +} + +static int ir_imon_unregister(struct rc_dev *dev) +{ + struct imon_dec *imon = &dev->raw->imon; + + input_unregister_device(imon->idev); + imon->idev = NULL; + + return 0; +} + +static struct ir_raw_handler imon_handler = { + .protocols = RC_PROTO_BIT_IMON, + .decode = ir_imon_decode, + .encode = ir_imon_encode, + .carrier = 38000, + .raw_register = ir_imon_register, + .raw_unregister = ir_imon_unregister, + .min_timeout = IMON_UNIT * IMON_BITS * 2, +}; + +static int __init ir_imon_decode_init(void) +{ + ir_raw_handler_register(&imon_handler); + + pr_info("IR iMON protocol handler initialized\n"); + return 0; +} + +static void __exit ir_imon_decode_exit(void) +{ + ir_raw_handler_unregister(&imon_handler); +} + +module_init(ir_imon_decode_init); +module_exit(ir_imon_decode_exit); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Sean Young "); +MODULE_DESCRIPTION("iMON IR protocol decoder"); diff --git a/drivers/media/rc/ir-jvc-decoder.c b/drivers/media/rc/ir-jvc-decoder.c index e2bd68c42edf..5706cfe60027 100644 --- a/drivers/media/rc/ir-jvc-decoder.c +++ b/drivers/media/rc/ir-jvc-decoder.c @@ -39,7 +39,7 @@ enum jvc_state { /** * ir_jvc_decode() - Decode one JVC pulse or space * @dev: the struct rc_dev descriptor of the device - * @duration: the struct ir_raw_event descriptor of the pulse/space + * @ev: the struct ir_raw_event descriptor of the pulse/space * * This function returns -EINVAL if the pulse violates the state machine */ @@ -56,8 +56,8 @@ static int ir_jvc_decode(struct rc_dev *dev, struct ir_raw_event ev) if (!geq_margin(ev.duration, JVC_UNIT, JVC_UNIT / 2)) goto out; - IR_dprintk(2, "JVC decode started at state %d (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "JVC decode started at state %d (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); again: switch (data->state) { @@ -136,15 +136,15 @@ again: u32 scancode; scancode = (bitrev8((data->bits >> 8) & 0xff) << 8) | (bitrev8((data->bits >> 0) & 0xff) << 0); - IR_dprintk(1, "JVC scancode 0x%04x\n", scancode); + dev_dbg(&dev->dev, "JVC scancode 0x%04x\n", scancode); rc_keydown(dev, RC_PROTO_JVC, scancode, data->toggle); data->first = false; data->old_bits = data->bits; } else if (data->bits == data->old_bits) { - IR_dprintk(1, "JVC repeat\n"); + dev_dbg(&dev->dev, "JVC repeat\n"); rc_repeat(dev); } else { - IR_dprintk(1, "JVC invalid repeat msg\n"); + dev_dbg(&dev->dev, "JVC invalid repeat msg\n"); break; } @@ -164,8 +164,8 @@ again: } out: - IR_dprintk(1, "JVC decode failed at state %d (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "JVC decode failed at state %d (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); data->state = STATE_INACTIVE; return -EINVAL; } @@ -212,6 +212,8 @@ static struct ir_raw_handler jvc_handler = { .protocols = RC_PROTO_BIT_JVC, .decode = ir_jvc_decode, .encode = ir_jvc_encode, + .carrier = 38000, + .min_timeout = JVC_TRAILER_SPACE, }; static int __init ir_jvc_decode_init(void) diff --git a/drivers/media/rc/ir-lirc-codec.c b/drivers/media/rc/ir-lirc-codec.c deleted file mode 100644 index 4c8f456238bc..000000000000 --- a/drivers/media/rc/ir-lirc-codec.c +++ /dev/null @@ -1,452 +0,0 @@ -/* ir-lirc-codec.c - rc-core to classic lirc interface bridge - * - * Copyright (C) 2010 by Jarod Wilson - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation version 2 of the License. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - */ - -#include -#include -#include -#include -#include -#include -#include "rc-core-priv.h" - -#define LIRCBUF_SIZE 256 - -/** - * ir_lirc_decode() - Send raw IR data to lirc_dev to be relayed to the - * lircd userspace daemon for decoding. - * @input_dev: the struct rc_dev descriptor of the device - * @duration: the struct ir_raw_event descriptor of the pulse/space - * - * This function returns -EINVAL if the lirc interfaces aren't wired up. - */ -static int ir_lirc_decode(struct rc_dev *dev, struct ir_raw_event ev) -{ - struct lirc_codec *lirc = &dev->raw->lirc; - int sample; - - if (!dev->raw->lirc.drv || !dev->raw->lirc.drv->rbuf) - return -EINVAL; - - /* Packet start */ - if (ev.reset) { - /* Userspace expects a long space event before the start of - * the signal to use as a sync. This may be done with repeat - * packets and normal samples. But if a reset has been sent - * then we assume that a long time has passed, so we send a - * space with the maximum time value. */ - sample = LIRC_SPACE(LIRC_VALUE_MASK); - IR_dprintk(2, "delivering reset sync space to lirc_dev\n"); - - /* Carrier reports */ - } else if (ev.carrier_report) { - sample = LIRC_FREQUENCY(ev.carrier); - IR_dprintk(2, "carrier report (freq: %d)\n", sample); - - /* Packet end */ - } else if (ev.timeout) { - - if (lirc->gap) - return 0; - - lirc->gap_start = ktime_get(); - lirc->gap = true; - lirc->gap_duration = ev.duration; - - if (!lirc->send_timeout_reports) - return 0; - - sample = LIRC_TIMEOUT(ev.duration / 1000); - IR_dprintk(2, "timeout report (duration: %d)\n", sample); - - /* Normal sample */ - } else { - - if (lirc->gap) { - int gap_sample; - - lirc->gap_duration += ktime_to_ns(ktime_sub(ktime_get(), - lirc->gap_start)); - - /* Convert to ms and cap by LIRC_VALUE_MASK */ - do_div(lirc->gap_duration, 1000); - lirc->gap_duration = min(lirc->gap_duration, - (u64)LIRC_VALUE_MASK); - - gap_sample = LIRC_SPACE(lirc->gap_duration); - lirc_buffer_write(dev->raw->lirc.drv->rbuf, - (unsigned char *) &gap_sample); - lirc->gap = false; - } - - sample = ev.pulse ? LIRC_PULSE(ev.duration / 1000) : - LIRC_SPACE(ev.duration / 1000); - IR_dprintk(2, "delivering %uus %s to lirc_dev\n", - TO_US(ev.duration), TO_STR(ev.pulse)); - } - - lirc_buffer_write(dev->raw->lirc.drv->rbuf, - (unsigned char *) &sample); - wake_up(&dev->raw->lirc.drv->rbuf->wait_poll); - - return 0; -} - -static ssize_t ir_lirc_transmit_ir(struct file *file, const char __user *buf, - size_t n, loff_t *ppos) -{ - struct lirc_codec *lirc; - struct rc_dev *dev; - unsigned int *txbuf; /* buffer with values to transmit */ - ssize_t ret = -EINVAL; - size_t count; - ktime_t start; - s64 towait; - unsigned int duration = 0; /* signal duration in us */ - int i; - - start = ktime_get(); - - lirc = lirc_get_pdata(file); - if (!lirc) - return -EFAULT; - - if (n < sizeof(unsigned) || n % sizeof(unsigned)) - return -EINVAL; - - count = n / sizeof(unsigned); - if (count > LIRCBUF_SIZE || count % 2 == 0) - return -EINVAL; - - txbuf = memdup_user(buf, n); - if (IS_ERR(txbuf)) - return PTR_ERR(txbuf); - - dev = lirc->dev; - if (!dev) { - ret = -EFAULT; - goto out; - } - - if (!dev->tx_ir) { - ret = -EINVAL; - goto out; - } - - for (i = 0; i < count; i++) { - if (txbuf[i] > IR_MAX_DURATION / 1000 - duration || !txbuf[i]) { - ret = -EINVAL; - goto out; - } - - duration += txbuf[i]; - } - - ret = dev->tx_ir(dev, txbuf, count); - if (ret < 0) - goto out; - - for (duration = i = 0; i < ret; i++) - duration += txbuf[i]; - - ret *= sizeof(unsigned int); - - /* - * The lircd gap calculation expects the write function to - * wait for the actual IR signal to be transmitted before - * returning. - */ - towait = ktime_us_delta(ktime_add_us(start, duration), ktime_get()); - if (towait > 0) { - set_current_state(TASK_INTERRUPTIBLE); - schedule_timeout(usecs_to_jiffies(towait)); - } - -out: - kfree(txbuf); - return ret; -} - -static long ir_lirc_ioctl(struct file *filep, unsigned int cmd, - unsigned long arg) -{ - struct lirc_codec *lirc; - struct rc_dev *dev; - u32 __user *argp = (u32 __user *)(arg); - int ret = 0; - __u32 val = 0, tmp; - - lirc = lirc_get_pdata(filep); - if (!lirc) - return -EFAULT; - - dev = lirc->dev; - if (!dev) - return -EFAULT; - - if (_IOC_DIR(cmd) & _IOC_WRITE) { - ret = get_user(val, argp); - if (ret) - return ret; - } - - switch (cmd) { - - /* legacy support */ - case LIRC_GET_SEND_MODE: - if (!dev->tx_ir) - return -ENOTTY; - - val = LIRC_MODE_PULSE; - break; - - case LIRC_SET_SEND_MODE: - if (!dev->tx_ir) - return -ENOTTY; - - if (val != LIRC_MODE_PULSE) - return -EINVAL; - return 0; - - /* TX settings */ - case LIRC_SET_TRANSMITTER_MASK: - if (!dev->s_tx_mask) - return -ENOTTY; - - return dev->s_tx_mask(dev, val); - - case LIRC_SET_SEND_CARRIER: - if (!dev->s_tx_carrier) - return -ENOTTY; - - return dev->s_tx_carrier(dev, val); - - case LIRC_SET_SEND_DUTY_CYCLE: - if (!dev->s_tx_duty_cycle) - return -ENOTTY; - - if (val <= 0 || val >= 100) - return -EINVAL; - - return dev->s_tx_duty_cycle(dev, val); - - /* RX settings */ - case LIRC_SET_REC_CARRIER: - if (!dev->s_rx_carrier_range) - return -ENOTTY; - - if (val <= 0) - return -EINVAL; - - return dev->s_rx_carrier_range(dev, - dev->raw->lirc.carrier_low, - val); - - case LIRC_SET_REC_CARRIER_RANGE: - if (!dev->s_rx_carrier_range) - return -ENOTTY; - - if (val <= 0) - return -EINVAL; - - dev->raw->lirc.carrier_low = val; - return 0; - - case LIRC_GET_REC_RESOLUTION: - if (!dev->rx_resolution) - return -ENOTTY; - - val = dev->rx_resolution / 1000; - break; - - case LIRC_SET_WIDEBAND_RECEIVER: - if (!dev->s_learning_mode) - return -ENOTTY; - - return dev->s_learning_mode(dev, !!val); - - case LIRC_SET_MEASURE_CARRIER_MODE: - if (!dev->s_carrier_report) - return -ENOTTY; - - return dev->s_carrier_report(dev, !!val); - - /* Generic timeout support */ - case LIRC_GET_MIN_TIMEOUT: - if (!dev->max_timeout) - return -ENOTTY; - val = DIV_ROUND_UP(dev->min_timeout, 1000); - break; - - case LIRC_GET_MAX_TIMEOUT: - if (!dev->max_timeout) - return -ENOTTY; - val = dev->max_timeout / 1000; - break; - - case LIRC_SET_REC_TIMEOUT: - if (!dev->max_timeout) - return -ENOTTY; - - /* Check for multiply overflow */ - if (val > U32_MAX / 1000) - return -EINVAL; - - tmp = val * 1000; - - if (tmp < dev->min_timeout || tmp > dev->max_timeout) - return -EINVAL; - - if (dev->s_timeout) - ret = dev->s_timeout(dev, tmp); - if (!ret) - dev->timeout = tmp; - break; - - case LIRC_SET_REC_TIMEOUT_REPORTS: - if (!dev->timeout) - return -ENOTTY; - - lirc->send_timeout_reports = !!val; - break; - - default: - return lirc_dev_fop_ioctl(filep, cmd, arg); - } - - if (_IOC_DIR(cmd) & _IOC_READ) - ret = put_user(val, argp); - - return ret; -} - -static const struct file_operations lirc_fops = { - .owner = THIS_MODULE, - .write = ir_lirc_transmit_ir, - .unlocked_ioctl = ir_lirc_ioctl, -#ifdef CONFIG_COMPAT - .compat_ioctl = ir_lirc_ioctl, -#endif - .read = lirc_dev_fop_read, - .poll = lirc_dev_fop_poll, - .open = lirc_dev_fop_open, - .release = lirc_dev_fop_close, - .llseek = no_llseek, -}; - -static int ir_lirc_register(struct rc_dev *dev) -{ - struct lirc_driver *drv; - int rc = -ENOMEM; - unsigned long features = 0; - - drv = kzalloc(sizeof(struct lirc_driver), GFP_KERNEL); - if (!drv) - return rc; - - if (dev->driver_type != RC_DRIVER_IR_RAW_TX) { - features |= LIRC_CAN_REC_MODE2; - if (dev->rx_resolution) - features |= LIRC_CAN_GET_REC_RESOLUTION; - } - - if (dev->tx_ir) { - features |= LIRC_CAN_SEND_PULSE; - if (dev->s_tx_mask) - features |= LIRC_CAN_SET_TRANSMITTER_MASK; - if (dev->s_tx_carrier) - features |= LIRC_CAN_SET_SEND_CARRIER; - if (dev->s_tx_duty_cycle) - features |= LIRC_CAN_SET_SEND_DUTY_CYCLE; - } - - if (dev->s_rx_carrier_range) - features |= LIRC_CAN_SET_REC_CARRIER | - LIRC_CAN_SET_REC_CARRIER_RANGE; - - if (dev->s_learning_mode) - features |= LIRC_CAN_USE_WIDEBAND_RECEIVER; - - if (dev->s_carrier_report) - features |= LIRC_CAN_MEASURE_CARRIER; - - if (dev->max_timeout) - features |= LIRC_CAN_SET_REC_TIMEOUT; - - snprintf(drv->name, sizeof(drv->name), "ir-lirc-codec (%s)", - dev->driver_name); - drv->minor = -1; - drv->features = features; - drv->data = &dev->raw->lirc; - drv->rbuf = NULL; - drv->code_length = sizeof(struct ir_raw_event) * 8; - drv->chunk_size = sizeof(int); - drv->buffer_size = LIRCBUF_SIZE; - drv->fops = &lirc_fops; - drv->dev = &dev->dev; - drv->rdev = dev; - drv->owner = THIS_MODULE; - - drv->minor = lirc_register_driver(drv); - if (drv->minor < 0) { - rc = -ENODEV; - goto out; - } - - dev->raw->lirc.drv = drv; - dev->raw->lirc.dev = dev; - return 0; - -out: - kfree(drv); - return rc; -} - -static int ir_lirc_unregister(struct rc_dev *dev) -{ - struct lirc_codec *lirc = &dev->raw->lirc; - - lirc_unregister_driver(lirc->drv->minor); - kfree(lirc->drv); - lirc->drv = NULL; - - return 0; -} - -static struct ir_raw_handler lirc_handler = { - .protocols = 0, - .decode = ir_lirc_decode, - .raw_register = ir_lirc_register, - .raw_unregister = ir_lirc_unregister, -}; - -static int __init ir_lirc_codec_init(void) -{ - ir_raw_handler_register(&lirc_handler); - - printk(KERN_INFO "IR LIRC bridge handler initialized\n"); - return 0; -} - -static void __exit ir_lirc_codec_exit(void) -{ - ir_raw_handler_unregister(&lirc_handler); -} - -module_init(ir_lirc_codec_init); -module_exit(ir_lirc_codec_exit); - -MODULE_LICENSE("GPL"); -MODULE_AUTHOR("Jarod Wilson "); -MODULE_AUTHOR("Red Hat Inc. (http://www.redhat.com)"); -MODULE_DESCRIPTION("LIRC IR handler bridge"); diff --git a/drivers/media/rc/ir-mce_kbd-decoder.c b/drivers/media/rc/ir-mce_kbd-decoder.c index 2a1728edb3c6..64ea42927669 100644 --- a/drivers/media/rc/ir-mce_kbd-decoder.c +++ b/drivers/media/rc/ir-mce_kbd-decoder.c @@ -115,23 +115,29 @@ static unsigned char kbd_keycodes[256] = { KEY_RESERVED }; -static void mce_kbd_rx_timeout(unsigned long data) +static void mce_kbd_rx_timeout(struct timer_list *t) { - struct mce_kbd_dec *mce_kbd = (struct mce_kbd_dec *)data; - int i; + struct ir_raw_event_ctrl *raw = from_timer(raw, t, mce_kbd.rx_timeout); unsigned char maskcode; + unsigned long flags; + int i; - IR_dprintk(2, "timer callback clearing all keys\n"); + dev_dbg(&raw->dev->dev, "timer callback clearing all keys\n"); - for (i = 0; i < 7; i++) { - maskcode = kbd_keycodes[MCIR2_MASK_KEYS_START + i]; - input_report_key(mce_kbd->idev, maskcode, 0); + spin_lock_irqsave(&raw->mce_kbd.keylock, flags); + + if (time_is_before_eq_jiffies(raw->mce_kbd.rx_timeout.expires)) { + for (i = 0; i < 7; i++) { + maskcode = kbd_keycodes[MCIR2_MASK_KEYS_START + i]; + input_report_key(raw->mce_kbd.idev, maskcode, 0); + } + + for (i = 0; i < MCIR2_MASK_KEYS_START; i++) + input_report_key(raw->mce_kbd.idev, kbd_keycodes[i], 0); + + input_sync(raw->mce_kbd.idev); } - - for (i = 0; i < MCIR2_MASK_KEYS_START; i++) - input_report_key(mce_kbd->idev, kbd_keycodes[i], 0); - - input_sync(mce_kbd->idev); + spin_unlock_irqrestore(&raw->mce_kbd.keylock, flags); } static enum mce_kbd_mode mce_kbd_mode(struct mce_kbd_dec *data) @@ -146,16 +152,17 @@ static enum mce_kbd_mode mce_kbd_mode(struct mce_kbd_dec *data) } } -static void ir_mce_kbd_process_keyboard_data(struct input_dev *idev, - u32 scancode) +static void ir_mce_kbd_process_keyboard_data(struct rc_dev *dev, u32 scancode) { - u8 keydata = (scancode >> 8) & 0xff; + struct mce_kbd_dec *data = &dev->raw->mce_kbd; + u8 keydata1 = (scancode >> 8) & 0xff; + u8 keydata2 = (scancode >> 16) & 0xff; u8 shiftmask = scancode & 0xff; - unsigned char keycode, maskcode; + unsigned char maskcode; int i, keystate; - IR_dprintk(1, "keyboard: keydata = 0x%02x, shiftmask = 0x%02x\n", - keydata, shiftmask); + dev_dbg(&dev->dev, "keyboard: keydata2 = 0x%02x, keydata1 = 0x%02x, shiftmask = 0x%02x\n", + keydata2, keydata1, shiftmask); for (i = 0; i < 7; i++) { maskcode = kbd_keycodes[MCIR2_MASK_KEYS_START + i]; @@ -163,20 +170,23 @@ static void ir_mce_kbd_process_keyboard_data(struct input_dev *idev, keystate = 1; else keystate = 0; - input_report_key(idev, maskcode, keystate); + input_report_key(data->idev, maskcode, keystate); } - if (keydata) { - keycode = kbd_keycodes[keydata]; - input_report_key(idev, keycode, 1); - } else { + if (keydata1) + input_report_key(data->idev, kbd_keycodes[keydata1], 1); + if (keydata2) + input_report_key(data->idev, kbd_keycodes[keydata2], 1); + + if (!keydata1 && !keydata2) { for (i = 0; i < MCIR2_MASK_KEYS_START; i++) - input_report_key(idev, kbd_keycodes[i], 0); + input_report_key(data->idev, kbd_keycodes[i], 0); } } -static void ir_mce_kbd_process_mouse_data(struct input_dev *idev, u32 scancode) +static void ir_mce_kbd_process_mouse_data(struct rc_dev *dev, u32 scancode) { + struct mce_kbd_dec *data = &dev->raw->mce_kbd; /* raw mouse coordinates */ u8 xdata = (scancode >> 7) & 0x7f; u8 ydata = (scancode >> 14) & 0x7f; @@ -195,14 +205,14 @@ static void ir_mce_kbd_process_mouse_data(struct input_dev *idev, u32 scancode) else y = ydata; - IR_dprintk(1, "mouse: x = %d, y = %d, btns = %s%s\n", - x, y, left ? "L" : "", right ? "R" : ""); + dev_dbg(&dev->dev, "mouse: x = %d, y = %d, btns = %s%s\n", + x, y, left ? "L" : "", right ? "R" : ""); - input_report_rel(idev, REL_X, x); - input_report_rel(idev, REL_Y, y); + input_report_rel(data->idev, REL_X, x); + input_report_rel(data->idev, REL_Y, y); - input_report_key(idev, BTN_LEFT, left); - input_report_key(idev, BTN_RIGHT, right); + input_report_key(data->idev, BTN_LEFT, left); + input_report_key(data->idev, BTN_RIGHT, right); } /** @@ -217,6 +227,7 @@ static int ir_mce_kbd_decode(struct rc_dev *dev, struct ir_raw_event ev) struct mce_kbd_dec *data = &dev->raw->mce_kbd; u32 scancode; unsigned long delay; + struct lirc_scancode lsc = {}; if (!is_timing_event(ev)) { if (ev.reset) @@ -228,8 +239,8 @@ static int ir_mce_kbd_decode(struct rc_dev *dev, struct ir_raw_event ev) goto out; again: - IR_dprintk(2, "started at state %i (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "started at state %i (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); if (!geq_margin(ev.duration, MCIR2_UNIT, MCIR2_UNIT / 2)) return 0; @@ -263,9 +274,6 @@ again: return 0; case STATE_HEADER_BIT_END: - if (!is_transition(&ev, &dev->raw->prev_ev)) - break; - decrease_duration(&ev, MCIR2_BIT_END); if (data->count != MCIR2_HEADER_NBITS) { @@ -281,7 +289,7 @@ again: data->wanted_bits = MCIR2_MOUSE_NBITS; break; default: - IR_dprintk(1, "not keyboard or mouse data\n"); + dev_dbg(&dev->dev, "not keyboard or mouse data\n"); goto out; } @@ -302,9 +310,6 @@ again: return 0; case STATE_BODY_BIT_END: - if (!is_transition(&ev, &dev->raw->prev_ev)) - break; - if (data->count == data->wanted_bits) data->state = STATE_FINISHED; else @@ -319,27 +324,36 @@ again: switch (data->wanted_bits) { case MCIR2_KEYBOARD_NBITS: - scancode = data->body & 0xffff; - IR_dprintk(1, "keyboard data 0x%08x\n", data->body); - if (dev->timeout) - delay = usecs_to_jiffies(dev->timeout / 1000); - else - delay = msecs_to_jiffies(100); - mod_timer(&data->rx_timeout, jiffies + delay); + scancode = data->body & 0xffffff; + dev_dbg(&dev->dev, "keyboard data 0x%08x\n", + data->body); + spin_lock(&data->keylock); + if (scancode) { + delay = nsecs_to_jiffies(dev->timeout) + + msecs_to_jiffies(100); + mod_timer(&data->rx_timeout, jiffies + delay); + } else { + del_timer(&data->rx_timeout); + } /* Pass data to keyboard buffer parser */ - ir_mce_kbd_process_keyboard_data(data->idev, scancode); + ir_mce_kbd_process_keyboard_data(dev, scancode); + spin_unlock(&data->keylock); + lsc.rc_proto = RC_PROTO_MCIR2_KBD; break; case MCIR2_MOUSE_NBITS: scancode = data->body & 0x1fffff; - IR_dprintk(1, "mouse data 0x%06x\n", scancode); + dev_dbg(&dev->dev, "mouse data 0x%06x\n", scancode); /* Pass data to mouse buffer parser */ - ir_mce_kbd_process_mouse_data(data->idev, scancode); + ir_mce_kbd_process_mouse_data(dev, scancode); + lsc.rc_proto = RC_PROTO_MCIR2_MSE; break; default: - IR_dprintk(1, "not keyboard or mouse data\n"); + dev_dbg(&dev->dev, "not keyboard or mouse data\n"); goto out; } + lsc.scancode = scancode; + ir_lirc_scancode_event(dev, &lsc); data->state = STATE_INACTIVE; input_event(data->idev, EV_MSC, MSC_SCAN, scancode); input_sync(data->idev); @@ -347,10 +361,9 @@ again: } out: - IR_dprintk(1, "failed at state %i (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "failed at state %i (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); data->state = STATE_INACTIVE; - input_sync(data->idev); return -EINVAL; } @@ -360,9 +373,6 @@ static int ir_mce_kbd_register(struct rc_dev *dev) struct input_dev *idev; int i, ret; - if (dev->driver_type == RC_DRIVER_IR_RAW_TX) - return 0; - idev = input_allocate_device(); if (!idev) return -ENOMEM; @@ -391,8 +401,8 @@ static int ir_mce_kbd_register(struct rc_dev *dev) set_bit(EV_MSC, idev->evbit); set_bit(MSC_SCAN, idev->mscbit); - setup_timer(&mce_kbd->rx_timeout, mce_kbd_rx_timeout, - (unsigned long)mce_kbd); + timer_setup(&mce_kbd->rx_timeout, mce_kbd_rx_timeout, 0); + spin_lock_init(&mce_kbd->keylock); input_set_drvdata(idev, mce_kbd); @@ -418,9 +428,6 @@ static int ir_mce_kbd_unregister(struct rc_dev *dev) struct mce_kbd_dec *mce_kbd = &dev->raw->mce_kbd; struct input_dev *idev = mce_kbd->idev; - if (dev->driver_type == RC_DRIVER_IR_RAW_TX) - return 0; - del_timer_sync(&mce_kbd->rx_timeout); input_unregister_device(idev); @@ -428,7 +435,7 @@ static int ir_mce_kbd_unregister(struct rc_dev *dev) } static const struct ir_raw_timings_manchester ir_mce_kbd_timings = { - .leader = MCIR2_PREFIX_PULSE, + .leader_pulse = MCIR2_PREFIX_PULSE, .invert = 1, .clock = MCIR2_UNIT, .trailer_space = MCIR2_UNIT * 10, @@ -456,11 +463,11 @@ static int ir_mce_kbd_encode(enum rc_proto protocol, u32 scancode, if (protocol == RC_PROTO_MCIR2_KBD) { raw = scancode | ((u64)MCIR2_KEYBOARD_HEADER << MCIR2_KEYBOARD_NBITS); - len = MCIR2_KEYBOARD_NBITS + MCIR2_HEADER_NBITS + 1; + len = MCIR2_KEYBOARD_NBITS + MCIR2_HEADER_NBITS; } else { raw = scancode | ((u64)MCIR2_MOUSE_HEADER << MCIR2_MOUSE_NBITS); - len = MCIR2_MOUSE_NBITS + MCIR2_HEADER_NBITS + 1; + len = MCIR2_MOUSE_NBITS + MCIR2_HEADER_NBITS; } ret = ir_raw_gen_manchester(&e, max, &ir_mce_kbd_timings, len, raw); @@ -476,6 +483,8 @@ static struct ir_raw_handler mce_kbd_handler = { .encode = ir_mce_kbd_encode, .raw_register = ir_mce_kbd_register, .raw_unregister = ir_mce_kbd_unregister, + .carrier = 36000, + .min_timeout = MCIR2_MAX_LEN + MCIR2_UNIT / 2, }; static int __init ir_mce_kbd_decode_init(void) diff --git a/drivers/media/rc/ir-nec-decoder.c b/drivers/media/rc/ir-nec-decoder.c index a95d09acc22a..6a8973ae3684 100644 --- a/drivers/media/rc/ir-nec-decoder.c +++ b/drivers/media/rc/ir-nec-decoder.c @@ -1,16 +1,7 @@ -/* ir-nec-decoder.c - handle NEC IR Pulse/Space protocol - * - * Copyright (C) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation version 2 of the License. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - */ +// SPDX-License-Identifier: GPL-2.0 +// ir-nec-decoder.c - handle NEC IR Pulse/Space protocol +// +// Copyright (C) 2010 by Mauro Carvalho Chehab #include #include @@ -41,7 +32,7 @@ enum nec_state { /** * ir_nec_decode() - Decode one NEC pulse or space * @dev: the struct rc_dev descriptor of the device - * @duration: the struct ir_raw_event descriptor of the pulse/space + * @ev: the struct ir_raw_event descriptor of the pulse/space * * This function returns -EINVAL if the pulse violates the state machine */ @@ -58,8 +49,8 @@ static int ir_nec_decode(struct rc_dev *dev, struct ir_raw_event ev) return 0; } - IR_dprintk(2, "NEC decode started at state %d (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "NEC decode started at state %d (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); switch (data->state) { @@ -108,13 +99,11 @@ static int ir_nec_decode(struct rc_dev *dev, struct ir_raw_event ev) break; if (data->necx_repeat && data->count == NECX_REPEAT_BITS && - geq_margin(ev.duration, - NEC_TRAILER_SPACE, NEC_UNIT / 2)) { - IR_dprintk(1, "Repeat last key\n"); - rc_repeat(dev); - data->state = STATE_INACTIVE; - return 0; - + geq_margin(ev.duration, NEC_TRAILER_SPACE, NEC_UNIT / 2)) { + dev_dbg(&dev->dev, "Repeat last key\n"); + rc_repeat(dev); + data->state = STATE_INACTIVE; + return 0; } else if (data->count > NECX_REPEAT_BITS) data->necx_repeat = false; @@ -173,8 +162,8 @@ static int ir_nec_decode(struct rc_dev *dev, struct ir_raw_event ev) return 0; } - IR_dprintk(1, "NEC decode failed at count %d state %d (%uus %s)\n", - data->count, data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "NEC decode failed at count %d state %d (%uus %s)\n", + data->count, data->state, TO_US(ev.duration), TO_STR(ev.pulse)); data->state = STATE_INACTIVE; return -EINVAL; } @@ -183,7 +172,6 @@ static int ir_nec_decode(struct rc_dev *dev, struct ir_raw_event ev) * ir_nec_scancode_to_raw() - encode an NEC scancode ready for modulation. * @protocol: specific protocol to use * @scancode: a single NEC scancode. - * @raw: raw data to be modulated. */ static u32 ir_nec_scancode_to_raw(enum rc_proto protocol, u32 scancode) { @@ -264,6 +252,8 @@ static struct ir_raw_handler nec_handler = { RC_PROTO_BIT_NEC32, .decode = ir_nec_decode, .encode = ir_nec_encode, + .carrier = 38000, + .min_timeout = NEC_TRAILER_SPACE, }; static int __init ir_nec_decode_init(void) @@ -282,7 +272,7 @@ static void __exit ir_nec_decode_exit(void) module_init(ir_nec_decode_init); module_exit(ir_nec_decode_exit); -MODULE_LICENSE("GPL"); +MODULE_LICENSE("GPL v2"); MODULE_AUTHOR("Mauro Carvalho Chehab"); MODULE_AUTHOR("Red Hat Inc. (http://www.redhat.com)"); MODULE_DESCRIPTION("NEC IR protocol decoder"); diff --git a/drivers/media/rc/ir-rc5-decoder.c b/drivers/media/rc/ir-rc5-decoder.c index 1292f534de43..63624654a71e 100644 --- a/drivers/media/rc/ir-rc5-decoder.c +++ b/drivers/media/rc/ir-rc5-decoder.c @@ -1,17 +1,8 @@ -/* ir-rc5-decoder.c - decoder for RC5(x) and StreamZap protocols - * - * Copyright (C) 2010 by Mauro Carvalho Chehab - * Copyright (C) 2010 by Jarod Wilson - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation version 2 of the License. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - */ +// SPDX-License-Identifier: GPL-2.0 +// ir-rc5-decoder.c - decoder for RC5(x) and StreamZap protocols +// +// Copyright (C) 2010 by Mauro Carvalho Chehab +// Copyright (C) 2010 by Jarod Wilson /* * This decoder handles the 14 bit RC5 protocol, 15 bit "StreamZap" protocol @@ -63,8 +54,8 @@ static int ir_rc5_decode(struct rc_dev *dev, struct ir_raw_event ev) goto out; again: - IR_dprintk(2, "RC5(x/sz) decode started at state %i (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "RC5(x/sz) decode started at state %i (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); if (!geq_margin(ev.duration, RC5_UNIT, RC5_UNIT / 2)) return 0; @@ -97,9 +88,6 @@ again: return 0; case STATE_BIT_END: - if (!is_transition(&ev, &dev->raw->prev_ev)) - break; - if (data->count == CHECK_RC5X_NBITS) data->state = STATE_CHECK_RC5X; else @@ -166,8 +154,8 @@ again: } else break; - IR_dprintk(1, "RC5(x/sz) scancode 0x%06x (p: %u, t: %u)\n", - scancode, protocol, toggle); + dev_dbg(&dev->dev, "RC5(x/sz) scancode 0x%06x (p: %u, t: %u)\n", + scancode, protocol, toggle); rc_keydown(dev, protocol, scancode, toggle); data->state = STATE_INACTIVE; @@ -175,23 +163,21 @@ again: } out: - IR_dprintk(1, "RC5(x/sz) decode failed at state %i count %d (%uus %s)\n", - data->state, data->count, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "RC5(x/sz) decode failed at state %i count %d (%uus %s)\n", + data->state, data->count, TO_US(ev.duration), TO_STR(ev.pulse)); data->state = STATE_INACTIVE; return -EINVAL; } static const struct ir_raw_timings_manchester ir_rc5_timings = { - .leader = RC5_UNIT, - .pulse_space_start = 0, + .leader_pulse = RC5_UNIT, .clock = RC5_UNIT, .trailer_space = RC5_UNIT * 10, }; static const struct ir_raw_timings_manchester ir_rc5x_timings[2] = { { - .leader = RC5_UNIT, - .pulse_space_start = 0, + .leader_pulse = RC5_UNIT, .clock = RC5_UNIT, .trailer_space = RC5X_SPACE, }, @@ -202,8 +188,7 @@ static const struct ir_raw_timings_manchester ir_rc5x_timings[2] = { }; static const struct ir_raw_timings_manchester ir_rc5_sz_timings = { - .leader = RC5_UNIT, - .pulse_space_start = 0, + .leader_pulse = RC5_UNIT, .clock = RC5_UNIT, .trailer_space = RC5_UNIT * 10, }; @@ -237,9 +222,9 @@ static int ir_rc5_encode(enum rc_proto protocol, u32 scancode, /* encode data */ data = !commandx << 12 | system << 6 | command; - /* Modulate the data */ + /* First bit is encoded by leader_pulse */ ret = ir_raw_gen_manchester(&e, max, &ir_rc5_timings, - RC5_NBITS, data); + RC5_NBITS - 1, data); if (ret < 0) return ret; } else if (protocol == RC_PROTO_RC5X_20) { @@ -252,10 +237,11 @@ static int ir_rc5_encode(enum rc_proto protocol, u32 scancode, /* encode data */ data = commandx << 18 | system << 12 | command << 6 | xdata; - /* Modulate the data */ + /* First bit is encoded by leader_pulse */ pre_space_data = data >> (RC5X_NBITS - CHECK_RC5X_NBITS); ret = ir_raw_gen_manchester(&e, max, &ir_rc5x_timings[0], - CHECK_RC5X_NBITS, pre_space_data); + CHECK_RC5X_NBITS - 1, + pre_space_data); if (ret < 0) return ret; ret = ir_raw_gen_manchester(&e, max - (e - events), @@ -266,8 +252,10 @@ static int ir_rc5_encode(enum rc_proto protocol, u32 scancode, return ret; } else if (protocol == RC_PROTO_RC5_SZ) { /* RC5-SZ scancode is raw enough for Manchester as it is */ + /* First bit is encoded by leader_pulse */ ret = ir_raw_gen_manchester(&e, max, &ir_rc5_sz_timings, - RC5_SZ_NBITS, scancode & 0x2fff); + RC5_SZ_NBITS - 1, + scancode & 0x2fff); if (ret < 0) return ret; } else { @@ -282,6 +270,8 @@ static struct ir_raw_handler rc5_handler = { RC_PROTO_BIT_RC5_SZ, .decode = ir_rc5_decode, .encode = ir_rc5_encode, + .carrier = 36000, + .min_timeout = RC5_TRAILER, }; static int __init ir_rc5_decode_init(void) @@ -300,7 +290,7 @@ static void __exit ir_rc5_decode_exit(void) module_init(ir_rc5_decode_init); module_exit(ir_rc5_decode_exit); -MODULE_LICENSE("GPL"); +MODULE_LICENSE("GPL v2"); MODULE_AUTHOR("Mauro Carvalho Chehab and Jarod Wilson"); MODULE_AUTHOR("Red Hat Inc. (http://www.redhat.com)"); MODULE_DESCRIPTION("RC5(x/sz) IR protocol decoder"); diff --git a/drivers/media/rc/ir-rc6-decoder.c b/drivers/media/rc/ir-rc6-decoder.c index 90f7930444a1..d96aed1343e4 100644 --- a/drivers/media/rc/ir-rc6-decoder.c +++ b/drivers/media/rc/ir-rc6-decoder.c @@ -101,8 +101,8 @@ static int ir_rc6_decode(struct rc_dev *dev, struct ir_raw_event ev) goto out; again: - IR_dprintk(2, "RC6 decode started at state %i (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "RC6 decode started at state %i (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); if (!geq_margin(ev.duration, RC6_UNIT, RC6_UNIT / 2)) return 0; @@ -146,9 +146,6 @@ again: return 0; case STATE_HEADER_BIT_END: - if (!is_transition(&ev, &dev->raw->prev_ev)) - break; - if (data->count == RC6_HEADER_NBITS) data->state = STATE_TOGGLE_START; else @@ -166,12 +163,8 @@ again: return 0; case STATE_TOGGLE_END: - if (!is_transition(&ev, &dev->raw->prev_ev) || - !geq_margin(ev.duration, RC6_TOGGLE_END, RC6_UNIT / 2)) - break; - if (!(data->header & RC6_STARTBIT_MASK)) { - IR_dprintk(1, "RC6 invalid start bit\n"); + dev_dbg(&dev->dev, "RC6 invalid start bit\n"); break; } @@ -188,7 +181,7 @@ again: data->wanted_bits = RC6_6A_NBITS; break; default: - IR_dprintk(1, "RC6 unknown mode\n"); + dev_dbg(&dev->dev, "RC6 unknown mode\n"); goto out; } goto again; @@ -211,9 +204,6 @@ again: break; case STATE_BODY_BIT_END: - if (!is_transition(&ev, &dev->raw->prev_ev)) - break; - if (data->count == data->wanted_bits) data->state = STATE_FINISHED; else @@ -231,13 +221,13 @@ again: scancode = data->body; toggle = data->toggle; protocol = RC_PROTO_RC6_0; - IR_dprintk(1, "RC6(0) scancode 0x%04x (toggle: %u)\n", - scancode, toggle); + dev_dbg(&dev->dev, "RC6(0) scancode 0x%04x (toggle: %u)\n", + scancode, toggle); break; case RC6_MODE_6A: if (data->count > CHAR_BIT * sizeof data->body) { - IR_dprintk(1, "RC6 too many (%u) data bits\n", + dev_dbg(&dev->dev, "RC6 too many (%u) data bits\n", data->count); goto out; } @@ -267,15 +257,15 @@ again: } break; default: - IR_dprintk(1, "RC6(6A) unsupported length\n"); + dev_dbg(&dev->dev, "RC6(6A) unsupported length\n"); goto out; } - IR_dprintk(1, "RC6(6A) proto 0x%04x, scancode 0x%08x (toggle: %u)\n", - protocol, scancode, toggle); + dev_dbg(&dev->dev, "RC6(6A) proto 0x%04x, scancode 0x%08x (toggle: %u)\n", + protocol, scancode, toggle); break; default: - IR_dprintk(1, "RC6 unknown mode\n"); + dev_dbg(&dev->dev, "RC6 unknown mode\n"); goto out; } @@ -285,21 +275,16 @@ again: } out: - IR_dprintk(1, "RC6 decode failed at state %i (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "RC6 decode failed at state %i (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); data->state = STATE_INACTIVE; return -EINVAL; } static const struct ir_raw_timings_manchester ir_rc6_timings[4] = { { - .leader = RC6_PREFIX_PULSE, - .pulse_space_start = 0, - .clock = RC6_UNIT, - .invert = 1, - .trailer_space = RC6_PREFIX_SPACE, - }, - { + .leader_pulse = RC6_PREFIX_PULSE, + .leader_space = RC6_PREFIX_SPACE, .clock = RC6_UNIT, .invert = 1, }, @@ -334,27 +319,22 @@ static int ir_rc6_encode(enum rc_proto protocol, u32 scancode, struct ir_raw_event *e = events; if (protocol == RC_PROTO_RC6_0) { - /* Modulate the preamble */ - ret = ir_raw_gen_manchester(&e, max, &ir_rc6_timings[0], 0, 0); - if (ret < 0) - return ret; - /* Modulate the header (Start Bit & Mode-0) */ ret = ir_raw_gen_manchester(&e, max - (e - events), - &ir_rc6_timings[1], + &ir_rc6_timings[0], RC6_HEADER_NBITS, (1 << 3)); if (ret < 0) return ret; /* Modulate Trailer Bit */ ret = ir_raw_gen_manchester(&e, max - (e - events), - &ir_rc6_timings[2], 1, 0); + &ir_rc6_timings[1], 1, 0); if (ret < 0) return ret; /* Modulate rest of the data */ ret = ir_raw_gen_manchester(&e, max - (e - events), - &ir_rc6_timings[3], RC6_0_NBITS, + &ir_rc6_timings[2], RC6_0_NBITS, scancode); if (ret < 0) return ret; @@ -377,27 +357,22 @@ static int ir_rc6_encode(enum rc_proto protocol, u32 scancode, return -EINVAL; } - /* Modulate the preamble */ - ret = ir_raw_gen_manchester(&e, max, &ir_rc6_timings[0], 0, 0); - if (ret < 0) - return ret; - /* Modulate the header (Start Bit & Header-version 6 */ ret = ir_raw_gen_manchester(&e, max - (e - events), - &ir_rc6_timings[1], + &ir_rc6_timings[0], RC6_HEADER_NBITS, (1 << 3 | 6)); if (ret < 0) return ret; /* Modulate Trailer Bit */ ret = ir_raw_gen_manchester(&e, max - (e - events), - &ir_rc6_timings[2], 1, 0); + &ir_rc6_timings[1], 1, 0); if (ret < 0) return ret; /* Modulate rest of the data */ ret = ir_raw_gen_manchester(&e, max - (e - events), - &ir_rc6_timings[3], + &ir_rc6_timings[2], bits, scancode); if (ret < 0) @@ -413,6 +388,8 @@ static struct ir_raw_handler rc6_handler = { RC_PROTO_BIT_RC6_MCE, .decode = ir_rc6_decode, .encode = ir_rc6_encode, + .carrier = 36000, + .min_timeout = RC6_SUFFIX_SPACE, }; static int __init ir_rc6_decode_init(void) diff --git a/drivers/media/rc/ir-rx51.c b/drivers/media/rc/ir-rx51.c index 49265f02e772..8a93f7468622 100644 --- a/drivers/media/rc/ir-rx51.c +++ b/drivers/media/rc/ir-rx51.c @@ -22,7 +22,6 @@ #include #include -#include #define WBUF_LEN 256 @@ -31,7 +30,6 @@ struct ir_rx51 { struct pwm_device *pwm; struct hrtimer timer; struct device *dev; - struct ir_rx51_platform_data *pdata; wait_queue_head_t wqueue; unsigned int freq; /* carrier frequency */ @@ -130,10 +128,9 @@ static int ir_rx51_tx(struct rc_dev *dev, unsigned int *buffer, ir_rx51->wbuf[count] = -1; /* Insert termination mark */ /* - * Adjust latency requirements so the device doesn't go in too - * deep sleep states + * REVISIT: Adjust latency requirements so the device doesn't go in too + * deep sleep states with pm_qos_add_request(). */ - ir_rx51->pdata->set_max_mpu_wakeup_lat(ir_rx51->dev, 50); ir_rx51_on(ir_rx51); ir_rx51->wbuf_index = 1; @@ -146,8 +143,7 @@ static int ir_rx51_tx(struct rc_dev *dev, unsigned int *buffer, */ wait_event_interruptible(ir_rx51->wqueue, ir_rx51->wbuf_index < 0); - /* We can sleep again */ - ir_rx51->pdata->set_max_mpu_wakeup_lat(ir_rx51->dev, -1); + /* REVISIT: Remove pm_qos constraint, we can sleep again */ return count; } @@ -244,13 +240,6 @@ static int ir_rx51_probe(struct platform_device *dev) struct pwm_device *pwm; struct rc_dev *rcdev; - ir_rx51.pdata = dev->dev.platform_data; - - if (!ir_rx51.pdata) { - dev_err(&dev->dev, "Platform Data is missing\n"); - return -ENXIO; - } - pwm = pwm_get(&dev->dev, NULL); if (IS_ERR(pwm)) { int err = PTR_ERR(pwm); diff --git a/drivers/media/rc/ir-sanyo-decoder.c b/drivers/media/rc/ir-sanyo-decoder.c index 758c60956850..dd6ee1e339d6 100644 --- a/drivers/media/rc/ir-sanyo-decoder.c +++ b/drivers/media/rc/ir-sanyo-decoder.c @@ -1,25 +1,16 @@ -/* ir-sanyo-decoder.c - handle SANYO IR Pulse/Space protocol - * - * Copyright (C) 2011 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation version 2 of the License. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * This protocol uses the NEC protocol timings. However, data is formatted as: - * 13 bits Custom Code - * 13 bits NOT(Custom Code) - * 8 bits Key data - * 8 bits NOT(Key data) - * - * According with LIRC, this protocol is used on Sanyo, Aiwa and Chinon - * Information for this protocol is available at the Sanyo LC7461 datasheet. - */ +// SPDX-License-Identifier: GPL-2.0 +// ir-sanyo-decoder.c - handle SANYO IR Pulse/Space protocol +// +// Copyright (C) 2011 by Mauro Carvalho Chehab +// +// This protocol uses the NEC protocol timings. However, data is formatted as: +// 13 bits Custom Code +// 13 bits NOT(Custom Code) +// 8 bits Key data +// 8 bits NOT(Key data) +// +// According with LIRC, this protocol is used on Sanyo, Aiwa and Chinon +// Information for this protocol is available at the Sanyo LC7461 datasheet. #include #include @@ -48,7 +39,7 @@ enum sanyo_state { /** * ir_sanyo_decode() - Decode one SANYO pulse or space * @dev: the struct rc_dev descriptor of the device - * @duration: the struct ir_raw_event descriptor of the pulse/space + * @ev: the struct ir_raw_event descriptor of the pulse/space * * This function returns -EINVAL if the pulse violates the state machine */ @@ -61,14 +52,14 @@ static int ir_sanyo_decode(struct rc_dev *dev, struct ir_raw_event ev) if (!is_timing_event(ev)) { if (ev.reset) { - IR_dprintk(1, "SANYO event reset received. reset to state 0\n"); + dev_dbg(&dev->dev, "SANYO event reset received. reset to state 0\n"); data->state = STATE_INACTIVE; } return 0; } - IR_dprintk(2, "SANYO decode started at state %d (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "SANYO decode started at state %d (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); switch (data->state) { @@ -111,7 +102,7 @@ static int ir_sanyo_decode(struct rc_dev *dev, struct ir_raw_event ev) if (!data->count && geq_margin(ev.duration, SANYO_REPEAT_SPACE, SANYO_UNIT / 2)) { rc_repeat(dev); - IR_dprintk(1, "SANYO repeat last key\n"); + dev_dbg(&dev->dev, "SANYO repeat last key\n"); data->state = STATE_INACTIVE; return 0; } @@ -153,21 +144,21 @@ static int ir_sanyo_decode(struct rc_dev *dev, struct ir_raw_event ev) not_command = bitrev8((data->bits >> 0) & 0xff); if ((command ^ not_command) != 0xff) { - IR_dprintk(1, "SANYO checksum error: received 0x%08Lx\n", - data->bits); + dev_dbg(&dev->dev, "SANYO checksum error: received 0x%08llx\n", + data->bits); data->state = STATE_INACTIVE; return 0; } scancode = address << 8 | command; - IR_dprintk(1, "SANYO scancode: 0x%06x\n", scancode); + dev_dbg(&dev->dev, "SANYO scancode: 0x%06x\n", scancode); rc_keydown(dev, RC_PROTO_SANYO, scancode, 0); data->state = STATE_INACTIVE; return 0; } - IR_dprintk(1, "SANYO decode failed at count %d state %d (%uus %s)\n", - data->count, data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "SANYO decode failed at count %d state %d (%uus %s)\n", + data->count, data->state, TO_US(ev.duration), TO_STR(ev.pulse)); data->state = STATE_INACTIVE; return -EINVAL; } @@ -218,6 +209,8 @@ static struct ir_raw_handler sanyo_handler = { .protocols = RC_PROTO_BIT_SANYO, .decode = ir_sanyo_decode, .encode = ir_sanyo_encode, + .carrier = 38000, + .min_timeout = SANYO_TRAILER_SPACE, }; static int __init ir_sanyo_decode_init(void) @@ -236,7 +229,7 @@ static void __exit ir_sanyo_decode_exit(void) module_init(ir_sanyo_decode_init); module_exit(ir_sanyo_decode_exit); -MODULE_LICENSE("GPL"); +MODULE_LICENSE("GPL v2"); MODULE_AUTHOR("Mauro Carvalho Chehab"); MODULE_AUTHOR("Red Hat Inc. (http://www.redhat.com)"); MODULE_DESCRIPTION("SANYO IR protocol decoder"); diff --git a/drivers/media/rc/ir-sharp-decoder.c b/drivers/media/rc/ir-sharp-decoder.c index 61d313be5bbf..dbddf987df97 100644 --- a/drivers/media/rc/ir-sharp-decoder.c +++ b/drivers/media/rc/ir-sharp-decoder.c @@ -41,7 +41,7 @@ enum sharp_state { /** * ir_sharp_decode() - Decode one Sharp pulse or space * @dev: the struct rc_dev descriptor of the device - * @duration: the struct ir_raw_event descriptor of the pulse/space + * @ev: the struct ir_raw_event descriptor of the pulse/space * * This function returns -EINVAL if the pulse violates the state machine */ @@ -56,8 +56,8 @@ static int ir_sharp_decode(struct rc_dev *dev, struct ir_raw_event ev) return 0; } - IR_dprintk(2, "Sharp decode started at state %d (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "Sharp decode started at state %d (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); switch (data->state) { @@ -151,9 +151,9 @@ static int ir_sharp_decode(struct rc_dev *dev, struct ir_raw_event ev) msg = (data->bits >> 15) & 0x7fff; echo = data->bits & 0x7fff; if ((msg ^ echo) != 0x3ff) { - IR_dprintk(1, - "Sharp checksum error: received 0x%04x, 0x%04x\n", - msg, echo); + dev_dbg(&dev->dev, + "Sharp checksum error: received 0x%04x, 0x%04x\n", + msg, echo); break; } @@ -161,16 +161,15 @@ static int ir_sharp_decode(struct rc_dev *dev, struct ir_raw_event ev) command = bitrev8((msg >> 2) & 0xff); scancode = address << 8 | command; - IR_dprintk(1, "Sharp scancode 0x%04x\n", scancode); + dev_dbg(&dev->dev, "Sharp scancode 0x%04x\n", scancode); rc_keydown(dev, RC_PROTO_SHARP, scancode, 0); data->state = STATE_INACTIVE; return 0; } - IR_dprintk(1, "Sharp decode failed at count %d state %d (%uus %s)\n", - data->count, data->state, TO_US(ev.duration), - TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "Sharp decode failed at count %d state %d (%uus %s)\n", + data->count, data->state, TO_US(ev.duration), TO_STR(ev.pulse)); data->state = STATE_INACTIVE; return -EINVAL; } @@ -228,6 +227,8 @@ static struct ir_raw_handler sharp_handler = { .protocols = RC_PROTO_BIT_SHARP, .decode = ir_sharp_decode, .encode = ir_sharp_encode, + .carrier = 38000, + .min_timeout = SHARP_ECHO_SPACE + SHARP_ECHO_SPACE / 4, }; static int __init ir_sharp_decode_init(void) diff --git a/drivers/media/rc/ir-sony-decoder.c b/drivers/media/rc/ir-sony-decoder.c index a47ced763031..5065c081238d 100644 --- a/drivers/media/rc/ir-sony-decoder.c +++ b/drivers/media/rc/ir-sony-decoder.c @@ -55,8 +55,8 @@ static int ir_sony_decode(struct rc_dev *dev, struct ir_raw_event ev) if (!geq_margin(ev.duration, SONY_UNIT, SONY_UNIT / 2)) goto out; - IR_dprintk(2, "Sony decode started at state %d (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "Sony decode started at state %d (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); switch (data->state) { @@ -148,19 +148,21 @@ static int ir_sony_decode(struct rc_dev *dev, struct ir_raw_event ev) protocol = RC_PROTO_SONY20; break; default: - IR_dprintk(1, "Sony invalid bitcount %u\n", data->count); + dev_dbg(&dev->dev, "Sony invalid bitcount %u\n", + data->count); goto out; } scancode = device << 16 | subdevice << 8 | function; - IR_dprintk(1, "Sony(%u) scancode 0x%05x\n", data->count, scancode); + dev_dbg(&dev->dev, "Sony(%u) scancode 0x%05x\n", data->count, + scancode); rc_keydown(dev, protocol, scancode, 0); goto finish_state_machine; } out: - IR_dprintk(1, "Sony decode failed at state %d (%uus %s)\n", - data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "Sony decode failed at state %d (%uus %s)\n", + data->state, TO_US(ev.duration), TO_STR(ev.pulse)); data->state = STATE_INACTIVE; return -EINVAL; @@ -221,6 +223,8 @@ static struct ir_raw_handler sony_handler = { RC_PROTO_BIT_SONY20, .decode = ir_sony_decode, .encode = ir_sony_encode, + .carrier = 40000, + .min_timeout = SONY_TRAILER_SPACE, }; static int __init ir_sony_decode_init(void) diff --git a/drivers/media/rc/ir-spi.c b/drivers/media/rc/ir-spi.c index cbe585f95715..c58f2d38a458 100644 --- a/drivers/media/rc/ir-spi.c +++ b/drivers/media/rc/ir-spi.c @@ -1,13 +1,8 @@ -/* - * Copyright (c) 2016 Samsung Electronics Co., Ltd. - * Author: Andi Shyti - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * SPI driven IR LED device driver - */ +// SPDX-License-Identifier: GPL-2.0 +// SPI driven IR LED device driver +// +// Copyright (c) 2016 Samsung Electronics Co., Ltd. +// Copyright (c) Andi Shyti #include #include @@ -20,21 +15,11 @@ #define IR_SPI_DRIVER_NAME "ir-spi" -/* pulse value for different duty cycles */ -#define IR_SPI_PULSE_DC_50 0xff00 -#define IR_SPI_PULSE_DC_60 0xfc00 -#define IR_SPI_PULSE_DC_70 0xf800 -#define IR_SPI_PULSE_DC_75 0xf000 -#define IR_SPI_PULSE_DC_80 0xc000 -#define IR_SPI_PULSE_DC_90 0x8000 - #define IR_SPI_DEFAULT_FREQUENCY 38000 -#define IR_SPI_BIT_PER_WORD 8 #define IR_SPI_MAX_BUFSIZE 4096 struct ir_spi_data { u32 freq; - u8 duty_cycle; bool negated; u16 tx_buf[IR_SPI_MAX_BUFSIZE]; @@ -110,19 +95,9 @@ static int ir_spi_set_tx_carrier(struct rc_dev *dev, u32 carrier) static int ir_spi_set_duty_cycle(struct rc_dev *dev, u32 duty_cycle) { struct ir_spi_data *idata = dev->priv; + int bits = (duty_cycle * 15) / 100; - if (duty_cycle >= 90) - idata->pulse = IR_SPI_PULSE_DC_90; - else if (duty_cycle >= 80) - idata->pulse = IR_SPI_PULSE_DC_80; - else if (duty_cycle >= 75) - idata->pulse = IR_SPI_PULSE_DC_75; - else if (duty_cycle >= 70) - idata->pulse = IR_SPI_PULSE_DC_70; - else if (duty_cycle >= 60) - idata->pulse = IR_SPI_PULSE_DC_60; - else - idata->pulse = IR_SPI_PULSE_DC_50; + idata->pulse = GENMASK(bits, 0); if (idata->negated) { idata->pulse = ~idata->pulse; @@ -199,6 +174,6 @@ static struct spi_driver ir_spi_driver = { module_spi_driver(ir_spi_driver); -MODULE_AUTHOR("Andi Shyti "); +MODULE_AUTHOR("Andi Shyti "); MODULE_DESCRIPTION("SPI IR LED"); MODULE_LICENSE("GPL v2"); diff --git a/drivers/media/rc/ir-xmp-decoder.c b/drivers/media/rc/ir-xmp-decoder.c index 6f464be1c8d7..c965f51df1c1 100644 --- a/drivers/media/rc/ir-xmp-decoder.c +++ b/drivers/media/rc/ir-xmp-decoder.c @@ -35,7 +35,7 @@ enum xmp_state { /** * ir_xmp_decode() - Decode one XMP pulse or space * @dev: the struct rc_dev descriptor of the device - * @duration: the struct ir_raw_event descriptor of the pulse/space + * @ev: the struct ir_raw_event descriptor of the pulse/space * * This function returns -EINVAL if the pulse violates the state machine */ @@ -49,8 +49,8 @@ static int ir_xmp_decode(struct rc_dev *dev, struct ir_raw_event ev) return 0; } - IR_dprintk(2, "XMP decode started at state %d %d (%uus %s)\n", - data->state, data->count, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "XMP decode started at state %d %d (%uus %s)\n", + data->state, data->count, TO_US(ev.duration), TO_STR(ev.pulse)); switch (data->state) { @@ -85,7 +85,7 @@ static int ir_xmp_decode(struct rc_dev *dev, struct ir_raw_event ev) u32 scancode; if (data->count != 16) { - IR_dprintk(2, "received TRAILER period at index %d: %u\n", + dev_dbg(&dev->dev, "received TRAILER period at index %d: %u\n", data->count, ev.duration); data->state = STATE_INACTIVE; return -EINVAL; @@ -99,7 +99,8 @@ static int ir_xmp_decode(struct rc_dev *dev, struct ir_raw_event ev) */ divider = (n[3] - XMP_NIBBLE_PREFIX) / 15 - 2000; if (divider < 50) { - IR_dprintk(2, "divider to small %d.\n", divider); + dev_dbg(&dev->dev, "divider to small %d.\n", + divider); data->state = STATE_INACTIVE; return -EINVAL; } @@ -113,7 +114,7 @@ static int ir_xmp_decode(struct rc_dev *dev, struct ir_raw_event ev) n[12] + n[13] + n[14] + n[15]) % 16; if (sum1 != 15 || sum2 != 15) { - IR_dprintk(2, "checksum errors sum1=0x%X sum2=0x%X\n", + dev_dbg(&dev->dev, "checksum errors sum1=0x%X sum2=0x%X\n", sum1, sum2); data->state = STATE_INACTIVE; return -EINVAL; @@ -127,24 +128,24 @@ static int ir_xmp_decode(struct rc_dev *dev, struct ir_raw_event ev) obc1 = n[12] << 4 | n[13]; obc2 = n[14] << 4 | n[15]; if (subaddr != subaddr2) { - IR_dprintk(2, "subaddress nibbles mismatch 0x%02X != 0x%02X\n", + dev_dbg(&dev->dev, "subaddress nibbles mismatch 0x%02X != 0x%02X\n", subaddr, subaddr2); data->state = STATE_INACTIVE; return -EINVAL; } if (oem != 0x44) - IR_dprintk(1, "Warning: OEM nibbles 0x%02X. Expected 0x44\n", + dev_dbg(&dev->dev, "Warning: OEM nibbles 0x%02X. Expected 0x44\n", oem); scancode = addr << 24 | subaddr << 16 | obc1 << 8 | obc2; - IR_dprintk(1, "XMP scancode 0x%06x\n", scancode); + dev_dbg(&dev->dev, "XMP scancode 0x%06x\n", scancode); if (toggle == 0) { rc_keydown(dev, RC_PROTO_XMP, scancode, 0); } else { rc_repeat(dev); - IR_dprintk(1, "Repeat last key\n"); + dev_dbg(&dev->dev, "Repeat last key\n"); } data->state = STATE_INACTIVE; @@ -153,7 +154,7 @@ static int ir_xmp_decode(struct rc_dev *dev, struct ir_raw_event ev) } else if (geq_margin(ev.duration, XMP_HALFFRAME_SPACE, XMP_NIBBLE_PREFIX)) { /* Expect 8 or 16 nibble pulses. 16 in case of 'final' frame */ if (data->count == 16) { - IR_dprintk(2, "received half frame pulse at index %d. Probably a final frame key-up event: %u\n", + dev_dbg(&dev->dev, "received half frame pulse at index %d. Probably a final frame key-up event: %u\n", data->count, ev.duration); /* * TODO: for now go back to half frame position @@ -164,7 +165,7 @@ static int ir_xmp_decode(struct rc_dev *dev, struct ir_raw_event ev) } else if (data->count != 8) - IR_dprintk(2, "received half frame pulse at index %d: %u\n", + dev_dbg(&dev->dev, "received half frame pulse at index %d: %u\n", data->count, ev.duration); data->state = STATE_LEADER_PULSE; @@ -173,7 +174,7 @@ static int ir_xmp_decode(struct rc_dev *dev, struct ir_raw_event ev) } else if (geq_margin(ev.duration, XMP_NIBBLE_PREFIX, XMP_UNIT)) { /* store nibble raw data, decode after trailer */ if (data->count == 16) { - IR_dprintk(2, "to many pulses (%d) ignoring: %u\n", + dev_dbg(&dev->dev, "to many pulses (%d) ignoring: %u\n", data->count, ev.duration); data->state = STATE_INACTIVE; return -EINVAL; @@ -189,8 +190,8 @@ static int ir_xmp_decode(struct rc_dev *dev, struct ir_raw_event ev) break; } - IR_dprintk(1, "XMP decode failed at count %d state %d (%uus %s)\n", - data->count, data->state, TO_US(ev.duration), TO_STR(ev.pulse)); + dev_dbg(&dev->dev, "XMP decode failed at count %d state %d (%uus %s)\n", + data->count, data->state, TO_US(ev.duration), TO_STR(ev.pulse)); data->state = STATE_INACTIVE; return -EINVAL; } @@ -198,6 +199,7 @@ static int ir_xmp_decode(struct rc_dev *dev, struct ir_raw_event ev) static struct ir_raw_handler xmp_handler = { .protocols = RC_PROTO_BIT_XMP, .decode = ir_xmp_decode, + .min_timeout = XMP_TRAILER_SPACE, }; static int __init ir_xmp_decode_init(void) diff --git a/drivers/media/rc/ite-cir.c b/drivers/media/rc/ite-cir.c index 7bc38e805acb..679dd78382d7 100644 --- a/drivers/media/rc/ite-cir.c +++ b/drivers/media/rc/ite-cir.c @@ -1567,9 +1567,11 @@ static int ite_probe(struct pnp_dev *pdev, const struct pnp_device_id rdev->close = ite_close; rdev->s_idle = ite_s_idle; rdev->s_rx_carrier_range = ite_set_rx_carrier_range; - rdev->min_timeout = ITE_MIN_IDLE_TIMEOUT; - rdev->max_timeout = ITE_MAX_IDLE_TIMEOUT; - rdev->timeout = ITE_IDLE_TIMEOUT; + /* FIFO threshold is 17 bytes, so 17 * 8 samples minimum */ + rdev->min_timeout = 17 * 8 * ITE_BAUDRATE_DIVISOR * + itdev->params.sample_period; + rdev->timeout = IR_DEFAULT_TIMEOUT; + rdev->max_timeout = 10 * IR_DEFAULT_TIMEOUT; rdev->rx_resolution = ITE_BAUDRATE_DIVISOR * itdev->params.sample_period; rdev->tx_resolution = ITE_BAUDRATE_DIVISOR * diff --git a/drivers/media/rc/ite-cir.h b/drivers/media/rc/ite-cir.h index 0e8ebc880d1f..9cb24ac01350 100644 --- a/drivers/media/rc/ite-cir.h +++ b/drivers/media/rc/ite-cir.h @@ -154,13 +154,6 @@ struct ite_dev { /* default carrier freq for when demodulator is off (Hz) */ #define ITE_DEFAULT_CARRIER_FREQ 38000 -/* default idling timeout in ns (0.2 seconds) */ -#define ITE_IDLE_TIMEOUT 200000000UL - -/* limit timeout values */ -#define ITE_MIN_IDLE_TIMEOUT 100000000UL -#define ITE_MAX_IDLE_TIMEOUT 1000000000UL - /* convert bits to us */ #define ITE_BITS_TO_NS(bits, sample_period) \ ((u32) ((bits) * ITE_BAUDRATE_DIVISOR * sample_period)) diff --git a/drivers/media/rc/keymaps/Makefile b/drivers/media/rc/keymaps/Makefile index 2d0b26bf2051..d6b913a3032d 100644 --- a/drivers/media/rc/keymaps/Makefile +++ b/drivers/media/rc/keymaps/Makefile @@ -3,6 +3,7 @@ obj-$(CONFIG_RC_MAP) += rc-adstech-dvb-t-pci.o \ rc-alink-dtu-m.o \ rc-anysee.o \ rc-apac-viewcomp.o \ + rc-astrometa-t2hybrid.o \ rc-asus-pc39.o \ rc-asus-ps3-100.o \ rc-ati-tv-wonder-hd-600.o \ @@ -48,8 +49,11 @@ obj-$(CONFIG_RC_MAP) += rc-adstech-dvb-t-pci.o \ rc-geekbox.o \ rc-genius-tvgo-a11mce.o \ rc-gotview7135.o \ + rc-hisi-poplar.o \ + rc-hisi-tv-demo.o \ rc-imon-mce.o \ rc-imon-pad.o \ + rc-imon-rsc.o \ rc-iodata-bctv7e.o \ rc-it913x-v1.o \ rc-it913x-v2.o \ @@ -89,6 +93,7 @@ obj-$(CONFIG_RC_MAP) += rc-adstech-dvb-t-pci.o \ rc-reddo.o \ rc-snapstream-firefly.o \ rc-streamzap.o \ + rc-tango.o \ rc-tbs-nec.o \ rc-technisat-ts35.o \ rc-technisat-usb2.o \ diff --git a/drivers/media/rc/keymaps/rc-adstech-dvb-t-pci.c b/drivers/media/rc/keymaps/rc-adstech-dvb-t-pci.c index 2d303c2cee3b..732687ce0637 100644 --- a/drivers/media/rc/keymaps/rc-adstech-dvb-t-pci.c +++ b/drivers/media/rc/keymaps/rc-adstech-dvb-t-pci.c @@ -1,14 +1,9 @@ -/* adstech-dvb-t-pci.h - Keytable for adstech_dvb_t_pci Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// adstech-dvb-t-pci.h - Keytable for adstech_dvb_t_pci Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-apac-viewcomp.c b/drivers/media/rc/keymaps/rc-apac-viewcomp.c index 65bc8957d9c3..af2e7fdc7b85 100644 --- a/drivers/media/rc/keymaps/rc-apac-viewcomp.c +++ b/drivers/media/rc/keymaps/rc-apac-viewcomp.c @@ -1,14 +1,9 @@ -/* apac-viewcomp.h - Keytable for apac_viewcomp Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// apac-viewcomp.h - Keytable for apac_viewcomp Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-astrometa-t2hybrid.c b/drivers/media/rc/keymaps/rc-astrometa-t2hybrid.c new file mode 100644 index 000000000000..51690960fec4 --- /dev/null +++ b/drivers/media/rc/keymaps/rc-astrometa-t2hybrid.c @@ -0,0 +1,70 @@ +/* + * Keytable for the Astrometa T2hybrid remote controller + * + * Copyright (C) 2017 Oleh Kravchenko + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see . + */ + +#include +#include + +static struct rc_map_table t2hybrid[] = { + { 0x4d, KEY_POWER2 }, + { 0x54, KEY_VIDEO }, /* Source */ + { 0x16, KEY_MUTE }, + + { 0x4c, KEY_RECORD }, + { 0x05, KEY_CHANNELUP }, + { 0x0c, KEY_TIME}, /* Timeshift */ + + { 0x0a, KEY_VOLUMEDOWN }, + { 0x40, KEY_ZOOM }, /* Fullscreen */ + { 0x1e, KEY_VOLUMEUP }, + + { 0x12, KEY_0 }, + { 0x02, KEY_CHANNELDOWN }, + { 0x1c, KEY_AGAIN }, /* Recall */ + + { 0x09, KEY_1 }, + { 0x1d, KEY_2 }, + { 0x1f, KEY_3 }, + + { 0x0d, KEY_4 }, + { 0x19, KEY_5 }, + { 0x1b, KEY_6 }, + + { 0x11, KEY_7 }, + { 0x15, KEY_8 }, + { 0x17, KEY_9 }, +}; + +static struct rc_map_list t2hybrid_map = { + .map = { + .scan = t2hybrid, + .size = ARRAY_SIZE(t2hybrid), + .rc_proto = RC_PROTO_NEC, + .name = RC_MAP_ASTROMETA_T2HYBRID, + } +}; + +static int __init init_rc_map_t2hybrid(void) +{ + return rc_map_register(&t2hybrid_map); +} + +static void __exit exit_rc_map_t2hybrid(void) +{ + rc_map_unregister(&t2hybrid_map); +} + +module_init(init_rc_map_t2hybrid) +module_exit(exit_rc_map_t2hybrid) + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Oleh Kravchenko "); diff --git a/drivers/media/rc/keymaps/rc-asus-pc39.c b/drivers/media/rc/keymaps/rc-asus-pc39.c index 530e1d1158d1..13a935c3ac59 100644 --- a/drivers/media/rc/keymaps/rc-asus-pc39.c +++ b/drivers/media/rc/keymaps/rc-asus-pc39.c @@ -1,14 +1,9 @@ -/* asus-pc39.h - Keytable for asus_pc39 Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// asus-pc39.h - Keytable for asus_pc39 Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-asus-ps3-100.c b/drivers/media/rc/keymaps/rc-asus-ps3-100.c index c91ba332984c..7f836fcc68ac 100644 --- a/drivers/media/rc/keymaps/rc-asus-ps3-100.c +++ b/drivers/media/rc/keymaps/rc-asus-ps3-100.c @@ -1,14 +1,9 @@ -/* asus-ps3-100.h - Keytable for asus_ps3_100 Remote Controller - * - * Copyright (c) 2012 by Mauro Carvalho Chehab - * - * Based on a previous patch from Remi Schwartz - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// asus-ps3-100.h - Keytable for asus_ps3_100 Remote Controller +// +// Copyright (c) 2012 by Mauro Carvalho Chehab +// +// Based on a previous patch from Remi Schwartz #include #include diff --git a/drivers/media/rc/keymaps/rc-ati-tv-wonder-hd-600.c b/drivers/media/rc/keymaps/rc-ati-tv-wonder-hd-600.c index 11b4bdd2392b..b4b7932c0c5a 100644 --- a/drivers/media/rc/keymaps/rc-ati-tv-wonder-hd-600.c +++ b/drivers/media/rc/keymaps/rc-ati-tv-wonder-hd-600.c @@ -1,14 +1,9 @@ -/* ati-tv-wonder-hd-600.h - Keytable for ati_tv_wonder_hd_600 Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// ati-tv-wonder-hd-600.h - Keytable for ati_tv_wonder_hd_600 Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-avermedia-a16d.c b/drivers/media/rc/keymaps/rc-avermedia-a16d.c index 510dc90ebf49..5549c043cfe4 100644 --- a/drivers/media/rc/keymaps/rc-avermedia-a16d.c +++ b/drivers/media/rc/keymaps/rc-avermedia-a16d.c @@ -1,14 +1,9 @@ -/* avermedia-a16d.h - Keytable for avermedia_a16d Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// avermedia-a16d.h - Keytable for avermedia_a16d Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-avermedia-cardbus.c b/drivers/media/rc/keymaps/rc-avermedia-cardbus.c index 4bbc1e68d1b8..74edcd82e685 100644 --- a/drivers/media/rc/keymaps/rc-avermedia-cardbus.c +++ b/drivers/media/rc/keymaps/rc-avermedia-cardbus.c @@ -1,14 +1,9 @@ -/* avermedia-cardbus.h - Keytable for avermedia_cardbus Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// avermedia-cardbus.h - Keytable for avermedia_cardbus Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-avermedia-dvbt.c b/drivers/media/rc/keymaps/rc-avermedia-dvbt.c index f6b8547dbad3..796184160a48 100644 --- a/drivers/media/rc/keymaps/rc-avermedia-dvbt.c +++ b/drivers/media/rc/keymaps/rc-avermedia-dvbt.c @@ -1,14 +1,9 @@ -/* avermedia-dvbt.h - Keytable for avermedia_dvbt Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// avermedia-dvbt.h - Keytable for avermedia_dvbt Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-avermedia-m135a.c b/drivers/media/rc/keymaps/rc-avermedia-m135a.c index 9882e2cde975..d275d98d066a 100644 --- a/drivers/media/rc/keymaps/rc-avermedia-m135a.c +++ b/drivers/media/rc/keymaps/rc-avermedia-m135a.c @@ -1,13 +1,8 @@ -/* avermedia-m135a.c - Keytable for Avermedia M135A Remote Controllers - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * Copyright (c) 2010 by Herton Ronaldo Krzesinski - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// avermedia-m135a.c - Keytable for Avermedia M135A Remote Controllers +// +// Copyright (c) 2010 by Mauro Carvalho Chehab +// Copyright (c) 2010 by Herton Ronaldo Krzesinski #include #include @@ -17,7 +12,7 @@ * * On Avermedia M135A with IR model RM-JX, the same codes exist on both * Positivo (BR) and original IR, initial version and remote control codes - * added by Mauro Carvalho Chehab + * added by Mauro Carvalho Chehab * * Positivo also ships Avermedia M135A with model RM-K6, extra control * codes added by Herton Ronaldo Krzesinski @@ -43,7 +38,8 @@ static struct rc_map_table avermedia_m135a[] = { { 0x0213, KEY_RIGHT }, /* -> or L */ { 0x0212, KEY_LEFT }, /* <- or R */ - { 0x0217, KEY_SLEEP }, /* Capturar Imagem or Snapshot */ + { 0x0215, KEY_MENU }, + { 0x0217, KEY_CAMERA }, /* Capturar Imagem or Snapshot */ { 0x0210, KEY_SHUFFLE }, /* Amostra or 16 chan prev */ { 0x0303, KEY_CHANNELUP }, diff --git a/drivers/media/rc/keymaps/rc-avermedia.c b/drivers/media/rc/keymaps/rc-avermedia.c index 6503f11c7df5..631ff52564f0 100644 --- a/drivers/media/rc/keymaps/rc-avermedia.c +++ b/drivers/media/rc/keymaps/rc-avermedia.c @@ -1,14 +1,9 @@ -/* avermedia.h - Keytable for avermedia Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// avermedia.h - Keytable for avermedia Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-avertv-303.c b/drivers/media/rc/keymaps/rc-avertv-303.c index fbdd7ada57ce..47ca8b7ea532 100644 --- a/drivers/media/rc/keymaps/rc-avertv-303.c +++ b/drivers/media/rc/keymaps/rc-avertv-303.c @@ -1,14 +1,9 @@ -/* avertv-303.h - Keytable for avertv_303 Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// avertv-303.h - Keytable for avertv_303 Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-behold-columbus.c b/drivers/media/rc/keymaps/rc-behold-columbus.c index d256743be998..e73057945bd1 100644 --- a/drivers/media/rc/keymaps/rc-behold-columbus.c +++ b/drivers/media/rc/keymaps/rc-behold-columbus.c @@ -1,14 +1,9 @@ -/* behold-columbus.h - Keytable for behold_columbus Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// behold-columbus.h - Keytable for behold_columbus Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include @@ -35,12 +30,12 @@ static struct rc_map_table behold_columbus[] = { /* 0x01 0x02 0x03 0x0D * * 1 2 3 Stereo * - * * + * * * 0x04 0x05 0x06 0x19 * * 4 5 6 Snapshot * - * * + * * * 0x07 0x08 0x09 0x10 * - * 7 8 9 Zoom * + * 7 8 9 Zoom * * */ { 0x01, KEY_1 }, { 0x02, KEY_2 }, diff --git a/drivers/media/rc/keymaps/rc-behold.c b/drivers/media/rc/keymaps/rc-behold.c index 93dc795adc67..9b1b57e3c875 100644 --- a/drivers/media/rc/keymaps/rc-behold.c +++ b/drivers/media/rc/keymaps/rc-behold.c @@ -1,14 +1,9 @@ -/* behold.h - Keytable for behold Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// behold.h - Keytable for behold Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-budget-ci-old.c b/drivers/media/rc/keymaps/rc-budget-ci-old.c index 81ea1424d9e5..56f051af6154 100644 --- a/drivers/media/rc/keymaps/rc-budget-ci-old.c +++ b/drivers/media/rc/keymaps/rc-budget-ci-old.c @@ -1,14 +1,9 @@ -/* budget-ci-old.h - Keytable for budget_ci_old Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// budget-ci-old.h - Keytable for budget_ci_old Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-cinergy-1400.c b/drivers/media/rc/keymaps/rc-cinergy-1400.c index bcb96b3dda85..dacb13c53bb4 100644 --- a/drivers/media/rc/keymaps/rc-cinergy-1400.c +++ b/drivers/media/rc/keymaps/rc-cinergy-1400.c @@ -1,14 +1,9 @@ -/* cinergy-1400.h - Keytable for cinergy_1400 Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// cinergy-1400.h - Keytable for cinergy_1400 Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-cinergy.c b/drivers/media/rc/keymaps/rc-cinergy.c index fd56c402aae5..6ab2e51b764d 100644 --- a/drivers/media/rc/keymaps/rc-cinergy.c +++ b/drivers/media/rc/keymaps/rc-cinergy.c @@ -1,14 +1,9 @@ -/* cinergy.h - Keytable for cinergy Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// cinergy.h - Keytable for cinergy Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-dib0700-nec.c b/drivers/media/rc/keymaps/rc-dib0700-nec.c index 1b4df106b7b5..4ee801acb089 100644 --- a/drivers/media/rc/keymaps/rc-dib0700-nec.c +++ b/drivers/media/rc/keymaps/rc-dib0700-nec.c @@ -1,19 +1,14 @@ -/* rc-dvb0700-big.c - Keytable for devices in dvb0700 - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * TODO: This table is a real mess, as it merges RC codes from several - * devices into a big table. It also has both RC-5 and NEC codes inside. - * It should be broken into small tables, and the protocols should properly - * be identificated. - * - * The table were imported from dib0700_devices.c. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// rc-dvb0700-big.c - Keytable for devices in dvb0700 +// +// Copyright (c) 2010 by Mauro Carvalho Chehab +// +// TODO: This table is a real mess, as it merges RC codes from several +// devices into a big table. It also has both RC-5 and NEC codes inside. +// It should be broken into small tables, and the protocols should properly +// be identificated. +// +// The table were imported from dib0700_devices.c. #include #include diff --git a/drivers/media/rc/keymaps/rc-dib0700-rc5.c b/drivers/media/rc/keymaps/rc-dib0700-rc5.c index b0f8151bb824..ef4085a0fda3 100644 --- a/drivers/media/rc/keymaps/rc-dib0700-rc5.c +++ b/drivers/media/rc/keymaps/rc-dib0700-rc5.c @@ -1,19 +1,14 @@ -/* rc-dvb0700-big.c - Keytable for devices in dvb0700 - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * TODO: This table is a real mess, as it merges RC codes from several - * devices into a big table. It also has both RC-5 and NEC codes inside. - * It should be broken into small tables, and the protocols should properly - * be identificated. - * - * The table were imported from dib0700_devices.c. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// rc-dvb0700-big.c - Keytable for devices in dvb0700 +// +// Copyright (c) 2010 by Mauro Carvalho Chehab +// +// TODO: This table is a real mess, as it merges RC codes from several +// devices into a big table. It also has both RC-5 and NEC codes inside. +// It should be broken into small tables, and the protocols should properly +// be identificated. +// +// The table were imported from dib0700_devices.c. #include #include diff --git a/drivers/media/rc/keymaps/rc-dm1105-nec.c b/drivers/media/rc/keymaps/rc-dm1105-nec.c index c353445d10ed..d853cd9a0936 100644 --- a/drivers/media/rc/keymaps/rc-dm1105-nec.c +++ b/drivers/media/rc/keymaps/rc-dm1105-nec.c @@ -1,14 +1,9 @@ -/* dm1105-nec.h - Keytable for dm1105_nec Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// dm1105-nec.h - Keytable for dm1105_nec Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-dntv-live-dvb-t.c b/drivers/media/rc/keymaps/rc-dntv-live-dvb-t.c index 5bafd5b70f5e..cdc1d8c990cb 100644 --- a/drivers/media/rc/keymaps/rc-dntv-live-dvb-t.c +++ b/drivers/media/rc/keymaps/rc-dntv-live-dvb-t.c @@ -1,14 +1,9 @@ -/* dntv-live-dvb-t.h - Keytable for dntv_live_dvb_t Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// dntv-live-dvb-t.h - Keytable for dntv_live_dvb_t Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-dntv-live-dvbt-pro.c b/drivers/media/rc/keymaps/rc-dntv-live-dvbt-pro.c index 360167c8829b..38e1d1b837da 100644 --- a/drivers/media/rc/keymaps/rc-dntv-live-dvbt-pro.c +++ b/drivers/media/rc/keymaps/rc-dntv-live-dvbt-pro.c @@ -1,14 +1,9 @@ -/* dntv-live-dvbt-pro.h - Keytable for dntv_live_dvbt_pro Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// dntv-live-dvbt-pro.h - Keytable for dntv_live_dvbt_pro Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-em-terratec.c b/drivers/media/rc/keymaps/rc-em-terratec.c index 18e1a2679c20..cbbba21484fb 100644 --- a/drivers/media/rc/keymaps/rc-em-terratec.c +++ b/drivers/media/rc/keymaps/rc-em-terratec.c @@ -1,14 +1,9 @@ -/* em-terratec.h - Keytable for em_terratec Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// em-terratec.h - Keytable for em_terratec Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-encore-enltv-fm53.c b/drivers/media/rc/keymaps/rc-encore-enltv-fm53.c index 72ffd5cb0108..057c13b765ef 100644 --- a/drivers/media/rc/keymaps/rc-encore-enltv-fm53.c +++ b/drivers/media/rc/keymaps/rc-encore-enltv-fm53.c @@ -1,20 +1,15 @@ -/* encore-enltv-fm53.h - Keytable for encore_enltv_fm53 Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// encore-enltv-fm53.h - Keytable for encore_enltv_fm53 Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include /* Encore ENLTV-FM v5.3 - Mauro Carvalho Chehab + Mauro Carvalho Chehab */ static struct rc_map_table encore_enltv_fm53[] = { diff --git a/drivers/media/rc/keymaps/rc-encore-enltv.c b/drivers/media/rc/keymaps/rc-encore-enltv.c index e0381e7aa964..5b4e832d5fac 100644 --- a/drivers/media/rc/keymaps/rc-encore-enltv.c +++ b/drivers/media/rc/keymaps/rc-encore-enltv.c @@ -1,14 +1,9 @@ -/* encore-enltv.h - Keytable for encore_enltv Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// encore-enltv.h - Keytable for encore_enltv Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-encore-enltv2.c b/drivers/media/rc/keymaps/rc-encore-enltv2.c index e9b0bfba319c..cd0555924456 100644 --- a/drivers/media/rc/keymaps/rc-encore-enltv2.c +++ b/drivers/media/rc/keymaps/rc-encore-enltv2.c @@ -1,20 +1,15 @@ -/* encore-enltv2.h - Keytable for encore_enltv2 Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// encore-enltv2.h - Keytable for encore_enltv2 Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include /* Encore ENLTV2-FM - silver plastic - "Wand Media" written at the botton - Mauro Carvalho Chehab */ + Mauro Carvalho Chehab */ static struct rc_map_table encore_enltv2[] = { { 0x4c, KEY_POWER2 }, diff --git a/drivers/media/rc/keymaps/rc-evga-indtube.c b/drivers/media/rc/keymaps/rc-evga-indtube.c index b77c5e908668..f4398444330b 100644 --- a/drivers/media/rc/keymaps/rc-evga-indtube.c +++ b/drivers/media/rc/keymaps/rc-evga-indtube.c @@ -1,14 +1,9 @@ -/* evga-indtube.h - Keytable for evga_indtube Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// evga-indtube.h - Keytable for evga_indtube Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-eztv.c b/drivers/media/rc/keymaps/rc-eztv.c index 5013b3b2aa93..0e481d51fcb5 100644 --- a/drivers/media/rc/keymaps/rc-eztv.c +++ b/drivers/media/rc/keymaps/rc-eztv.c @@ -1,14 +1,9 @@ -/* eztv.h - Keytable for eztv Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// eztv.h - Keytable for eztv Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-flydvb.c b/drivers/media/rc/keymaps/rc-flydvb.c index 418b32521273..45940d7c92d0 100644 --- a/drivers/media/rc/keymaps/rc-flydvb.c +++ b/drivers/media/rc/keymaps/rc-flydvb.c @@ -1,14 +1,9 @@ -/* flydvb.h - Keytable for flydvb Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// flydvb.h - Keytable for flydvb Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-flyvideo.c b/drivers/media/rc/keymaps/rc-flyvideo.c index 93fb87ecf061..b2d4e4c7b192 100644 --- a/drivers/media/rc/keymaps/rc-flyvideo.c +++ b/drivers/media/rc/keymaps/rc-flyvideo.c @@ -1,14 +1,9 @@ -/* flyvideo.h - Keytable for flyvideo Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// flyvideo.h - Keytable for flyvideo Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-fusionhdtv-mce.c b/drivers/media/rc/keymaps/rc-fusionhdtv-mce.c index 9ed3f749262b..1c63fc7d4576 100644 --- a/drivers/media/rc/keymaps/rc-fusionhdtv-mce.c +++ b/drivers/media/rc/keymaps/rc-fusionhdtv-mce.c @@ -1,14 +1,9 @@ -/* fusionhdtv-mce.h - Keytable for fusionhdtv_mce Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// fusionhdtv-mce.h - Keytable for fusionhdtv_mce Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-gadmei-rm008z.c b/drivers/media/rc/keymaps/rc-gadmei-rm008z.c index 3443b721d092..4a0a9786914f 100644 --- a/drivers/media/rc/keymaps/rc-gadmei-rm008z.c +++ b/drivers/media/rc/keymaps/rc-gadmei-rm008z.c @@ -1,14 +1,9 @@ -/* gadmei-rm008z.h - Keytable for gadmei_rm008z Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// gadmei-rm008z.h - Keytable for gadmei_rm008z Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-genius-tvgo-a11mce.c b/drivers/media/rc/keymaps/rc-genius-tvgo-a11mce.c index d140e8d45bcc..cc876a85cc31 100644 --- a/drivers/media/rc/keymaps/rc-genius-tvgo-a11mce.c +++ b/drivers/media/rc/keymaps/rc-genius-tvgo-a11mce.c @@ -1,14 +1,9 @@ -/* genius-tvgo-a11mce.h - Keytable for genius_tvgo_a11mce Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// genius-tvgo-a11mce.h - Keytable for genius_tvgo_a11mce Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-gotview7135.c b/drivers/media/rc/keymaps/rc-gotview7135.c index 51230fbb52ba..6b94bd39d977 100644 --- a/drivers/media/rc/keymaps/rc-gotview7135.c +++ b/drivers/media/rc/keymaps/rc-gotview7135.c @@ -1,14 +1,9 @@ -/* gotview7135.h - Keytable for gotview7135 Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// gotview7135.h - Keytable for gotview7135 Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-hauppauge.c b/drivers/media/rc/keymaps/rc-hauppauge.c index 890164b68d64..582aa9012443 100644 --- a/drivers/media/rc/keymaps/rc-hauppauge.c +++ b/drivers/media/rc/keymaps/rc-hauppauge.c @@ -1,20 +1,15 @@ -/* rc-hauppauge.c - Keytable for Hauppauge Remote Controllers - * - * keymap imported from ir-keymaps.c - * - * This map currently contains the code for four different RCs: - * - New Hauppauge Gray; - * - Old Hauppauge Gray (with a golden screen for media keys); - * - Hauppauge Black; - * - DSR-0112 remote bundled with Haupauge MiniStick. - * - * Copyright (c) 2010-2011 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// rc-hauppauge.c - Keytable for Hauppauge Remote Controllers +// +// keymap imported from ir-keymaps.c +// +// This map currently contains the code for four different RCs: +// - New Hauppauge Gray; +// - Old Hauppauge Gray (with a golden screen for media keys); +// - Hauppauge Black; +// - DSR-0112 remote bundled with Haupauge MiniStick. +// +// Copyright (c) 2010-2011 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-hisi-poplar.c b/drivers/media/rc/keymaps/rc-hisi-poplar.c new file mode 100644 index 000000000000..78728bc7f63a --- /dev/null +++ b/drivers/media/rc/keymaps/rc-hisi-poplar.c @@ -0,0 +1,69 @@ +/* + * Keytable for remote controller of HiSilicon poplar board. + * + * Copyright (c) 2017 HiSilicon Technologies Co., Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include +#include + +static struct rc_map_table hisi_poplar_keymap[] = { + { 0x0000b292, KEY_1}, + { 0x0000b293, KEY_2}, + { 0x0000b2cc, KEY_3}, + { 0x0000b28e, KEY_4}, + { 0x0000b28f, KEY_5}, + { 0x0000b2c8, KEY_6}, + { 0x0000b28a, KEY_7}, + { 0x0000b28b, KEY_8}, + { 0x0000b2c4, KEY_9}, + { 0x0000b287, KEY_0}, + { 0x0000b282, KEY_HOMEPAGE}, + { 0x0000b2ca, KEY_UP}, + { 0x0000b299, KEY_LEFT}, + { 0x0000b2c1, KEY_RIGHT}, + { 0x0000b2d2, KEY_DOWN}, + { 0x0000b2c5, KEY_DELETE}, + { 0x0000b29c, KEY_MUTE}, + { 0x0000b281, KEY_VOLUMEDOWN}, + { 0x0000b280, KEY_VOLUMEUP}, + { 0x0000b2dc, KEY_POWER}, + { 0x0000b29a, KEY_MENU}, + { 0x0000b28d, KEY_SETUP}, + { 0x0000b2c5, KEY_BACK}, + { 0x0000b295, KEY_PLAYPAUSE}, + { 0x0000b2ce, KEY_ENTER}, + { 0x0000b285, KEY_CHANNELUP}, + { 0x0000b286, KEY_CHANNELDOWN}, + { 0x0000b2da, KEY_NUMERIC_STAR}, + { 0x0000b2d0, KEY_NUMERIC_POUND}, +}; + +static struct rc_map_list hisi_poplar_map = { + .map = { + .scan = hisi_poplar_keymap, + .size = ARRAY_SIZE(hisi_poplar_keymap), + .rc_proto = RC_PROTO_NEC, + .name = RC_MAP_HISI_POPLAR, + } +}; + +static int __init init_rc_map_hisi_poplar(void) +{ + return rc_map_register(&hisi_poplar_map); +} + +static void __exit exit_rc_map_hisi_poplar(void) +{ + rc_map_unregister(&hisi_poplar_map); +} + +module_init(init_rc_map_hisi_poplar) +module_exit(exit_rc_map_hisi_poplar) + +MODULE_LICENSE("GPL v2"); diff --git a/drivers/media/rc/keymaps/rc-hisi-tv-demo.c b/drivers/media/rc/keymaps/rc-hisi-tv-demo.c new file mode 100644 index 000000000000..4816e3a4a18d --- /dev/null +++ b/drivers/media/rc/keymaps/rc-hisi-tv-demo.c @@ -0,0 +1,81 @@ +/* + * Keytable for remote controller of HiSilicon tv demo board. + * + * Copyright (c) 2017 HiSilicon Technologies Co., Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include +#include + +static struct rc_map_table hisi_tv_demo_keymap[] = { + { 0x00000092, KEY_1}, + { 0x00000093, KEY_2}, + { 0x000000cc, KEY_3}, + { 0x0000009f, KEY_4}, + { 0x0000008e, KEY_5}, + { 0x0000008f, KEY_6}, + { 0x000000c8, KEY_7}, + { 0x00000094, KEY_8}, + { 0x0000008a, KEY_9}, + { 0x0000008b, KEY_0}, + { 0x000000ce, KEY_ENTER}, + { 0x000000ca, KEY_UP}, + { 0x00000099, KEY_LEFT}, + { 0x00000084, KEY_PAGEUP}, + { 0x000000c1, KEY_RIGHT}, + { 0x000000d2, KEY_DOWN}, + { 0x00000089, KEY_PAGEDOWN}, + { 0x000000d1, KEY_MUTE}, + { 0x00000098, KEY_VOLUMEDOWN}, + { 0x00000090, KEY_VOLUMEUP}, + { 0x0000009c, KEY_POWER}, + { 0x000000d6, KEY_STOP}, + { 0x00000097, KEY_MENU}, + { 0x000000cb, KEY_BACK}, + { 0x000000da, KEY_PLAYPAUSE}, + { 0x00000080, KEY_INFO}, + { 0x000000c3, KEY_REWIND}, + { 0x00000087, KEY_HOMEPAGE}, + { 0x000000d0, KEY_FASTFORWARD}, + { 0x000000c4, KEY_SOUND}, + { 0x00000082, BTN_1}, + { 0x000000c7, BTN_2}, + { 0x00000086, KEY_PROGRAM}, + { 0x000000d9, KEY_SUBTITLE}, + { 0x00000085, KEY_ZOOM}, + { 0x0000009b, KEY_RED}, + { 0x0000009a, KEY_GREEN}, + { 0x000000c0, KEY_YELLOW}, + { 0x000000c2, KEY_BLUE}, + { 0x0000009d, KEY_CHANNELDOWN}, + { 0x000000cf, KEY_CHANNELUP}, +}; + +static struct rc_map_list hisi_tv_demo_map = { + .map = { + .scan = hisi_tv_demo_keymap, + .size = ARRAY_SIZE(hisi_tv_demo_keymap), + .rc_proto = RC_PROTO_NEC, + .name = RC_MAP_HISI_TV_DEMO, + } +}; + +static int __init init_rc_map_hisi_tv_demo(void) +{ + return rc_map_register(&hisi_tv_demo_map); +} + +static void __exit exit_rc_map_hisi_tv_demo(void) +{ + rc_map_unregister(&hisi_tv_demo_map); +} + +module_init(init_rc_map_hisi_tv_demo) +module_exit(exit_rc_map_hisi_tv_demo) + +MODULE_LICENSE("GPL v2"); diff --git a/drivers/media/rc/keymaps/rc-imon-pad.c b/drivers/media/rc/keymaps/rc-imon-pad.c index a7296ffbf218..8501cf0a3253 100644 --- a/drivers/media/rc/keymaps/rc-imon-pad.c +++ b/drivers/media/rc/keymaps/rc-imon-pad.c @@ -134,8 +134,7 @@ static struct rc_map_list imon_pad_map = { .map = { .scan = imon_pad, .size = ARRAY_SIZE(imon_pad), - /* actual protocol details unknown, hardware decoder */ - .rc_proto = RC_PROTO_OTHER, + .rc_proto = RC_PROTO_IMON, .name = RC_MAP_IMON_PAD, } }; diff --git a/drivers/media/rc/keymaps/rc-imon-rsc.c b/drivers/media/rc/keymaps/rc-imon-rsc.c new file mode 100644 index 000000000000..83e4564aaa22 --- /dev/null +++ b/drivers/media/rc/keymaps/rc-imon-rsc.c @@ -0,0 +1,81 @@ +// SPDX-License-Identifier: GPL-2.0+ +// +// Copyright (C) 2018 Sean Young + +#include +#include + +// +// Note that this remote has a stick which its own IR protocol, +// with 16 directions. This is not supported yet. +// +static struct rc_map_table imon_rsc[] = { + { 0x801010, KEY_EXIT }, + { 0x80102f, KEY_POWER }, + { 0x80104a, KEY_SCREENSAVER }, /* Screensaver */ + { 0x801049, KEY_TIME }, /* Timer */ + { 0x801054, KEY_NUMERIC_1 }, + { 0x801055, KEY_NUMERIC_2 }, + { 0x801056, KEY_NUMERIC_3 }, + { 0x801057, KEY_NUMERIC_4 }, + { 0x801058, KEY_NUMERIC_5 }, + { 0x801059, KEY_NUMERIC_6 }, + { 0x80105a, KEY_NUMERIC_7 }, + { 0x80105b, KEY_NUMERIC_8 }, + { 0x80105c, KEY_NUMERIC_9 }, + { 0x801081, KEY_SCREEN }, /* Desktop */ + { 0x80105d, KEY_NUMERIC_0 }, + { 0x801082, KEY_MAX }, + { 0x801048, KEY_ESC }, + { 0x80104b, KEY_MEDIA }, /* Windows key */ + { 0x801083, KEY_MENU }, + { 0x801045, KEY_APPSELECT }, /* app launcher */ + { 0x801084, KEY_STOP }, + { 0x801046, KEY_CYCLEWINDOWS }, + { 0x801085, KEY_BACKSPACE }, + { 0x801086, KEY_KEYBOARD }, + { 0x801087, KEY_SPACE }, + { 0x80101e, KEY_RESERVED }, /* shift tab */ + { 0x801098, BTN_0 }, + { 0x80101f, KEY_TAB }, + { 0x80101b, BTN_LEFT }, + { 0x80101d, BTN_RIGHT }, + { 0x801016, BTN_MIDDLE }, /* drag and drop */ + { 0x801088, KEY_MUTE }, + { 0x80105e, KEY_VOLUMEDOWN }, + { 0x80105f, KEY_VOLUMEUP }, + { 0x80104c, KEY_PLAY }, + { 0x80104d, KEY_PAUSE }, + { 0x80104f, KEY_EJECTCD }, + { 0x801050, KEY_PREVIOUS }, + { 0x801051, KEY_NEXT }, + { 0x80104e, KEY_STOP }, + { 0x801052, KEY_REWIND }, + { 0x801053, KEY_FASTFORWARD }, + { 0x801089, KEY_ZOOM } /* full screen */ +}; + +static struct rc_map_list imon_rsc_map = { + .map = { + .scan = imon_rsc, + .size = ARRAY_SIZE(imon_rsc), + .rc_proto = RC_PROTO_NEC, + .name = RC_MAP_IMON_RSC, + } +}; + +static int __init init_rc_map_imon_rsc(void) +{ + return rc_map_register(&imon_rsc_map); +} + +static void __exit exit_rc_map_imon_rsc(void) +{ + rc_map_unregister(&imon_rsc_map); +} + +module_init(init_rc_map_imon_rsc) +module_exit(exit_rc_map_imon_rsc) + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Sean Young "); diff --git a/drivers/media/rc/keymaps/rc-iodata-bctv7e.c b/drivers/media/rc/keymaps/rc-iodata-bctv7e.c index 8cf87a15c4f2..6ced43458f2a 100644 --- a/drivers/media/rc/keymaps/rc-iodata-bctv7e.c +++ b/drivers/media/rc/keymaps/rc-iodata-bctv7e.c @@ -1,14 +1,9 @@ -/* iodata-bctv7e.h - Keytable for iodata_bctv7e Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// iodata-bctv7e.h - Keytable for iodata_bctv7e Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-kaiomy.c b/drivers/media/rc/keymaps/rc-kaiomy.c index e791f1e1b43b..a00051339842 100644 --- a/drivers/media/rc/keymaps/rc-kaiomy.c +++ b/drivers/media/rc/keymaps/rc-kaiomy.c @@ -1,20 +1,15 @@ -/* kaiomy.h - Keytable for kaiomy Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// kaiomy.h - Keytable for kaiomy Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include /* Kaiomy TVnPC U2 - Mauro Carvalho Chehab + Mauro Carvalho Chehab */ static struct rc_map_table kaiomy[] = { diff --git a/drivers/media/rc/keymaps/rc-kworld-315u.c b/drivers/media/rc/keymaps/rc-kworld-315u.c index 71dce0138f0e..ed0e0586dea2 100644 --- a/drivers/media/rc/keymaps/rc-kworld-315u.c +++ b/drivers/media/rc/keymaps/rc-kworld-315u.c @@ -1,14 +1,9 @@ -/* kworld-315u.h - Keytable for kworld_315u Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// kworld-315u.h - Keytable for kworld_315u Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-kworld-plus-tv-analog.c b/drivers/media/rc/keymaps/rc-kworld-plus-tv-analog.c index e0322ed16c94..db5edde3eeb1 100644 --- a/drivers/media/rc/keymaps/rc-kworld-plus-tv-analog.c +++ b/drivers/media/rc/keymaps/rc-kworld-plus-tv-analog.c @@ -1,20 +1,15 @@ -/* kworld-plus-tv-analog.h - Keytable for kworld_plus_tv_analog Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// kworld-plus-tv-analog.h - Keytable for kworld_plus_tv_analog Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include /* Kworld Plus TV Analog Lite PCI IR - Mauro Carvalho Chehab + Mauro Carvalho Chehab */ static struct rc_map_table kworld_plus_tv_analog[] = { diff --git a/drivers/media/rc/keymaps/rc-manli.c b/drivers/media/rc/keymaps/rc-manli.c index da566902a4dd..29c9feaf413b 100644 --- a/drivers/media/rc/keymaps/rc-manli.c +++ b/drivers/media/rc/keymaps/rc-manli.c @@ -1,14 +1,9 @@ -/* manli.h - Keytable for manli Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// manli.h - Keytable for manli Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-msi-tvanywhere-plus.c b/drivers/media/rc/keymaps/rc-msi-tvanywhere-plus.c index dfa0ed1d7667..78cf2c286083 100644 --- a/drivers/media/rc/keymaps/rc-msi-tvanywhere-plus.c +++ b/drivers/media/rc/keymaps/rc-msi-tvanywhere-plus.c @@ -1,14 +1,9 @@ -/* msi-tvanywhere-plus.h - Keytable for msi_tvanywhere_plus Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// msi-tvanywhere-plus.h - Keytable for msi_tvanywhere_plus Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-msi-tvanywhere.c b/drivers/media/rc/keymaps/rc-msi-tvanywhere.c index 2111816a3f59..359a57be3a66 100644 --- a/drivers/media/rc/keymaps/rc-msi-tvanywhere.c +++ b/drivers/media/rc/keymaps/rc-msi-tvanywhere.c @@ -1,14 +1,9 @@ -/* msi-tvanywhere.h - Keytable for msi_tvanywhere Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// msi-tvanywhere.h - Keytable for msi_tvanywhere Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-nebula.c b/drivers/media/rc/keymaps/rc-nebula.c index 109b6e1a8b1a..17d7c1b324da 100644 --- a/drivers/media/rc/keymaps/rc-nebula.c +++ b/drivers/media/rc/keymaps/rc-nebula.c @@ -1,14 +1,9 @@ -/* nebula.h - Keytable for nebula Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// nebula.h - Keytable for nebula Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-nec-terratec-cinergy-xs.c b/drivers/media/rc/keymaps/rc-nec-terratec-cinergy-xs.c index bb2d3a2962c0..76beef44a8d7 100644 --- a/drivers/media/rc/keymaps/rc-nec-terratec-cinergy-xs.c +++ b/drivers/media/rc/keymaps/rc-nec-terratec-cinergy-xs.c @@ -1,14 +1,9 @@ -/* nec-terratec-cinergy-xs.h - Keytable for nec_terratec_cinergy_xs Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// nec-terratec-cinergy-xs.h - Keytable for nec_terratec_cinergy_xs Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-norwood.c b/drivers/media/rc/keymaps/rc-norwood.c index cd25df336749..3765705c5549 100644 --- a/drivers/media/rc/keymaps/rc-norwood.c +++ b/drivers/media/rc/keymaps/rc-norwood.c @@ -1,14 +1,9 @@ -/* norwood.h - Keytable for norwood Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// norwood.h - Keytable for norwood Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-npgtech.c b/drivers/media/rc/keymaps/rc-npgtech.c index 140bbc20a764..abaf7f6d4cb7 100644 --- a/drivers/media/rc/keymaps/rc-npgtech.c +++ b/drivers/media/rc/keymaps/rc-npgtech.c @@ -1,14 +1,9 @@ -/* npgtech.h - Keytable for npgtech Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// npgtech.h - Keytable for npgtech Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-pctv-sedna.c b/drivers/media/rc/keymaps/rc-pctv-sedna.c index 52b4558b7bd0..e3462c5c8984 100644 --- a/drivers/media/rc/keymaps/rc-pctv-sedna.c +++ b/drivers/media/rc/keymaps/rc-pctv-sedna.c @@ -1,14 +1,9 @@ -/* pctv-sedna.h - Keytable for pctv_sedna Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// pctv-sedna.h - Keytable for pctv_sedna Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-pinnacle-color.c b/drivers/media/rc/keymaps/rc-pinnacle-color.c index 973c9c34e304..63c2851e9dfe 100644 --- a/drivers/media/rc/keymaps/rc-pinnacle-color.c +++ b/drivers/media/rc/keymaps/rc-pinnacle-color.c @@ -1,14 +1,9 @@ -/* pinnacle-color.h - Keytable for pinnacle_color Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// pinnacle-color.h - Keytable for pinnacle_color Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-pinnacle-grey.c b/drivers/media/rc/keymaps/rc-pinnacle-grey.c index 22e44b0d2a93..31794d4180db 100644 --- a/drivers/media/rc/keymaps/rc-pinnacle-grey.c +++ b/drivers/media/rc/keymaps/rc-pinnacle-grey.c @@ -1,14 +1,9 @@ -/* pinnacle-grey.h - Keytable for pinnacle_grey Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// pinnacle-grey.h - Keytable for pinnacle_grey Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-pinnacle-pctv-hd.c b/drivers/media/rc/keymaps/rc-pinnacle-pctv-hd.c index 186dcf8e0491..876aeb6e1d9c 100644 --- a/drivers/media/rc/keymaps/rc-pinnacle-pctv-hd.c +++ b/drivers/media/rc/keymaps/rc-pinnacle-pctv-hd.c @@ -1,14 +1,9 @@ -/* pinnacle-pctv-hd.h - Keytable for pinnacle_pctv_hd Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// pinnacle-pctv-hd.h - Keytable for pinnacle_pctv_hd Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-pixelview-002t.c b/drivers/media/rc/keymaps/rc-pixelview-002t.c index b235ada2e28f..4ed85f61d0ee 100644 --- a/drivers/media/rc/keymaps/rc-pixelview-002t.c +++ b/drivers/media/rc/keymaps/rc-pixelview-002t.c @@ -1,14 +1,9 @@ -/* rc-pixelview-mk12.h - Keytable for pixelview Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// rc-pixelview-mk12.h - Keytable for pixelview Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-pixelview-mk12.c b/drivers/media/rc/keymaps/rc-pixelview-mk12.c index 453d52d663fe..6ded64b732a5 100644 --- a/drivers/media/rc/keymaps/rc-pixelview-mk12.c +++ b/drivers/media/rc/keymaps/rc-pixelview-mk12.c @@ -1,14 +1,9 @@ -/* rc-pixelview-mk12.h - Keytable for pixelview Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// rc-pixelview-mk12.h - Keytable for pixelview Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-pixelview-new.c b/drivers/media/rc/keymaps/rc-pixelview-new.c index ef97095ec8f1..e4e34f2ccf74 100644 --- a/drivers/media/rc/keymaps/rc-pixelview-new.c +++ b/drivers/media/rc/keymaps/rc-pixelview-new.c @@ -1,20 +1,15 @@ -/* pixelview-new.h - Keytable for pixelview_new Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// pixelview-new.h - Keytable for pixelview_new Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include /* - Mauro Carvalho Chehab + Mauro Carvalho Chehab present on PV MPEG 8000GT */ diff --git a/drivers/media/rc/keymaps/rc-pixelview.c b/drivers/media/rc/keymaps/rc-pixelview.c index cfd8f80d3617..988919735165 100644 --- a/drivers/media/rc/keymaps/rc-pixelview.c +++ b/drivers/media/rc/keymaps/rc-pixelview.c @@ -1,14 +1,9 @@ -/* pixelview.h - Keytable for pixelview Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// pixelview.h - Keytable for pixelview Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-powercolor-real-angel.c b/drivers/media/rc/keymaps/rc-powercolor-real-angel.c index b63f82bcf29a..4988e71c524c 100644 --- a/drivers/media/rc/keymaps/rc-powercolor-real-angel.c +++ b/drivers/media/rc/keymaps/rc-powercolor-real-angel.c @@ -1,14 +1,9 @@ -/* powercolor-real-angel.h - Keytable for powercolor_real_angel Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// powercolor-real-angel.h - Keytable for powercolor_real_angel Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-proteus-2309.c b/drivers/media/rc/keymaps/rc-proteus-2309.c index be34c517e4e1..d2c13d0e7bff 100644 --- a/drivers/media/rc/keymaps/rc-proteus-2309.c +++ b/drivers/media/rc/keymaps/rc-proteus-2309.c @@ -1,14 +1,9 @@ -/* proteus-2309.h - Keytable for proteus_2309 Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// proteus-2309.h - Keytable for proteus_2309 Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-purpletv.c b/drivers/media/rc/keymaps/rc-purpletv.c index 84c40b97ee00..c8011f4d96ea 100644 --- a/drivers/media/rc/keymaps/rc-purpletv.c +++ b/drivers/media/rc/keymaps/rc-purpletv.c @@ -1,14 +1,9 @@ -/* purpletv.h - Keytable for purpletv Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// purpletv.h - Keytable for purpletv Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-pv951.c b/drivers/media/rc/keymaps/rc-pv951.c index be190ddebfc4..5235ee899c30 100644 --- a/drivers/media/rc/keymaps/rc-pv951.c +++ b/drivers/media/rc/keymaps/rc-pv951.c @@ -1,14 +1,9 @@ -/* pv951.h - Keytable for pv951 Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// pv951.h - Keytable for pv951 Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-real-audio-220-32-keys.c b/drivers/media/rc/keymaps/rc-real-audio-220-32-keys.c index 957fa21747ea..1cf786649675 100644 --- a/drivers/media/rc/keymaps/rc-real-audio-220-32-keys.c +++ b/drivers/media/rc/keymaps/rc-real-audio-220-32-keys.c @@ -1,14 +1,9 @@ -/* real-audio-220-32-keys.h - Keytable for real_audio_220_32_keys Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// real-audio-220-32-keys.h - Keytable for real_audio_220_32_keys Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-tango.c b/drivers/media/rc/keymaps/rc-tango.c new file mode 100644 index 000000000000..1c6e8875d46f --- /dev/null +++ b/drivers/media/rc/keymaps/rc-tango.c @@ -0,0 +1,92 @@ +/* + * Copyright (C) 2017 Sigma Designs + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * version 2 as published by the Free Software Foundation. + */ + +#include +#include + +static struct rc_map_table tango_table[] = { + { 0x4cb4a, KEY_POWER }, + { 0x4cb48, KEY_FILE }, + { 0x4cb0f, KEY_SETUP }, + { 0x4cb4d, KEY_SUSPEND }, + { 0x4cb4e, KEY_VOLUMEUP }, + { 0x4cb44, KEY_EJECTCD }, + { 0x4cb13, KEY_TV }, + { 0x4cb51, KEY_MUTE }, + { 0x4cb52, KEY_VOLUMEDOWN }, + + { 0x4cb41, KEY_1 }, + { 0x4cb03, KEY_2 }, + { 0x4cb42, KEY_3 }, + { 0x4cb45, KEY_4 }, + { 0x4cb07, KEY_5 }, + { 0x4cb46, KEY_6 }, + { 0x4cb55, KEY_7 }, + { 0x4cb17, KEY_8 }, + { 0x4cb56, KEY_9 }, + { 0x4cb1b, KEY_0 }, + { 0x4cb59, KEY_DELETE }, + { 0x4cb5a, KEY_CAPSLOCK }, + + { 0x4cb47, KEY_BACK }, + { 0x4cb05, KEY_SWITCHVIDEOMODE }, + { 0x4cb06, KEY_UP }, + { 0x4cb43, KEY_LEFT }, + { 0x4cb01, KEY_RIGHT }, + { 0x4cb0a, KEY_DOWN }, + { 0x4cb02, KEY_ENTER }, + { 0x4cb4b, KEY_INFO }, + { 0x4cb09, KEY_HOME }, + + { 0x4cb53, KEY_MENU }, + { 0x4cb12, KEY_PREVIOUS }, + { 0x4cb50, KEY_PLAY }, + { 0x4cb11, KEY_NEXT }, + { 0x4cb4f, KEY_TITLE }, + { 0x4cb0e, KEY_REWIND }, + { 0x4cb4c, KEY_STOP }, + { 0x4cb0d, KEY_FORWARD }, + { 0x4cb57, KEY_MEDIA_REPEAT }, + { 0x4cb16, KEY_ANGLE }, + { 0x4cb54, KEY_PAUSE }, + { 0x4cb15, KEY_SLOW }, + { 0x4cb5b, KEY_TIME }, + { 0x4cb1a, KEY_AUDIO }, + { 0x4cb58, KEY_SUBTITLE }, + { 0x4cb19, KEY_ZOOM }, + + { 0x4cb5f, KEY_RED }, + { 0x4cb1e, KEY_GREEN }, + { 0x4cb5c, KEY_YELLOW }, + { 0x4cb1d, KEY_BLUE }, +}; + +static struct rc_map_list tango_map = { + .map = { + .scan = tango_table, + .size = ARRAY_SIZE(tango_table), + .rc_proto = RC_PROTO_NECX, + .name = RC_MAP_TANGO, + } +}; + +static int __init init_rc_map_tango(void) +{ + return rc_map_register(&tango_map); +} + +static void __exit exit_rc_map_tango(void) +{ + rc_map_unregister(&tango_map); +} + +module_init(init_rc_map_tango) +module_exit(exit_rc_map_tango) + +MODULE_AUTHOR("Sigma Designs"); +MODULE_LICENSE("GPL"); diff --git a/drivers/media/rc/keymaps/rc-tbs-nec.c b/drivers/media/rc/keymaps/rc-tbs-nec.c index 05facc043272..42766cb877c3 100644 --- a/drivers/media/rc/keymaps/rc-tbs-nec.c +++ b/drivers/media/rc/keymaps/rc-tbs-nec.c @@ -1,14 +1,9 @@ -/* tbs-nec.h - Keytable for tbs_nec Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// tbs-nec.h - Keytable for tbs_nec Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-terratec-cinergy-xs.c b/drivers/media/rc/keymaps/rc-terratec-cinergy-xs.c index 3d0f6f7e5bea..6cf53a56bce4 100644 --- a/drivers/media/rc/keymaps/rc-terratec-cinergy-xs.c +++ b/drivers/media/rc/keymaps/rc-terratec-cinergy-xs.c @@ -1,14 +1,9 @@ -/* terratec-cinergy-xs.h - Keytable for terratec_cinergy_xs Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// terratec-cinergy-xs.h - Keytable for terratec_cinergy_xs Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-tevii-nec.c b/drivers/media/rc/keymaps/rc-tevii-nec.c index 31f8a0fd1f2c..58fcc72f528e 100644 --- a/drivers/media/rc/keymaps/rc-tevii-nec.c +++ b/drivers/media/rc/keymaps/rc-tevii-nec.c @@ -1,14 +1,9 @@ -/* tevii-nec.h - Keytable for tevii_nec Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// tevii-nec.h - Keytable for tevii_nec Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-tt-1500.c b/drivers/media/rc/keymaps/rc-tt-1500.c index 374c230705d2..52f239d2c025 100644 --- a/drivers/media/rc/keymaps/rc-tt-1500.c +++ b/drivers/media/rc/keymaps/rc-tt-1500.c @@ -1,14 +1,9 @@ -/* tt-1500.h - Keytable for tt_1500 Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// tt-1500.h - Keytable for tt_1500 Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-twinhan1027.c b/drivers/media/rc/keymaps/rc-twinhan1027.c index 2275b37c61d2..78bb3143a1a8 100644 --- a/drivers/media/rc/keymaps/rc-twinhan1027.c +++ b/drivers/media/rc/keymaps/rc-twinhan1027.c @@ -66,7 +66,7 @@ static struct rc_map_list twinhan_vp1027_map = { .map = { .scan = twinhan_vp1027, .size = ARRAY_SIZE(twinhan_vp1027), - .rc_proto = RC_PROTO_UNKNOWN, /* Legacy IR type */ + .rc_proto = RC_PROTO_NEC, .name = RC_MAP_TWINHAN_VP1027_DVBS, } }; diff --git a/drivers/media/rc/keymaps/rc-videomate-s350.c b/drivers/media/rc/keymaps/rc-videomate-s350.c index b4f103269872..e4d4dff06a24 100644 --- a/drivers/media/rc/keymaps/rc-videomate-s350.c +++ b/drivers/media/rc/keymaps/rc-videomate-s350.c @@ -1,14 +1,9 @@ -/* videomate-s350.h - Keytable for videomate_s350 Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// videomate-s350.h - Keytable for videomate_s350 Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-videomate-tv-pvr.c b/drivers/media/rc/keymaps/rc-videomate-tv-pvr.c index c431fdf44057..7c4890944407 100644 --- a/drivers/media/rc/keymaps/rc-videomate-tv-pvr.c +++ b/drivers/media/rc/keymaps/rc-videomate-tv-pvr.c @@ -1,14 +1,9 @@ -/* videomate-tv-pvr.h - Keytable for videomate_tv_pvr Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// videomate-tv-pvr.h - Keytable for videomate_tv_pvr Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/keymaps/rc-winfast-usbii-deluxe.c b/drivers/media/rc/keymaps/rc-winfast-usbii-deluxe.c index 5a437e61bd5d..e443192dbe14 100644 --- a/drivers/media/rc/keymaps/rc-winfast-usbii-deluxe.c +++ b/drivers/media/rc/keymaps/rc-winfast-usbii-deluxe.c @@ -1,14 +1,9 @@ -/* winfast-usbii-deluxe.h - Keytable for winfast_usbii_deluxe Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// winfast-usbii-deluxe.h - Keytable for winfast_usbii_deluxe Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include @@ -42,7 +37,7 @@ static struct rc_map_table winfast_usbii_deluxe[] = { { 0x60, KEY_CHANNELDOWN}, /* CHANNELDOWN */ { 0x61, KEY_LAST}, /* LAST CHANNEL (RECALL) */ - { 0x72, KEY_VIDEO}, /* INPUT MODES (TV/FM) */ + { 0x72, KEY_VIDEO}, /* INPUT MODES (TV/FM) */ { 0x70, KEY_POWER2}, /* TV ON/OFF */ diff --git a/drivers/media/rc/keymaps/rc-winfast.c b/drivers/media/rc/keymaps/rc-winfast.c index 53685d1f9a47..ee7f4c349fd6 100644 --- a/drivers/media/rc/keymaps/rc-winfast.c +++ b/drivers/media/rc/keymaps/rc-winfast.c @@ -1,14 +1,9 @@ -/* winfast.h - Keytable for winfast Remote Controller - * - * keymap imported from ir-keymaps.c - * - * Copyright (c) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - */ +// SPDX-License-Identifier: GPL-2.0+ +// winfast.h - Keytable for winfast Remote Controller +// +// keymap imported from ir-keymaps.c +// +// Copyright (c) 2010 by Mauro Carvalho Chehab #include #include diff --git a/drivers/media/rc/lirc_dev.c b/drivers/media/rc/lirc_dev.c index 9080e39ea391..87311ce13da7 100644 --- a/drivers/media/rc/lirc_dev.c +++ b/drivers/media/rc/lirc_dev.c @@ -18,569 +18,812 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include -#include -#include -#include #include #include -#include +#include +#include +#include +#include +#include -#include -#include -#include +#include "rc-core-priv.h" +#include -#define NOPLUG -1 -#define LOGHEAD "lirc_dev (%s[%d]): " +#define LIRCBUF_SIZE 1024 static dev_t lirc_base_dev; -struct irctl { - struct lirc_driver d; - int attached; - int open; - - struct mutex irctl_lock; - struct lirc_buffer *buf; - bool buf_internal; - unsigned int chunk_size; - - struct device dev; - struct cdev cdev; -}; - -static DEFINE_MUTEX(lirc_dev_lock); - -static struct irctl *irctls[MAX_IRCTL_DEVICES]; +/* Used to keep track of allocated lirc devices */ +static DEFINE_IDA(lirc_ida); /* Only used for sysfs but defined to void otherwise */ static struct class *lirc_class; -static void lirc_release(struct device *ld) +/** + * ir_lirc_raw_event() - Send raw IR data to lirc to be relayed to userspace + * + * @dev: the struct rc_dev descriptor of the device + * @ev: the struct ir_raw_event descriptor of the pulse/space + */ +void ir_lirc_raw_event(struct rc_dev *dev, struct ir_raw_event ev) { - struct irctl *ir = container_of(ld, struct irctl, dev); + unsigned long flags; + struct lirc_fh *fh; + int sample; - put_device(ir->dev.parent); + /* Packet start */ + if (ev.reset) { + /* + * Userspace expects a long space event before the start of + * the signal to use as a sync. This may be done with repeat + * packets and normal samples. But if a reset has been sent + * then we assume that a long time has passed, so we send a + * space with the maximum time value. + */ + sample = LIRC_SPACE(LIRC_VALUE_MASK); + dev_dbg(&dev->dev, "delivering reset sync space to lirc_dev\n"); - if (ir->buf_internal) { - lirc_buffer_free(ir->buf); - kfree(ir->buf); - } + /* Carrier reports */ + } else if (ev.carrier_report) { + sample = LIRC_FREQUENCY(ev.carrier); + dev_dbg(&dev->dev, "carrier report (freq: %d)\n", sample); - mutex_lock(&lirc_dev_lock); - irctls[ir->d.minor] = NULL; - mutex_unlock(&lirc_dev_lock); - kfree(ir); -} + /* Packet end */ + } else if (ev.timeout) { + if (dev->gap) + return; -static int lirc_allocate_buffer(struct irctl *ir) -{ - int err = 0; - int bytes_in_key; - unsigned int chunk_size; - unsigned int buffer_size; - struct lirc_driver *d = &ir->d; + dev->gap_start = ktime_get(); + dev->gap = true; + dev->gap_duration = ev.duration; - bytes_in_key = BITS_TO_LONGS(d->code_length) + - (d->code_length % 8 ? 1 : 0); - buffer_size = d->buffer_size ? d->buffer_size : BUFLEN / bytes_in_key; - chunk_size = d->chunk_size ? d->chunk_size : bytes_in_key; + sample = LIRC_TIMEOUT(ev.duration / 1000); + dev_dbg(&dev->dev, "timeout report (duration: %d)\n", sample); - if (d->rbuf) { - ir->buf = d->rbuf; - ir->buf_internal = false; + /* Normal sample */ } else { - ir->buf = kmalloc(sizeof(struct lirc_buffer), GFP_KERNEL); - if (!ir->buf) { - err = -ENOMEM; - goto out; + if (dev->gap) { + dev->gap_duration += ktime_to_ns(ktime_sub(ktime_get(), + dev->gap_start)); + + /* Convert to ms and cap by LIRC_VALUE_MASK */ + do_div(dev->gap_duration, 1000); + dev->gap_duration = min_t(u64, dev->gap_duration, + LIRC_VALUE_MASK); + + spin_lock_irqsave(&dev->lirc_fh_lock, flags); + list_for_each_entry(fh, &dev->lirc_fh, list) + kfifo_put(&fh->rawir, + LIRC_SPACE(dev->gap_duration)); + spin_unlock_irqrestore(&dev->lirc_fh_lock, flags); + dev->gap = false; } - err = lirc_buffer_init(ir->buf, chunk_size, buffer_size); - if (err) { - kfree(ir->buf); - ir->buf = NULL; - goto out; - } - - ir->buf_internal = true; - d->rbuf = ir->buf; + sample = ev.pulse ? LIRC_PULSE(ev.duration / 1000) : + LIRC_SPACE(ev.duration / 1000); + dev_dbg(&dev->dev, "delivering %uus %s to lirc_dev\n", + TO_US(ev.duration), TO_STR(ev.pulse)); } - ir->chunk_size = ir->buf->chunk_size; -out: - return err; + /* + * bpf does not care about the gap generated above; that exists + * for backwards compatibility + */ + lirc_bpf_run(dev, sample); + + spin_lock_irqsave(&dev->lirc_fh_lock, flags); + list_for_each_entry(fh, &dev->lirc_fh, list) { + if (LIRC_IS_TIMEOUT(sample) && !fh->send_timeout_reports) + continue; + if (kfifo_put(&fh->rawir, sample)) + wake_up_poll(&fh->wait_poll, EPOLLIN | EPOLLRDNORM); + } + spin_unlock_irqrestore(&dev->lirc_fh_lock, flags); } -int lirc_register_driver(struct lirc_driver *d) +/** + * ir_lirc_scancode_event() - Send scancode data to lirc to be relayed to + * userspace. This can be called in atomic context. + * @dev: the struct rc_dev descriptor of the device + * @lsc: the struct lirc_scancode describing the decoded scancode + */ +void ir_lirc_scancode_event(struct rc_dev *dev, struct lirc_scancode *lsc) { - struct irctl *ir; - int minor; - int err; + unsigned long flags; + struct lirc_fh *fh; - if (!d) { - pr_err("driver pointer must be not NULL!\n"); - return -EBADRQC; + lsc->timestamp = ktime_get_ns(); + + spin_lock_irqsave(&dev->lirc_fh_lock, flags); + list_for_each_entry(fh, &dev->lirc_fh, list) { + if (kfifo_put(&fh->scancodes, *lsc)) + wake_up_poll(&fh->wait_poll, EPOLLIN | EPOLLRDNORM); } - - if (!d->dev) { - pr_err("dev pointer not filled in!\n"); - return -EINVAL; - } - - if (!d->fops) { - pr_err("fops pointer not filled in!\n"); - return -EINVAL; - } - - if (d->minor >= MAX_IRCTL_DEVICES) { - dev_err(d->dev, "minor must be between 0 and %d!\n", - MAX_IRCTL_DEVICES - 1); - return -EBADRQC; - } - - if (d->code_length < 1 || d->code_length > (BUFLEN * 8)) { - dev_err(d->dev, "code length must be less than %d bits\n", - BUFLEN * 8); - return -EBADRQC; - } - - if (!d->rbuf && !(d->fops && d->fops->read && - d->fops->poll && d->fops->unlocked_ioctl)) { - dev_err(d->dev, "undefined read, poll, ioctl\n"); - return -EBADRQC; - } - - mutex_lock(&lirc_dev_lock); - - minor = d->minor; - - if (minor < 0) { - /* find first free slot for driver */ - for (minor = 0; minor < MAX_IRCTL_DEVICES; minor++) - if (!irctls[minor]) - break; - if (minor == MAX_IRCTL_DEVICES) { - dev_err(d->dev, "no free slots for drivers!\n"); - err = -ENOMEM; - goto out_lock; - } - } else if (irctls[minor]) { - dev_err(d->dev, "minor (%d) just registered!\n", minor); - err = -EBUSY; - goto out_lock; - } - - ir = kzalloc(sizeof(struct irctl), GFP_KERNEL); - if (!ir) { - err = -ENOMEM; - goto out_lock; - } - - mutex_init(&ir->irctl_lock); - irctls[minor] = ir; - d->minor = minor; - - /* some safety check 8-) */ - d->name[sizeof(d->name)-1] = '\0'; - - if (d->features == 0) - d->features = LIRC_CAN_REC_LIRCCODE; - - ir->d = *d; - - if (LIRC_CAN_REC(d->features)) { - err = lirc_allocate_buffer(irctls[minor]); - if (err) { - kfree(ir); - goto out_lock; - } - d->rbuf = ir->buf; - } - - device_initialize(&ir->dev); - ir->dev.devt = MKDEV(MAJOR(lirc_base_dev), ir->d.minor); - ir->dev.class = lirc_class; - ir->dev.parent = d->dev; - ir->dev.release = lirc_release; - dev_set_name(&ir->dev, "lirc%d", ir->d.minor); - - cdev_init(&ir->cdev, d->fops); - ir->cdev.owner = ir->d.owner; - ir->cdev.kobj.parent = &ir->dev.kobj; - - err = cdev_add(&ir->cdev, ir->dev.devt, 1); - if (err) - goto out_free_dev; - - ir->attached = 1; - - err = device_add(&ir->dev); - if (err) - goto out_cdev; - - mutex_unlock(&lirc_dev_lock); - - get_device(ir->dev.parent); - - dev_info(ir->d.dev, "lirc_dev: driver %s registered at minor = %d\n", - ir->d.name, ir->d.minor); - - return minor; - -out_cdev: - cdev_del(&ir->cdev); -out_free_dev: - put_device(&ir->dev); -out_lock: - mutex_unlock(&lirc_dev_lock); - - return err; + spin_unlock_irqrestore(&dev->lirc_fh_lock, flags); } -EXPORT_SYMBOL(lirc_register_driver); +EXPORT_SYMBOL_GPL(ir_lirc_scancode_event); -int lirc_unregister_driver(int minor) +static int ir_lirc_open(struct inode *inode, struct file *file) { - struct irctl *ir; + struct rc_dev *dev = container_of(inode->i_cdev, struct rc_dev, + lirc_cdev); + struct lirc_fh *fh = kzalloc(sizeof(*fh), GFP_KERNEL); + unsigned long flags; + int retval; - if (minor < 0 || minor >= MAX_IRCTL_DEVICES) { - pr_err("minor (%d) must be between 0 and %d!\n", - minor, MAX_IRCTL_DEVICES - 1); - return -EBADRQC; + if (!fh) + return -ENOMEM; + + get_device(&dev->dev); + + if (!dev->registered) { + retval = -ENODEV; + goto out_fh; } - ir = irctls[minor]; - if (!ir) { - pr_err("failed to get irctl\n"); - return -ENOENT; + if (dev->driver_type == RC_DRIVER_IR_RAW) { + if (kfifo_alloc(&fh->rawir, MAX_IR_EVENT_SIZE, GFP_KERNEL)) { + retval = -ENOMEM; + goto out_fh; + } } - mutex_lock(&lirc_dev_lock); - - if (ir->d.minor != minor) { - dev_err(ir->d.dev, "lirc_dev: minor %d device not registered\n", - minor); - mutex_unlock(&lirc_dev_lock); - return -ENOENT; + if (dev->driver_type != RC_DRIVER_IR_RAW_TX) { + if (kfifo_alloc(&fh->scancodes, 32, GFP_KERNEL)) { + retval = -ENOMEM; + goto out_rawir; + } } - dev_dbg(ir->d.dev, "lirc_dev: driver %s unregistered from minor = %d\n", - ir->d.name, ir->d.minor); + fh->send_mode = LIRC_MODE_PULSE; + fh->rc = dev; + fh->send_timeout_reports = true; - ir->attached = 0; - if (ir->open) { - dev_dbg(ir->d.dev, LOGHEAD "releasing opened driver\n", - ir->d.name, ir->d.minor); - wake_up_interruptible(&ir->buf->wait_poll); - } + if (dev->driver_type == RC_DRIVER_SCANCODE) + fh->rec_mode = LIRC_MODE_SCANCODE; + else + fh->rec_mode = LIRC_MODE_MODE2; - mutex_unlock(&lirc_dev_lock); + retval = rc_open(dev); + if (retval) + goto out_kfifo; - device_del(&ir->dev); - cdev_del(&ir->cdev); - put_device(&ir->dev); + init_waitqueue_head(&fh->wait_poll); + + file->private_data = fh; + spin_lock_irqsave(&dev->lirc_fh_lock, flags); + list_add(&fh->list, &dev->lirc_fh); + spin_unlock_irqrestore(&dev->lirc_fh_lock, flags); + + nonseekable_open(inode, file); return 0; -} -EXPORT_SYMBOL(lirc_unregister_driver); - -int lirc_dev_fop_open(struct inode *inode, struct file *file) -{ - struct irctl *ir; - int retval = 0; - - if (iminor(inode) >= MAX_IRCTL_DEVICES) { - pr_err("open result for %d is -ENODEV\n", iminor(inode)); - return -ENODEV; - } - - if (mutex_lock_interruptible(&lirc_dev_lock)) - return -ERESTARTSYS; - - ir = irctls[iminor(inode)]; - mutex_unlock(&lirc_dev_lock); - - if (!ir) { - retval = -ENODEV; - goto error; - } - - dev_dbg(ir->d.dev, LOGHEAD "open called\n", ir->d.name, ir->d.minor); - - if (ir->d.minor == NOPLUG) { - retval = -ENODEV; - goto error; - } - - if (ir->open) { - retval = -EBUSY; - goto error; - } - - if (ir->d.rdev) { - retval = rc_open(ir->d.rdev); - if (retval) - goto error; - } - - if (ir->buf) - lirc_buffer_clear(ir->buf); - - ir->open++; - -error: - nonseekable_open(inode, file); +out_kfifo: + if (dev->driver_type != RC_DRIVER_IR_RAW_TX) + kfifo_free(&fh->scancodes); +out_rawir: + if (dev->driver_type == RC_DRIVER_IR_RAW) + kfifo_free(&fh->rawir); +out_fh: + kfree(fh); + put_device(&dev->dev); return retval; } -EXPORT_SYMBOL(lirc_dev_fop_open); -int lirc_dev_fop_close(struct inode *inode, struct file *file) +static int ir_lirc_close(struct inode *inode, struct file *file) { - struct irctl *ir = irctls[iminor(inode)]; - int ret; + struct lirc_fh *fh = file->private_data; + struct rc_dev *dev = fh->rc; + unsigned long flags; - if (!ir) { - pr_err("called with invalid irctl\n"); - return -EINVAL; - } + spin_lock_irqsave(&dev->lirc_fh_lock, flags); + list_del(&fh->list); + spin_unlock_irqrestore(&dev->lirc_fh_lock, flags); - ret = mutex_lock_killable(&lirc_dev_lock); - WARN_ON(ret); + if (dev->driver_type == RC_DRIVER_IR_RAW) + kfifo_free(&fh->rawir); + if (dev->driver_type != RC_DRIVER_IR_RAW_TX) + kfifo_free(&fh->scancodes); + kfree(fh); - rc_close(ir->d.rdev); - - ir->open--; - if (!ret) - mutex_unlock(&lirc_dev_lock); + rc_close(dev); + put_device(&dev->dev); return 0; } -EXPORT_SYMBOL(lirc_dev_fop_close); -unsigned int lirc_dev_fop_poll(struct file *file, poll_table *wait) +static ssize_t ir_lirc_transmit_ir(struct file *file, const char __user *buf, + size_t n, loff_t *ppos) { - struct irctl *ir = irctls[iminor(file_inode(file))]; - unsigned int ret; + struct lirc_fh *fh = file->private_data; + struct rc_dev *dev = fh->rc; + unsigned int *txbuf; + struct ir_raw_event *raw = NULL; + ssize_t ret; + size_t count; + ktime_t start; + s64 towait; + unsigned int duration = 0; /* signal duration in us */ + int i; - if (!ir) { - pr_err("called with invalid irctl\n"); - return POLLERR; + ret = mutex_lock_interruptible(&dev->lock); + if (ret) + return ret; + + if (!dev->registered) { + ret = -ENODEV; + goto out_unlock; } - if (!ir->attached) - return POLLHUP | POLLERR; + if (!dev->tx_ir) { + ret = -EINVAL; + goto out_unlock; + } - if (ir->buf) { - poll_wait(file, &ir->buf->wait_poll, wait); + if (fh->send_mode == LIRC_MODE_SCANCODE) { + struct lirc_scancode scan; - if (lirc_buffer_empty(ir->buf)) - ret = 0; + if (n != sizeof(scan)) { + ret = -EINVAL; + goto out_unlock; + } + + if (copy_from_user(&scan, buf, sizeof(scan))) { + ret = -EFAULT; + goto out_unlock; + } + + if (scan.flags || scan.keycode || scan.timestamp) { + ret = -EINVAL; + goto out_unlock; + } + + /* + * The scancode field in lirc_scancode is 64-bit simply + * to future-proof it, since there are IR protocols encode + * use more than 32 bits. For now only 32-bit protocols + * are supported. + */ + if (scan.scancode > U32_MAX || + !rc_validate_scancode(scan.rc_proto, scan.scancode)) { + ret = -EINVAL; + goto out_unlock; + } + + raw = kmalloc_array(LIRCBUF_SIZE, sizeof(*raw), GFP_KERNEL); + if (!raw) { + ret = -ENOMEM; + goto out_unlock; + } + + ret = ir_raw_encode_scancode(scan.rc_proto, scan.scancode, + raw, LIRCBUF_SIZE); + if (ret < 0) + goto out_kfree_raw; + + /* drop trailing space */ + if (!(ret % 2)) + count = ret - 1; else - ret = POLLIN | POLLRDNORM; - } else - ret = POLLERR; + count = ret; - dev_dbg(ir->d.dev, LOGHEAD "poll result = %d\n", - ir->d.name, ir->d.minor, ret); + txbuf = kmalloc_array(count, sizeof(unsigned int), GFP_KERNEL); + if (!txbuf) { + ret = -ENOMEM; + goto out_kfree_raw; + } + for (i = 0; i < count; i++) + /* Convert from NS to US */ + txbuf[i] = DIV_ROUND_UP(raw[i].duration, 1000); + + if (dev->s_tx_carrier) { + int carrier = ir_raw_encode_carrier(scan.rc_proto); + + if (carrier > 0) + dev->s_tx_carrier(dev, carrier); + } + } else { + if (n < sizeof(unsigned int) || n % sizeof(unsigned int)) { + ret = -EINVAL; + goto out_unlock; + } + + count = n / sizeof(unsigned int); + if (count > LIRCBUF_SIZE || count % 2 == 0) { + ret = -EINVAL; + goto out_unlock; + } + + txbuf = memdup_user(buf, n); + if (IS_ERR(txbuf)) { + ret = PTR_ERR(txbuf); + goto out_unlock; + } + } + + for (i = 0; i < count; i++) { + if (txbuf[i] > IR_MAX_DURATION / 1000 - duration || !txbuf[i]) { + ret = -EINVAL; + goto out_kfree; + } + + duration += txbuf[i]; + } + + start = ktime_get(); + + ret = dev->tx_ir(dev, txbuf, count); + if (ret < 0) + goto out_kfree; + + kfree(txbuf); + kfree(raw); + mutex_unlock(&dev->lock); + + /* + * The lircd gap calculation expects the write function to + * wait for the actual IR signal to be transmitted before + * returning. + */ + towait = ktime_us_delta(ktime_add_us(start, duration), + ktime_get()); + if (towait > 0) { + set_current_state(TASK_INTERRUPTIBLE); + schedule_timeout(usecs_to_jiffies(towait)); + } + + return n; +out_kfree: + kfree(txbuf); +out_kfree_raw: + kfree(raw); +out_unlock: + mutex_unlock(&dev->lock); return ret; } -EXPORT_SYMBOL(lirc_dev_fop_poll); -long lirc_dev_fop_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +static long ir_lirc_ioctl(struct file *file, unsigned int cmd, + unsigned long arg) { - __u32 mode; - int result = 0; - struct irctl *ir = irctls[iminor(file_inode(file))]; + struct lirc_fh *fh = file->private_data; + struct rc_dev *dev = fh->rc; + u32 __user *argp = (u32 __user *)(arg); + u32 val = 0; + int ret; - if (!ir) { - pr_err("no irctl found!\n"); - return -ENODEV; + if (_IOC_DIR(cmd) & _IOC_WRITE) { + ret = get_user(val, argp); + if (ret) + return ret; } - dev_dbg(ir->d.dev, LOGHEAD "ioctl called (0x%x)\n", - ir->d.name, ir->d.minor, cmd); + ret = mutex_lock_interruptible(&dev->lock); + if (ret) + return ret; - if (ir->d.minor == NOPLUG || !ir->attached) { - dev_err(ir->d.dev, LOGHEAD "ioctl result = -ENODEV\n", - ir->d.name, ir->d.minor); - return -ENODEV; + if (!dev->registered) { + ret = -ENODEV; + goto out; } - mutex_lock(&ir->irctl_lock); - switch (cmd) { case LIRC_GET_FEATURES: - result = put_user(ir->d.features, (__u32 __user *)arg); + if (dev->driver_type == RC_DRIVER_SCANCODE) + val |= LIRC_CAN_REC_SCANCODE; + + if (dev->driver_type == RC_DRIVER_IR_RAW) { + val |= LIRC_CAN_REC_MODE2; + if (dev->rx_resolution) + val |= LIRC_CAN_GET_REC_RESOLUTION; + } + + if (dev->tx_ir) { + val |= LIRC_CAN_SEND_PULSE; + if (dev->s_tx_mask) + val |= LIRC_CAN_SET_TRANSMITTER_MASK; + if (dev->s_tx_carrier) + val |= LIRC_CAN_SET_SEND_CARRIER; + if (dev->s_tx_duty_cycle) + val |= LIRC_CAN_SET_SEND_DUTY_CYCLE; + } + + if (dev->s_rx_carrier_range) + val |= LIRC_CAN_SET_REC_CARRIER | + LIRC_CAN_SET_REC_CARRIER_RANGE; + + if (dev->s_learning_mode) + val |= LIRC_CAN_USE_WIDEBAND_RECEIVER; + + if (dev->s_carrier_report) + val |= LIRC_CAN_MEASURE_CARRIER; + + if (dev->max_timeout) + val |= LIRC_CAN_SET_REC_TIMEOUT; + break; + + /* mode support */ case LIRC_GET_REC_MODE: - if (!LIRC_CAN_REC(ir->d.features)) { - result = -ENOTTY; - break; - } - - result = put_user(LIRC_REC2MODE - (ir->d.features & LIRC_CAN_REC_MASK), - (__u32 __user *)arg); + if (dev->driver_type == RC_DRIVER_IR_RAW_TX) + ret = -ENOTTY; + else + val = fh->rec_mode; break; + case LIRC_SET_REC_MODE: - if (!LIRC_CAN_REC(ir->d.features)) { - result = -ENOTTY; + switch (dev->driver_type) { + case RC_DRIVER_IR_RAW_TX: + ret = -ENOTTY; + break; + case RC_DRIVER_SCANCODE: + if (val != LIRC_MODE_SCANCODE) + ret = -EINVAL; + break; + case RC_DRIVER_IR_RAW: + if (!(val == LIRC_MODE_MODE2 || + val == LIRC_MODE_SCANCODE)) + ret = -EINVAL; break; } - result = get_user(mode, (__u32 __user *)arg); - if (!result && !(LIRC_MODE2REC(mode) & ir->d.features)) - result = -EINVAL; - /* - * FIXME: We should actually set the mode somehow but - * for now, lirc_serial doesn't support mode changing either - */ + if (!ret) + fh->rec_mode = val; break; - case LIRC_GET_LENGTH: - result = put_user(ir->d.code_length, (__u32 __user *)arg); + + case LIRC_GET_SEND_MODE: + if (!dev->tx_ir) + ret = -ENOTTY; + else + val = fh->send_mode; break; + + case LIRC_SET_SEND_MODE: + if (!dev->tx_ir) + ret = -ENOTTY; + else if (!(val == LIRC_MODE_PULSE || val == LIRC_MODE_SCANCODE)) + ret = -EINVAL; + else + fh->send_mode = val; + break; + + /* TX settings */ + case LIRC_SET_TRANSMITTER_MASK: + if (!dev->s_tx_mask) + ret = -ENOTTY; + else + ret = dev->s_tx_mask(dev, val); + break; + + case LIRC_SET_SEND_CARRIER: + if (!dev->s_tx_carrier) + ret = -ENOTTY; + else + ret = dev->s_tx_carrier(dev, val); + break; + + case LIRC_SET_SEND_DUTY_CYCLE: + if (!dev->s_tx_duty_cycle) + ret = -ENOTTY; + else if (val <= 0 || val >= 100) + ret = -EINVAL; + else + ret = dev->s_tx_duty_cycle(dev, val); + break; + + /* RX settings */ + case LIRC_SET_REC_CARRIER: + if (!dev->s_rx_carrier_range) + ret = -ENOTTY; + else if (val <= 0) + ret = -EINVAL; + else + ret = dev->s_rx_carrier_range(dev, fh->carrier_low, + val); + break; + + case LIRC_SET_REC_CARRIER_RANGE: + if (!dev->s_rx_carrier_range) + ret = -ENOTTY; + else if (val <= 0) + ret = -EINVAL; + else + fh->carrier_low = val; + break; + + case LIRC_GET_REC_RESOLUTION: + if (!dev->rx_resolution) + ret = -ENOTTY; + else + val = dev->rx_resolution / 1000; + break; + + case LIRC_SET_WIDEBAND_RECEIVER: + if (!dev->s_learning_mode) + ret = -ENOTTY; + else + ret = dev->s_learning_mode(dev, !!val); + break; + + case LIRC_SET_MEASURE_CARRIER_MODE: + if (!dev->s_carrier_report) + ret = -ENOTTY; + else + ret = dev->s_carrier_report(dev, !!val); + break; + + /* Generic timeout support */ case LIRC_GET_MIN_TIMEOUT: - if (!(ir->d.features & LIRC_CAN_SET_REC_TIMEOUT) || - ir->d.min_timeout == 0) { - result = -ENOTTY; - break; - } - - result = put_user(ir->d.min_timeout, (__u32 __user *)arg); + if (!dev->max_timeout) + ret = -ENOTTY; + else + val = DIV_ROUND_UP(dev->min_timeout, 1000); break; + case LIRC_GET_MAX_TIMEOUT: - if (!(ir->d.features & LIRC_CAN_SET_REC_TIMEOUT) || - ir->d.max_timeout == 0) { - result = -ENOTTY; - break; - } - - result = put_user(ir->d.max_timeout, (__u32 __user *)arg); + if (!dev->max_timeout) + ret = -ENOTTY; + else + val = dev->max_timeout / 1000; break; + + case LIRC_SET_REC_TIMEOUT: + if (!dev->max_timeout) { + ret = -ENOTTY; + } else if (val > U32_MAX / 1000) { + /* Check for multiply overflow */ + ret = -EINVAL; + } else { + u32 tmp = val * 1000; + + if (tmp < dev->min_timeout || tmp > dev->max_timeout) + ret = -EINVAL; + else if (dev->s_timeout) + ret = dev->s_timeout(dev, tmp); + else + dev->timeout = tmp; + } + break; + + case LIRC_GET_REC_TIMEOUT: + if (!dev->timeout) + ret = -ENOTTY; + else + val = DIV_ROUND_UP(dev->timeout, 1000); + break; + + case LIRC_SET_REC_TIMEOUT_REPORTS: + if (dev->driver_type != RC_DRIVER_IR_RAW) + ret = -ENOTTY; + else + fh->send_timeout_reports = !!val; + break; + default: - result = -ENOTTY; + ret = -ENOTTY; } - mutex_unlock(&ir->irctl_lock); + if (!ret && _IOC_DIR(cmd) & _IOC_READ) + ret = put_user(val, argp); - return result; +out: + mutex_unlock(&dev->lock); + return ret; } -EXPORT_SYMBOL(lirc_dev_fop_ioctl); -ssize_t lirc_dev_fop_read(struct file *file, - char __user *buffer, - size_t length, - loff_t *ppos) +static unsigned int ir_lirc_poll(struct file *file, + struct poll_table_struct *wait) { - struct irctl *ir = irctls[iminor(file_inode(file))]; - unsigned char *buf; - int ret = 0, written = 0; - DECLARE_WAITQUEUE(wait, current); + struct lirc_fh *fh = file->private_data; + struct rc_dev *rcdev = fh->rc; + unsigned int events = 0; - if (!ir) { - pr_err("called with invalid irctl\n"); - return -ENODEV; + poll_wait(file, &fh->wait_poll, wait); + + if (!rcdev->registered) { + events = EPOLLHUP | EPOLLERR; + } else if (rcdev->driver_type != RC_DRIVER_IR_RAW_TX) { + if (fh->rec_mode == LIRC_MODE_SCANCODE && + !kfifo_is_empty(&fh->scancodes)) + events = EPOLLIN | EPOLLRDNORM; + + if (fh->rec_mode == LIRC_MODE_MODE2 && + !kfifo_is_empty(&fh->rawir)) + events = EPOLLIN | EPOLLRDNORM; } - if (!LIRC_CAN_REC(ir->d.features)) + return events; +} + +static ssize_t ir_lirc_read_mode2(struct file *file, char __user *buffer, + size_t length) +{ + struct lirc_fh *fh = file->private_data; + struct rc_dev *rcdev = fh->rc; + unsigned int copied; + int ret; + + if (length < sizeof(unsigned int) || length % sizeof(unsigned int)) return -EINVAL; - dev_dbg(ir->d.dev, LOGHEAD "read called\n", ir->d.name, ir->d.minor); + do { + if (kfifo_is_empty(&fh->rawir)) { + if (file->f_flags & O_NONBLOCK) + return -EAGAIN; - buf = kzalloc(ir->chunk_size, GFP_KERNEL); - if (!buf) - return -ENOMEM; - - if (mutex_lock_interruptible(&ir->irctl_lock)) { - ret = -ERESTARTSYS; - goto out_unlocked; - } - if (!ir->attached) { - ret = -ENODEV; - goto out_locked; - } - - if (length % ir->chunk_size) { - ret = -EINVAL; - goto out_locked; - } - - /* - * we add ourselves to the task queue before buffer check - * to avoid losing scan code (in case when queue is awaken somewhere - * between while condition checking and scheduling) - */ - add_wait_queue(&ir->buf->wait_poll, &wait); - - /* - * while we didn't provide 'length' bytes, device is opened in blocking - * mode and 'copy_to_user' is happy, wait for data. - */ - while (written < length && ret == 0) { - if (lirc_buffer_empty(ir->buf)) { - /* According to the read(2) man page, 'written' can be - * returned as less than 'length', instead of blocking - * again, returning -EWOULDBLOCK, or returning - * -ERESTARTSYS - */ - if (written) - break; - if (file->f_flags & O_NONBLOCK) { - ret = -EWOULDBLOCK; - break; - } - if (signal_pending(current)) { - ret = -ERESTARTSYS; - break; - } - - mutex_unlock(&ir->irctl_lock); - set_current_state(TASK_INTERRUPTIBLE); - schedule(); - set_current_state(TASK_RUNNING); - - if (mutex_lock_interruptible(&ir->irctl_lock)) { - ret = -ERESTARTSYS; - remove_wait_queue(&ir->buf->wait_poll, &wait); - goto out_unlocked; - } - - if (!ir->attached) { - ret = -ENODEV; - goto out_locked; - } - } else { - lirc_buffer_read(ir->buf, buf); - ret = copy_to_user((void __user *)buffer+written, buf, - ir->buf->chunk_size); - if (!ret) - written += ir->buf->chunk_size; - else - ret = -EFAULT; + ret = wait_event_interruptible(fh->wait_poll, + !kfifo_is_empty(&fh->rawir) || + !rcdev->registered); + if (ret) + return ret; } + + if (!rcdev->registered) + return -ENODEV; + + ret = mutex_lock_interruptible(&rcdev->lock); + if (ret) + return ret; + ret = kfifo_to_user(&fh->rawir, buffer, length, &copied); + mutex_unlock(&rcdev->lock); + if (ret) + return ret; + } while (copied == 0); + + return copied; +} + +static ssize_t ir_lirc_read_scancode(struct file *file, char __user *buffer, + size_t length) +{ + struct lirc_fh *fh = file->private_data; + struct rc_dev *rcdev = fh->rc; + unsigned int copied; + int ret; + + if (length < sizeof(struct lirc_scancode) || + length % sizeof(struct lirc_scancode)) + return -EINVAL; + + do { + if (kfifo_is_empty(&fh->scancodes)) { + if (file->f_flags & O_NONBLOCK) + return -EAGAIN; + + ret = wait_event_interruptible(fh->wait_poll, + !kfifo_is_empty(&fh->scancodes) || + !rcdev->registered); + if (ret) + return ret; + } + + if (!rcdev->registered) + return -ENODEV; + + ret = mutex_lock_interruptible(&rcdev->lock); + if (ret) + return ret; + ret = kfifo_to_user(&fh->scancodes, buffer, length, &copied); + mutex_unlock(&rcdev->lock); + if (ret) + return ret; + } while (copied == 0); + + return copied; +} + +static ssize_t ir_lirc_read(struct file *file, char __user *buffer, + size_t length, loff_t *ppos) +{ + struct lirc_fh *fh = file->private_data; + struct rc_dev *rcdev = fh->rc; + + if (rcdev->driver_type == RC_DRIVER_IR_RAW_TX) + return -EINVAL; + + if (!rcdev->registered) + return -ENODEV; + + if (fh->rec_mode == LIRC_MODE_MODE2) + return ir_lirc_read_mode2(file, buffer, length); + else /* LIRC_MODE_SCANCODE */ + return ir_lirc_read_scancode(file, buffer, length); +} + +static const struct file_operations lirc_fops = { + .owner = THIS_MODULE, + .write = ir_lirc_transmit_ir, + .unlocked_ioctl = ir_lirc_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = ir_lirc_ioctl, +#endif + .read = ir_lirc_read, + .poll = ir_lirc_poll, + .open = ir_lirc_open, + .release = ir_lirc_close, + .llseek = no_llseek, +}; + +static void lirc_release_device(struct device *ld) +{ + struct rc_dev *rcdev = container_of(ld, struct rc_dev, lirc_dev); + + put_device(&rcdev->dev); +} + +int ir_lirc_register(struct rc_dev *dev) +{ + const char *rx_type, *tx_type; + int err, minor; + + minor = ida_simple_get(&lirc_ida, 0, RC_DEV_MAX, GFP_KERNEL); + if (minor < 0) + return minor; + + device_initialize(&dev->lirc_dev); + dev->lirc_dev.class = lirc_class; + dev->lirc_dev.parent = &dev->dev; + dev->lirc_dev.release = lirc_release_device; + dev->lirc_dev.devt = MKDEV(MAJOR(lirc_base_dev), minor); + dev_set_name(&dev->lirc_dev, "lirc%d", minor); + + INIT_LIST_HEAD(&dev->lirc_fh); + spin_lock_init(&dev->lirc_fh_lock); + + cdev_init(&dev->lirc_cdev, &lirc_fops); + + err = cdev_device_add(&dev->lirc_cdev, &dev->lirc_dev); + if (err) + goto out_ida; + + get_device(&dev->dev); + + switch (dev->driver_type) { + case RC_DRIVER_SCANCODE: + rx_type = "scancode"; + break; + case RC_DRIVER_IR_RAW: + rx_type = "raw IR"; + break; + default: + rx_type = "no"; + break; } - remove_wait_queue(&ir->buf->wait_poll, &wait); + if (dev->tx_ir) + tx_type = "raw IR"; + else + tx_type = "no"; -out_locked: - mutex_unlock(&ir->irctl_lock); + dev_info(&dev->dev, "lirc_dev: driver %s registered at minor = %d, %s receiver, %s transmitter", + dev->driver_name, minor, rx_type, tx_type); -out_unlocked: - kfree(buf); + return 0; - return ret ? ret : written; +out_ida: + ida_simple_remove(&lirc_ida, minor); + return err; } -EXPORT_SYMBOL(lirc_dev_fop_read); -void *lirc_get_pdata(struct file *file) +void ir_lirc_unregister(struct rc_dev *dev) { - return irctls[iminor(file_inode(file))]->d.data; + unsigned long flags; + struct lirc_fh *fh; + + dev_dbg(&dev->dev, "lirc_dev: driver %s unregistered from minor = %d\n", + dev->driver_name, MINOR(dev->lirc_dev.devt)); + + spin_lock_irqsave(&dev->lirc_fh_lock, flags); + list_for_each_entry(fh, &dev->lirc_fh, list) + wake_up_poll(&fh->wait_poll, EPOLLHUP | EPOLLERR); + spin_unlock_irqrestore(&dev->lirc_fh_lock, flags); + + cdev_device_del(&dev->lirc_cdev, &dev->lirc_dev); + ida_simple_remove(&lirc_ida, MINOR(dev->lirc_dev.devt)); } -EXPORT_SYMBOL(lirc_get_pdata); - -static int __init lirc_dev_init(void) +int __init lirc_dev_init(void) { int retval; @@ -590,7 +833,7 @@ static int __init lirc_dev_init(void) return PTR_ERR(lirc_class); } - retval = alloc_chrdev_region(&lirc_base_dev, 0, MAX_IRCTL_DEVICES, + retval = alloc_chrdev_region(&lirc_base_dev, 0, RC_DEV_MAX, "BaseRemoteCtl"); if (retval) { class_destroy(lirc_class); @@ -598,22 +841,39 @@ static int __init lirc_dev_init(void) return retval; } - pr_info("IR Remote Control driver registered, major %d\n", - MAJOR(lirc_base_dev)); + pr_debug("IR Remote Control driver registered, major %d\n", + MAJOR(lirc_base_dev)); return 0; } -static void __exit lirc_dev_exit(void) +void __exit lirc_dev_exit(void) { class_destroy(lirc_class); - unregister_chrdev_region(lirc_base_dev, MAX_IRCTL_DEVICES); - pr_info("module unloaded\n"); + unregister_chrdev_region(lirc_base_dev, RC_DEV_MAX); } -module_init(lirc_dev_init); -module_exit(lirc_dev_exit); +struct rc_dev *rc_dev_get_from_fd(int fd) +{ + struct fd f = fdget(fd); + struct lirc_fh *fh; + struct rc_dev *dev; -MODULE_DESCRIPTION("LIRC base driver module"); -MODULE_AUTHOR("Artur Lipowski"); -MODULE_LICENSE("GPL"); + if (!f.file) + return ERR_PTR(-EBADF); + + if (f.file->f_op != &lirc_fops) { + fdput(f); + return ERR_PTR(-EINVAL); + } + + fh = f.file->private_data; + dev = fh->rc; + + get_device(&dev->dev); + fdput(f); + + return dev; +} + +MODULE_ALIAS("lirc_dev"); diff --git a/drivers/media/rc/mceusb.c b/drivers/media/rc/mceusb.c index 53e68d02763c..0b619a2c146e 100644 --- a/drivers/media/rc/mceusb.c +++ b/drivers/media/rc/mceusb.c @@ -42,21 +42,22 @@ #include #include -#define DRIVER_VERSION "1.93" +#define DRIVER_VERSION "1.95" #define DRIVER_AUTHOR "Jarod Wilson " #define DRIVER_DESC "Windows Media Center Ed. eHome Infrared Transceiver " \ "device driver" #define DRIVER_NAME "mceusb" +#define USB_TX_TIMEOUT 1000 /* in milliseconds */ #define USB_CTRL_MSG_SZ 2 /* Size of usb ctrl msg on gen1 hw */ #define MCE_G1_INIT_MSGS 40 /* Init messages on gen1 hw to throw out */ /* MCE constants */ -#define MCE_CMDBUF_SIZE 384 /* MCE Command buffer length */ +#define MCE_IRBUF_SIZE 128 /* TX IR buffer length */ #define MCE_TIME_UNIT 50 /* Approx 50us resolution */ -#define MCE_CODE_LENGTH 5 /* Normal length of packet (with header) */ -#define MCE_PACKET_SIZE 4 /* Normal length of packet (without header) */ -#define MCE_IRDATA_HEADER 0x84 /* Actual header format is 0x80 + num_bytes */ +#define MCE_PACKET_SIZE 31 /* Max length of packet (with header) */ +#define MCE_IRDATA_HEADER (0x80 + MCE_PACKET_SIZE - 1) + /* Actual format is 0x80 + num_bytes */ #define MCE_IRDATA_TRAILER 0x80 /* End of IR data */ #define MCE_MAX_CHANNELS 2 /* Two transmitters, hardware dependent? */ #define MCE_DEFAULT_TX_MASK 0x03 /* Vals: TX1=0x01, TX2=0x02, ALL=0x03 */ @@ -181,13 +182,17 @@ enum mceusb_model_type { MCE_GEN2 = 0, /* Most boards */ MCE_GEN1, MCE_GEN3, + MCE_GEN3_BROKEN_IRTIMEOUT, MCE_GEN2_TX_INV, + MCE_GEN2_TX_INV_RX_GOOD, POLARIS_EVK, CX_HYBRID_TV, MULTIFUNCTION, TIVO_KIT, MCE_GEN2_NO_TX, HAUPPAUGE_CX_HYBRID_TV, + EVROMEDIA_FULL_HYBRID_FULLHD, + ASTROMETA_T2HYBRID, }; struct mceusb_model { @@ -196,6 +201,14 @@ struct mceusb_model { u32 mce_gen3:1; u32 tx_mask_normal:1; u32 no_tx:1; + u32 broken_irtimeout:1; + /* + * 2nd IR receiver (short-range, wideband) for learning mode: + * 0, absent 2nd receiver (rx2) + * 1, rx2 present + * 2, rx2 which under counts IR carrier cycles + */ + u32 rx2; int ir_intfnum; @@ -207,9 +220,11 @@ static const struct mceusb_model mceusb_model[] = { [MCE_GEN1] = { .mce_gen1 = 1, .tx_mask_normal = 1, + .rx2 = 2, }, [MCE_GEN2] = { .mce_gen2 = 1, + .rx2 = 2, }, [MCE_GEN2_NO_TX] = { .mce_gen2 = 1, @@ -218,10 +233,23 @@ static const struct mceusb_model mceusb_model[] = { [MCE_GEN2_TX_INV] = { .mce_gen2 = 1, .tx_mask_normal = 1, + .rx2 = 1, + }, + [MCE_GEN2_TX_INV_RX_GOOD] = { + .mce_gen2 = 1, + .tx_mask_normal = 1, + .rx2 = 2, }, [MCE_GEN3] = { .mce_gen3 = 1, .tx_mask_normal = 1, + .rx2 = 2, + }, + [MCE_GEN3_BROKEN_IRTIMEOUT] = { + .mce_gen3 = 1, + .tx_mask_normal = 1, + .rx2 = 2, + .broken_irtimeout = 1 }, [POLARIS_EVK] = { /* @@ -230,6 +258,7 @@ static const struct mceusb_model mceusb_model[] = { * to allow testing it */ .name = "Conexant Hybrid TV (cx231xx) MCE IR", + .rx2 = 2, }, [CX_HYBRID_TV] = { .no_tx = 1, /* tx isn't wired up at all */ @@ -242,14 +271,26 @@ static const struct mceusb_model mceusb_model[] = { [MULTIFUNCTION] = { .mce_gen2 = 1, .ir_intfnum = 2, + .rx2 = 2, }, [TIVO_KIT] = { .mce_gen2 = 1, .rc_map = RC_MAP_TIVO, + .rx2 = 2, }, + [EVROMEDIA_FULL_HYBRID_FULLHD] = { + .name = "Evromedia USB Full Hybrid Full HD", + .no_tx = 1, + .rc_map = RC_MAP_MSI_DIGIVOX_III, + }, + [ASTROMETA_T2HYBRID] = { + .name = "Astrometa T2Hybrid", + .no_tx = 1, + .rc_map = RC_MAP_ASTROMETA_T2HYBRID, + } }; -static struct usb_device_id mceusb_dev_table[] = { +static const struct usb_device_id mceusb_dev_table[] = { /* Original Microsoft MCE IR Transceiver (often HP-branded) */ { USB_DEVICE(VENDOR_MICROSOFT, 0x006d), .driver_info = MCE_GEN1 }, @@ -278,7 +319,7 @@ static struct usb_device_id mceusb_dev_table[] = { .driver_info = MULTIFUNCTION }, /* SMK/Toshiba G83C0004D410 */ { USB_DEVICE(VENDOR_SMK, 0x031d), - .driver_info = MCE_GEN2_TX_INV }, + .driver_info = MCE_GEN2_TX_INV_RX_GOOD }, /* SMK eHome Infrared Transceiver (Sony VAIO) */ { USB_DEVICE(VENDOR_SMK, 0x0322), .driver_info = MCE_GEN2_TX_INV }, @@ -320,7 +361,7 @@ static struct usb_device_id mceusb_dev_table[] = { .driver_info = MCE_GEN2_TX_INV }, /* Topseed eHome Infrared Transceiver */ { USB_DEVICE(VENDOR_TOPSEED, 0x0011), - .driver_info = MCE_GEN3 }, + .driver_info = MCE_GEN3_BROKEN_IRTIMEOUT }, /* Ricavision internal Infrared Transceiver */ { USB_DEVICE(VENDOR_RICAVISION, 0x0010) }, /* Itron ione Libra Q-11 */ @@ -398,6 +439,12 @@ static struct usb_device_id mceusb_dev_table[] = { .driver_info = HAUPPAUGE_CX_HYBRID_TV }, /* Adaptec / HP eHome Receiver */ { USB_DEVICE(VENDOR_ADAPTEC, 0x0094) }, + /* Evromedia USB Full Hybrid Full HD */ + { USB_DEVICE(0x1b80, 0xd3b2), + .driver_info = EVROMEDIA_FULL_HYBRID_FULLHD }, + /* Astrometa T2hybrid */ + { USB_DEVICE(0x15f4, 0x0135), + .driver_info = ASTROMETA_T2HYBRID }, /* Terminating entry */ { } @@ -409,7 +456,8 @@ struct mceusb_dev { struct rc_dev *rc; /* optional features we can enable */ - bool learning_enabled; + bool carrier_report_enabled; + bool wideband_rx_enabled; /* aka learning mode, short-range rx */ /* core device bits */ struct device *dev; @@ -440,6 +488,7 @@ struct mceusb_dev { u32 tx_mask_normal:1; u32 microsoft_gen1:1; u32 no_tx:1; + u32 rx2; } flags; /* transmit support */ @@ -456,6 +505,11 @@ struct mceusb_dev { u8 num_rxports; /* number of receive sensors */ u8 txports_cabled; /* bitmask of transmitters with cable */ u8 rxports_active; /* bitmask of active receive sensors */ + bool learning_active; /* wideband rx is active */ + + /* receiver carrier frequency detection support */ + u32 pulse_tunit; /* IR pulse "on" cumulative time units */ + u32 pulse_count; /* pulse "on" count in measurement interval */ /* * support for async error handler mceusb_deferred_kevent() @@ -519,6 +573,7 @@ static int mceusb_cmd_datasize(u8 cmd, u8 subcmd) datasize = 1; break; } + break; case MCE_CMD_PORT_IR: switch (subcmd) { case MCE_CMD_UNKNOWN: @@ -555,9 +610,9 @@ static void mceusb_dev_printdata(struct mceusb_dev *ir, u8 *buf, int buf_len, if (len <= skip) return; - dev_dbg(dev, "%cx data: %*ph (length=%d)", - (out ? 't' : 'r'), - min(len, buf_len - offset), buf + offset, len); + dev_dbg(dev, "%cx data[%d]: %*ph (len=%d sz=%d)", + (out ? 't' : 'r'), offset, + min(len, buf_len - offset), buf + offset, len, buf_len); inout = out ? "Request" : "Got"; @@ -673,8 +728,8 @@ static void mceusb_dev_printdata(struct mceusb_dev *ir, u8 *buf, int buf_len, /* aka MCE_RSP_EQIRRXCFCNT */ if (out) dev_dbg(dev, "Get receive sensor"); - else if (ir->learning_enabled) - dev_dbg(dev, "RX pulse count: %d", + else + dev_dbg(dev, "RX carrier cycle count: %d", ((data[0] << 8) | data[1])); break; case MCE_RSP_EQIRNUMPORTS: @@ -686,6 +741,9 @@ static void mceusb_dev_printdata(struct mceusb_dev *ir, u8 *buf, int buf_len, case MCE_RSP_CMD_ILLEGAL: dev_dbg(dev, "Illegal PORT_IR command"); break; + case MCE_RSP_TX_TIMEOUT: + dev_dbg(dev, "IR TX timeout (TX buffer underrun)"); + break; default: dev_dbg(dev, "Unknown command 0x%02x 0x%02x", cmd, subcmd); @@ -700,13 +758,14 @@ static void mceusb_dev_printdata(struct mceusb_dev *ir, u8 *buf, int buf_len, dev_dbg(dev, "End of raw IR data"); else if ((cmd != MCE_CMD_PORT_IR) && ((cmd & MCE_PORT_MASK) == MCE_COMMAND_IRDATA)) - dev_dbg(dev, "Raw IR data, %d pulse/space samples", ir->rem); + dev_dbg(dev, "Raw IR data, %d pulse/space samples", + cmd & MCE_PACKET_LENGTH_MASK); #endif } /* * Schedule work that can't be done in interrupt handlers - * (mceusb_dev_recv() and mce_async_callback()) nor tasklets. + * (mceusb_dev_recv() and mce_write_callback()) nor tasklets. * Invokes mceusb_deferred_kevent() for recovering from * error events specified by the kevent bit field. */ @@ -719,23 +778,80 @@ static void mceusb_defer_kevent(struct mceusb_dev *ir, int kevent) dev_dbg(ir->dev, "kevent %d scheduled", kevent); } -static void mce_async_callback(struct urb *urb) +static void mce_write_callback(struct urb *urb) { - struct mceusb_dev *ir; - int len; - if (!urb) return; - ir = urb->context; + complete(urb->context); +} + +/* + * Write (TX/send) data to MCE device USB endpoint out. + * Used for IR blaster TX and MCE device commands. + * + * Return: The number of bytes written (> 0) or errno (< 0). + */ +static int mce_write(struct mceusb_dev *ir, u8 *data, int size) +{ + int ret; + struct urb *urb; + struct device *dev = ir->dev; + unsigned char *buf_out; + struct completion tx_done; + unsigned long expire; + unsigned long ret_wait; + + mceusb_dev_printdata(ir, data, size, 0, size, true); + + urb = usb_alloc_urb(0, GFP_KERNEL); + if (unlikely(!urb)) { + dev_err(dev, "Error: mce write couldn't allocate urb"); + return -ENOMEM; + } + + buf_out = kmalloc(size, GFP_KERNEL); + if (!buf_out) { + usb_free_urb(urb); + return -ENOMEM; + } + + init_completion(&tx_done); + + /* outbound data */ + if (usb_endpoint_xfer_int(ir->usb_ep_out)) + usb_fill_int_urb(urb, ir->usbdev, ir->pipe_out, + buf_out, size, mce_write_callback, &tx_done, + ir->usb_ep_out->bInterval); + else + usb_fill_bulk_urb(urb, ir->usbdev, ir->pipe_out, + buf_out, size, mce_write_callback, &tx_done); + memcpy(buf_out, data, size); + + ret = usb_submit_urb(urb, GFP_KERNEL); + if (ret) { + dev_err(dev, "Error: mce write submit urb error = %d", ret); + kfree(buf_out); + usb_free_urb(urb); + return ret; + } + + expire = msecs_to_jiffies(USB_TX_TIMEOUT); + ret_wait = wait_for_completion_timeout(&tx_done, expire); + if (!ret_wait) { + dev_err(dev, "Error: mce write timed out (expire = %lu (%dms))", + expire, USB_TX_TIMEOUT); + usb_kill_urb(urb); + ret = (urb->status == -ENOENT ? -ETIMEDOUT : urb->status); + } else { + ret = urb->status; + } + if (ret >= 0) + ret = urb->actual_length; /* bytes written */ switch (urb->status) { /* success */ case 0: - len = urb->actual_length; - - mceusb_dev_printdata(ir, urb->transfer_buffer, len, - 0, len, true); break; case -ECONNRESET: @@ -745,140 +861,135 @@ static void mce_async_callback(struct urb *urb) break; case -EPIPE: - dev_err(ir->dev, "Error: request urb status = %d (TX HALT)", + dev_err(ir->dev, "Error: mce write urb status = %d (TX HALT)", urb->status); mceusb_defer_kevent(ir, EVENT_TX_HALT); break; default: - dev_err(ir->dev, "Error: request urb status = %d", urb->status); + dev_err(ir->dev, "Error: mce write urb status = %d", + urb->status); break; } - /* the transfer buffer and urb were allocated in mce_request_packet */ - kfree(urb->transfer_buffer); + dev_dbg(dev, "tx done status = %d (wait = %lu, expire = %lu (%dms), urb->actual_length = %d, urb->status = %d)", + ret, ret_wait, expire, USB_TX_TIMEOUT, + urb->actual_length, urb->status); + + kfree(buf_out); usb_free_urb(urb); + + return ret; } -/* request outgoing (send) usb packet - used to initialize remote */ -static void mce_request_packet(struct mceusb_dev *ir, unsigned char *data, - int size) -{ - int res; - struct urb *async_urb; - struct device *dev = ir->dev; - unsigned char *async_buf; - - async_urb = usb_alloc_urb(0, GFP_KERNEL); - if (unlikely(!async_urb)) { - dev_err(dev, "Error, couldn't allocate urb!"); - return; - } - - async_buf = kmalloc(size, GFP_KERNEL); - if (!async_buf) { - usb_free_urb(async_urb); - return; - } - - /* outbound data */ - if (usb_endpoint_xfer_int(ir->usb_ep_out)) - usb_fill_int_urb(async_urb, ir->usbdev, ir->pipe_out, - async_buf, size, mce_async_callback, ir, - ir->usb_ep_out->bInterval); - else - usb_fill_bulk_urb(async_urb, ir->usbdev, ir->pipe_out, - async_buf, size, mce_async_callback, ir); - - memcpy(async_buf, data, size); - - dev_dbg(dev, "send request called (size=%#x)", size); - - res = usb_submit_urb(async_urb, GFP_ATOMIC); - if (res) { - dev_err(dev, "send request FAILED! (res=%d)", res); - kfree(async_buf); - usb_free_urb(async_urb); - return; - } - dev_dbg(dev, "send request complete (res=%d)", res); -} - -static void mce_async_out(struct mceusb_dev *ir, unsigned char *data, int size) +static void mce_command_out(struct mceusb_dev *ir, u8 *data, int size) { int rsize = sizeof(DEVICE_RESUME); if (ir->need_reset) { ir->need_reset = false; - mce_request_packet(ir, DEVICE_RESUME, rsize); + mce_write(ir, DEVICE_RESUME, rsize); msleep(10); } - mce_request_packet(ir, data, size); + mce_write(ir, data, size); msleep(10); } -/* Send data out the IR blaster port(s) */ +/* + * Transmit IR out the MCE device IR blaster port(s). + * + * Convert IR pulse/space sequence from LIRC to MCE format. + * Break up a long IR sequence into multiple parts (MCE IR data packets). + * + * u32 txbuf[] consists of IR pulse, space, ..., and pulse times in usec. + * Pulses and spaces are implicit by their position. + * The first IR sample, txbuf[0], is always a pulse. + * + * u8 irbuf[] consists of multiple IR data packets for the MCE device. + * A packet is 1 u8 MCE_IRDATA_HEADER and up to 30 u8 IR samples. + * An IR sample is 1-bit pulse/space flag with 7-bit time + * in MCE time units (50usec). + * + * Return: The number of IR samples sent (> 0) or errno (< 0). + */ static int mceusb_tx_ir(struct rc_dev *dev, unsigned *txbuf, unsigned count) { struct mceusb_dev *ir = dev->priv; - int i, length, ret = 0; - int cmdcount = 0; - unsigned char cmdbuf[MCE_CMDBUF_SIZE]; - - /* MCE tx init header */ - cmdbuf[cmdcount++] = MCE_CMD_PORT_IR; - cmdbuf[cmdcount++] = MCE_CMD_SETIRTXPORTS; - cmdbuf[cmdcount++] = ir->tx_mask; + u8 cmdbuf[3] = { MCE_CMD_PORT_IR, MCE_CMD_SETIRTXPORTS, 0x00 }; + u8 irbuf[MCE_IRBUF_SIZE]; + int ircount = 0; + unsigned int irsample; + int i, length, ret; /* Send the set TX ports command */ - mce_async_out(ir, cmdbuf, cmdcount); - cmdcount = 0; + cmdbuf[2] = ir->tx_mask; + mce_command_out(ir, cmdbuf, sizeof(cmdbuf)); - /* Generate mce packet data */ - for (i = 0; (i < count) && (cmdcount < MCE_CMDBUF_SIZE); i++) { - txbuf[i] = txbuf[i] / MCE_TIME_UNIT; + /* Generate mce IR data packet */ + for (i = 0; i < count; i++) { + irsample = txbuf[i] / MCE_TIME_UNIT; - do { /* loop to support long pulses/spaces > 127*50us=6.35ms */ - - /* Insert mce packet header every 4th entry */ - if ((cmdcount < MCE_CMDBUF_SIZE) && - (cmdcount % MCE_CODE_LENGTH) == 0) - cmdbuf[cmdcount++] = MCE_IRDATA_HEADER; - - /* Insert mce packet data */ - if (cmdcount < MCE_CMDBUF_SIZE) - cmdbuf[cmdcount++] = - (txbuf[i] < MCE_PULSE_BIT ? - txbuf[i] : MCE_MAX_PULSE_LENGTH) | - (i & 1 ? 0x00 : MCE_PULSE_BIT); - else { - ret = -EINVAL; - goto out; + /* loop to support long pulses/spaces > 6350us (127*50us) */ + while (irsample > 0) { + /* Insert IR header every 30th entry */ + if (ircount % MCE_PACKET_SIZE == 0) { + /* Room for IR header and one IR sample? */ + if (ircount >= MCE_IRBUF_SIZE - 1) { + /* Send near full buffer */ + ret = mce_write(ir, irbuf, ircount); + if (ret < 0) + return ret; + ircount = 0; + } + irbuf[ircount++] = MCE_IRDATA_HEADER; } - } while ((txbuf[i] > MCE_MAX_PULSE_LENGTH) && - (txbuf[i] -= MCE_MAX_PULSE_LENGTH)); - } + /* Insert IR sample */ + if (irsample <= MCE_MAX_PULSE_LENGTH) { + irbuf[ircount] = irsample; + irsample = 0; + } else { + irbuf[ircount] = MCE_MAX_PULSE_LENGTH; + irsample -= MCE_MAX_PULSE_LENGTH; + } + /* + * Even i = IR pulse + * Odd i = IR space + */ + irbuf[ircount] |= (i & 1 ? 0 : MCE_PULSE_BIT); + ircount++; - /* Check if we have room for the empty packet at the end */ - if (cmdcount >= MCE_CMDBUF_SIZE) { - ret = -EINVAL; - goto out; - } + /* IR buffer full? */ + if (ircount >= MCE_IRBUF_SIZE) { + /* Fix packet length in last header */ + length = ircount % MCE_PACKET_SIZE; + if (length > 0) + irbuf[ircount - length] -= + MCE_PACKET_SIZE - length; + /* Send full buffer */ + ret = mce_write(ir, irbuf, ircount); + if (ret < 0) + return ret; + ircount = 0; + } + } + } /* after for loop, 0 <= ircount < MCE_IRBUF_SIZE */ /* Fix packet length in last header */ - length = cmdcount % MCE_CODE_LENGTH; - cmdbuf[cmdcount - length] -= MCE_CODE_LENGTH - length; + length = ircount % MCE_PACKET_SIZE; + if (length > 0) + irbuf[ircount - length] -= MCE_PACKET_SIZE - length; - /* All mce commands end with an empty packet (0x80) */ - cmdbuf[cmdcount++] = MCE_IRDATA_TRAILER; + /* Append IR trailer (0x80) to final partial (or empty) IR buffer */ + irbuf[ircount++] = MCE_IRDATA_TRAILER; - /* Transmit the command to the mce device */ - mce_async_out(ir, cmdbuf, cmdcount); + /* Send final buffer */ + ret = mce_write(ir, irbuf, ircount); + if (ret < 0) + return ret; -out: - return ret ? ret : count; + return count; } /* Sets active IR outputs -- mce devices typically have two */ @@ -918,7 +1029,7 @@ static int mceusb_set_tx_carrier(struct rc_dev *dev, u32 carrier) cmdbuf[2] = MCE_CMD_SIG_END; cmdbuf[3] = MCE_IRDATA_TRAILER; dev_dbg(ir->dev, "disabling carrier modulation"); - mce_async_out(ir, cmdbuf, sizeof(cmdbuf)); + mce_command_out(ir, cmdbuf, sizeof(cmdbuf)); return 0; } @@ -932,7 +1043,7 @@ static int mceusb_set_tx_carrier(struct rc_dev *dev, u32 carrier) carrier); /* Transmit new carrier to mce device */ - mce_async_out(ir, cmdbuf, sizeof(cmdbuf)); + mce_command_out(ir, cmdbuf, sizeof(cmdbuf)); return 0; } } @@ -944,6 +1055,86 @@ static int mceusb_set_tx_carrier(struct rc_dev *dev, u32 carrier) return 0; } +static int mceusb_set_timeout(struct rc_dev *dev, unsigned int timeout) +{ + u8 cmdbuf[4] = { MCE_CMD_PORT_IR, MCE_CMD_SETIRTIMEOUT, 0, 0 }; + struct mceusb_dev *ir = dev->priv; + unsigned int units; + + units = DIV_ROUND_CLOSEST(timeout, US_TO_NS(MCE_TIME_UNIT)); + + cmdbuf[2] = units >> 8; + cmdbuf[3] = units; + + mce_command_out(ir, cmdbuf, sizeof(cmdbuf)); + + /* get receiver timeout value */ + mce_command_out(ir, GET_RX_TIMEOUT, sizeof(GET_RX_TIMEOUT)); + + return 0; +} + +/* + * Select or deselect the 2nd receiver port. + * Second receiver is learning mode, wide-band, short-range receiver. + * Only one receiver (long or short range) may be active at a time. + */ +static int mceusb_set_rx_wideband(struct rc_dev *dev, int enable) +{ + struct mceusb_dev *ir = dev->priv; + unsigned char cmdbuf[3] = { MCE_CMD_PORT_IR, + MCE_CMD_SETIRRXPORTEN, 0x00 }; + + dev_dbg(ir->dev, "select %s-range receive sensor", + enable ? "short" : "long"); + if (enable) { + ir->wideband_rx_enabled = true; + cmdbuf[2] = 2; /* port 2 is short range receiver */ + } else { + ir->wideband_rx_enabled = false; + cmdbuf[2] = 1; /* port 1 is long range receiver */ + } + mce_command_out(ir, cmdbuf, sizeof(cmdbuf)); + /* response from device sets ir->learning_active */ + + return 0; +} + +/* + * Enable/disable receiver carrier frequency pass through reporting. + * Only the short-range receiver has carrier frequency measuring capability. + * Implicitly select this receiver when enabling carrier frequency reporting. + */ +static int mceusb_set_rx_carrier_report(struct rc_dev *dev, int enable) +{ + struct mceusb_dev *ir = dev->priv; + unsigned char cmdbuf[3] = { MCE_CMD_PORT_IR, + MCE_CMD_SETIRRXPORTEN, 0x00 }; + + dev_dbg(ir->dev, "%s short-range receiver carrier reporting", + enable ? "enable" : "disable"); + if (enable) { + ir->carrier_report_enabled = true; + if (!ir->learning_active) { + cmdbuf[2] = 2; /* port 2 is short range receiver */ + mce_command_out(ir, cmdbuf, sizeof(cmdbuf)); + } + } else { + ir->carrier_report_enabled = false; + /* + * Revert to normal (long-range) receiver only if the + * wideband (short-range) receiver wasn't explicitly + * enabled. + */ + if (ir->learning_active && !ir->wideband_rx_enabled) { + cmdbuf[2] = 1; /* port 1 is long range receiver */ + mce_command_out(ir, cmdbuf, sizeof(cmdbuf)); + } + } + + return 0; +} + /* * We don't do anything but print debug spew for many of the command bits * we receive from the hardware, but some of them are useful information @@ -951,8 +1142,11 @@ static int mceusb_set_tx_carrier(struct rc_dev *dev, u32 carrier) */ static void mceusb_handle_command(struct mceusb_dev *ir, int index) { + DEFINE_IR_RAW_EVENT(rawir); u8 hi = ir->buf_in[index + 1] & 0xff; u8 lo = ir->buf_in[index + 2] & 0xff; + u32 carrier_cycles; + u32 cycles_fix; switch (ir->buf_in[index]) { /* the one and only 5-byte return value command */ @@ -969,6 +1163,33 @@ static void mceusb_handle_command(struct mceusb_dev *ir, int index) ir->num_txports = hi; ir->num_rxports = lo; break; + case MCE_RSP_EQIRRXCFCNT: + /* + * The carrier cycle counter can overflow and wrap around + * without notice from the device. So frequency measurement + * will be inaccurate with long duration IR. + * + * The long-range (non learning) receiver always reports + * zero count so we always ignore its report. + */ + if (ir->carrier_report_enabled && ir->learning_active && + ir->pulse_tunit > 0) { + carrier_cycles = (hi << 8 | lo); + /* + * Adjust carrier cycle count by adding + * 1 missed count per pulse "on" + */ + cycles_fix = ir->flags.rx2 == 2 ? ir->pulse_count : 0; + rawir.carrier_report = 1; + rawir.carrier = (1000000u / MCE_TIME_UNIT) * + (carrier_cycles + cycles_fix) / + ir->pulse_tunit; + dev_dbg(ir->dev, "RX carrier frequency %u Hz (pulse count = %u, cycles = %u, duration = %u, rx2 = %u)", + rawir.carrier, ir->pulse_count, carrier_cycles, + ir->pulse_tunit, ir->flags.rx2); + ir_raw_event_store(ir->rc, &rawir); + } + break; /* 1-byte return value commands */ case MCE_RSP_EQEMVER: @@ -978,10 +1199,15 @@ static void mceusb_handle_command(struct mceusb_dev *ir, int index) ir->tx_mask = hi; break; case MCE_RSP_EQIRRXPORTEN: - ir->learning_enabled = ((hi & 0x02) == 0x02); - ir->rxports_active = hi; + ir->learning_active = ((hi & 0x02) == 0x02); + if (ir->rxports_active != hi) { + dev_info(ir->dev, "%s-range (0x%x) receiver active", + ir->learning_active ? "short" : "long", hi); + ir->rxports_active = hi; + } break; case MCE_RSP_CMD_ILLEGAL: + case MCE_RSP_TX_TIMEOUT: ir->need_reset = true; break; default: @@ -1016,12 +1242,21 @@ static void mceusb_process_ir_data(struct mceusb_dev *ir, int buf_len) ir->rem--; init_ir_raw_event(&rawir); rawir.pulse = ((ir->buf_in[i] & MCE_PULSE_BIT) != 0); - rawir.duration = (ir->buf_in[i] & MCE_PULSE_MASK) - * US_TO_NS(MCE_TIME_UNIT); + rawir.duration = (ir->buf_in[i] & MCE_PULSE_MASK); + if (unlikely(!rawir.duration)) { + dev_warn(ir->dev, "nonsensical irdata %02x with duration 0", + ir->buf_in[i]); + break; + } + if (rawir.pulse) { + ir->pulse_tunit += rawir.duration; + ir->pulse_count++; + } + rawir.duration *= US_TO_NS(MCE_TIME_UNIT); - dev_dbg(ir->dev, "Storing %s with duration %u", + dev_dbg(ir->dev, "Storing %s %u ns (%02x)", rawir.pulse ? "pulse" : "space", - rawir.duration); + rawir.duration, ir->buf_in[i]); if (ir_raw_event_store_with_filter(ir->rc, &rawir)) event = true; @@ -1042,10 +1277,18 @@ static void mceusb_process_ir_data(struct mceusb_dev *ir, int buf_len) ir->rem = (ir->cmd & MCE_PACKET_LENGTH_MASK); mceusb_dev_printdata(ir, ir->buf_in, buf_len, i, ir->rem + 1, false); - if (ir->rem) + if (ir->rem) { ir->parser_state = PARSE_IRDATA; - else - ir_raw_event_reset(ir->rc); + } else { + init_ir_raw_event(&rawir); + rawir.timeout = 1; + rawir.duration = ir->rc->timeout; + if (ir_raw_event_store_with_filter(ir->rc, + &rawir)) + event = true; + ir->pulse_tunit = 0; + ir->pulse_count = 0; + } break; } @@ -1103,7 +1346,7 @@ static void mceusb_get_emulator_version(struct mceusb_dev *ir) { /* If we get no reply or an illegal command reply, its ver 1, says MS */ ir->emver = 1; - mce_async_out(ir, GET_EMVER, sizeof(GET_EMVER)); + mce_command_out(ir, GET_EMVER, sizeof(GET_EMVER)); } static void mceusb_gen1_init(struct mceusb_dev *ir) @@ -1149,10 +1392,10 @@ static void mceusb_gen1_init(struct mceusb_dev *ir) dev_dbg(dev, "set handshake - retC = %d", ret); /* device resume */ - mce_async_out(ir, DEVICE_RESUME, sizeof(DEVICE_RESUME)); + mce_command_out(ir, DEVICE_RESUME, sizeof(DEVICE_RESUME)); /* get hw/sw revision? */ - mce_async_out(ir, GET_REVISION, sizeof(GET_REVISION)); + mce_command_out(ir, GET_REVISION, sizeof(GET_REVISION)); kfree(data); } @@ -1160,13 +1403,13 @@ static void mceusb_gen1_init(struct mceusb_dev *ir) static void mceusb_gen2_init(struct mceusb_dev *ir) { /* device resume */ - mce_async_out(ir, DEVICE_RESUME, sizeof(DEVICE_RESUME)); + mce_command_out(ir, DEVICE_RESUME, sizeof(DEVICE_RESUME)); /* get wake version (protocol, key, address) */ - mce_async_out(ir, GET_WAKEVERSION, sizeof(GET_WAKEVERSION)); + mce_command_out(ir, GET_WAKEVERSION, sizeof(GET_WAKEVERSION)); /* unknown what this one actually returns... */ - mce_async_out(ir, GET_UNKNOWN2, sizeof(GET_UNKNOWN2)); + mce_command_out(ir, GET_UNKNOWN2, sizeof(GET_UNKNOWN2)); } static void mceusb_get_parameters(struct mceusb_dev *ir) @@ -1180,24 +1423,24 @@ static void mceusb_get_parameters(struct mceusb_dev *ir) ir->num_rxports = 2; /* get number of tx and rx ports */ - mce_async_out(ir, GET_NUM_PORTS, sizeof(GET_NUM_PORTS)); + mce_command_out(ir, GET_NUM_PORTS, sizeof(GET_NUM_PORTS)); /* get the carrier and frequency */ - mce_async_out(ir, GET_CARRIER_FREQ, sizeof(GET_CARRIER_FREQ)); + mce_command_out(ir, GET_CARRIER_FREQ, sizeof(GET_CARRIER_FREQ)); if (ir->num_txports && !ir->flags.no_tx) /* get the transmitter bitmask */ - mce_async_out(ir, GET_TX_BITMASK, sizeof(GET_TX_BITMASK)); + mce_command_out(ir, GET_TX_BITMASK, sizeof(GET_TX_BITMASK)); /* get receiver timeout value */ - mce_async_out(ir, GET_RX_TIMEOUT, sizeof(GET_RX_TIMEOUT)); + mce_command_out(ir, GET_RX_TIMEOUT, sizeof(GET_RX_TIMEOUT)); /* get receiver sensor setting */ - mce_async_out(ir, GET_RX_SENSOR, sizeof(GET_RX_SENSOR)); + mce_command_out(ir, GET_RX_SENSOR, sizeof(GET_RX_SENSOR)); for (i = 0; i < ir->num_txports; i++) { cmdbuf[2] = i; - mce_async_out(ir, cmdbuf, sizeof(cmdbuf)); + mce_command_out(ir, cmdbuf, sizeof(cmdbuf)); } } @@ -1206,7 +1449,7 @@ static void mceusb_flash_led(struct mceusb_dev *ir) if (ir->emver < 2) return; - mce_async_out(ir, FLASH_LED, sizeof(FLASH_LED)); + mce_command_out(ir, FLASH_LED, sizeof(FLASH_LED)); } /* @@ -1276,12 +1519,27 @@ static struct rc_dev *mceusb_init_rc_dev(struct mceusb_dev *ir) rc->dev.parent = dev; rc->priv = ir; rc->allowed_protocols = RC_PROTO_BIT_ALL_IR_DECODER; + rc->min_timeout = US_TO_NS(MCE_TIME_UNIT); rc->timeout = MS_TO_NS(100); + if (!mceusb_model[ir->model].broken_irtimeout) { + rc->s_timeout = mceusb_set_timeout; + rc->max_timeout = 10 * IR_DEFAULT_TIMEOUT; + } else { + /* + * If we can't set the timeout using CMD_SETIRTIMEOUT, we can + * rely on software timeouts for timeouts < 100ms. + */ + rc->max_timeout = rc->timeout; + } if (!ir->flags.no_tx) { rc->s_tx_mask = mceusb_set_tx_mask; rc->s_tx_carrier = mceusb_set_tx_carrier; rc->tx_ir = mceusb_tx_ir; } + if (ir->flags.rx2 > 0) { + rc->s_learning_mode = mceusb_set_rx_wideband; + rc->s_carrier_report = mceusb_set_rx_carrier_report; + } rc->driver_name = DRIVER_NAME; switch (le16_to_cpu(udev->descriptor.idVendor)) { @@ -1396,6 +1654,7 @@ static int mceusb_dev_probe(struct usb_interface *intf, ir->flags.microsoft_gen1 = is_microsoft_gen1; ir->flags.tx_mask_normal = tx_mask_normal; ir->flags.no_tx = mceusb_model[model].no_tx; + ir->flags.rx2 = mceusb_model[model].rx2; ir->model = model; /* Saving usb interface data for use by the transmitter routine */ diff --git a/drivers/media/rc/meson-ir.c b/drivers/media/rc/meson-ir.c index f2204eb77e2a..f449b35d25e7 100644 --- a/drivers/media/rc/meson-ir.c +++ b/drivers/media/rc/meson-ir.c @@ -97,8 +97,7 @@ static irqreturn_t meson_ir_irq(int irqno, void *dev_id) status = readl_relaxed(ir->reg + IR_DEC_STATUS); rawir.pulse = !!(status & STATUS_IR_DEC_IN); - ir_raw_event_store(ir->rc, &rawir); - ir_raw_event_handle(ir->rc); + ir_raw_event_store_with_timeout(ir->rc, &rawir); spin_unlock(&ir->lock); @@ -145,7 +144,9 @@ static int meson_ir_probe(struct platform_device *pdev) ir->rc->map_name = map_name ? map_name : RC_MAP_EMPTY; ir->rc->allowed_protocols = RC_PROTO_BIT_ALL_IR_DECODER; ir->rc->rx_resolution = US_TO_NS(MESON_TRATE); - ir->rc->timeout = MS_TO_NS(200); + ir->rc->min_timeout = 1; + ir->rc->timeout = IR_DEFAULT_TIMEOUT; + ir->rc->max_timeout = 10 * IR_DEFAULT_TIMEOUT; ir->rc->driver_name = DRIVER_NAME; spin_lock_init(&ir->lock); diff --git a/drivers/media/rc/mtk-cir.c b/drivers/media/rc/mtk-cir.c index 00a4a0dfcab8..d37b85d2bc75 100644 --- a/drivers/media/rc/mtk-cir.c +++ b/drivers/media/rc/mtk-cir.c @@ -304,8 +304,6 @@ static int mtk_ir_probe(struct platform_device *pdev) { struct device *dev = &pdev->dev; struct device_node *dn = dev->of_node; - const struct of_device_id *of_id = - of_match_device(mtk_ir_match, &pdev->dev); struct resource *res; struct mtk_ir *ir; u32 val; @@ -317,7 +315,7 @@ static int mtk_ir_probe(struct platform_device *pdev) return -ENOMEM; ir->dev = dev; - ir->data = of_id->data; + ir->data = of_device_get_match_data(dev); ir->clk = devm_clk_get(dev, "clk"); if (IS_ERR(ir->clk)) { diff --git a/drivers/media/rc/nuvoton-cir.c b/drivers/media/rc/nuvoton-cir.c index 5e1d866a61a5..b8299c9a9744 100644 --- a/drivers/media/rc/nuvoton-cir.c +++ b/drivers/media/rc/nuvoton-cir.c @@ -535,6 +535,8 @@ static void nvt_set_cir_iren(struct nvt_dev *nvt) static void nvt_cir_regs_init(struct nvt_dev *nvt) { + nvt_enable_logical_dev(nvt, LOGICAL_DEV_CIR); + /* set sample limit count (PE interrupt raised when reached) */ nvt_cir_reg_write(nvt, CIR_RX_LIMIT_COUNT >> 8, CIR_SLCH); nvt_cir_reg_write(nvt, CIR_RX_LIMIT_COUNT & 0xff, CIR_SLCL); @@ -543,31 +545,17 @@ static void nvt_cir_regs_init(struct nvt_dev *nvt) nvt_cir_reg_write(nvt, CIR_FIFOCON_TX_TRIGGER_LEV | CIR_FIFOCON_RX_TRIGGER_LEV, CIR_FIFOCON); - /* - * Enable TX and RX, specify carrier on = low, off = high, and set - * sample period (currently 50us) - */ - nvt_cir_reg_write(nvt, - CIR_IRCON_TXEN | CIR_IRCON_RXEN | - CIR_IRCON_RXINV | CIR_IRCON_SAMPLE_PERIOD_SEL, - CIR_IRCON); - /* clear hardware rx and tx fifos */ nvt_clear_cir_fifo(nvt); nvt_clear_tx_fifo(nvt); - /* clear any and all stray interrupts */ - nvt_cir_reg_write(nvt, 0xff, CIR_IRSTS); - - /* and finally, enable interrupts */ - nvt_set_cir_iren(nvt); - - /* enable the CIR logical device */ - nvt_enable_logical_dev(nvt, LOGICAL_DEV_CIR); + nvt_disable_logical_dev(nvt, LOGICAL_DEV_CIR); } static void nvt_cir_wake_regs_init(struct nvt_dev *nvt) { + nvt_enable_logical_dev(nvt, LOGICAL_DEV_CIR_WAKE); + /* * Disable RX, set specific carrier on = low, off = high, * and sample period (currently 50us) @@ -579,9 +567,6 @@ static void nvt_cir_wake_regs_init(struct nvt_dev *nvt) /* clear any and all stray interrupts */ nvt_cir_wake_reg_write(nvt, 0xff, CIR_WAKE_IRSTS); - - /* enable the CIR WAKE logical device */ - nvt_enable_logical_dev(nvt, LOGICAL_DEV_CIR_WAKE); } static void nvt_enable_wake(struct nvt_dev *nvt) @@ -892,6 +877,32 @@ static irqreturn_t nvt_cir_isr(int irq, void *data) return IRQ_HANDLED; } +static void nvt_enable_cir(struct nvt_dev *nvt) +{ + unsigned long flags; + + /* enable the CIR logical device */ + nvt_enable_logical_dev(nvt, LOGICAL_DEV_CIR); + + spin_lock_irqsave(&nvt->lock, flags); + + /* + * Enable TX and RX, specify carrier on = low, off = high, and set + * sample period (currently 50us) + */ + nvt_cir_reg_write(nvt, CIR_IRCON_TXEN | CIR_IRCON_RXEN | + CIR_IRCON_RXINV | CIR_IRCON_SAMPLE_PERIOD_SEL, + CIR_IRCON); + + /* clear all pending interrupts */ + nvt_cir_reg_write(nvt, 0xff, CIR_IRSTS); + + /* enable interrupts */ + nvt_set_cir_iren(nvt); + + spin_unlock_irqrestore(&nvt->lock, flags); +} + static void nvt_disable_cir(struct nvt_dev *nvt) { unsigned long flags; @@ -920,25 +931,8 @@ static void nvt_disable_cir(struct nvt_dev *nvt) static int nvt_open(struct rc_dev *dev) { struct nvt_dev *nvt = dev->priv; - unsigned long flags; - spin_lock_irqsave(&nvt->lock, flags); - - /* set function enable flags */ - nvt_cir_reg_write(nvt, CIR_IRCON_TXEN | CIR_IRCON_RXEN | - CIR_IRCON_RXINV | CIR_IRCON_SAMPLE_PERIOD_SEL, - CIR_IRCON); - - /* clear all pending interrupts */ - nvt_cir_reg_write(nvt, 0xff, CIR_IRSTS); - - /* enable interrupts */ - nvt_set_cir_iren(nvt); - - spin_unlock_irqrestore(&nvt->lock, flags); - - /* enable the CIR logical device */ - nvt_enable_logical_dev(nvt, LOGICAL_DEV_CIR); + nvt_enable_cir(nvt); return 0; } @@ -1093,19 +1087,13 @@ static void nvt_remove(struct pnp_dev *pdev) static int nvt_suspend(struct pnp_dev *pdev, pm_message_t state) { struct nvt_dev *nvt = pnp_get_drvdata(pdev); - unsigned long flags; nvt_dbg("%s called", __func__); - spin_lock_irqsave(&nvt->lock, flags); - - /* disable all CIR interrupts */ - nvt_cir_reg_write(nvt, 0, CIR_IREN); - - spin_unlock_irqrestore(&nvt->lock, flags); - - /* disable cir logical dev */ - nvt_disable_logical_dev(nvt, LOGICAL_DEV_CIR); + mutex_lock(&nvt->rdev->lock); + if (nvt->rdev->users) + nvt_disable_cir(nvt); + mutex_unlock(&nvt->rdev->lock); /* make sure wake is enabled */ nvt_enable_wake(nvt); @@ -1122,6 +1110,11 @@ static int nvt_resume(struct pnp_dev *pdev) nvt_cir_regs_init(nvt); nvt_cir_wake_regs_init(nvt); + mutex_lock(&nvt->rdev->lock); + if (nvt->rdev->users) + nvt_enable_cir(nvt); + mutex_unlock(&nvt->rdev->lock); + return 0; } diff --git a/drivers/media/rc/rc-core-priv.h b/drivers/media/rc/rc-core-priv.h index 7da9c96cb058..e847bdad5c51 100644 --- a/drivers/media/rc/rc-core-priv.h +++ b/drivers/media/rc/rc-core-priv.h @@ -1,27 +1,35 @@ /* + * SPDX-License-Identifier: GPL-2.0 * Remote Controller core raw events header * * Copyright (C) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation version 2 of the License. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. */ #ifndef _RC_CORE_PRIV #define _RC_CORE_PRIV +#define RC_DEV_MAX 256 /* Define the max number of pulse/space transitions to buffer */ #define MAX_IR_EVENT_SIZE 512 #include +#include #include +/** + * rc_open - Opens a RC device + * + * @rdev: pointer to struct rc_dev. + */ +int rc_open(struct rc_dev *rdev); + +/** + * rc_close - Closes a RC device + * + * @rdev: pointer to struct rc_dev. + */ +void rc_close(struct rc_dev *rdev); + struct ir_raw_handler { struct list_head list; @@ -29,8 +37,10 @@ struct ir_raw_handler { int (*decode)(struct rc_dev *dev, struct ir_raw_event event); int (*encode)(enum rc_proto protocol, u32 scancode, struct ir_raw_event *events, unsigned int max); + u32 carrier; + u32 min_timeout; - /* These two should only be used by the lirc decoder */ + /* These two should only be used by the mce kbd decoder */ int (*raw_register)(struct rc_dev *dev); int (*raw_unregister)(struct rc_dev *dev); }; @@ -42,12 +52,18 @@ struct ir_raw_event_ctrl { DECLARE_KFIFO(kfifo, struct ir_raw_event, MAX_IR_EVENT_SIZE); ktime_t last_event; /* when last event occurred */ struct rc_dev *dev; /* pointer to the parent rc_dev */ - /* edge driver */ - struct timer_list edge_handle; + /* handle delayed ir_raw_event_store_edge processing */ + spinlock_t edge_spinlock; + struct timer_list edge_handle; /* raw decoder state follows */ struct ir_raw_event prev_ev; struct ir_raw_event this_ev; + +#ifdef CONFIG_BPF_LIRC_MODE2 + u32 bpf_sample; + struct bpf_prog_array __rcu *progs; +#endif struct nec_dec { int state; unsigned count; @@ -95,6 +111,8 @@ struct ir_raw_event_ctrl { } sharp; struct mce_kbd_dec { struct input_dev *idev; + /* locks key up timer */ + spinlock_t keylock; struct timer_list rx_timeout; char name[64]; char phys[64]; @@ -104,24 +122,25 @@ struct ir_raw_event_ctrl { unsigned count; unsigned wanted_bits; } mce_kbd; - struct lirc_codec { - struct rc_dev *dev; - struct lirc_driver *drv; - int carrier_low; - - ktime_t gap_start; - u64 gap_duration; - bool gap; - bool send_timeout_reports; - - } lirc; struct xmp_dec { int state; unsigned count; u32 durations[16]; } xmp; + struct imon_dec { + int state; + int count; + int last_chk; + unsigned int bits; + bool stick_keyboard; + struct input_dev *idev; + char name[64]; + } imon; }; +/* Mutex for locking raw IR processing and handler change */ +extern struct mutex ir_raw_handler_lock; + /* macros for IR decoders */ static inline bool geq_margin(unsigned d1, unsigned d2, unsigned margin) { @@ -156,6 +175,7 @@ static inline bool is_timing_event(struct ir_raw_event ev) #define TO_STR(is_pulse) ((is_pulse) ? "pulse" : "space") /* functions for IR encoders */ +bool rc_validate_scancode(enum rc_proto proto, u32 scancode); static inline void init_ir_raw_event_duration(struct ir_raw_event *ev, unsigned int pulse, @@ -168,17 +188,17 @@ static inline void init_ir_raw_event_duration(struct ir_raw_event *ev, /** * struct ir_raw_timings_manchester - Manchester coding timings - * @leader: duration of leader pulse (if any) 0 if continuing - * existing signal (see @pulse_space_start) - * @pulse_space_start: 1 for starting with pulse (0 for starting with space) + * @leader_pulse: duration of leader pulse (if any) 0 if continuing + * existing signal + * @leader_space: duration of leader space (if any) * @clock: duration of each pulse/space in ns * @invert: if set clock logic is inverted * (0 = space + pulse, 1 = pulse + space) * @trailer_space: duration of trailer space in ns */ struct ir_raw_timings_manchester { - unsigned int leader; - unsigned int pulse_space_start:1; + unsigned int leader_pulse; + unsigned int leader_space; unsigned int clock; unsigned int invert:1; unsigned int trailer_space; @@ -270,13 +290,40 @@ void ir_raw_event_free(struct rc_dev *dev); void ir_raw_event_unregister(struct rc_dev *dev); int ir_raw_handler_register(struct ir_raw_handler *ir_raw_handler); void ir_raw_handler_unregister(struct ir_raw_handler *ir_raw_handler); +void ir_raw_load_modules(u64 *protocols); void ir_raw_init(void); /* - * Decoder initialization code - * - * Those load logic are called during ir-core init, and automatically - * loads the compiled decoders for their usage with IR raw events + * lirc interface */ +#ifdef CONFIG_LIRC +int lirc_dev_init(void); +void lirc_dev_exit(void); +void ir_lirc_raw_event(struct rc_dev *dev, struct ir_raw_event ev); +void ir_lirc_scancode_event(struct rc_dev *dev, struct lirc_scancode *lsc); +int ir_lirc_register(struct rc_dev *dev); +void ir_lirc_unregister(struct rc_dev *dev); +struct rc_dev *rc_dev_get_from_fd(int fd); +#else +static inline int lirc_dev_init(void) { return 0; } +static inline void lirc_dev_exit(void) {} +static inline void ir_lirc_raw_event(struct rc_dev *dev, + struct ir_raw_event ev) { } +static inline void ir_lirc_scancode_event(struct rc_dev *dev, + struct lirc_scancode *lsc) { } +static inline int ir_lirc_register(struct rc_dev *dev) { return 0; } +static inline void ir_lirc_unregister(struct rc_dev *dev) { } +#endif + +/* + * bpf interface + */ +#ifdef CONFIG_BPF_LIRC_MODE2 +void lirc_bpf_free(struct rc_dev *dev); +void lirc_bpf_run(struct rc_dev *dev, u32 sample); +#else +static inline void lirc_bpf_free(struct rc_dev *dev) { } +static inline void lirc_bpf_run(struct rc_dev *dev, u32 sample) { } +#endif #endif /* _RC_CORE_PRIV */ diff --git a/drivers/media/rc/rc-ir-raw.c b/drivers/media/rc/rc-ir-raw.c index 503bc425a187..e7948908e78c 100644 --- a/drivers/media/rc/rc-ir-raw.c +++ b/drivers/media/rc/rc-ir-raw.c @@ -1,16 +1,7 @@ -/* rc-ir-raw.c - handle IR pulse/space events - * - * Copyright (C) 2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation version 2 of the License. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - */ +// SPDX-License-Identifier: GPL-2.0 +// rc-ir-raw.c - handle IR pulse/space events +// +// Copyright (C) 2010 by Mauro Carvalho Chehab #include #include @@ -23,7 +14,7 @@ static LIST_HEAD(ir_raw_client_list); /* Used to handle IR raw handler extensions */ -static DEFINE_MUTEX(ir_raw_handler_lock); +DEFINE_MUTEX(ir_raw_handler_lock); static LIST_HEAD(ir_raw_handler_list); static atomic64_t available_protocols = ATOMIC64_INIT(0); @@ -31,15 +22,27 @@ static int ir_raw_event_thread(void *data) { struct ir_raw_event ev; struct ir_raw_handler *handler; - struct ir_raw_event_ctrl *raw = (struct ir_raw_event_ctrl *)data; + struct ir_raw_event_ctrl *raw = data; + struct rc_dev *dev = raw->dev; while (1) { mutex_lock(&ir_raw_handler_lock); while (kfifo_out(&raw->kfifo, &ev, 1)) { + if (is_timing_event(ev)) { + if (ev.duration == 0) + dev_warn_once(&dev->dev, "nonsensical timing event of duration 0"); + if (is_timing_event(raw->prev_ev) && + !is_transition(&ev, &raw->prev_ev)) + dev_warn_once(&dev->dev, "two consecutive events of type %s", + TO_STR(ev.pulse)); + if (raw->prev_ev.reset && ev.pulse == 0) + dev_warn_once(&dev->dev, "timing event after reset should be pulse"); + } list_for_each_entry(handler, &ir_raw_handler_list, list) - if (raw->dev->enabled_protocols & + if (dev->enabled_protocols & handler->protocols || !handler->protocols) - handler->decode(raw->dev, ev); + handler->decode(dev, ev); + ir_lirc_raw_event(dev, ev); raw->prev_ev = ev; } mutex_unlock(&ir_raw_handler_lock); @@ -73,8 +76,8 @@ int ir_raw_event_store(struct rc_dev *dev, struct ir_raw_event *ev) if (!dev->raw) return -EINVAL; - IR_dprintk(2, "sample: (%05dus %s)\n", - TO_US(ev->duration), TO_STR(ev->pulse)); + dev_dbg(&dev->dev, "sample: (%05dus %s)\n", + TO_US(ev->duration), TO_STR(ev->pulse)); if (!kfifo_put(&dev->raw->kfifo, *ev)) { dev_err(&dev->dev, "IR event FIFO is full!\n"); @@ -100,7 +103,6 @@ int ir_raw_event_store_edge(struct rc_dev *dev, bool pulse) { ktime_t now; DEFINE_IR_RAW_EVENT(ev); - int rc = 0; if (!dev->raw) return -EINVAL; @@ -109,7 +111,33 @@ int ir_raw_event_store_edge(struct rc_dev *dev, bool pulse) ev.duration = ktime_to_ns(ktime_sub(now, dev->raw->last_event)); ev.pulse = !pulse; - rc = ir_raw_event_store(dev, &ev); + return ir_raw_event_store_with_timeout(dev, &ev); +} +EXPORT_SYMBOL_GPL(ir_raw_event_store_edge); + +/* + * ir_raw_event_store_with_timeout() - pass a pulse/space duration to the raw + * ir decoders, schedule decoding and + * timeout + * @dev: the struct rc_dev device descriptor + * @ev: the struct ir_raw_event descriptor of the pulse/space + * + * This routine (which may be called from an interrupt context) stores a + * pulse/space duration for the raw ir decoding state machines, schedules + * decoding and generates a timeout. + */ +int ir_raw_event_store_with_timeout(struct rc_dev *dev, struct ir_raw_event *ev) +{ + ktime_t now; + int rc = 0; + + if (!dev->raw) + return -EINVAL; + + now = ktime_get(); + + spin_lock(&dev->raw->edge_spinlock); + rc = ir_raw_event_store(dev, ev); dev->raw->last_event = now; @@ -120,15 +148,16 @@ int ir_raw_event_store_edge(struct rc_dev *dev, bool pulse) mod_timer(&dev->raw->edge_handle, jiffies + msecs_to_jiffies(15)); } + spin_unlock(&dev->raw->edge_spinlock); return rc; } -EXPORT_SYMBOL_GPL(ir_raw_event_store_edge); +EXPORT_SYMBOL_GPL(ir_raw_event_store_with_timeout); /** * ir_raw_event_store_with_filter() - pass next pulse/space to decoders with some processing * @dev: the struct rc_dev device descriptor - * @type: the type of the event that has occurred + * @ev: the event that has occurred * * This routine (which may be called from an interrupt context) works * in similar manner to ir_raw_event_store_edge. @@ -176,7 +205,7 @@ void ir_raw_event_set_idle(struct rc_dev *dev, bool idle) if (!dev->raw) return; - IR_dprintk(2, "%s idle mode\n", idle ? "enter" : "leave"); + dev_dbg(&dev->dev, "%s idle mode\n", idle ? "enter" : "leave"); if (idle) { dev->raw->this_ev.timeout = true; @@ -215,7 +244,49 @@ ir_raw_get_allowed_protocols(void) static int change_protocol(struct rc_dev *dev, u64 *rc_proto) { - /* the caller will update dev->enabled_protocols */ + struct ir_raw_handler *handler; + u32 timeout = 0; + + mutex_lock(&ir_raw_handler_lock); + list_for_each_entry(handler, &ir_raw_handler_list, list) { + if (!(dev->enabled_protocols & handler->protocols) && + (*rc_proto & handler->protocols) && handler->raw_register) + handler->raw_register(dev); + + if ((dev->enabled_protocols & handler->protocols) && + !(*rc_proto & handler->protocols) && + handler->raw_unregister) + handler->raw_unregister(dev); + } + mutex_unlock(&ir_raw_handler_lock); + + if (!dev->max_timeout) + return 0; + + mutex_lock(&ir_raw_handler_lock); + list_for_each_entry(handler, &ir_raw_handler_list, list) { + if (handler->protocols & *rc_proto) { + if (timeout < handler->min_timeout) + timeout = handler->min_timeout; + } + } + mutex_unlock(&ir_raw_handler_lock); + + if (timeout == 0) + timeout = IR_DEFAULT_TIMEOUT; + else + timeout += MS_TO_NS(10); + + if (timeout < dev->min_timeout) + timeout = dev->min_timeout; + else if (timeout > dev->max_timeout) + timeout = dev->max_timeout; + + if (dev->s_timeout) + dev->s_timeout(dev, timeout); + else + dev->timeout = timeout; + return 0; } @@ -254,19 +325,16 @@ int ir_raw_gen_manchester(struct ir_raw_event **ev, unsigned int max, i = BIT_ULL(n - 1); - if (timings->leader) { + if (timings->leader_pulse) { if (!max--) return ret; - if (timings->pulse_space_start) { - init_ir_raw_event_duration((*ev)++, 1, timings->leader); - + init_ir_raw_event_duration((*ev), 1, timings->leader_pulse); + if (timings->leader_space) { if (!max--) return ret; - init_ir_raw_event_duration((*ev), 0, timings->leader); - } else { - init_ir_raw_event_duration((*ev), 1, timings->leader); + init_ir_raw_event_duration(++(*ev), 0, + timings->leader_space); } - i >>= 1; } else { /* continue existing signal */ --(*ev); @@ -457,6 +525,8 @@ int ir_raw_encode_scancode(enum rc_proto protocol, u32 scancode, int ret = -EINVAL; u64 mask = 1ULL << protocol; + ir_raw_load_modules(&mask); + mutex_lock(&ir_raw_handler_lock); list_for_each_entry(handler, &ir_raw_handler_list, list) { if (handler->protocols & mask && handler->encode) { @@ -471,11 +541,26 @@ int ir_raw_encode_scancode(enum rc_proto protocol, u32 scancode, } EXPORT_SYMBOL(ir_raw_encode_scancode); -static void edge_handle(unsigned long arg) +/** + * ir_raw_edge_handle() - Handle ir_raw_event_store_edge() processing + * + * @t: timer_list + * + * This callback is armed by ir_raw_event_store_edge(). It does two things: + * first of all, rather than calling ir_raw_event_handle() for each + * edge and waking up the rc thread, 15 ms after the first edge + * ir_raw_event_handle() is called. Secondly, generate a timeout event + * no more IR is received after the rc_dev timeout. + */ +static void ir_raw_edge_handle(struct timer_list *t) { - struct rc_dev *dev = (struct rc_dev *)arg; - ktime_t interval = ktime_sub(ktime_get(), dev->raw->last_event); + struct ir_raw_event_ctrl *raw = from_timer(raw, t, edge_handle); + struct rc_dev *dev = raw->dev; + unsigned long flags; + ktime_t interval; + spin_lock_irqsave(&dev->raw->edge_spinlock, flags); + interval = ktime_sub(ktime_get(), dev->raw->last_event); if (ktime_to_ns(interval) >= dev->timeout) { DEFINE_IR_RAW_EVENT(ev); @@ -488,33 +573,58 @@ static void edge_handle(unsigned long arg) jiffies + nsecs_to_jiffies(dev->timeout - ktime_to_ns(interval))); } + spin_unlock_irqrestore(&dev->raw->edge_spinlock, flags); ir_raw_event_handle(dev); } +/** + * ir_raw_encode_carrier() - Get carrier used for protocol + * + * @protocol: protocol + * + * Attempts to find the carrier for the specified protocol + * + * Returns: The carrier in Hz + * -EINVAL if the protocol is invalid, or if no + * compatible encoder was found. + */ +int ir_raw_encode_carrier(enum rc_proto protocol) +{ + struct ir_raw_handler *handler; + int ret = -EINVAL; + u64 mask = BIT_ULL(protocol); + + mutex_lock(&ir_raw_handler_lock); + list_for_each_entry(handler, &ir_raw_handler_list, list) { + if (handler->protocols & mask && handler->encode) { + ret = handler->carrier; + break; + } + } + mutex_unlock(&ir_raw_handler_lock); + + return ret; +} +EXPORT_SYMBOL(ir_raw_encode_carrier); + /* * Used to (un)register raw event clients */ int ir_raw_event_prepare(struct rc_dev *dev) { - static bool raw_init; /* 'false' default value, raw decoders loaded? */ - if (!dev) return -EINVAL; - if (!raw_init) { - request_module("ir-lirc-codec"); - raw_init = true; - } - dev->raw = kzalloc(sizeof(*dev->raw), GFP_KERNEL); if (!dev->raw) return -ENOMEM; dev->raw->dev = dev; dev->change_protocol = change_protocol; - setup_timer(&dev->raw->edge_handle, edge_handle, - (unsigned long)dev); + dev->idle = true; + spin_lock_init(&dev->raw->edge_spinlock); + timer_setup(&dev->raw->edge_handle, ir_raw_edge_handle, 0); INIT_KFIFO(dev->raw->kfifo); return 0; @@ -522,28 +632,16 @@ int ir_raw_event_prepare(struct rc_dev *dev) int ir_raw_event_register(struct rc_dev *dev) { - struct ir_raw_handler *handler; struct task_struct *thread; - /* - * raw transmitters do not need any event registration - * because the event is coming from userspace - */ - if (dev->driver_type != RC_DRIVER_IR_RAW_TX) { - thread = kthread_run(ir_raw_event_thread, dev->raw, "rc%u", - dev->minor); + thread = kthread_run(ir_raw_event_thread, dev->raw, "rc%u", dev->minor); + if (IS_ERR(thread)) + return PTR_ERR(thread); - if (IS_ERR(thread)) - return PTR_ERR(thread); - - dev->raw->thread = thread; - } + dev->raw->thread = thread; mutex_lock(&ir_raw_handler_lock); list_add_tail(&dev->raw->list, &ir_raw_client_list); - list_for_each_entry(handler, &ir_raw_handler_list, list) - if (handler->raw_register) - handler->raw_register(dev); mutex_unlock(&ir_raw_handler_lock); return 0; @@ -571,11 +669,20 @@ void ir_raw_event_unregister(struct rc_dev *dev) mutex_lock(&ir_raw_handler_lock); list_del(&dev->raw->list); list_for_each_entry(handler, &ir_raw_handler_list, list) - if (handler->raw_unregister) + if (handler->raw_unregister && + (handler->protocols & dev->enabled_protocols)) handler->raw_unregister(dev); - mutex_unlock(&ir_raw_handler_lock); + + lirc_bpf_free(dev); ir_raw_event_free(dev); + + /* + * A user can be calling bpf(BPF_PROG_{QUERY|ATTACH|DETACH}), so + * ensure that the raw member is null on unlock; this is how + * "device gone" is checked. + */ + mutex_unlock(&ir_raw_handler_lock); } /* @@ -584,13 +691,8 @@ void ir_raw_event_unregister(struct rc_dev *dev) int ir_raw_handler_register(struct ir_raw_handler *ir_raw_handler) { - struct ir_raw_event_ctrl *raw; - mutex_lock(&ir_raw_handler_lock); list_add_tail(&ir_raw_handler->list, &ir_raw_handler_list); - if (ir_raw_handler->raw_register) - list_for_each_entry(raw, &ir_raw_client_list, list) - ir_raw_handler->raw_register(raw->dev); atomic64_or(ir_raw_handler->protocols, &available_protocols); mutex_unlock(&ir_raw_handler_lock); @@ -606,9 +708,10 @@ void ir_raw_handler_unregister(struct ir_raw_handler *ir_raw_handler) mutex_lock(&ir_raw_handler_lock); list_del(&ir_raw_handler->list); list_for_each_entry(raw, &ir_raw_client_list, list) { - ir_raw_disable_protocols(raw->dev, protocols); - if (ir_raw_handler->raw_unregister) + if (ir_raw_handler->raw_unregister && + (raw->dev->enabled_protocols & protocols)) ir_raw_handler->raw_unregister(raw->dev); + ir_raw_disable_protocols(raw->dev, protocols); } atomic64_andnot(protocols, &available_protocols); mutex_unlock(&ir_raw_handler_lock); diff --git a/drivers/media/rc/rc-main.c b/drivers/media/rc/rc-main.c index a22828713c1c..6ea6038b9d36 100644 --- a/drivers/media/rc/rc-main.c +++ b/drivers/media/rc/rc-main.c @@ -1,20 +1,12 @@ -/* rc-main.c - Remote Controller core module - * - * Copyright (C) 2009-2010 by Mauro Carvalho Chehab - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation version 2 of the License. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - */ +// SPDX-License-Identifier: GPL-2.0 +// rc-main.c - Remote Controller core module +// +// Copyright (C) 2009-2010 by Mauro Carvalho Chehab #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include +#include #include #include #include @@ -28,55 +20,56 @@ /* Sizes are in bytes, 256 bytes allows for 32 entries on x64 */ #define IR_TAB_MIN_SIZE 256 #define IR_TAB_MAX_SIZE 8192 -#define RC_DEV_MAX 256 static const struct { const char *name; unsigned int repeat_period; unsigned int scancode_bits; } protocols[] = { - [RC_PROTO_UNKNOWN] = { .name = "unknown", .repeat_period = 250 }, - [RC_PROTO_OTHER] = { .name = "other", .repeat_period = 250 }, + [RC_PROTO_UNKNOWN] = { .name = "unknown", .repeat_period = 125 }, + [RC_PROTO_OTHER] = { .name = "other", .repeat_period = 125 }, [RC_PROTO_RC5] = { .name = "rc-5", - .scancode_bits = 0x1f7f, .repeat_period = 250 }, + .scancode_bits = 0x1f7f, .repeat_period = 114 }, [RC_PROTO_RC5X_20] = { .name = "rc-5x-20", - .scancode_bits = 0x1f7f3f, .repeat_period = 250 }, + .scancode_bits = 0x1f7f3f, .repeat_period = 114 }, [RC_PROTO_RC5_SZ] = { .name = "rc-5-sz", - .scancode_bits = 0x2fff, .repeat_period = 250 }, + .scancode_bits = 0x2fff, .repeat_period = 114 }, [RC_PROTO_JVC] = { .name = "jvc", - .scancode_bits = 0xffff, .repeat_period = 250 }, + .scancode_bits = 0xffff, .repeat_period = 125 }, [RC_PROTO_SONY12] = { .name = "sony-12", - .scancode_bits = 0x1f007f, .repeat_period = 250 }, + .scancode_bits = 0x1f007f, .repeat_period = 100 }, [RC_PROTO_SONY15] = { .name = "sony-15", - .scancode_bits = 0xff007f, .repeat_period = 250 }, + .scancode_bits = 0xff007f, .repeat_period = 100 }, [RC_PROTO_SONY20] = { .name = "sony-20", - .scancode_bits = 0x1fff7f, .repeat_period = 250 }, + .scancode_bits = 0x1fff7f, .repeat_period = 100 }, [RC_PROTO_NEC] = { .name = "nec", - .scancode_bits = 0xffff, .repeat_period = 250 }, + .scancode_bits = 0xffff, .repeat_period = 110 }, [RC_PROTO_NECX] = { .name = "nec-x", - .scancode_bits = 0xffffff, .repeat_period = 250 }, + .scancode_bits = 0xffffff, .repeat_period = 110 }, [RC_PROTO_NEC32] = { .name = "nec-32", - .scancode_bits = 0xffffffff, .repeat_period = 250 }, + .scancode_bits = 0xffffffff, .repeat_period = 110 }, [RC_PROTO_SANYO] = { .name = "sanyo", - .scancode_bits = 0x1fffff, .repeat_period = 250 }, + .scancode_bits = 0x1fffff, .repeat_period = 125 }, [RC_PROTO_MCIR2_KBD] = { .name = "mcir2-kbd", - .scancode_bits = 0xffff, .repeat_period = 250 }, + .scancode_bits = 0xffffff, .repeat_period = 100 }, [RC_PROTO_MCIR2_MSE] = { .name = "mcir2-mse", - .scancode_bits = 0x1fffff, .repeat_period = 250 }, + .scancode_bits = 0x1fffff, .repeat_period = 100 }, [RC_PROTO_RC6_0] = { .name = "rc-6-0", - .scancode_bits = 0xffff, .repeat_period = 250 }, + .scancode_bits = 0xffff, .repeat_period = 114 }, [RC_PROTO_RC6_6A_20] = { .name = "rc-6-6a-20", - .scancode_bits = 0xfffff, .repeat_period = 250 }, + .scancode_bits = 0xfffff, .repeat_period = 114 }, [RC_PROTO_RC6_6A_24] = { .name = "rc-6-6a-24", - .scancode_bits = 0xffffff, .repeat_period = 250 }, + .scancode_bits = 0xffffff, .repeat_period = 114 }, [RC_PROTO_RC6_6A_32] = { .name = "rc-6-6a-32", - .scancode_bits = 0xffffffff, .repeat_period = 250 }, + .scancode_bits = 0xffffffff, .repeat_period = 114 }, [RC_PROTO_RC6_MCE] = { .name = "rc-6-mce", - .scancode_bits = 0xffff7fff, .repeat_period = 250 }, + .scancode_bits = 0xffff7fff, .repeat_period = 114 }, [RC_PROTO_SHARP] = { .name = "sharp", - .scancode_bits = 0x1fff, .repeat_period = 250 }, - [RC_PROTO_XMP] = { .name = "xmp", .repeat_period = 250 }, - [RC_PROTO_CEC] = { .name = "cec", .repeat_period = 550 }, + .scancode_bits = 0x1fff, .repeat_period = 125 }, + [RC_PROTO_XMP] = { .name = "xmp", .repeat_period = 125 }, + [RC_PROTO_CEC] = { .name = "cec", .repeat_period = 0 }, + [RC_PROTO_IMON] = { .name = "imon", + .scancode_bits = 0x7fffffff, .repeat_period = 114 }, }; /* Used to keep track of known keymaps */ @@ -165,16 +158,18 @@ static struct rc_map_list empty_map = { /** * ir_create_table() - initializes a scancode table + * @dev: the rc_dev device * @rc_map: the rc_map to initialize * @name: name to assign to the table * @rc_proto: ir type to assign to the new table * @size: initial size of the table - * @return: zero on success or a negative error code * * This routine will initialize the rc_map and will allocate * memory to hold at least the specified number of elements. + * + * return: zero on success or a negative error code */ -static int ir_create_table(struct rc_map *rc_map, +static int ir_create_table(struct rc_dev *dev, struct rc_map *rc_map, const char *name, u64 rc_proto, size_t size) { rc_map->name = kstrdup(name, GFP_KERNEL); @@ -190,8 +185,8 @@ static int ir_create_table(struct rc_map *rc_map, return -ENOMEM; } - IR_dprintk(1, "Allocated space for %u keycode entries (%u bytes)\n", - rc_map->size, rc_map->alloc); + dev_dbg(&dev->dev, "Allocated space for %u keycode entries (%u bytes)\n", + rc_map->size, rc_map->alloc); return 0; } @@ -213,14 +208,17 @@ static void ir_free_table(struct rc_map *rc_map) /** * ir_resize_table() - resizes a scancode table if necessary + * @dev: the rc_dev device * @rc_map: the rc_map to resize * @gfp_flags: gfp flags to use when allocating memory - * @return: zero on success or a negative error code * * This routine will shrink the rc_map if it has lots of * unused entries and grow it if it is full. + * + * return: zero on success or a negative error code */ -static int ir_resize_table(struct rc_map *rc_map, gfp_t gfp_flags) +static int ir_resize_table(struct rc_dev *dev, struct rc_map *rc_map, + gfp_t gfp_flags) { unsigned int oldalloc = rc_map->alloc; unsigned int newalloc = oldalloc; @@ -233,23 +231,21 @@ static int ir_resize_table(struct rc_map *rc_map, gfp_t gfp_flags) return -ENOMEM; newalloc *= 2; - IR_dprintk(1, "Growing table to %u bytes\n", newalloc); + dev_dbg(&dev->dev, "Growing table to %u bytes\n", newalloc); } if ((rc_map->len * 3 < rc_map->size) && (oldalloc > IR_TAB_MIN_SIZE)) { /* Less than 1/3 of entries in use -> shrink keytable */ newalloc /= 2; - IR_dprintk(1, "Shrinking table to %u bytes\n", newalloc); + dev_dbg(&dev->dev, "Shrinking table to %u bytes\n", newalloc); } if (newalloc == oldalloc) return 0; newscan = kmalloc(newalloc, gfp_flags); - if (!newscan) { - IR_dprintk(1, "Failed to kmalloc %u bytes\n", newalloc); + if (!newscan) return -ENOMEM; - } memcpy(newscan, rc_map->scan, rc_map->len * sizeof(struct rc_map_table)); rc_map->scan = newscan; @@ -264,11 +260,13 @@ static int ir_resize_table(struct rc_map *rc_map, gfp_t gfp_flags) * @dev: the struct rc_dev device descriptor * @rc_map: scancode table to be adjusted * @index: index of the mapping that needs to be updated - * @keycode: the desired keycode - * @return: previous keycode assigned to the mapping + * @new_keycode: the desired keycode * * This routine is used to update scancode->keycode mapping at given * position. + * + * return: previous keycode assigned to the mapping + * */ static unsigned int ir_update_mapping(struct rc_dev *dev, struct rc_map *rc_map, @@ -280,16 +278,16 @@ static unsigned int ir_update_mapping(struct rc_dev *dev, /* Did the user wish to remove the mapping? */ if (new_keycode == KEY_RESERVED || new_keycode == KEY_UNKNOWN) { - IR_dprintk(1, "#%d: Deleting scan 0x%04x\n", - index, rc_map->scan[index].scancode); + dev_dbg(&dev->dev, "#%d: Deleting scan 0x%04x\n", + index, rc_map->scan[index].scancode); rc_map->len--; memmove(&rc_map->scan[index], &rc_map->scan[index+ 1], (rc_map->len - index) * sizeof(struct rc_map_table)); } else { - IR_dprintk(1, "#%d: %s scan 0x%04x with key 0x%04x\n", - index, - old_keycode == KEY_RESERVED ? "New" : "Replacing", - rc_map->scan[index].scancode, new_keycode); + dev_dbg(&dev->dev, "#%d: %s scan 0x%04x with key 0x%04x\n", + index, + old_keycode == KEY_RESERVED ? "New" : "Replacing", + rc_map->scan[index].scancode, new_keycode); rc_map->scan[index].keycode = new_keycode; __set_bit(new_keycode, dev->input_dev->keybit); } @@ -306,7 +304,7 @@ static unsigned int ir_update_mapping(struct rc_dev *dev, } /* Possibly shrink the keytable, failure is not a problem */ - ir_resize_table(rc_map, GFP_ATOMIC); + ir_resize_table(dev, rc_map, GFP_ATOMIC); } return old_keycode; @@ -319,12 +317,13 @@ static unsigned int ir_update_mapping(struct rc_dev *dev, * @scancode: the desired scancode * @resize: controls whether we allowed to resize the table to * accommodate not yet present scancodes - * @return: index of the mapping containing scancode in question - * or -1U in case of failure. * * This routine is used to locate given scancode in rc_map. * If scancode is not yet present the routine will allocate a new slot * for it. + * + * return: index of the mapping containing scancode in question + * or -1U in case of failure. */ static unsigned int ir_establish_scancode(struct rc_dev *dev, struct rc_map *rc_map, @@ -356,7 +355,7 @@ static unsigned int ir_establish_scancode(struct rc_dev *dev, /* No previous mapping found, we might need to grow the table */ if (rc_map->size == rc_map->len) { - if (!resize || ir_resize_table(rc_map, GFP_ATOMIC)) + if (!resize || ir_resize_table(dev, rc_map, GFP_ATOMIC)) return -1U; } @@ -374,11 +373,12 @@ static unsigned int ir_establish_scancode(struct rc_dev *dev, /** * ir_setkeycode() - set a keycode in the scancode->keycode table * @idev: the struct input_dev device descriptor - * @scancode: the desired scancode - * @keycode: result - * @return: -EINVAL if the keycode could not be inserted, otherwise zero. + * @ke: Input keymap entry + * @old_keycode: result * * This routine is used to handle evdev EVIOCSKEY ioctl. + * + * return: -EINVAL if the keycode could not be inserted, otherwise zero. */ static int ir_setkeycode(struct input_dev *idev, const struct input_keymap_entry *ke, @@ -421,11 +421,11 @@ out: /** * ir_setkeytable() - sets several entries in the scancode->keycode table * @dev: the struct rc_dev device descriptor - * @to: the struct rc_map to copy entries to * @from: the struct rc_map to copy entries from - * @return: -ENOMEM if all keycodes could not be inserted, otherwise zero. * * This routine is used to handle table initialization. + * + * return: -ENOMEM if all keycodes could not be inserted, otherwise zero. */ static int ir_setkeytable(struct rc_dev *dev, const struct rc_map *from) @@ -434,14 +434,11 @@ static int ir_setkeytable(struct rc_dev *dev, unsigned int i, index; int rc; - rc = ir_create_table(rc_map, from->name, - from->rc_proto, from->size); + rc = ir_create_table(dev, rc_map, from->name, from->rc_proto, + from->size); if (rc) return rc; - IR_dprintk(1, "Allocated space for %u keycode entries (%u bytes)\n", - rc_map->size, rc_map->alloc); - for (i = 0; i < from->size; i++) { index = ir_establish_scancode(dev, rc_map, from->scan[i].scancode, false); @@ -460,43 +457,49 @@ static int ir_setkeytable(struct rc_dev *dev, return rc; } +static int rc_map_cmp(const void *key, const void *elt) +{ + const unsigned int *scancode = key; + const struct rc_map_table *e = elt; + + if (*scancode < e->scancode) + return -1; + else if (*scancode > e->scancode) + return 1; + return 0; +} + /** * ir_lookup_by_scancode() - locate mapping by scancode * @rc_map: the struct rc_map to search * @scancode: scancode to look for in the table - * @return: index in the table, -1U if not found * * This routine performs binary search in RC keykeymap table for * given scancode. + * + * return: index in the table, -1U if not found */ static unsigned int ir_lookup_by_scancode(const struct rc_map *rc_map, unsigned int scancode) { - int start = 0; - int end = rc_map->len - 1; - int mid; + struct rc_map_table *res; - while (start <= end) { - mid = (start + end) / 2; - if (rc_map->scan[mid].scancode < scancode) - start = mid + 1; - else if (rc_map->scan[mid].scancode > scancode) - end = mid - 1; - else - return mid; - } - - return -1U; + res = bsearch(&scancode, rc_map->scan, rc_map->len, + sizeof(struct rc_map_table), rc_map_cmp); + if (!res) + return -1U; + else + return res - rc_map->scan; } /** * ir_getkeycode() - get a keycode from the scancode->keycode table * @idev: the struct input_dev device descriptor - * @scancode: the desired scancode - * @keycode: used to return the keycode, if found, or KEY_RESERVED - * @return: always returns zero. + * @ke: Input keymap entry * * This routine is used to handle evdev EVIOCGKEY ioctl. + * + * return: always returns zero. */ static int ir_getkeycode(struct input_dev *idev, struct input_keymap_entry *ke) @@ -553,11 +556,12 @@ out: * rc_g_keycode_from_table() - gets the keycode that corresponds to a scancode * @dev: the struct rc_dev descriptor of the device * @scancode: the scancode to look for - * @return: the corresponding keycode, or KEY_RESERVED * * This routine is used by drivers which need to convert a scancode to a * keycode. Normally it should not be used since drivers should have no * interest in keycodes. + * + * return: the corresponding keycode, or KEY_RESERVED */ u32 rc_g_keycode_from_table(struct rc_dev *dev, u32 scancode) { @@ -575,8 +579,8 @@ u32 rc_g_keycode_from_table(struct rc_dev *dev, u32 scancode) spin_unlock_irqrestore(&rc_map->lock, flags); if (keycode != KEY_RESERVED) - IR_dprintk(1, "%s: scancode 0x%04x keycode 0x%02x\n", - dev->device_name, scancode, keycode); + dev_dbg(&dev->dev, "%s: scancode 0x%04x keycode 0x%02x\n", + dev->device_name, scancode, keycode); return keycode; } @@ -595,7 +599,8 @@ static void ir_do_keyup(struct rc_dev *dev, bool sync) if (!dev->keypressed) return; - IR_dprintk(1, "keyup key 0x%04x\n", dev->last_keycode); + dev_dbg(&dev->dev, "keyup key 0x%04x\n", dev->last_keycode); + del_timer(&dev->timer_repeat); input_report_key(dev->input_dev, dev->last_keycode, 0); led_trigger_event(led_feedback, LED_OFF); if (sync) @@ -622,14 +627,15 @@ EXPORT_SYMBOL_GPL(rc_keyup); /** * ir_timer_keyup() - generates a keyup event after a timeout - * @cookie: a pointer to the struct rc_dev for the device + * + * @t: a pointer to the struct timer_list * * This routine will generate a keyup event some time after a keydown event * is generated when no further activity has been detected. */ -static void ir_timer_keyup(unsigned long cookie) +static void ir_timer_keyup(struct timer_list *t) { - struct rc_dev *dev = (struct rc_dev *)cookie; + struct rc_dev *dev = from_timer(dev, t, timer_keyup); unsigned long flags; /* @@ -648,6 +654,39 @@ static void ir_timer_keyup(unsigned long cookie) spin_unlock_irqrestore(&dev->keylock, flags); } +/** + * ir_timer_repeat() - generates a repeat event after a timeout + * + * @t: a pointer to the struct timer_list + * + * This routine will generate a soft repeat event every REP_PERIOD + * milliseconds. + */ +static void ir_timer_repeat(struct timer_list *t) +{ + struct rc_dev *dev = from_timer(dev, t, timer_repeat); + struct input_dev *input = dev->input_dev; + unsigned long flags; + + spin_lock_irqsave(&dev->keylock, flags); + if (dev->keypressed) { + input_event(input, EV_KEY, dev->last_keycode, 2); + input_sync(input); + if (input->rep[REP_PERIOD]) + mod_timer(&dev->timer_repeat, jiffies + + msecs_to_jiffies(input->rep[REP_PERIOD])); + } + spin_unlock_irqrestore(&dev->keylock, flags); +} + +static unsigned int repeat_period(int protocol) +{ + if (protocol >= ARRAY_SIZE(protocols)) + return 100; + + return protocols[protocol].repeat_period; +} + /** * rc_repeat() - signals that a key is still pressed * @dev: the struct rc_dev descriptor of the device @@ -659,20 +698,28 @@ static void ir_timer_keyup(unsigned long cookie) void rc_repeat(struct rc_dev *dev) { unsigned long flags; - unsigned int timeout = protocols[dev->last_protocol].repeat_period; + unsigned int timeout = nsecs_to_jiffies(dev->timeout) + + msecs_to_jiffies(repeat_period(dev->last_protocol)); + struct lirc_scancode sc = { + .scancode = dev->last_scancode, .rc_proto = dev->last_protocol, + .keycode = dev->keypressed ? dev->last_keycode : KEY_RESERVED, + .flags = LIRC_SCANCODE_FLAG_REPEAT | + (dev->last_toggle ? LIRC_SCANCODE_FLAG_TOGGLE : 0) + }; + + if (dev->allowed_protocols != RC_PROTO_BIT_CEC) + ir_lirc_scancode_event(dev, &sc); spin_lock_irqsave(&dev->keylock, flags); - if (!dev->keypressed) - goto out; - input_event(dev->input_dev, EV_MSC, MSC_SCAN, dev->last_scancode); input_sync(dev->input_dev); - dev->keyup_jiffies = jiffies + msecs_to_jiffies(timeout); - mod_timer(&dev->timer_keyup, dev->keyup_jiffies); + if (dev->keypressed) { + dev->keyup_jiffies = jiffies + timeout; + mod_timer(&dev->timer_keyup, dev->keyup_jiffies); + } -out: spin_unlock_irqrestore(&dev->keylock, flags); } EXPORT_SYMBOL_GPL(rc_repeat); @@ -695,27 +742,52 @@ static void ir_do_keydown(struct rc_dev *dev, enum rc_proto protocol, dev->last_protocol != protocol || dev->last_scancode != scancode || dev->last_toggle != toggle); + struct lirc_scancode sc = { + .scancode = scancode, .rc_proto = protocol, + .flags = toggle ? LIRC_SCANCODE_FLAG_TOGGLE : 0, + .keycode = keycode + }; + + if (dev->allowed_protocols != RC_PROTO_BIT_CEC) + ir_lirc_scancode_event(dev, &sc); if (new_event && dev->keypressed) ir_do_keyup(dev, false); input_event(dev->input_dev, EV_MSC, MSC_SCAN, scancode); + dev->last_protocol = protocol; + dev->last_scancode = scancode; + dev->last_toggle = toggle; + dev->last_keycode = keycode; + if (new_event && keycode != KEY_RESERVED) { /* Register a keypress */ dev->keypressed = true; - dev->last_protocol = protocol; - dev->last_scancode = scancode; - dev->last_toggle = toggle; - dev->last_keycode = keycode; - IR_dprintk(1, "%s: key down event, key 0x%04x, protocol 0x%04x, scancode 0x%08x\n", - dev->device_name, keycode, protocol, scancode); + dev_dbg(&dev->dev, "%s: key down event, key 0x%04x, protocol 0x%04x, scancode 0x%08x\n", + dev->device_name, keycode, protocol, scancode); input_report_key(dev->input_dev, keycode, 1); led_trigger_event(led_feedback, LED_FULL); } + /* + * For CEC, start sending repeat messages as soon as the first + * repeated message is sent, as long as REP_DELAY = 0 and REP_PERIOD + * is non-zero. Otherwise, the input layer will generate repeat + * messages. + */ + if (!new_event && keycode != KEY_RESERVED && + dev->allowed_protocols == RC_PROTO_BIT_CEC && + !timer_pending(&dev->timer_repeat) && + dev->input_dev->rep[REP_PERIOD] && + !dev->input_dev->rep[REP_DELAY]) { + input_event(dev->input_dev, EV_KEY, keycode, 2); + mod_timer(&dev->timer_repeat, jiffies + + msecs_to_jiffies(dev->input_dev->rep[REP_PERIOD])); + } + input_sync(dev->input_dev); } @@ -740,8 +812,8 @@ void rc_keydown(struct rc_dev *dev, enum rc_proto protocol, u32 scancode, ir_do_keydown(dev, protocol, scancode, keycode, toggle); if (dev->keypressed) { - dev->keyup_jiffies = jiffies + - msecs_to_jiffies(protocols[protocol].repeat_period); + dev->keyup_jiffies = jiffies + nsecs_to_jiffies(dev->timeout) + + msecs_to_jiffies(repeat_period(protocol)); mod_timer(&dev->timer_keyup, dev->keyup_jiffies); } spin_unlock_irqrestore(&dev->keylock, flags); @@ -772,12 +844,58 @@ void rc_keydown_notimeout(struct rc_dev *dev, enum rc_proto protocol, } EXPORT_SYMBOL_GPL(rc_keydown_notimeout); +/** + * rc_validate_scancode() - checks that a scancode is valid for a protocol. + * For nec, it should do the opposite of ir_nec_bytes_to_scancode() + * @proto: protocol + * @scancode: scancode + */ +bool rc_validate_scancode(enum rc_proto proto, u32 scancode) +{ + switch (proto) { + /* + * NECX has a 16-bit address; if the lower 8 bits match the upper + * 8 bits inverted, then the address would match regular nec. + */ + case RC_PROTO_NECX: + if ((((scancode >> 16) ^ ~(scancode >> 8)) & 0xff) == 0) + return false; + break; + /* + * NEC32 has a 16 bit address and 16 bit command. If the lower 8 bits + * of the command match the upper 8 bits inverted, then it would + * be either NEC or NECX. + */ + case RC_PROTO_NEC32: + if ((((scancode >> 8) ^ ~scancode) & 0xff) == 0) + return false; + break; + /* + * If the customer code (top 32-bit) is 0x800f, it is MCE else it + * is regular mode-6a 32 bit + */ + case RC_PROTO_RC6_MCE: + if ((scancode & 0xffff0000) != 0x800f0000) + return false; + break; + case RC_PROTO_RC6_6A_32: + if ((scancode & 0xffff0000) == 0x800f0000) + return false; + break; + default: + break; + } + + return true; +} + /** * rc_validate_filter() - checks that the scancode and mask are valid and * provides sensible defaults * @dev: the struct rc_dev descriptor of the device * @filter: the scancode and mask - * @return: 0 or -EINVAL if the filter is not valid + * + * return: 0 or -EINVAL if the filter is not valid */ static int rc_validate_filter(struct rc_dev *dev, struct rc_scancode_filter *filter) @@ -790,26 +908,8 @@ static int rc_validate_filter(struct rc_dev *dev, mask = protocols[protocol].scancode_bits; - switch (protocol) { - case RC_PROTO_NECX: - if ((((s >> 16) ^ ~(s >> 8)) & 0xff) == 0) - return -EINVAL; - break; - case RC_PROTO_NEC32: - if ((((s >> 24) ^ ~(s >> 16)) & 0xff) == 0) - return -EINVAL; - break; - case RC_PROTO_RC6_MCE: - if ((s & 0xffff0000) != 0x800f0000) - return -EINVAL; - break; - case RC_PROTO_RC6_6A_32: - if ((s & 0xffff0000) == 0x800f0000) - return -EINVAL; - break; - default: - break; - } + if (!rc_validate_scancode(protocol, s)) + return -EINVAL; filter->data &= mask; filter->mask &= mask; @@ -832,17 +932,20 @@ int rc_open(struct rc_dev *rdev) mutex_lock(&rdev->lock); - if (!rdev->users++ && rdev->open != NULL) - rval = rdev->open(rdev); + if (!rdev->registered) { + rval = -ENODEV; + } else { + if (!rdev->users++ && rdev->open) + rval = rdev->open(rdev); - if (rval) - rdev->users--; + if (rval) + rdev->users--; + } mutex_unlock(&rdev->lock); return rval; } -EXPORT_SYMBOL_GPL(rc_open); static int ir_open(struct input_dev *idev) { @@ -856,13 +959,12 @@ void rc_close(struct rc_dev *rdev) if (rdev) { mutex_lock(&rdev->lock); - if (!--rdev->users && rdev->close != NULL) + if (!--rdev->users && rdev->close && rdev->registered) rdev->close(rdev); mutex_unlock(&rdev->lock); } } -EXPORT_SYMBOL_GPL(rc_close); static void ir_close(struct input_dev *idev) { @@ -915,6 +1017,7 @@ static const struct { RC_PROTO_BIT_MCIR2_MSE, "mce_kbd", "ir-mce_kbd-decoder" }, { RC_PROTO_BIT_XMP, "xmp", "ir-xmp-decoder" }, { RC_PROTO_BIT_CEC, "cec", NULL }, + { RC_PROTO_BIT_IMON, "imon", "ir-imon-decoder" }, }; /** @@ -937,23 +1040,6 @@ struct rc_filter_attribute { .mask = (_mask), \ } -static bool lirc_is_present(void) -{ -#if defined(CONFIG_LIRC_MODULE) - struct module *lirc; - - mutex_lock(&module_mutex); - lirc = find_module("lirc_dev"); - mutex_unlock(&module_mutex); - - return lirc ? true : false; -#elif defined(CONFIG_LIRC) - return true; -#else - return false; -#endif -} - /** * show_protocols() - shows the current IR protocol(s) * @device: the device descriptor @@ -985,8 +1071,8 @@ static ssize_t show_protocols(struct device *device, mutex_unlock(&dev->lock); - IR_dprintk(1, "%s: allowed - 0x%llx, enabled - 0x%llx\n", - __func__, (long long)allowed, (long long)enabled); + dev_dbg(&dev->dev, "%s: allowed - 0x%llx, enabled - 0x%llx\n", + __func__, (long long)allowed, (long long)enabled); for (i = 0; i < ARRAY_SIZE(proto_names); i++) { if (allowed & enabled & proto_names[i].type) @@ -998,8 +1084,10 @@ static ssize_t show_protocols(struct device *device, allowed &= ~proto_names[i].type; } - if (dev->driver_type == RC_DRIVER_IR_RAW && lirc_is_present()) +#ifdef CONFIG_LIRC + if (dev->driver_type == RC_DRIVER_IR_RAW) tmp += sprintf(tmp, "[lirc] "); +#endif if (tmp != buf) tmp--; @@ -1010,6 +1098,7 @@ static ssize_t show_protocols(struct device *device, /** * parse_protocol_change() - parses a protocol change request + * @dev: rc_dev device * @protocols: pointer to the bitmask of current protocols * @buf: pointer to the buffer with a list of changes * @@ -1019,7 +1108,8 @@ static ssize_t show_protocols(struct device *device, * Writing "none" will disable all protocols. * Returns the number of changes performed or a negative error code. */ -static int parse_protocol_change(u64 *protocols, const char *buf) +static int parse_protocol_change(struct rc_dev *dev, u64 *protocols, + const char *buf) { const char *tmp; unsigned count = 0; @@ -1055,7 +1145,8 @@ static int parse_protocol_change(u64 *protocols, const char *buf) if (!strcasecmp(tmp, "lirc")) mask = 0; else { - IR_dprintk(1, "Unknown protocol: '%s'\n", tmp); + dev_dbg(&dev->dev, "Unknown protocol: '%s'\n", + tmp); return -EINVAL; } } @@ -1071,14 +1162,14 @@ static int parse_protocol_change(u64 *protocols, const char *buf) } if (!count) { - IR_dprintk(1, "Protocol not specified\n"); + dev_dbg(&dev->dev, "Protocol not specified\n"); return -EINVAL; } return count; } -static void ir_raw_load_modules(u64 *protocols) +void ir_raw_load_modules(u64 *protocols) { u64 available; int i, ret; @@ -1144,37 +1235,41 @@ static ssize_t store_protocols(struct device *device, u64 old_protocols, new_protocols; ssize_t rc; - IR_dprintk(1, "Normal protocol change requested\n"); + dev_dbg(&dev->dev, "Normal protocol change requested\n"); current_protocols = &dev->enabled_protocols; filter = &dev->scancode_filter; if (!dev->change_protocol) { - IR_dprintk(1, "Protocol switching not supported\n"); + dev_dbg(&dev->dev, "Protocol switching not supported\n"); return -EINVAL; } mutex_lock(&dev->lock); + if (!dev->registered) { + mutex_unlock(&dev->lock); + return -ENODEV; + } old_protocols = *current_protocols; new_protocols = old_protocols; - rc = parse_protocol_change(&new_protocols, buf); + rc = parse_protocol_change(dev, &new_protocols, buf); if (rc < 0) goto out; - rc = dev->change_protocol(dev, &new_protocols); - if (rc < 0) { - IR_dprintk(1, "Error setting protocols to 0x%llx\n", - (long long)new_protocols); - goto out; - } - if (dev->driver_type == RC_DRIVER_IR_RAW) ir_raw_load_modules(&new_protocols); + rc = dev->change_protocol(dev, &new_protocols); + if (rc < 0) { + dev_dbg(&dev->dev, "Error setting protocols to 0x%llx\n", + (long long)new_protocols); + goto out; + } + if (new_protocols != old_protocols) { *current_protocols = new_protocols; - IR_dprintk(1, "Protocols changed to 0x%llx\n", - (long long)new_protocols); + dev_dbg(&dev->dev, "Protocols changed to 0x%llx\n", + (long long)new_protocols); } /* @@ -1292,6 +1387,10 @@ static ssize_t store_filter(struct device *device, return -EINVAL; mutex_lock(&dev->lock); + if (!dev->registered) { + mutex_unlock(&dev->lock); + return -ENODEV; + } new_filter = *filter; if (fattr->mask) @@ -1362,8 +1461,8 @@ static ssize_t show_wakeup_protocols(struct device *device, mutex_unlock(&dev->lock); - IR_dprintk(1, "%s: allowed - 0x%llx, enabled - %d\n", - __func__, (long long)allowed, enabled); + dev_dbg(&dev->dev, "%s: allowed - 0x%llx, enabled - %d\n", + __func__, (long long)allowed, enabled); for (i = 0; i < ARRAY_SIZE(protocols); i++) { if (allowed & (1ULL << i)) { @@ -1406,6 +1505,10 @@ static ssize_t store_wakeup_protocols(struct device *device, int i; mutex_lock(&dev->lock); + if (!dev->registered) { + mutex_unlock(&dev->lock); + return -ENODEV; + } allowed = dev->allowed_wakeup_protocols; @@ -1438,7 +1541,7 @@ static ssize_t store_wakeup_protocols(struct device *device, if (dev->wakeup_protocol != protocol) { dev->wakeup_protocol = protocol; - IR_dprintk(1, "Wakeup protocol changed to %d\n", protocol); + dev_dbg(&dev->dev, "Wakeup protocol changed to %d\n", protocol); if (protocol == RC_PROTO_RC6_MCE) dev->scancode_wakeup_filter.data = 0x800f0000; @@ -1465,29 +1568,34 @@ static void rc_dev_release(struct device *device) kfree(dev); } -#define ADD_HOTPLUG_VAR(fmt, val...) \ - do { \ - int err = add_uevent_var(env, fmt, val); \ - if (err) \ - return err; \ - } while (0) - static int rc_dev_uevent(struct device *device, struct kobj_uevent_env *env) { struct rc_dev *dev = to_rc_dev(device); + int ret = 0; - if (dev->rc_map.name) - ADD_HOTPLUG_VAR("NAME=%s", dev->rc_map.name); - if (dev->driver_name) - ADD_HOTPLUG_VAR("DRV_NAME=%s", dev->driver_name); + mutex_lock(&dev->lock); - return 0; + if (!dev->registered) + ret = -ENODEV; + if (ret == 0 && dev->rc_map.name) + ret = add_uevent_var(env, "NAME=%s", dev->rc_map.name); + if (ret == 0 && dev->driver_name) + ret = add_uevent_var(env, "DRV_NAME=%s", dev->driver_name); + if (ret == 0 && dev->device_name) + ret = add_uevent_var(env, "DEV_NAME=%s", dev->device_name); + + mutex_unlock(&dev->lock); + + return ret; } /* * Static device attribute struct with the sysfs attributes for IR's */ -static DEVICE_ATTR(protocols, 0644, show_protocols, store_protocols); +static struct device_attribute dev_attr_ro_protocols = +__ATTR(protocols, 0444, show_protocols, NULL); +static struct device_attribute dev_attr_rw_protocols = +__ATTR(protocols, 0644, show_protocols, store_protocols); static DEVICE_ATTR(wakeup_protocols, 0644, show_wakeup_protocols, store_wakeup_protocols); static RC_FILTER_ATTR(filter, S_IRUGO|S_IWUSR, @@ -1499,13 +1607,22 @@ static RC_FILTER_ATTR(wakeup_filter, S_IRUGO|S_IWUSR, static RC_FILTER_ATTR(wakeup_filter_mask, S_IRUGO|S_IWUSR, show_filter, store_filter, RC_FILTER_WAKEUP, true); -static struct attribute *rc_dev_protocol_attrs[] = { - &dev_attr_protocols.attr, +static struct attribute *rc_dev_rw_protocol_attrs[] = { + &dev_attr_rw_protocols.attr, NULL, }; -static const struct attribute_group rc_dev_protocol_attr_grp = { - .attrs = rc_dev_protocol_attrs, +static const struct attribute_group rc_dev_rw_protocol_attr_grp = { + .attrs = rc_dev_rw_protocol_attrs, +}; + +static struct attribute *rc_dev_ro_protocol_attrs[] = { + &dev_attr_ro_protocols.attr, + NULL, +}; + +static const struct attribute_group rc_dev_ro_protocol_attr_grp = { + .attrs = rc_dev_ro_protocol_attrs, }; static struct attribute *rc_dev_filter_attrs[] = { @@ -1529,7 +1646,7 @@ static const struct attribute_group rc_dev_wakeup_filter_attr_grp = { .attrs = rc_dev_wakeup_filter_attrs, }; -static struct device_type rc_dev_type = { +static const struct device_type rc_dev_type = { .release = rc_dev_release, .uevent = rc_dev_uevent, }; @@ -1553,8 +1670,9 @@ struct rc_dev *rc_allocate_device(enum rc_driver_type type) dev->input_dev->setkeycode = ir_setkeycode; input_set_drvdata(dev->input_dev, dev); - setup_timer(&dev->timer_keyup, ir_timer_keyup, - (unsigned long)dev); + dev->timeout = IR_DEFAULT_TIMEOUT; + timer_setup(&dev->timer_keyup, ir_timer_keyup, 0); + timer_setup(&dev->timer_repeat, ir_timer_repeat, 0); spin_lock_init(&dev->rc_map.lock); spin_lock_init(&dev->keylock); @@ -1638,6 +1756,12 @@ static int rc_prepare_rx_device(struct rc_dev *dev) rc_proto = BIT_ULL(rc_map->rc_proto); + if (dev->driver_type == RC_DRIVER_SCANCODE && !dev->change_protocol) + dev->enabled_protocols = dev->allowed_protocols; + + if (dev->driver_type == RC_DRIVER_IR_RAW) + ir_raw_load_modules(&rc_proto); + if (dev->change_protocol) { rc = dev->change_protocol(dev, &rc_proto); if (rc < 0) @@ -1645,9 +1769,6 @@ static int rc_prepare_rx_device(struct rc_dev *dev) dev->enabled_protocols = rc_proto; } - if (dev->driver_type == RC_DRIVER_IR_RAW) - ir_raw_load_modules(&rc_proto); - set_bit(EV_KEY, dev->input_dev->evbit); set_bit(EV_REP, dev->input_dev->evbit); set_bit(EV_MSC, dev->input_dev->evbit); @@ -1685,7 +1806,10 @@ static int rc_setup_rx_device(struct rc_dev *dev) * to avoid wrong repetition of the keycodes. Note that this must be * set after the call to input_register_device(). */ - dev->input_dev->rep[REP_DELAY] = 500; + if (dev->allowed_protocols == RC_PROTO_BIT_CEC) + dev->input_dev->rep[REP_DELAY] = 0; + else + dev->input_dev->rep[REP_DELAY] = 500; /* * As a repeat event on protocols like RC-5 and NEC take as long as @@ -1729,16 +1853,17 @@ int rc_register_device(struct rc_dev *dev) dev_set_drvdata(&dev->dev, dev); dev->dev.groups = dev->sysfs_groups; - if (dev->driver_type != RC_DRIVER_IR_RAW_TX) - dev->sysfs_groups[attr++] = &rc_dev_protocol_attr_grp; + if (dev->driver_type == RC_DRIVER_SCANCODE && !dev->change_protocol) + dev->sysfs_groups[attr++] = &rc_dev_ro_protocol_attr_grp; + else if (dev->driver_type != RC_DRIVER_IR_RAW_TX) + dev->sysfs_groups[attr++] = &rc_dev_rw_protocol_attr_grp; if (dev->s_filter) dev->sysfs_groups[attr++] = &rc_dev_filter_attr_grp; if (dev->s_wakeup_filter) dev->sysfs_groups[attr++] = &rc_dev_wakeup_filter_attr_grp; dev->sysfs_groups[attr++] = NULL; - if (dev->driver_type == RC_DRIVER_IR_RAW || - dev->driver_type == RC_DRIVER_IR_RAW_TX) { + if (dev->driver_type == RC_DRIVER_IR_RAW) { rc = ir_raw_event_prepare(dev); if (rc < 0) goto out_minor; @@ -1750,6 +1875,8 @@ int rc_register_device(struct rc_dev *dev) goto out_raw; } + dev->registered = true; + rc = device_add(&dev->dev); if (rc) goto out_rx_free; @@ -1759,27 +1886,40 @@ int rc_register_device(struct rc_dev *dev) dev->device_name ?: "Unspecified device", path ?: "N/A"); kfree(path); - if (dev->driver_type != RC_DRIVER_IR_RAW_TX) { - rc = rc_setup_rx_device(dev); - if (rc) + /* + * once the the input device is registered in rc_setup_rx_device, + * userspace can open the input device and rc_open() will be called + * as a result. This results in driver code being allowed to submit + * keycodes with rc_keydown, so lirc must be registered first. + */ + if (dev->allowed_protocols != RC_PROTO_BIT_CEC) { + rc = ir_lirc_register(dev); + if (rc < 0) goto out_dev; } - if (dev->driver_type == RC_DRIVER_IR_RAW || - dev->driver_type == RC_DRIVER_IR_RAW_TX) { + if (dev->driver_type != RC_DRIVER_IR_RAW_TX) { + rc = rc_setup_rx_device(dev); + if (rc) + goto out_lirc; + } + + if (dev->driver_type == RC_DRIVER_IR_RAW) { rc = ir_raw_event_register(dev); if (rc < 0) goto out_rx; } - IR_dprintk(1, "Registered rc%u (driver: %s)\n", - dev->minor, - dev->driver_name ? dev->driver_name : "unknown"); + dev_dbg(&dev->dev, "Registered rc%u (driver: %s)\n", dev->minor, + dev->driver_name ? dev->driver_name : "unknown"); return 0; out_rx: rc_free_rx_device(dev); +out_lirc: + if (dev->allowed_protocols != RC_PROTO_BIT_CEC) + ir_lirc_unregister(dev); out_dev: device_del(&dev->dev); out_rx_free: @@ -1828,9 +1968,23 @@ void rc_unregister_device(struct rc_dev *dev) ir_raw_event_unregister(dev); del_timer_sync(&dev->timer_keyup); + del_timer_sync(&dev->timer_repeat); + + mutex_lock(&dev->lock); + if (dev->users && dev->close) + dev->close(dev); + dev->registered = false; + mutex_unlock(&dev->lock); rc_free_rx_device(dev); + /* + * lirc device should be freed with dev->registered = false, so + * that userspace polling will get notified. + */ + if (dev->allowed_protocols != RC_PROTO_BIT_CEC) + ir_lirc_unregister(dev); + device_del(&dev->dev); ida_simple_remove(&rc_ida, dev->minor); @@ -1853,6 +2007,13 @@ static int __init rc_core_init(void) return rc; } + rc = lirc_dev_init(); + if (rc) { + pr_err("rc_core: unable to init lirc\n"); + class_unregister(&rc_class); + return 0; + } + led_trigger_register_simple("rc-feedback", &led_feedback); rc_map_register(&empty_map); @@ -1861,6 +2022,7 @@ static int __init rc_core_init(void) static void __exit rc_core_exit(void) { + lirc_dev_exit(); class_unregister(&rc_class); led_trigger_unregister_simple(led_feedback); rc_map_unregister(&empty_map); @@ -1869,9 +2031,5 @@ static void __exit rc_core_exit(void) subsys_initcall(rc_core_init); module_exit(rc_core_exit); -int rc_core_debug; /* ir_debug level (0,1,2) */ -EXPORT_SYMBOL_GPL(rc_core_debug); -module_param_named(debug, rc_core_debug, int, 0644); - MODULE_AUTHOR("Mauro Carvalho Chehab"); -MODULE_LICENSE("GPL"); +MODULE_LICENSE("GPL v2"); diff --git a/drivers/media/rc/redrat3.c b/drivers/media/rc/redrat3.c index d6fd924fea53..14be14b0b0b0 100644 --- a/drivers/media/rc/redrat3.c +++ b/drivers/media/rc/redrat3.c @@ -186,7 +186,7 @@ struct redrat3_error { } __packed; /* table of devices that work with this driver */ -static struct usb_device_id redrat3_dev_table[] = { +static const struct usb_device_id redrat3_dev_table[] = { /* Original version of the RedRat3 */ {USB_DEVICE(USB_RR3USB_VENDOR_ID, USB_RR3USB_PRODUCT_ID)}, /* Second Version/release of the RedRat3 - RetRat3-II */ diff --git a/drivers/media/rc/serial_ir.c b/drivers/media/rc/serial_ir.c index 842c121dca2d..e613c0175591 100644 --- a/drivers/media/rc/serial_ir.c +++ b/drivers/media/rc/serial_ir.c @@ -470,7 +470,7 @@ static int hardware_init_port(void) return 0; } -static void serial_ir_timeout(unsigned long arg) +static void serial_ir_timeout(struct timer_list *unused) { DEFINE_IR_RAW_EVENT(ev); @@ -540,8 +540,7 @@ static int serial_ir_probe(struct platform_device *dev) serial_ir.rcdev = rcdev; - setup_timer(&serial_ir.timeout_timer, serial_ir_timeout, - (unsigned long)&serial_ir); + timer_setup(&serial_ir.timeout_timer, serial_ir_timeout, 0); result = devm_request_irq(&dev->dev, irq, serial_ir_irq_handler, share_irq ? IRQF_SHARED : 0, diff --git a/drivers/media/rc/sir_ir.c b/drivers/media/rc/sir_ir.c index d59918878eb2..9ee2c9196b4d 100644 --- a/drivers/media/rc/sir_ir.c +++ b/drivers/media/rc/sir_ir.c @@ -120,7 +120,7 @@ static void add_read_queue(int flag, unsigned long val) } /* SECTION: Hardware */ -static void sir_timeout(unsigned long data) +static void sir_timeout(struct timer_list *unused) { /* * if last received signal was a pulse, but receiving stopped @@ -348,7 +348,7 @@ static int sir_ir_probe(struct platform_device *dev) rcdev->timeout = IR_DEFAULT_TIMEOUT; rcdev->dev.parent = &sir_ir_dev->dev; - setup_timer(&timerlist, sir_timeout, 0); + timer_setup(&timerlist, sir_timeout, 0); /* get I/O port access and IRQ line */ if (!devm_request_region(&sir_ir_dev->dev, io, 8, KBUILD_MODNAME)) { diff --git a/drivers/media/rc/st_rc.c b/drivers/media/rc/st_rc.c index a8e39c635f34..c855b177103c 100644 --- a/drivers/media/rc/st_rc.c +++ b/drivers/media/rc/st_rc.c @@ -49,7 +49,7 @@ struct st_rc_device { #define IRB_RX_NOISE_SUPPR 0x5c /* noise suppression */ #define IRB_RX_POLARITY_INV 0x68 /* polarity inverter */ -/** +/* * IRQ set: Enable full FIFO 1 -> bit 3; * Enable overrun IRQ 1 -> bit 2; * Enable last symbol IRQ 1 -> bit 1: @@ -72,7 +72,7 @@ static void st_rc_send_lirc_timeout(struct rc_dev *rdev) ir_raw_event_store(rdev, &ev); } -/** +/* * RX graphical example to better understand the difference between ST IR block * output and standard definition used by LIRC (and most of the world!) * @@ -96,19 +96,24 @@ static void st_rc_send_lirc_timeout(struct rc_dev *rdev) static irqreturn_t st_rc_rx_interrupt(int irq, void *data) { + unsigned long timeout; unsigned int symbol, mark = 0; struct st_rc_device *dev = data; int last_symbol = 0; - u32 status; + u32 status, int_status; DEFINE_IR_RAW_EVENT(ev); if (dev->irq_wake) pm_wakeup_event(dev->dev, 0); - status = readl(dev->rx_base + IRB_RX_STATUS); + /* FIXME: is 10ms good enough ? */ + timeout = jiffies + msecs_to_jiffies(10); + do { + status = readl(dev->rx_base + IRB_RX_STATUS); + if (!(status & (IRB_FIFO_NOT_EMPTY | IRB_OVERFLOW))) + break; - while (status & (IRB_FIFO_NOT_EMPTY | IRB_OVERFLOW)) { - u32 int_status = readl(dev->rx_base + IRB_RX_INT_STATUS); + int_status = readl(dev->rx_base + IRB_RX_INT_STATUS); if (unlikely(int_status & IRB_RX_OVERRUN_INT)) { /* discard the entire collection in case of errors! */ ir_raw_event_reset(dev->rdev); @@ -148,8 +153,7 @@ static irqreturn_t st_rc_rx_interrupt(int irq, void *data) } last_symbol = 0; - status = readl(dev->rx_base + IRB_RX_STATUS); - } + } while (time_is_after_jiffies(timeout)); writel(IRB_RX_INTS, dev->rx_base + IRB_RX_INT_CLEAR); @@ -317,7 +321,7 @@ static int st_rc_probe(struct platform_device *pdev) device_init_wakeup(dev, true); dev_pm_set_wake_irq(dev, rc_dev->irq); - /** + /* * for LIRC_MODE_MODE2 or LIRC_MODE_PULSE or LIRC_MODE_RAW * lircd expects a long space first before a signal train to sync. */ diff --git a/drivers/media/rc/streamzap.c b/drivers/media/rc/streamzap.c index f03a174ddf9d..c9a70fda88a8 100644 --- a/drivers/media/rc/streamzap.c +++ b/drivers/media/rc/streamzap.c @@ -43,7 +43,7 @@ #define USB_STREAMZAP_PRODUCT_ID 0x0000 /* table of devices that work with this driver */ -static struct usb_device_id streamzap_table[] = { +static const struct usb_device_id streamzap_table[] = { /* Streamzap Remote Control */ { USB_DEVICE(USB_STREAMZAP_VENDOR_ID, USB_STREAMZAP_PRODUCT_ID) }, /* Terminating entry */ @@ -191,7 +191,7 @@ static void sz_push_half_space(struct streamzap_ir *sz, sz_push_full_space(sz, value & SZ_SPACE_MASK); } -/** +/* * streamzap_callback - usb IRQ handler callback * * This procedure is invoked on reception of data from @@ -321,7 +321,7 @@ out: return NULL; } -/** +/* * streamzap_probe * * Called by usb-core to associated with a candidate device @@ -450,7 +450,7 @@ free_sz: return retval; } -/** +/* * streamzap_disconnect * * Called by the usb core when the device is removed from the system. diff --git a/drivers/media/rc/sunxi-cir.c b/drivers/media/rc/sunxi-cir.c index bc026a7116ec..0114e81fa6fa 100644 --- a/drivers/media/rc/sunxi-cir.c +++ b/drivers/media/rc/sunxi-cir.c @@ -72,12 +72,8 @@ /* CIR_REG register idle threshold */ #define REG_CIR_ITHR(val) (((val) << 8) & (GENMASK(15, 8))) -/* Required frequency for IR0 or IR1 clock in CIR mode */ +/* Required frequency for IR0 or IR1 clock in CIR mode (default) */ #define SUNXI_IR_BASE_CLK 8000000 -/* Frequency after IR internal divider */ -#define SUNXI_IR_CLK (SUNXI_IR_BASE_CLK / 64) -/* Sample period in ns */ -#define SUNXI_IR_SAMPLE (1000000000ul / SUNXI_IR_CLK) /* Noise threshold in samples */ #define SUNXI_IR_RXNOISE 1 /* Idle Threshold in samples */ @@ -122,7 +118,8 @@ static irqreturn_t sunxi_ir_irq(int irqno, void *dev_id) /* for each bit in fifo */ dt = readb(ir->base + SUNXI_IR_RXFIFO_REG); rawir.pulse = (dt & 0x80) != 0; - rawir.duration = ((dt & 0x7f) + 1) * SUNXI_IR_SAMPLE; + rawir.duration = ((dt & 0x7f) + 1) * + ir->rc->rx_resolution; ir_raw_event_store_with_filter(ir->rc, &rawir); } } @@ -150,6 +147,7 @@ static int sunxi_ir_probe(struct platform_device *pdev) struct device_node *dn = dev->of_node; struct resource *res; struct sunxi_ir *ir; + u32 b_clk_freq = SUNXI_IR_BASE_CLK; ir = devm_kzalloc(dev, sizeof(struct sunxi_ir), GFP_KERNEL); if (!ir) @@ -174,6 +172,9 @@ static int sunxi_ir_probe(struct platform_device *pdev) return PTR_ERR(ir->clk); } + /* Base clock frequency (optional) */ + of_property_read_u32(dn, "clock-frequency", &b_clk_freq); + /* Reset (optional) */ ir->rst = devm_reset_control_get_optional_exclusive(dev, NULL); if (IS_ERR(ir->rst)) @@ -182,11 +183,12 @@ static int sunxi_ir_probe(struct platform_device *pdev) if (ret) return ret; - ret = clk_set_rate(ir->clk, SUNXI_IR_BASE_CLK); + ret = clk_set_rate(ir->clk, b_clk_freq); if (ret) { dev_err(dev, "set ir base clock failed!\n"); goto exit_reset_assert; } + dev_dbg(dev, "set base clock frequency to %d Hz.\n", b_clk_freq); if (clk_prepare_enable(ir->apb_clk)) { dev_err(dev, "try to enable apb_ir_clk failed\n"); @@ -227,7 +229,8 @@ static int sunxi_ir_probe(struct platform_device *pdev) ir->rc->map_name = ir->map_name ?: RC_MAP_EMPTY; ir->rc->dev.parent = dev; ir->rc->allowed_protocols = RC_PROTO_BIT_ALL_IR_DECODER; - ir->rc->rx_resolution = SUNXI_IR_SAMPLE; + /* Frequency after IR internal divider with sample period in ns */ + ir->rc->rx_resolution = (1000000000ul / (b_clk_freq / 64)); ir->rc->timeout = MS_TO_NS(SUNXI_IR_TIMEOUT); ir->rc->driver_name = SUNXI_IR_DEV; diff --git a/drivers/media/rc/tango-ir.c b/drivers/media/rc/tango-ir.c new file mode 100644 index 000000000000..9d4c17230c3a --- /dev/null +++ b/drivers/media/rc/tango-ir.c @@ -0,0 +1,281 @@ +/* + * Copyright (C) 2015 Mans Rullgard + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; either version 2 of the License, or (at your + * option) any later version. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#define DRIVER_NAME "tango-ir" + +#define IR_NEC_CTRL 0x00 +#define IR_NEC_DATA 0x04 +#define IR_CTRL 0x08 +#define IR_RC5_CLK_DIV 0x0c +#define IR_RC5_DATA 0x10 +#define IR_INT 0x14 + +#define NEC_TIME_BASE 560 +#define RC5_TIME_BASE 1778 + +#define RC6_CTRL 0x00 +#define RC6_CLKDIV 0x04 +#define RC6_DATA0 0x08 +#define RC6_DATA1 0x0c +#define RC6_DATA2 0x10 +#define RC6_DATA3 0x14 +#define RC6_DATA4 0x18 + +#define RC6_CARRIER 36000 +#define RC6_TIME_BASE 16 + +#define NEC_CAP(n) ((n) << 24) +#define GPIO_SEL(n) ((n) << 16) +#define DISABLE_NEC (BIT(4) | BIT(8)) +#define ENABLE_RC5 (BIT(0) | BIT(9)) +#define ENABLE_RC6 (BIT(0) | BIT(7)) +#define ACK_IR_INT (BIT(0) | BIT(1)) +#define ACK_RC6_INT (BIT(31)) + +#define NEC_ANY (RC_PROTO_BIT_NEC | RC_PROTO_BIT_NECX | RC_PROTO_BIT_NEC32) + +struct tango_ir { + void __iomem *rc5_base; + void __iomem *rc6_base; + struct rc_dev *rc; + struct clk *clk; +}; + +static void tango_ir_handle_nec(struct tango_ir *ir) +{ + u32 v, code; + enum rc_proto proto; + + v = readl_relaxed(ir->rc5_base + IR_NEC_DATA); + if (!v) { + rc_repeat(ir->rc); + return; + } + + code = ir_nec_bytes_to_scancode(v, v >> 8, v >> 16, v >> 24, &proto); + rc_keydown(ir->rc, proto, code, 0); +} + +static void tango_ir_handle_rc5(struct tango_ir *ir) +{ + u32 data, field, toggle, addr, cmd, code; + + data = readl_relaxed(ir->rc5_base + IR_RC5_DATA); + if (data & BIT(31)) + return; + + field = data >> 12 & 1; + toggle = data >> 11 & 1; + addr = data >> 6 & 0x1f; + cmd = (data & 0x3f) | (field ^ 1) << 6; + + code = RC_SCANCODE_RC5(addr, cmd); + rc_keydown(ir->rc, RC_PROTO_RC5, code, toggle); +} + +static void tango_ir_handle_rc6(struct tango_ir *ir) +{ + u32 data0, data1, toggle, mode, addr, cmd, code; + + data0 = readl_relaxed(ir->rc6_base + RC6_DATA0); + data1 = readl_relaxed(ir->rc6_base + RC6_DATA1); + + mode = data0 >> 1 & 7; + if (mode != 0) + return; + + toggle = data0 & 1; + addr = data0 >> 16; + cmd = data1; + + code = RC_SCANCODE_RC6_0(addr, cmd); + rc_keydown(ir->rc, RC_PROTO_RC6_0, code, toggle); +} + +static irqreturn_t tango_ir_irq(int irq, void *dev_id) +{ + struct tango_ir *ir = dev_id; + unsigned int rc5_stat; + unsigned int rc6_stat; + + rc5_stat = readl_relaxed(ir->rc5_base + IR_INT); + writel_relaxed(rc5_stat, ir->rc5_base + IR_INT); + + rc6_stat = readl_relaxed(ir->rc6_base + RC6_CTRL); + writel_relaxed(rc6_stat, ir->rc6_base + RC6_CTRL); + + if (!(rc5_stat & 3) && !(rc6_stat & BIT(31))) + return IRQ_NONE; + + if (rc5_stat & BIT(0)) + tango_ir_handle_rc5(ir); + + if (rc5_stat & BIT(1)) + tango_ir_handle_nec(ir); + + if (rc6_stat & BIT(31)) + tango_ir_handle_rc6(ir); + + return IRQ_HANDLED; +} + +static int tango_change_protocol(struct rc_dev *dev, u64 *rc_type) +{ + struct tango_ir *ir = dev->priv; + u32 rc5_ctrl = DISABLE_NEC; + u32 rc6_ctrl = 0; + + if (*rc_type & NEC_ANY) + rc5_ctrl = 0; + + if (*rc_type & RC_PROTO_BIT_RC5) + rc5_ctrl |= ENABLE_RC5; + + if (*rc_type & RC_PROTO_BIT_RC6_0) + rc6_ctrl = ENABLE_RC6; + + writel_relaxed(rc5_ctrl, ir->rc5_base + IR_CTRL); + writel_relaxed(rc6_ctrl, ir->rc6_base + RC6_CTRL); + + return 0; +} + +static int tango_ir_probe(struct platform_device *pdev) +{ + const char *map_name = RC_MAP_TANGO; + struct device *dev = &pdev->dev; + struct rc_dev *rc; + struct tango_ir *ir; + struct resource *rc5_res; + struct resource *rc6_res; + u64 clkrate, clkdiv; + int irq, err; + u32 val; + + rc5_res = platform_get_resource(pdev, IORESOURCE_MEM, 0); + if (!rc5_res) + return -EINVAL; + + rc6_res = platform_get_resource(pdev, IORESOURCE_MEM, 1); + if (!rc6_res) + return -EINVAL; + + irq = platform_get_irq(pdev, 0); + if (irq <= 0) + return -EINVAL; + + ir = devm_kzalloc(dev, sizeof(*ir), GFP_KERNEL); + if (!ir) + return -ENOMEM; + + ir->rc5_base = devm_ioremap_resource(dev, rc5_res); + if (IS_ERR(ir->rc5_base)) + return PTR_ERR(ir->rc5_base); + + ir->rc6_base = devm_ioremap_resource(dev, rc6_res); + if (IS_ERR(ir->rc6_base)) + return PTR_ERR(ir->rc6_base); + + ir->clk = devm_clk_get(dev, NULL); + if (IS_ERR(ir->clk)) + return PTR_ERR(ir->clk); + + rc = devm_rc_allocate_device(dev, RC_DRIVER_SCANCODE); + if (!rc) + return -ENOMEM; + + of_property_read_string(dev->of_node, "linux,rc-map-name", &map_name); + + rc->device_name = DRIVER_NAME; + rc->driver_name = DRIVER_NAME; + rc->input_phys = DRIVER_NAME "/input0"; + rc->map_name = map_name; + rc->allowed_protocols = NEC_ANY | RC_PROTO_BIT_RC5 | RC_PROTO_BIT_RC6_0; + rc->change_protocol = tango_change_protocol; + rc->priv = ir; + ir->rc = rc; + + err = clk_prepare_enable(ir->clk); + if (err) + return err; + + clkrate = clk_get_rate(ir->clk); + + clkdiv = clkrate * NEC_TIME_BASE; + do_div(clkdiv, 1000000); + + val = NEC_CAP(31) | GPIO_SEL(12) | clkdiv; + writel_relaxed(val, ir->rc5_base + IR_NEC_CTRL); + + clkdiv = clkrate * RC5_TIME_BASE; + do_div(clkdiv, 1000000); + + writel_relaxed(DISABLE_NEC, ir->rc5_base + IR_CTRL); + writel_relaxed(clkdiv, ir->rc5_base + IR_RC5_CLK_DIV); + writel_relaxed(ACK_IR_INT, ir->rc5_base + IR_INT); + + clkdiv = clkrate * RC6_TIME_BASE; + do_div(clkdiv, RC6_CARRIER); + + writel_relaxed(ACK_RC6_INT, ir->rc6_base + RC6_CTRL); + writel_relaxed((clkdiv >> 2) << 18 | clkdiv, ir->rc6_base + RC6_CLKDIV); + + err = devm_request_irq(dev, irq, tango_ir_irq, IRQF_SHARED, + dev_name(dev), ir); + if (err) + goto err_clk; + + err = devm_rc_register_device(dev, rc); + if (err) + goto err_clk; + + platform_set_drvdata(pdev, ir); + return 0; + +err_clk: + clk_disable_unprepare(ir->clk); + return err; +} + +static int tango_ir_remove(struct platform_device *pdev) +{ + struct tango_ir *ir = platform_get_drvdata(pdev); + + clk_disable_unprepare(ir->clk); + return 0; +} + +static const struct of_device_id tango_ir_dt_ids[] = { + { .compatible = "sigma,smp8642-ir" }, + { } +}; +MODULE_DEVICE_TABLE(of, tango_ir_dt_ids); + +static struct platform_driver tango_ir_driver = { + .probe = tango_ir_probe, + .remove = tango_ir_remove, + .driver = { + .name = DRIVER_NAME, + .of_match_table = tango_ir_dt_ids, + }, +}; +module_platform_driver(tango_ir_driver); + +MODULE_DESCRIPTION("SMP86xx IR decoder driver"); +MODULE_AUTHOR("Mans Rullgard "); +MODULE_LICENSE("GPL"); diff --git a/drivers/media/rc/winbond-cir.c b/drivers/media/rc/winbond-cir.c index 3ca7ab48293d..851acba9b436 100644 --- a/drivers/media/rc/winbond-cir.c +++ b/drivers/media/rc/winbond-cir.c @@ -989,8 +989,7 @@ wbcir_init_hw(struct wbcir_data *data) /* Clear RX state */ data->rxstate = WBCIR_RXSTATE_INACTIVE; - ir_raw_event_reset(data->dev); - ir_raw_event_set_idle(data->dev, true); + wbcir_idle_rx(data->dev, true); /* Clear TX state */ if (data->txstate == WBCIR_TXSTATE_ACTIVE) { @@ -1009,6 +1008,7 @@ wbcir_resume(struct pnp_dev *device) struct wbcir_data *data = pnp_get_drvdata(device); wbcir_init_hw(data); + ir_raw_event_reset(data->dev); enable_irq(data->irq); led_classdev_resume(&data->led); @@ -1044,7 +1044,7 @@ wbcir_probe(struct pnp_dev *device, const struct pnp_device_id *dev_id) data->irq = pnp_irq(device, 0); if (data->wbase == 0 || data->ebase == 0 || - data->sbase == 0 || data->irq == 0) { + data->sbase == 0 || data->irq == -1) { err = -ENODEV; dev_err(dev, "Invalid resources\n"); goto exit_free_data; diff --git a/drivers/media/usb/cx231xx/cx231xx-cards.c b/drivers/media/usb/cx231xx/cx231xx-cards.c index c30cb0fb165d..6f906b7b682c 100644 --- a/drivers/media/usb/cx231xx/cx231xx-cards.c +++ b/drivers/media/usb/cx231xx/cx231xx-cards.c @@ -847,7 +847,6 @@ struct cx231xx_board cx231xx_boards[] = { .demod_addr = 0x64, /* 0xc8 >> 1 */ .demod_i2c_master = I2C_1_MUX_3, .has_dvb = 1, - .ir_i2c_master = I2C_0, .norm = V4L2_STD_PAL, .output_mode = OUT_MODE_VIP11, .tuner_addr = 0x60, /* 0xc0 >> 1 */ diff --git a/drivers/media/usb/dvb-usb/cinergyT2-fe.c b/drivers/media/usb/dvb-usb/cinergyT2-fe.c index f9772ad0a2a5..5a2f81311fb7 100644 --- a/drivers/media/usb/dvb-usb/cinergyT2-fe.c +++ b/drivers/media/usb/dvb-usb/cinergyT2-fe.c @@ -26,7 +26,7 @@ #include "cinergyT2.h" -/** +/* * convert linux-dvb frontend parameter set into TPS. * See ETSI ETS-300744, section 4.6.2, table 9 for details. * diff --git a/drivers/media/usb/dvb-usb/dib0700_devices.c b/drivers/media/usb/dvb-usb/dib0700_devices.c index 0a65884cefe3..8bc4d4d1109d 100644 --- a/drivers/media/usb/dvb-usb/dib0700_devices.c +++ b/drivers/media/usb/dvb-usb/dib0700_devices.c @@ -1678,10 +1678,10 @@ static int dib8096_set_param_override(struct dvb_frontend *fe) return -EINVAL; } - /** Update PLL if needed ratio **/ + /* Update PLL if needed ratio */ state->dib8000_ops.update_pll(fe, &dib8090_pll_config_12mhz, fe->dtv_property_cache.bandwidth_hz / 1000, 0); - /** Get optimize PLL ratio to remove spurious **/ + /* Get optimize PLL ratio to remove spurious */ pll_ratio = dib8090_compute_pll_parameters(fe); if (pll_ratio == 17) timf = 21387946; @@ -1692,7 +1692,7 @@ static int dib8096_set_param_override(struct dvb_frontend *fe) else timf = 18179756; - /** Update ratio **/ + /* Update ratio */ state->dib8000_ops.update_pll(fe, &dib8090_pll_config_12mhz, fe->dtv_property_cache.bandwidth_hz / 1000, pll_ratio); state->dib8000_ops.ctrl_timf(fe, DEMOD_TIMF_SET, timf); @@ -3378,7 +3378,7 @@ static int novatd_sleep_override(struct dvb_frontend* fe) return state->sleep(fe); } -/** +/* * novatd_frontend_attach - Nova-TD specific attach * * Nova-TD has GPIO0, 1 and 2 for LEDs. So do not fiddle with them except for diff --git a/drivers/media/usb/dvb-usb/dvb-usb-remote.c b/drivers/media/usb/dvb-usb/dvb-usb-remote.c index 701c10835482..57ff5869144c 100644 --- a/drivers/media/usb/dvb-usb/dvb-usb-remote.c +++ b/drivers/media/usb/dvb-usb/dvb-usb-remote.c @@ -284,6 +284,7 @@ static int rc_core_dvb_usb_remote_init(struct dvb_usb_device *d) dev->input_phys = d->rc_phys; dev->dev.parent = &d->udev->dev; dev->priv = d; + dev->scancode_mask = d->props.rc.core.scancode_mask; err = rc_register_device(dev); if (err < 0) { diff --git a/drivers/media/usb/dvb-usb/dvb-usb.h b/drivers/media/usb/dvb-usb/dvb-usb.h index 21ab517417fc..f4269cebf2fe 100644 --- a/drivers/media/usb/dvb-usb/dvb-usb.h +++ b/drivers/media/usb/dvb-usb/dvb-usb.h @@ -208,6 +208,7 @@ struct dvb_rc { int (*rc_query) (struct dvb_usb_device *d); int rc_interval; bool bulk_mode; /* uses bulk mode */ + u32 scancode_mask; }; /** diff --git a/drivers/media/usb/dvb-usb/friio-fe.c b/drivers/media/usb/dvb-usb/friio-fe.c index 0251a4e91d47..0b108071197a 100644 --- a/drivers/media/usb/dvb-usb/friio-fe.c +++ b/drivers/media/usb/dvb-usb/friio-fe.c @@ -319,7 +319,7 @@ static int jdvbt90502_set_frontend(struct dvb_frontend *fe) } -/** +/* * (reg, val) commad list to initialize this module. * captured on a Windows box. */ diff --git a/drivers/media/usb/dvb-usb/friio.c b/drivers/media/usb/dvb-usb/friio.c index 62abe6c43a32..16875945e662 100644 --- a/drivers/media/usb/dvb-usb/friio.c +++ b/drivers/media/usb/dvb-usb/friio.c @@ -21,7 +21,7 @@ MODULE_PARM_DESC(debug, DVB_DEFINE_MOD_OPT_ADAPTER_NR(adapter_nr); -/** +/* * Indirect I2C access to the PLL via FE. * whole I2C protocol data to the PLL is sent via the FE's I2C register. * This is done by a control msg to the FE with the I2C data accompanied, and diff --git a/drivers/media/usb/dvb-usb/vp7045.c b/drivers/media/usb/dvb-usb/vp7045.c index 13340af0d39c..2527b88beb87 100644 --- a/drivers/media/usb/dvb-usb/vp7045.c +++ b/drivers/media/usb/dvb-usb/vp7045.c @@ -97,82 +97,22 @@ static int vp7045_power_ctrl(struct dvb_usb_device *d, int onoff) return vp7045_usb_op(d,SET_TUNER_POWER,&v,1,NULL,0,150); } -/* remote control stuff */ - -/* The keymapping struct. Somehow this should be loaded to the driver, but - * currently it is hardcoded. */ -static struct rc_map_table rc_map_vp7045_table[] = { - { 0x0016, KEY_POWER }, - { 0x0010, KEY_MUTE }, - { 0x0003, KEY_1 }, - { 0x0001, KEY_2 }, - { 0x0006, KEY_3 }, - { 0x0009, KEY_4 }, - { 0x001d, KEY_5 }, - { 0x001f, KEY_6 }, - { 0x000d, KEY_7 }, - { 0x0019, KEY_8 }, - { 0x001b, KEY_9 }, - { 0x0015, KEY_0 }, - { 0x0005, KEY_CHANNELUP }, - { 0x0002, KEY_CHANNELDOWN }, - { 0x001e, KEY_VOLUMEUP }, - { 0x000a, KEY_VOLUMEDOWN }, - { 0x0011, KEY_RECORD }, - { 0x0017, KEY_FAVORITES }, /* Heart symbol - Channel list. */ - { 0x0014, KEY_PLAY }, - { 0x001a, KEY_STOP }, - { 0x0040, KEY_REWIND }, - { 0x0012, KEY_FASTFORWARD }, - { 0x000e, KEY_PREVIOUS }, /* Recall - Previous channel. */ - { 0x004c, KEY_PAUSE }, - { 0x004d, KEY_SCREEN }, /* Full screen mode. */ - { 0x0054, KEY_AUDIO }, /* MTS - Switch to secondary audio. */ - { 0x000c, KEY_CANCEL }, /* Cancel */ - { 0x001c, KEY_EPG }, /* EPG */ - { 0x0000, KEY_TAB }, /* Tab */ - { 0x0048, KEY_INFO }, /* Preview */ - { 0x0004, KEY_LIST }, /* RecordList */ - { 0x000f, KEY_TEXT }, /* Teletext */ - { 0x0041, KEY_PREVIOUSSONG }, - { 0x0042, KEY_NEXTSONG }, - { 0x004b, KEY_UP }, - { 0x0051, KEY_DOWN }, - { 0x004e, KEY_LEFT }, - { 0x0052, KEY_RIGHT }, - { 0x004f, KEY_ENTER }, - { 0x0013, KEY_CANCEL }, - { 0x004a, KEY_CLEAR }, - { 0x0054, KEY_PRINT }, /* Capture */ - { 0x0043, KEY_SUBTITLE }, /* Subtitle/CC */ - { 0x0008, KEY_VIDEO }, /* A/V */ - { 0x0007, KEY_SLEEP }, /* Hibernate */ - { 0x0045, KEY_ZOOM }, /* Zoom+ */ - { 0x0018, KEY_RED}, - { 0x0053, KEY_GREEN}, - { 0x005e, KEY_YELLOW}, - { 0x005f, KEY_BLUE} -}; - -static int vp7045_rc_query(struct dvb_usb_device *d, u32 *event, int *state) +static int vp7045_rc_query(struct dvb_usb_device *d) { u8 key; - int i; vp7045_usb_op(d,RC_VAL_READ,NULL,0,&key,1,20); deb_rc("remote query key: %x %d\n",key,key); - if (key == 0x44) { - *state = REMOTE_NO_KEY_PRESSED; - return 0; + if (key != 0x44) { + /* + * The 8 bit address isn't available, but since the remote uses + * address 0 we'll use that. nec repeats are ignored too, even + * though the remote sends them. + */ + rc_keydown(d->rc_dev, RC_PROTO_NEC, RC_SCANCODE_NEC(0, key), 0); } - for (i = 0; i < ARRAY_SIZE(rc_map_vp7045_table); i++) - if (rc5_data(&rc_map_vp7045_table[i]) == key) { - *state = REMOTE_KEY_PRESSED; - *event = rc_map_vp7045_table[i].keycode; - break; - } return 0; } @@ -265,11 +205,13 @@ static struct dvb_usb_device_properties vp7045_properties = { .power_ctrl = vp7045_power_ctrl, .read_mac_address = vp7045_read_mac_addr, - .rc.legacy = { - .rc_interval = 400, - .rc_map_table = rc_map_vp7045_table, - .rc_map_size = ARRAY_SIZE(rc_map_vp7045_table), - .rc_query = vp7045_rc_query, + .rc.core = { + .rc_interval = 400, + .rc_codes = RC_MAP_TWINHAN_VP1027_DVBS, + .module_name = KBUILD_MODNAME, + .rc_query = vp7045_rc_query, + .allowed_protos = RC_PROTO_BIT_NEC, + .scancode_mask = 0xff, }, .num_device_descs = 2, diff --git a/drivers/media/usb/gspca/ov519.c b/drivers/media/usb/gspca/ov519.c index b51d2de1aca8..b78a2cf664fc 100644 --- a/drivers/media/usb/gspca/ov519.c +++ b/drivers/media/usb/gspca/ov519.c @@ -1,4 +1,4 @@ -/** +/* * OV519 driver * * Copyright (C) 2008-2011 Jean-François Moine diff --git a/drivers/media/usb/pwc/pwc-dec23.c b/drivers/media/usb/pwc/pwc-dec23.c index 3792fedff951..1283b3bd9800 100644 --- a/drivers/media/usb/pwc/pwc-dec23.c +++ b/drivers/media/usb/pwc/pwc-dec23.c @@ -649,11 +649,10 @@ static void DecompressBand23(struct pwc_dec23_private *pdec, } /** - * * Uncompress a pwc23 buffer. - * - * src: raw data - * dst: image output + * @pdev: pointer to pwc device's internal struct + * @src: raw data + * @dst: image output */ void pwc_dec23_decompress(struct pwc_device *pdev, const void *src, diff --git a/drivers/media/usb/ttusb-budget/dvb-ttusb-budget.c b/drivers/media/usb/ttusb-budget/dvb-ttusb-budget.c index b842f367249f..a142b9dc0feb 100644 --- a/drivers/media/usb/ttusb-budget/dvb-ttusb-budget.c +++ b/drivers/media/usb/ttusb-budget/dvb-ttusb-budget.c @@ -76,7 +76,7 @@ DVB_DEFINE_MOD_OPT_ADAPTER_NR(adapter_nr); #define TTUSB_REV_2_2 0x22 #define TTUSB_BUDGET_NAME "ttusb_stc_fw" -/** +/* * since we're casting (struct ttusb*) <-> (struct dvb_demux*) around * the dvb_demux field must be the first in struct!! */ @@ -713,7 +713,7 @@ static void ttusb_process_frame(struct ttusb *ttusb, u8 * data, int len) } } - /** + /* * if length is valid and we reached the end: * goto next muxpack */ @@ -729,7 +729,7 @@ static void ttusb_process_frame(struct ttusb *ttusb, u8 * data, int len) /* maximum bytes, until we know the length */ ttusb->muxpack_len = 2; - /** + /* * no muxpacks left? * return to search-sync state */ diff --git a/drivers/misc/memory_state_time.c b/drivers/misc/memory_state_time.c index ba94dcf09169..3e449bfced41 100644 --- a/drivers/misc/memory_state_time.c +++ b/drivers/misc/memory_state_time.c @@ -129,7 +129,7 @@ static ssize_t show_stat_show(struct kobject *kobj, } } } - pr_debug("Current Time: %llu\n", ktime_get_boot_ns()); + pr_debug("Current Time: %llu\n", ktime_get_boottime_ns()); return len; } KERNEL_ATTR_RO(show_stat); @@ -212,7 +212,7 @@ static void memory_state_freq_update(struct memory_state_update_block *ub, return; INIT_WORK(&freq_container->update_state, freq_update_do_work); - freq_container->time_now = ktime_get_boot_ns(); + freq_container->time_now = ktime_get_boottime_ns(); freq_container->value = value; pr_debug("Scheduling freq update in work queue\n"); queue_work(memory_wq, &freq_container->update_state); @@ -234,7 +234,7 @@ static void memory_state_bw_update(struct memory_state_update_block *ub, return; INIT_WORK(&bw_container->update_state, bw_update_do_work); - bw_container->time_now = ktime_get_boot_ns(); + bw_container->time_now = ktime_get_boottime_ns(); bw_container->value = value; bw_container->id = ub->id; pr_debug("Scheduling bandwidth update in work queue\n"); @@ -400,7 +400,7 @@ static int memory_state_time_probe(struct platform_device *pdev) error = freq_buckets_init(&pdev->dev); if (error) return error; - last_update = ktime_get_boot_ns(); + last_update = ktime_get_boottime_ns(); init_success = true; pr_debug("memory_state_time initialized with num_freqs %d\n", diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index df6e76e5d414..0f962cb2ca2e 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -7830,8 +7830,6 @@ static void bnxt_remove_one(struct pci_dev *pdev) bnxt_dcb_free(bp); kfree(bp->edev); bp->edev = NULL; - if (bp->xdp_prog) - bpf_prog_put(bp->xdp_prog); bnxt_cleanup_pci(bp); free_netdev(dev); } diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c index ba0869bd60e0..476fc871c585 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c @@ -94,6 +94,7 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons, xdp.data_hard_start = *data_ptr - offset; xdp.data = *data_ptr; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = *data_ptr + *len; orig_data = xdp.data; mapping = rx_buf->mapping - bp->rx_dma_offset; @@ -217,7 +218,6 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp) rc = bnxt_xdp_set(bp, xdp->prog); break; case XDP_QUERY_PROG: - xdp->prog_attached = !!bp->xdp_prog; xdp->prog_id = bp->xdp_prog ? bp->xdp_prog->aux->id : 0; rc = 0; break; diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c index 79e476cdd97d..82791bfb3c05 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c @@ -541,6 +541,7 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog, xdp.data_hard_start = page_address(page); xdp.data = (void *)cpu_addr; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + len; orig_data = xdp.data; @@ -1787,7 +1788,6 @@ static int nicvf_xdp(struct net_device *netdev, struct netdev_bpf *xdp) case XDP_SETUP_PROG: return nicvf_xdp_setup(nic, xdp->prog); case XDP_QUERY_PROG: - xdp->prog_attached = !!nic->xdp_prog; xdp->prog_id = nic->xdp_prog ? nic->xdp_prog->aux->id : 0; return 0; default: diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 751c931fe184..ca08e13a898a 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -9662,7 +9662,6 @@ static int i40e_xdp(struct net_device *dev, case XDP_SETUP_PROG: return i40e_xdp_setup(vsi, xdp->prog); case XDP_QUERY_PROG: - xdp->prog_attached = i40e_enabled_xdp_vsi(vsi); xdp->prog_id = vsi->xdp_prog ? vsi->xdp_prog->aux->id : 0; return 0; default: diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index 02871e0e2024..22fb4afb137a 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -2117,6 +2117,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget) if (!skb) { xdp.data = page_address(rx_buffer->page) + rx_buffer->page_offset; + xdp_set_data_meta_invalid(&xdp); xdp.data_hard_start = xdp.data - i40e_rx_offset(rx_ring); xdp.data_end = xdp.data + size; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 64b8cc5c6283..47c14dd3bf81 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -2339,6 +2339,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector, if (!skb) { xdp.data = page_address(rx_buffer->page) + rx_buffer->page_offset; + xdp_set_data_meta_invalid(&xdp); xdp.data_hard_start = xdp.data - ixgbe_rx_offset(rx_ring); xdp.data_end = xdp.data + size; @@ -9893,7 +9894,6 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp) case XDP_SETUP_PROG: return ixgbe_xdp_setup(dev, xdp->prog); case XDP_QUERY_PROG: - xdp->prog_attached = !!(adapter->xdp_prog); xdp->prog_id = adapter->xdp_prog ? adapter->xdp_prog->aux->id : 0; return 0; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 1bc7e3497a1a..59cf6493b261 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -2937,7 +2937,6 @@ static int mlx4_xdp(struct net_device *dev, struct netdev_bpf *xdp) return mlx4_xdp_set(dev, xdp->prog); case XDP_QUERY_PROG: xdp->prog_id = mlx4_xdp_query(dev); - xdp->prog_attached = !!xdp->prog_id; return 0; default: return -EINVAL; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index bb0063e851c3..0fbed144c597 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -780,6 +780,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud xdp.data_hard_start = va - frags[0].page_offset; xdp.data = va; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + length; orig_data = xdp.data; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig index 576b61c119bb..69c4b98bd2f3 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig @@ -30,6 +30,7 @@ config MLX5_CORE_EN depends on NETDEVICES && ETHERNET && INET && PCI && MLX5_CORE depends on IPV6=y || IPV6=n || MLX5_CORE=m imply PTP_1588_CLOCK + select PAGE_POOL default n ---help--- Ethernet support in Mellanox Technologies ConnectX-4 NIC. diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 707c87f9987c..307c067cd51d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -3742,7 +3742,6 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp) return mlx5e_xdp_set(dev, xdp->prog); case XDP_QUERY_PROG: xdp->prog_id = mlx5e_xdp_query(dev); - xdp->prog_attached = !!xdp->prog_id; return 0; default: return -EINVAL; @@ -4174,9 +4173,6 @@ static void mlx5e_nic_cleanup(struct mlx5e_priv *priv) { mlx5e_ipsec_cleanup(priv); mlx5e_vxlan_cleanup(priv); - - if (priv->channels.params.xdp_prog) - bpf_prog_put(priv->channels.params.xdp_prog); } static int mlx5e_init_nic_rx(struct mlx5e_priv *priv) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index bf311a3c3e02..ce96230b98ed 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -845,6 +845,7 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq, return false; xdp.data = va + *rx_headroom; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + *len; xdp.data_hard_start = va; diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index f6050b13ff1b..95f39434f304 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1614,6 +1614,7 @@ static int nfp_net_run_xdp(struct bpf_prog *prog, void *data, void *hard_start, xdp.data_hard_start = hard_start; xdp.data = data + *off; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = data + *off + *len; orig_data = xdp.data; @@ -3432,10 +3433,8 @@ static int nfp_net_xdp(struct net_device *netdev, struct netdev_bpf *xdp) return nfp_net_xdp_setup(nn, xdp->prog, xdp->flags, xdp->extack); case XDP_QUERY_PROG: - xdp->prog_attached = !!nn->xdp_prog; - if (nn->dp.bpf_offload_xdp) - xdp->prog_attached = XDP_ATTACHED_HW; xdp->prog_id = nn->xdp_prog ? nn->xdp_prog->aux->id : 0; + xdp->flags = nn->xdp_prog ? nn->xdp_flags : 0; return 0; default: return -EINVAL; diff --git a/drivers/net/ethernet/qlogic/qede/qede.h b/drivers/net/ethernet/qlogic/qede/qede.h index 4cc9af175a76..b3cb1d89f1e9 100644 --- a/drivers/net/ethernet/qlogic/qede/qede.h +++ b/drivers/net/ethernet/qlogic/qede/qede.h @@ -40,6 +40,7 @@ #include #include #include +#include #include #include #ifdef CONFIG_RFS_ACCEL @@ -347,6 +348,7 @@ struct qede_rx_queue { u64 xdp_no_pass; void *handle; + struct xdp_rxq_info xdp_rxq; }; union db_prod { diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c b/drivers/net/ethernet/qlogic/qede/qede_filter.c index 924cb2ea664d..6a30751a7c84 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_filter.c +++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c @@ -1073,7 +1073,6 @@ int qede_xdp(struct net_device *dev, struct netdev_bpf *xdp) case XDP_SETUP_PROG: return qede_xdp_set(edev, xdp->prog); case XDP_QUERY_PROG: - xdp->prog_attached = !!edev->xdp_prog; xdp->prog_id = edev->xdp_prog ? edev->xdp_prog->aux->id : 0; return 0; default: diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c index 8a8e1616b79a..85b71cdb78e8 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_fp.c +++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c @@ -1002,7 +1002,9 @@ static bool qede_rx_xdp(struct qede_dev *edev, xdp.data_hard_start = page_address(bd->data); xdp.data = xdp.data_hard_start + *data_offset; + xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + *len; + xdp.rxq = &rxq->xdp_rxq; /* Queues always have a full reset currently, so for the time * being until there's atomic program replace just mark read diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c index a2da52362d09..cf01fdfda5ff 100644 --- a/drivers/net/ethernet/qlogic/qede/qede_main.c +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c @@ -769,6 +769,12 @@ static void qede_free_fp_array(struct qede_dev *edev) fp = &edev->fp_array[i]; kfree(fp->sb_info); + /* Handle mem alloc failure case where qede_init_fp + * didn't register xdp_rxq_info yet. + * Implicit only (fp->type & QEDE_FASTPATH_RX) + */ + if (fp->rxq && xdp_rxq_info_is_reg(&fp->rxq->xdp_rxq)) + xdp_rxq_info_unreg(&fp->rxq->xdp_rxq); kfree(fp->rxq); kfree(fp->xdp_tx); kfree(fp->txq); @@ -1083,10 +1089,6 @@ static void __qede_remove(struct pci_dev *pdev, enum qede_remove_mode mode) pci_set_drvdata(pdev, NULL); - /* Release edev's reference to XDP's bpf if such exist */ - if (edev->xdp_prog) - bpf_prog_put(edev->xdp_prog); - /* Use global ops since we've freed edev */ qed_ops->common->slowpath_stop(cdev); if (system_state == SYSTEM_POWER_OFF) @@ -1517,6 +1519,10 @@ static void qede_init_fp(struct qede_dev *edev) else fp->rxq->data_direction = DMA_FROM_DEVICE; fp->rxq->dev = &edev->pdev->dev; + + /* Driver have no error path from here */ + WARN_ON(xdp_rxq_info_reg(&fp->rxq->xdp_rxq, edev->ndev, + fp->rxq->rxq_id) < 0); } if (fp->type & QEDE_FASTPATH_TX) { diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index a152b2902788..29badd77c8b9 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -234,7 +234,8 @@ static void geneve_rx(struct geneve_dev *geneve, struct geneve_sock *gs, } /* Update tunnel dst according to Geneve options. */ ip_tunnel_info_opts_set(&tun_dst->u.tun_info, - gnvh->options, gnvh->opt_len * 4); + gnvh->options, gnvh->opt_len * 4, + TUNNEL_GENEVE_OPT); } else { /* Drop packets w/ critical options, * since we don't support any... @@ -691,7 +692,8 @@ static void geneve_build_header(struct genevehdr *geneveh, geneveh->proto_type = htons(ETH_P_TEB); geneveh->rsvd2 = 0; - ip_tunnel_info_opts_get(geneveh->options, info); + if (info->key.tun_flags & TUNNEL_GENEVE_OPT) + ip_tunnel_info_opts_get(geneveh->options, info); } static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb, diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c index ab9beca09ca5..df8c6d0dae1f 100644 --- a/drivers/net/ipvlan/ipvlan_core.c +++ b/drivers/net/ipvlan/ipvlan_core.c @@ -776,7 +776,8 @@ struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb, }; skb_dst_drop(skb); - dst = ip6_route_input_lookup(dev_net(sdev), sdev, &fl6, flags); + dst = ip6_route_input_lookup(dev_net(sdev), sdev, &fl6, + skb, flags); skb_dst_set(skb, dst); break; } diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 757dff1c7216..ece060edc280 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -632,7 +632,6 @@ static void tun_detach(struct tun_file *tfile, bool clean) static void tun_detach_all(struct net_device *dev) { struct tun_struct *tun = netdev_priv(dev); - struct bpf_prog *xdp_prog = rtnl_dereference(tun->xdp_prog); struct tun_file *tfile, *tmp; int i, n = tun->numqueues; @@ -665,9 +664,6 @@ static void tun_detach_all(struct net_device *dev) } BUG_ON(tun->numdisabled != 0); - if (xdp_prog) - bpf_prog_put(xdp_prog); - if (tun->flags & IFF_PERSIST) module_put(THIS_MODULE); } @@ -1084,7 +1080,6 @@ static int tun_xdp(struct net_device *dev, struct netdev_bpf *xdp) return tun_xdp_set(dev, xdp->prog, xdp->extack); case XDP_QUERY_PROG: xdp->prog_id = tun_xdp_query(dev); - xdp->prog_attached = !!xdp->prog_id; return 0; default: return -EINVAL; diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 339d6c0b162a..91871e8bfd47 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2104,7 +2104,6 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) return virtnet_xdp_set(dev, xdp->prog, xdp->extack); case XDP_QUERY_PROG: xdp->prog_id = virtnet_xdp_query(dev); - xdp->prog_attached = !!xdp->prog_id; return 0; default: return -EINVAL; diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index 5c7c608e24c9..1b5cce954072 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -995,6 +995,7 @@ static struct rt6_info *vrf_ip6_route_lookup(struct net *net, const struct net_device *dev, struct flowi6 *fl6, int ifindex, + const struct sk_buff *skb, int flags) { struct net_vrf *vrf = netdev_priv(dev); @@ -1013,7 +1014,7 @@ static struct rt6_info *vrf_ip6_route_lookup(struct net *net, if (!table) return NULL; - return ip6_pol_route(net, table, ifindex, fl6, flags); + return ip6_pol_route(net, table, ifindex, fl6, skb, flags); } static void vrf_ip6_input_dst(struct sk_buff *skb, struct net_device *vrf_dev, @@ -1031,7 +1032,7 @@ static void vrf_ip6_input_dst(struct sk_buff *skb, struct net_device *vrf_dev, struct net *net = dev_net(vrf_dev); struct rt6_info *rt6; - rt6 = vrf_ip6_route_lookup(net, vrf_dev, &fl6, ifindex, + rt6 = vrf_ip6_route_lookup(net, vrf_dev, &fl6, ifindex, skb, RT6_LOOKUP_F_HAS_SADDR | RT6_LOOKUP_F_IFACE); if (unlikely(!rt6)) return; @@ -1164,7 +1165,7 @@ static struct dst_entry *vrf_link_scope_lookup(const struct net_device *dev, if (!ipv6_addr_any(&fl6->saddr)) flags |= RT6_LOOKUP_F_HAS_SADDR; - rt = vrf_ip6_route_lookup(net, dev, fl6, fl6->flowi6_oif, flags); + rt = vrf_ip6_route_lookup(net, dev, fl6, fl6->flowi6_oif, NULL, flags); if (rt) dst = &rt->dst; @@ -1199,6 +1200,7 @@ static inline size_t vrf_fib_rule_nl_size(void) sz = NLMSG_ALIGN(sizeof(struct fib_rule_hdr)); sz += nla_total_size(sizeof(u8)); /* FRA_L3MDEV */ sz += nla_total_size(sizeof(u32)); /* FRA_PRIORITY */ + sz += nla_total_size(sizeof(u8)); /* FRA_PROTOCOL */ return sz; } @@ -1229,6 +1231,9 @@ static int vrf_fib_rule(const struct net_device *dev, __u8 family, bool add_it) frh->family = family; frh->action = FR_ACT_TO_TBL; + if (nla_put_u8(skb, FRA_PROTOCOL, RTPROT_KERNEL)) + goto nla_put_failure; + if (nla_put_u8(skb, FRA_L3MDEV, 1)) goto nla_put_failure; diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 3aa49417cdfd..768394d96538 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -2181,7 +2181,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev, vni = tunnel_id_to_key32(info->key.tun_id); ifindex = 0; dst_cache = &info->dst_cache; - if (info->options_len) { + if (info->options_len && + info->key.tun_flags & TUNNEL_VXLAN_OPT) { if (info->options_len < sizeof(*md)) goto drop; md = ip_tunnel_info_opts(info); diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/rx.c b/drivers/net/wireless/intel/iwlwifi/mvm/rx.c index c31303d13069..f8a481818031 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/rx.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/rx.c @@ -489,7 +489,7 @@ void iwl_mvm_rx_rx_mpdu(struct iwl_mvm *mvm, struct napi_struct *napi, if (unlikely(ieee80211_is_beacon(hdr->frame_control) || ieee80211_is_probe_resp(hdr->frame_control))) - rx_status->boottime_ns = ktime_get_boot_ns(); + rx_status->boottime_ns = ktime_get_boottime_ns(); /* Take a reference briefly to kick off a d0i3 entry delay so * we can handle bursts of RX packets without toggling the diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c b/drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c index 1a12e829e98b..aff8b2ddd5d1 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c @@ -1046,7 +1046,7 @@ void iwl_mvm_rx_mpdu_mq(struct iwl_mvm *mvm, struct napi_struct *napi, if (unlikely(ieee80211_is_beacon(hdr->frame_control) || ieee80211_is_probe_resp(hdr->frame_control))) - rx_status->boottime_ns = ktime_get_boot_ns(); + rx_status->boottime_ns = ktime_get_boottime_ns(); } if (iwl_mvm_create_skb(mvm, skb, hdr, len, crypt_len, rxb)) { diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/utils.c b/drivers/net/wireless/intel/iwlwifi/mvm/utils.c index 3303fc85d76f..0b307410c640 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/utils.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/utils.c @@ -1409,7 +1409,7 @@ void iwl_mvm_get_sync_time(struct iwl_mvm *mvm, u32 *gp2, u64 *boottime) } *gp2 = iwl_read_prph(mvm->trans, DEVICE_SYSTEM_TIME_REG); - *boottime = ktime_get_boot_ns(); + *boottime = ktime_get_boottime_ns(); if (!ps_disabled) { mvm->ps_disabled = ps_disabled; diff --git a/drivers/net/wireless/mac80211_hwsim.c b/drivers/net/wireless/mac80211_hwsim.c index d3e946de86b5..7f11f420129f 100644 --- a/drivers/net/wireless/mac80211_hwsim.c +++ b/drivers/net/wireless/mac80211_hwsim.c @@ -1343,10 +1343,12 @@ static bool mac80211_hwsim_tx_frame_no_nl(struct ieee80211_hw *hw, * probably doesn't really matter. */ if (ieee80211_is_beacon(hdr->frame_control) || - ieee80211_is_probe_resp(hdr->frame_control)) + ieee80211_is_probe_resp(hdr->frame_control)) { + rx_status.boottime_ns = ktime_get_boottime_ns(); now = data->abs_bcn_ts; - else + } else { now = mac80211_hwsim_get_tsf_raw(); + } /* Copy skb to all enabled radios that are on the current frequency */ spin_lock(&hwsim_radio_lock); diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h index 1913b51c1e80..880110b3df09 100644 --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h @@ -1,15 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright (c) 2014 - 2017 Jes Sorensen * - * This program is free software; you can redistribute it and/or modify it - * under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. - * * Register definitions taken from original Realtek rtl8723au driver */ diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192c.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192c.c index a41a29612582..27c4cb688be4 100644 --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192c.c +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192c.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * RTL8XXXU mac80211 USB driver - 8188c/8188r/8192c specific subdriver * @@ -10,15 +11,6 @@ * rtl8723au driver. As the Realtek 8xxx chips are very similar in * their programming interface, I have started adding support for * additional 8xxx chips like the 8192cu, 8188cus, etc. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192e.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192e.c index 413f0ced960a..f6af1d401a31 100644 --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192e.c +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8192e.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * RTL8XXXU mac80211 USB driver - 8192e specific subdriver * @@ -10,15 +11,6 @@ * rtl8723au driver. As the Realtek 8xxx chips are very similar in * their programming interface, I have started adding support for * additional 8xxx chips like the 8192cu, 8188cus, etc. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723a.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723a.c index 174631132b96..4f93f88716a9 100644 --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723a.c +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723a.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * RTL8XXXU mac80211 USB driver - 8723a specific subdriver * @@ -10,15 +11,6 @@ * rtl8723au driver. As the Realtek 8xxx chips are very similar in * their programming interface, I have started adding support for * additional 8xxx chips like the 8192cu, 8188cus, etc. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c index 27e97df996c7..22a76910b785 100644 --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_8723b.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * RTL8XXXU mac80211 USB driver - 8723b specific subdriver * @@ -10,15 +11,6 @@ * rtl8723au driver. As the Realtek 8xxx chips are very similar in * their programming interface, I have started adding support for * additional 8xxx chips like the 8192cu, 8188cus, etc. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c index 9263a6a64788..305f9934b451 100644 --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * RTL8XXXU mac80211 USB driver * @@ -10,15 +11,6 @@ * rtl8723au driver. As the Realtek 8xxx chips are very similar in * their programming interface, I have started adding support for * additional 8xxx chips like the 8192cu, 8188cus, etc. - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include diff --git a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_regs.h b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_regs.h index 3d3e2e1ada6f..a2a31f374a82 100644 --- a/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_regs.h +++ b/drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_regs.h @@ -1,15 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright (c) 2014 - 2017 Jes Sorensen * - * This program is free software; you can redistribute it and/or modify it - * under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. - * * Register definitions taken from original Realtek rtl8723au driver */ diff --git a/drivers/net/wireless/ti/wlcore/main.c b/drivers/net/wireless/ti/wlcore/main.c index 9f568034deb3..e21fe7c6c718 100644 --- a/drivers/net/wireless/ti/wlcore/main.c +++ b/drivers/net/wireless/ti/wlcore/main.c @@ -388,7 +388,6 @@ static void wl12xx_irq_update_links_status(struct wl1271 *wl, static int wlcore_fw_status(struct wl1271 *wl, struct wl_fw_status *status) { struct wl12xx_vif *wlvif; - struct timespec ts; u32 old_tx_blk_count = wl->tx_blocks_available; int avail, freed_blocks; int i; @@ -485,8 +484,7 @@ static int wlcore_fw_status(struct wl1271 *wl, struct wl_fw_status *status) } /* update the host-chipset time offset */ - getnstimeofday(&ts); - wl->time_offset = (timespec_to_ns(&ts) >> 10) - + wl->time_offset = (ktime_get_boottime_ns() >> 10) - (s64)(status->fw_localtime); wl->fw_fast_lnk_map = status->link_fast_bitmap; diff --git a/drivers/net/wireless/ti/wlcore/rx.c b/drivers/net/wireless/ti/wlcore/rx.c index 078a4940bc5c..cb4b17bfba23 100644 --- a/drivers/net/wireless/ti/wlcore/rx.c +++ b/drivers/net/wireless/ti/wlcore/rx.c @@ -107,7 +107,7 @@ static void wl1271_rx_status(struct wl1271 *wl, } if (beacon || probe_rsp) - status->boottime_ns = ktime_get_boot_ns(); + status->boottime_ns = ktime_get_boottime_ns(); if (beacon) wlcore_set_pending_regdomain_ch(wl, (u16)desc->channel, diff --git a/drivers/net/wireless/ti/wlcore/tx.c b/drivers/net/wireless/ti/wlcore/tx.c index a3f5e9ca492a..1d12b293c8a5 100644 --- a/drivers/net/wireless/ti/wlcore/tx.c +++ b/drivers/net/wireless/ti/wlcore/tx.c @@ -264,7 +264,6 @@ static void wl1271_tx_fill_hdr(struct wl1271 *wl, struct wl12xx_vif *wlvif, struct sk_buff *skb, u32 extra, struct ieee80211_tx_info *control, u8 hlid) { - struct timespec ts; struct wl1271_tx_hw_descr *desc; int ac, rate_idx; s64 hosttime; @@ -287,8 +286,7 @@ static void wl1271_tx_fill_hdr(struct wl1271 *wl, struct wl12xx_vif *wlvif, } /* configure packet life time */ - getnstimeofday(&ts); - hosttime = (timespec_to_ns(&ts) >> 10); + hosttime = (ktime_get_boottime_ns() >> 10); desc->start_time = cpu_to_le32(hosttime - wl->time_offset); is_dummy = wl12xx_is_dummy_packet(wl, skb); diff --git a/drivers/net/wireless/virt_wifi.c b/drivers/net/wireless/virt_wifi.c index c06b11c84732..ff4837a1dc7d 100644 --- a/drivers/net/wireless/virt_wifi.c +++ b/drivers/net/wireless/virt_wifi.c @@ -174,7 +174,7 @@ static void virt_wifi_scan_result(struct work_struct *work) scan_result.work); struct wiphy *wiphy = priv_to_wiphy(priv); struct cfg80211_scan_info scan_info = { .aborted = false }; - u64 tsf = div_u64(ktime_get_boot_ns(), 1000); + u64 tsf = div_u64(ktime_get_boottime_ns(), 1000); informed_bss = cfg80211_inform_bss(wiphy, &channel_5ghz, CFG80211_BSS_FTYPE_PRESP, diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c index 76a74e292fd7..c2d17c957cf8 100644 --- a/drivers/nvdimm/btt_devs.c +++ b/drivers/nvdimm/btt_devs.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c index 17d54cabcbc6..291ccf717a64 100644 --- a/drivers/nvdimm/bus.c +++ b/drivers/nvdimm/bus.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c index 32f2aaf62f27..85380f9d6efa 100644 --- a/drivers/nvdimm/claim.c +++ b/drivers/nvdimm/claim.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c index bb71f0cf8f5d..2244b4c478f1 100644 --- a/drivers/nvdimm/core.c +++ b/drivers/nvdimm/core.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/nvdimm/dax_devs.c b/drivers/nvdimm/dax_devs.c index 0e9e37410c58..16ed08d76fff 100644 --- a/drivers/nvdimm/dax_devs.c +++ b/drivers/nvdimm/dax_devs.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2016 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c index 0939f064054d..a6ba9e65e2f6 100644 --- a/drivers/nvdimm/dimm.c +++ b/drivers/nvdimm/dimm.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c index 58e0dcfeb724..1674e9280c74 100644 --- a/drivers/nvdimm/dimm_devs.c +++ b/drivers/nvdimm/dimm_devs.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c index 84eb510456fd..43194f1a7e2d 100644 --- a/drivers/nvdimm/label.c +++ b/drivers/nvdimm/label.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/nvdimm/label.h b/drivers/nvdimm/label.h index 9ed772db6900..2d374af0429d 100644 --- a/drivers/nvdimm/label.h +++ b/drivers/nvdimm/label.h @@ -1,14 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef __LABEL_H__ #define __LABEL_H__ diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c index 6ed3b4ed27dd..0182947d7c89 100644 --- a/drivers/nvdimm/namespace_devs.c +++ b/drivers/nvdimm/namespace_devs.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h index f7b0c39ac339..6a71b083eaf7 100644 --- a/drivers/nvdimm/nd-core.h +++ b/drivers/nvdimm/nd-core.h @@ -1,14 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef __ND_CORE_H__ #define __ND_CORE_H__ diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h index cf30b49481a1..024ace14ea39 100644 --- a/drivers/nvdimm/nd.h +++ b/drivers/nvdimm/nd.h @@ -1,14 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef __ND_H__ #define __ND_H__ diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c index e2af91a91c22..cd1d0f88329c 100644 --- a/drivers/nvdimm/pfn_devs.c +++ b/drivers/nvdimm/pfn_devs.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2016 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c index 034f0a07d627..d29084736f5a 100644 --- a/drivers/nvdimm/region.c +++ b/drivers/nvdimm/region.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index c0e6a6d235de..57711462fa23 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/drivers/nvmem/meson-efuse.c b/drivers/nvmem/meson-efuse.c index 70bfc9839bb2..e4d821ba964d 100644 --- a/drivers/nvmem/meson-efuse.c +++ b/drivers/nvmem/meson-efuse.c @@ -1,17 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Amlogic eFuse Driver * * Copyright (c) 2016 Endless Computers, Inc. * Author: Carlo Caione - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include diff --git a/drivers/nvmem/rockchip-efuse.c b/drivers/nvmem/rockchip-efuse.c index 63e3eb55f3ac..c7e5e253860f 100644 --- a/drivers/nvmem/rockchip-efuse.c +++ b/drivers/nvmem/rockchip-efuse.c @@ -1,17 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Rockchip eFuse Driver * * Copyright (c) 2015 Rockchip Electronics Co. Ltd. * Author: Caesar Wang - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - * more details. */ #include diff --git a/drivers/staging/media/Kconfig b/drivers/staging/media/Kconfig index f8c25ee082ef..3a09140700e6 100644 --- a/drivers/staging/media/Kconfig +++ b/drivers/staging/media/Kconfig @@ -31,7 +31,4 @@ source "drivers/staging/media/imx/Kconfig" source "drivers/staging/media/omap4iss/Kconfig" -# Keep LIRC at the end, as it has sub-menus -source "drivers/staging/media/lirc/Kconfig" - endif diff --git a/drivers/staging/media/Makefile b/drivers/staging/media/Makefile index be732cf932fd..f25327163c67 100644 --- a/drivers/staging/media/Makefile +++ b/drivers/staging/media/Makefile @@ -2,7 +2,6 @@ obj-$(CONFIG_I2C_BCM2048) += bcm2048/ obj-$(CONFIG_DVB_CXD2099) += cxd2099/ obj-$(CONFIG_VIDEO_IMX_MEDIA) += imx/ -obj-$(CONFIG_LIRC_STAGING) += lirc/ obj-$(CONFIG_VIDEO_DM365_VPFE) += davinci_vpfe/ obj-$(CONFIG_VIDEO_OMAP4) += omap4iss/ obj-$(CONFIG_INTEL_ATOMISP) += atomisp/ diff --git a/drivers/staging/media/lirc/Kconfig b/drivers/staging/media/lirc/Kconfig deleted file mode 100644 index 3e350a9922de..000000000000 --- a/drivers/staging/media/lirc/Kconfig +++ /dev/null @@ -1,21 +0,0 @@ -# -# LIRC driver(s) configuration -# -menuconfig LIRC_STAGING - bool "Linux Infrared Remote Control IR receiver/transmitter drivers" - depends on LIRC - help - Say Y here, and all supported Linux Infrared Remote Control IR and - RF receiver and transmitter drivers will be displayed. When paired - with a remote control and the lirc daemon, the receiver drivers - allow control of your Linux system via remote control. - -if LIRC_STAGING - -config LIRC_ZILOG - tristate "Zilog/Hauppauge IR Transmitter" - depends on LIRC && I2C - help - Driver for the Zilog/Hauppauge IR Transmitter, found on - PVR-150/500, HVR-1200/1250/1700/1800, HD-PVR and other cards -endif diff --git a/drivers/staging/media/lirc/Makefile b/drivers/staging/media/lirc/Makefile deleted file mode 100644 index 665562436e30..000000000000 --- a/drivers/staging/media/lirc/Makefile +++ /dev/null @@ -1,6 +0,0 @@ -# Makefile for the lirc drivers. -# - -# Each configuration option enables a list of files. - -obj-$(CONFIG_LIRC_ZILOG) += lirc_zilog.o diff --git a/drivers/staging/media/lirc/TODO b/drivers/staging/media/lirc/TODO deleted file mode 100644 index a97800a8e127..000000000000 --- a/drivers/staging/media/lirc/TODO +++ /dev/null @@ -1,36 +0,0 @@ -1. Both ir-kbd-i2c and lirc_zilog provide support for RX events for -the chips supported by lirc_zilog. Before moving lirc_zilog out of staging: - -a. ir-kbd-i2c needs a module parameter added to allow the user to tell - ir-kbd-i2c to ignore Z8 IR units. - -b. lirc_zilog should provide Rx key presses to the rc core like ir-kbd-i2c - does. - - -2. lirc_zilog module ref-counting need examination. It has not been -verified that cdev and lirc_dev will take the proper module references on -lirc_zilog to prevent removal of lirc_zilog when the /dev/lircN device node -is open. - -(The good news is ref-counting of lirc_zilog internal structures appears to be -complete. Testing has shown the cx18 module can be unloaded out from under -irw + lircd + lirc_dev, with the /dev/lirc0 device node open, with no adverse -effects. The cx18 module could then be reloaded and irw properly began -receiving button presses again and ir_send worked without error.) - - -3. Bridge drivers, if able, should provide a chip reset() callback -to lirc_zilog via struct IR_i2c_init_data. cx18 and ivtv already have routines -to perform Z8 chip resets via GPIO manipulations. This would allow lirc_zilog -to bring the chip back to normal when it hangs, in the same places the -original lirc_pvr150 driver code does. This is not strictly needed, so it -is not required to move lirc_zilog out of staging. - -Note: Both lirc_zilog and ir-kbd-i2c support the Zilog Z8 for IR, as programmed -and installed on Hauppauge products. When working on either module, developers -must consider at least the following bridge drivers which mention an IR Rx unit -at address 0x71 (indicative of a Z8): - - ivtv cx18 hdpvr pvrusb2 bt8xx cx88 saa7134 - diff --git a/drivers/staging/media/lirc/lirc_zilog.c b/drivers/staging/media/lirc/lirc_zilog.c deleted file mode 100644 index e35e1b2160e3..000000000000 --- a/drivers/staging/media/lirc/lirc_zilog.c +++ /dev/null @@ -1,1683 +0,0 @@ -/* - * i2c IR lirc driver for devices with zilog IR processors - * - * Copyright (c) 2000 Gerd Knorr - * modified for PixelView (BT878P+W/FM) by - * Michal Kochanowicz - * Christoph Bartelmus - * modified for KNC ONE TV Station/Anubis Typhoon TView Tuner by - * Ulrich Mueller - * modified for Asus TV-Box and Creative/VisionTek BreakOut-Box by - * Stefan Jahn - * modified for inclusion into kernel sources by - * Jerome Brock - * modified for Leadtek Winfast PVR2000 by - * Thomas Reitmayr (treitmayr@yahoo.com) - * modified for Hauppauge PVR-150 IR TX device by - * Mark Weaver - * changed name from lirc_pvr150 to lirc_zilog, works on more than pvr-150 - * Jarod Wilson - * - * parts are cut&pasted from the lirc_i2c.c driver - * - * Numerous changes updating lirc_zilog.c in kernel 2.6.38 and later are - * Copyright (C) 2011 Andy Walls - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - * - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include - -#include -#include - -/* Max transfer size done by I2C transfer functions */ -#define MAX_XFER_SIZE 64 - -struct IR; - -struct IR_rx { - struct kref ref; - struct IR *ir; - - /* RX device */ - struct mutex client_lock; - struct i2c_client *c; - - /* RX polling thread data */ - struct task_struct *task; - - /* RX read data */ - unsigned char b[3]; - bool hdpvr_data_fmt; -}; - -struct IR_tx { - struct kref ref; - struct IR *ir; - - /* TX device */ - struct mutex client_lock; - struct i2c_client *c; - - /* TX additional actions needed */ - int need_boot; - bool post_tx_ready_poll; -}; - -struct IR { - struct kref ref; - struct list_head list; - - /* FIXME spinlock access to l.features */ - struct lirc_driver l; - struct lirc_buffer rbuf; - - struct mutex ir_lock; - atomic_t open_count; - - struct i2c_adapter *adapter; - - spinlock_t rx_ref_lock; /* struct IR_rx kref get()/put() */ - struct IR_rx *rx; - - spinlock_t tx_ref_lock; /* struct IR_tx kref get()/put() */ - struct IR_tx *tx; -}; - -/* IR transceiver instance object list */ -/* - * This lock is used for the following: - * a. ir_devices_list access, insertions, deletions - * b. struct IR kref get()s and put()s - * c. serialization of ir_probe() for the two i2c_clients for a Z8 - */ -static DEFINE_MUTEX(ir_devices_lock); -static LIST_HEAD(ir_devices_list); - -/* Block size for IR transmitter */ -#define TX_BLOCK_SIZE 99 - -/* Hauppauge IR transmitter data */ -struct tx_data_struct { - /* Boot block */ - unsigned char *boot_data; - - /* Start of binary data block */ - unsigned char *datap; - - /* End of binary data block */ - unsigned char *endp; - - /* Number of installed codesets */ - unsigned int num_code_sets; - - /* Pointers to codesets */ - unsigned char **code_sets; - - /* Global fixed data template */ - int fixed[TX_BLOCK_SIZE]; -}; - -static struct tx_data_struct *tx_data; -static struct mutex tx_data_lock; - - -/* module parameters */ -static bool debug; /* debug output */ -static bool tx_only; /* only handle the IR Tx function */ - - -/* struct IR reference counting */ -static struct IR *get_ir_device(struct IR *ir, bool ir_devices_lock_held) -{ - if (ir_devices_lock_held) { - kref_get(&ir->ref); - } else { - mutex_lock(&ir_devices_lock); - kref_get(&ir->ref); - mutex_unlock(&ir_devices_lock); - } - return ir; -} - -static void release_ir_device(struct kref *ref) -{ - struct IR *ir = container_of(ref, struct IR, ref); - - /* - * Things should be in this state by now: - * ir->rx set to NULL and deallocated - happens before ir->rx->ir put() - * ir->rx->task kthread stopped - happens before ir->rx->ir put() - * ir->tx set to NULL and deallocated - happens before ir->tx->ir put() - * ir->open_count == 0 - happens on final close() - * ir_lock, tx_ref_lock, rx_ref_lock, all released - */ - if (ir->l.minor >= 0) { - lirc_unregister_driver(ir->l.minor); - ir->l.minor = -1; - } - - if (kfifo_initialized(&ir->rbuf.fifo)) - lirc_buffer_free(&ir->rbuf); - list_del(&ir->list); - kfree(ir); -} - -static int put_ir_device(struct IR *ir, bool ir_devices_lock_held) -{ - int released; - - if (ir_devices_lock_held) - return kref_put(&ir->ref, release_ir_device); - - mutex_lock(&ir_devices_lock); - released = kref_put(&ir->ref, release_ir_device); - mutex_unlock(&ir_devices_lock); - - return released; -} - -/* struct IR_rx reference counting */ -static struct IR_rx *get_ir_rx(struct IR *ir) -{ - struct IR_rx *rx; - - spin_lock(&ir->rx_ref_lock); - rx = ir->rx; - if (rx) - kref_get(&rx->ref); - spin_unlock(&ir->rx_ref_lock); - return rx; -} - -static void destroy_rx_kthread(struct IR_rx *rx, bool ir_devices_lock_held) -{ - /* end up polling thread */ - if (!IS_ERR_OR_NULL(rx->task)) { - kthread_stop(rx->task); - rx->task = NULL; - /* Put the ir ptr that ir_probe() gave to the rx poll thread */ - put_ir_device(rx->ir, ir_devices_lock_held); - } -} - -static void release_ir_rx(struct kref *ref) -{ - struct IR_rx *rx = container_of(ref, struct IR_rx, ref); - struct IR *ir = rx->ir; - - /* - * This release function can't do all the work, as we want - * to keep the rx_ref_lock a spinlock, and killing the poll thread - * and releasing the ir reference can cause a sleep. That work is - * performed by put_ir_rx() - */ - ir->l.features &= ~LIRC_CAN_REC_LIRCCODE; - /* Don't put_ir_device(rx->ir) here; lock can't be freed yet */ - ir->rx = NULL; - /* Don't do the kfree(rx) here; we still need to kill the poll thread */ -} - -static int put_ir_rx(struct IR_rx *rx, bool ir_devices_lock_held) -{ - int released; - struct IR *ir = rx->ir; - - spin_lock(&ir->rx_ref_lock); - released = kref_put(&rx->ref, release_ir_rx); - spin_unlock(&ir->rx_ref_lock); - /* Destroy the rx kthread while not holding the spinlock */ - if (released) { - destroy_rx_kthread(rx, ir_devices_lock_held); - kfree(rx); - /* Make sure we're not still in a poll_table somewhere */ - wake_up_interruptible(&ir->rbuf.wait_poll); - } - /* Do a reference put() for the rx->ir reference, if we released rx */ - if (released) - put_ir_device(ir, ir_devices_lock_held); - return released; -} - -/* struct IR_tx reference counting */ -static struct IR_tx *get_ir_tx(struct IR *ir) -{ - struct IR_tx *tx; - - spin_lock(&ir->tx_ref_lock); - tx = ir->tx; - if (tx) - kref_get(&tx->ref); - spin_unlock(&ir->tx_ref_lock); - return tx; -} - -static void release_ir_tx(struct kref *ref) -{ - struct IR_tx *tx = container_of(ref, struct IR_tx, ref); - struct IR *ir = tx->ir; - - ir->l.features &= ~LIRC_CAN_SEND_PULSE; - /* Don't put_ir_device(tx->ir) here, so our lock doesn't get freed */ - ir->tx = NULL; - kfree(tx); -} - -static int put_ir_tx(struct IR_tx *tx, bool ir_devices_lock_held) -{ - int released; - struct IR *ir = tx->ir; - - spin_lock(&ir->tx_ref_lock); - released = kref_put(&tx->ref, release_ir_tx); - spin_unlock(&ir->tx_ref_lock); - /* Do a reference put() for the tx->ir reference, if we released tx */ - if (released) - put_ir_device(ir, ir_devices_lock_held); - return released; -} - -static int add_to_buf(struct IR *ir) -{ - __u16 code; - unsigned char codes[2]; - unsigned char keybuf[6]; - int got_data = 0; - int ret; - int failures = 0; - unsigned char sendbuf[1] = { 0 }; - struct lirc_buffer *rbuf = ir->l.rbuf; - struct IR_rx *rx; - struct IR_tx *tx; - - if (lirc_buffer_full(rbuf)) { - dev_dbg(ir->l.dev, "buffer overflow\n"); - return -EOVERFLOW; - } - - rx = get_ir_rx(ir); - if (!rx) - return -ENXIO; - - /* Ensure our rx->c i2c_client remains valid for the duration */ - mutex_lock(&rx->client_lock); - if (!rx->c) { - mutex_unlock(&rx->client_lock); - put_ir_rx(rx, false); - return -ENXIO; - } - - tx = get_ir_tx(ir); - - /* - * service the device as long as it is returning - * data and we have space - */ - do { - if (kthread_should_stop()) { - ret = -ENODATA; - break; - } - - /* - * Lock i2c bus for the duration. RX/TX chips interfere so - * this is worth it - */ - mutex_lock(&ir->ir_lock); - - if (kthread_should_stop()) { - mutex_unlock(&ir->ir_lock); - ret = -ENODATA; - break; - } - - /* - * Send random "poll command" (?) Windows driver does this - * and it is a good point to detect chip failure. - */ - ret = i2c_master_send(rx->c, sendbuf, 1); - if (ret != 1) { - dev_err(ir->l.dev, "i2c_master_send failed with %d\n", - ret); - if (failures >= 3) { - mutex_unlock(&ir->ir_lock); - dev_err(ir->l.dev, - "unable to read from the IR chip after 3 resets, giving up\n"); - break; - } - - /* Looks like the chip crashed, reset it */ - dev_err(ir->l.dev, - "polling the IR receiver chip failed, trying reset\n"); - - set_current_state(TASK_UNINTERRUPTIBLE); - if (kthread_should_stop()) { - mutex_unlock(&ir->ir_lock); - ret = -ENODATA; - break; - } - schedule_timeout((100 * HZ + 999) / 1000); - if (tx) - tx->need_boot = 1; - - ++failures; - mutex_unlock(&ir->ir_lock); - ret = 0; - continue; - } - - if (kthread_should_stop()) { - mutex_unlock(&ir->ir_lock); - ret = -ENODATA; - break; - } - ret = i2c_master_recv(rx->c, keybuf, sizeof(keybuf)); - mutex_unlock(&ir->ir_lock); - if (ret != sizeof(keybuf)) { - dev_err(ir->l.dev, - "i2c_master_recv failed with %d -- keeping last read buffer\n", - ret); - } else { - rx->b[0] = keybuf[3]; - rx->b[1] = keybuf[4]; - rx->b[2] = keybuf[5]; - dev_dbg(ir->l.dev, - "key (0x%02x/0x%02x)\n", - rx->b[0], rx->b[1]); - } - - /* key pressed ? */ - if (rx->hdpvr_data_fmt) { - if (got_data && (keybuf[0] == 0x80)) { - ret = 0; - break; - } else if (got_data && (keybuf[0] == 0x00)) { - ret = -ENODATA; - break; - } - } else if ((rx->b[0] & 0x80) == 0) { - ret = got_data ? 0 : -ENODATA; - break; - } - - /* look what we have */ - code = (((__u16)rx->b[0] & 0x7f) << 6) | (rx->b[1] >> 2); - - codes[0] = (code >> 8) & 0xff; - codes[1] = code & 0xff; - - /* return it */ - lirc_buffer_write(rbuf, codes); - ++got_data; - ret = 0; - } while (!lirc_buffer_full(rbuf)); - - mutex_unlock(&rx->client_lock); - if (tx) - put_ir_tx(tx, false); - put_ir_rx(rx, false); - return ret; -} - -/* - * Main function of the polling thread -- from lirc_dev. - * We don't fit the LIRC model at all anymore. This is horrible, but - * basically we have a single RX/TX device with a nasty failure mode - * that needs to be accounted for across the pair. lirc lets us provide - * fops, but prevents us from using the internal polling, etc. if we do - * so. Hence the replication. Might be neater to extend the LIRC model - * to account for this but I'd think it's a very special case of seriously - * messed up hardware. - */ -static int lirc_thread(void *arg) -{ - struct IR *ir = arg; - struct lirc_buffer *rbuf = ir->l.rbuf; - - dev_dbg(ir->l.dev, "poll thread started\n"); - - while (!kthread_should_stop()) { - set_current_state(TASK_INTERRUPTIBLE); - - /* if device not opened, we can sleep half a second */ - if (atomic_read(&ir->open_count) == 0) { - schedule_timeout(HZ / 2); - continue; - } - - /* - * This is ~113*2 + 24 + jitter (2*repeat gap + code length). - * We use this interval as the chip resets every time you poll - * it (bad!). This is therefore just sufficient to catch all - * of the button presses. It makes the remote much more - * responsive. You can see the difference by running irw and - * holding down a button. With 100ms, the old polling - * interval, you'll notice breaks in the repeat sequence - * corresponding to lost keypresses. - */ - schedule_timeout((260 * HZ) / 1000); - if (kthread_should_stop()) - break; - if (!add_to_buf(ir)) - wake_up_interruptible(&rbuf->wait_poll); - } - - dev_dbg(ir->l.dev, "poll thread ended\n"); - return 0; -} - -/* safe read of a uint32 (always network byte order) */ -static int read_uint32(unsigned char **data, - unsigned char *endp, unsigned int *val) -{ - if (*data + 4 > endp) - return 0; - *val = ((*data)[0] << 24) | ((*data)[1] << 16) | - ((*data)[2] << 8) | (*data)[3]; - *data += 4; - return 1; -} - -/* safe read of a uint8 */ -static int read_uint8(unsigned char **data, - unsigned char *endp, unsigned char *val) -{ - if (*data + 1 > endp) - return 0; - *val = *((*data)++); - return 1; -} - -/* safe skipping of N bytes */ -static int skip(unsigned char **data, - unsigned char *endp, unsigned int distance) -{ - if (*data + distance > endp) - return 0; - *data += distance; - return 1; -} - -/* decompress key data into the given buffer */ -static int get_key_data(unsigned char *buf, - unsigned int codeset, unsigned int key) -{ - unsigned char *data, *endp, *diffs, *key_block; - unsigned char keys, ndiffs, id; - unsigned int base, lim, pos, i; - - /* Binary search for the codeset */ - for (base = 0, lim = tx_data->num_code_sets; lim; lim >>= 1) { - pos = base + (lim >> 1); - data = tx_data->code_sets[pos]; - - if (!read_uint32(&data, tx_data->endp, &i)) - goto corrupt; - - if (i == codeset) { - break; - } else if (codeset > i) { - base = pos + 1; - --lim; - } - } - /* Not found? */ - if (!lim) - return -EPROTO; - - /* Set end of data block */ - endp = pos < tx_data->num_code_sets - 1 ? - tx_data->code_sets[pos + 1] : tx_data->endp; - - /* Read the block header */ - if (!read_uint8(&data, endp, &keys) || - !read_uint8(&data, endp, &ndiffs) || - ndiffs > TX_BLOCK_SIZE || keys == 0) - goto corrupt; - - /* Save diffs & skip */ - diffs = data; - if (!skip(&data, endp, ndiffs)) - goto corrupt; - - /* Read the id of the first key */ - if (!read_uint8(&data, endp, &id)) - goto corrupt; - - /* Unpack the first key's data */ - for (i = 0; i < TX_BLOCK_SIZE; ++i) { - if (tx_data->fixed[i] == -1) { - if (!read_uint8(&data, endp, &buf[i])) - goto corrupt; - } else { - buf[i] = (unsigned char)tx_data->fixed[i]; - } - } - - /* Early out key found/not found */ - if (key == id) - return 0; - if (keys == 1) - return -EPROTO; - - /* Sanity check */ - key_block = data; - if (!skip(&data, endp, (keys - 1) * (ndiffs + 1))) - goto corrupt; - - /* Binary search for the key */ - for (base = 0, lim = keys - 1; lim; lim >>= 1) { - /* Seek to block */ - unsigned char *key_data; - - pos = base + (lim >> 1); - key_data = key_block + (ndiffs + 1) * pos; - - if (*key_data == key) { - /* skip key id */ - ++key_data; - - /* found, so unpack the diffs */ - for (i = 0; i < ndiffs; ++i) { - unsigned char val; - - if (!read_uint8(&key_data, endp, &val) || - diffs[i] >= TX_BLOCK_SIZE) - goto corrupt; - buf[diffs[i]] = val; - } - - return 0; - } else if (key > *key_data) { - base = pos + 1; - --lim; - } - } - /* Key not found */ - return -EPROTO; - -corrupt: - pr_err("firmware is corrupt\n"); - return -EFAULT; -} - -/* send a block of data to the IR TX device */ -static int send_data_block(struct IR_tx *tx, unsigned char *data_block) -{ - int i, j, ret; - unsigned char buf[5]; - - for (i = 0; i < TX_BLOCK_SIZE;) { - int tosend = TX_BLOCK_SIZE - i; - - if (tosend > 4) - tosend = 4; - buf[0] = (unsigned char)(i + 1); - for (j = 0; j < tosend; ++j) - buf[1 + j] = data_block[i + j]; - dev_dbg(tx->ir->l.dev, "%*ph", 5, buf); - ret = i2c_master_send(tx->c, buf, tosend + 1); - if (ret != tosend + 1) { - dev_err(tx->ir->l.dev, - "i2c_master_send failed with %d\n", ret); - return ret < 0 ? ret : -EFAULT; - } - i += tosend; - } - return 0; -} - -/* send boot data to the IR TX device */ -static int send_boot_data(struct IR_tx *tx) -{ - int ret, i; - unsigned char buf[4]; - - /* send the boot block */ - ret = send_data_block(tx, tx_data->boot_data); - if (ret != 0) - return ret; - - /* Hit the go button to activate the new boot data */ - buf[0] = 0x00; - buf[1] = 0x20; - ret = i2c_master_send(tx->c, buf, 2); - if (ret != 2) { - dev_err(tx->ir->l.dev, "i2c_master_send failed with %d\n", ret); - return ret < 0 ? ret : -EFAULT; - } - - /* - * Wait for zilog to settle after hitting go post boot block upload. - * Without this delay, the HD-PVR and HVR-1950 both return an -EIO - * upon attempting to get firmware revision, and tx probe thus fails. - */ - for (i = 0; i < 10; i++) { - ret = i2c_master_send(tx->c, buf, 1); - if (ret == 1) - break; - udelay(100); - } - - if (ret != 1) { - dev_err(tx->ir->l.dev, "i2c_master_send failed with %d\n", ret); - return ret < 0 ? ret : -EFAULT; - } - - /* Here comes the firmware version... (hopefully) */ - ret = i2c_master_recv(tx->c, buf, 4); - if (ret != 4) { - dev_err(tx->ir->l.dev, "i2c_master_recv failed with %d\n", ret); - return 0; - } - if ((buf[0] != 0x80) && (buf[0] != 0xa0)) { - dev_err(tx->ir->l.dev, "unexpected IR TX init response: %02x\n", - buf[0]); - return 0; - } - dev_notice(tx->ir->l.dev, - "Zilog/Hauppauge IR blaster firmware version %d.%d.%d loaded\n", - buf[1], buf[2], buf[3]); - - return 0; -} - -/* unload "firmware", lock held */ -static void fw_unload_locked(void) -{ - if (tx_data) { - vfree(tx_data->code_sets); - - vfree(tx_data->datap); - - vfree(tx_data); - tx_data = NULL; - pr_debug("successfully unloaded IR blaster firmware\n"); - } -} - -/* unload "firmware" for the IR TX device */ -static void fw_unload(void) -{ - mutex_lock(&tx_data_lock); - fw_unload_locked(); - mutex_unlock(&tx_data_lock); -} - -/* load "firmware" for the IR TX device */ -static int fw_load(struct IR_tx *tx) -{ - int ret; - unsigned int i; - unsigned char *data, version, num_global_fixed; - const struct firmware *fw_entry; - - /* Already loaded? */ - mutex_lock(&tx_data_lock); - if (tx_data) { - ret = 0; - goto out; - } - - /* Request codeset data file */ - ret = request_firmware(&fw_entry, "haup-ir-blaster.bin", tx->ir->l.dev); - if (ret != 0) { - dev_err(tx->ir->l.dev, - "firmware haup-ir-blaster.bin not available (%d)\n", - ret); - ret = ret < 0 ? ret : -EFAULT; - goto out; - } - dev_dbg(tx->ir->l.dev, "firmware of size %zu loaded\n", fw_entry->size); - - /* Parse the file */ - tx_data = vmalloc(sizeof(*tx_data)); - if (!tx_data) { - release_firmware(fw_entry); - ret = -ENOMEM; - goto out; - } - tx_data->code_sets = NULL; - - /* Copy the data so hotplug doesn't get confused and timeout */ - tx_data->datap = vmalloc(fw_entry->size); - if (!tx_data->datap) { - release_firmware(fw_entry); - vfree(tx_data); - ret = -ENOMEM; - goto out; - } - memcpy(tx_data->datap, fw_entry->data, fw_entry->size); - tx_data->endp = tx_data->datap + fw_entry->size; - release_firmware(fw_entry); fw_entry = NULL; - - /* Check version */ - data = tx_data->datap; - if (!read_uint8(&data, tx_data->endp, &version)) - goto corrupt; - if (version != 1) { - dev_err(tx->ir->l.dev, - "unsupported code set file version (%u, expected 1) -- please upgrade to a newer driver\n", - version); - fw_unload_locked(); - ret = -EFAULT; - goto out; - } - - /* Save boot block for later */ - tx_data->boot_data = data; - if (!skip(&data, tx_data->endp, TX_BLOCK_SIZE)) - goto corrupt; - - if (!read_uint32(&data, tx_data->endp, - &tx_data->num_code_sets)) - goto corrupt; - - dev_dbg(tx->ir->l.dev, "%u IR blaster codesets loaded\n", - tx_data->num_code_sets); - - tx_data->code_sets = vmalloc( - tx_data->num_code_sets * sizeof(char *)); - if (!tx_data->code_sets) { - fw_unload_locked(); - ret = -ENOMEM; - goto out; - } - - for (i = 0; i < TX_BLOCK_SIZE; ++i) - tx_data->fixed[i] = -1; - - /* Read global fixed data template */ - if (!read_uint8(&data, tx_data->endp, &num_global_fixed) || - num_global_fixed > TX_BLOCK_SIZE) - goto corrupt; - for (i = 0; i < num_global_fixed; ++i) { - unsigned char pos, val; - - if (!read_uint8(&data, tx_data->endp, &pos) || - !read_uint8(&data, tx_data->endp, &val) || - pos >= TX_BLOCK_SIZE) - goto corrupt; - tx_data->fixed[pos] = (int)val; - } - - /* Filch out the position of each code set */ - for (i = 0; i < tx_data->num_code_sets; ++i) { - unsigned int id; - unsigned char keys; - unsigned char ndiffs; - - /* Save the codeset position */ - tx_data->code_sets[i] = data; - - /* Read header */ - if (!read_uint32(&data, tx_data->endp, &id) || - !read_uint8(&data, tx_data->endp, &keys) || - !read_uint8(&data, tx_data->endp, &ndiffs) || - ndiffs > TX_BLOCK_SIZE || keys == 0) - goto corrupt; - - /* skip diff positions */ - if (!skip(&data, tx_data->endp, ndiffs)) - goto corrupt; - - /* - * After the diffs we have the first key id + data - - * global fixed - */ - if (!skip(&data, tx_data->endp, - 1 + TX_BLOCK_SIZE - num_global_fixed)) - goto corrupt; - - /* Then we have keys-1 blocks of key id+diffs */ - if (!skip(&data, tx_data->endp, - (ndiffs + 1) * (keys - 1))) - goto corrupt; - } - ret = 0; - goto out; - -corrupt: - dev_err(tx->ir->l.dev, "firmware is corrupt\n"); - fw_unload_locked(); - ret = -EFAULT; - -out: - mutex_unlock(&tx_data_lock); - return ret; -} - -/* copied from lirc_dev */ -static ssize_t read(struct file *filep, char __user *outbuf, size_t n, - loff_t *ppos) -{ - struct IR *ir = filep->private_data; - struct IR_rx *rx; - struct lirc_buffer *rbuf = ir->l.rbuf; - int ret = 0, written = 0, retries = 0; - unsigned int m; - DECLARE_WAITQUEUE(wait, current); - - dev_dbg(ir->l.dev, "read called\n"); - if (n % rbuf->chunk_size) { - dev_dbg(ir->l.dev, "read result = -EINVAL\n"); - return -EINVAL; - } - - rx = get_ir_rx(ir); - if (!rx) - return -ENXIO; - - /* - * we add ourselves to the task queue before buffer check - * to avoid losing scan code (in case when queue is awaken somewhere - * between while condition checking and scheduling) - */ - add_wait_queue(&rbuf->wait_poll, &wait); - set_current_state(TASK_INTERRUPTIBLE); - - /* - * while we didn't provide 'length' bytes, device is opened in blocking - * mode and 'copy_to_user' is happy, wait for data. - */ - while (written < n && ret == 0) { - if (lirc_buffer_empty(rbuf)) { - /* - * According to the read(2) man page, 'written' can be - * returned as less than 'n', instead of blocking - * again, returning -EWOULDBLOCK, or returning - * -ERESTARTSYS - */ - if (written) - break; - if (filep->f_flags & O_NONBLOCK) { - ret = -EWOULDBLOCK; - break; - } - if (signal_pending(current)) { - ret = -ERESTARTSYS; - break; - } - schedule(); - set_current_state(TASK_INTERRUPTIBLE); - } else { - unsigned char buf[MAX_XFER_SIZE]; - - if (rbuf->chunk_size > sizeof(buf)) { - dev_err(ir->l.dev, - "chunk_size is too big (%d)!\n", - rbuf->chunk_size); - ret = -EINVAL; - break; - } - m = lirc_buffer_read(rbuf, buf); - if (m == rbuf->chunk_size) { - ret = copy_to_user(outbuf + written, buf, - rbuf->chunk_size); - written += rbuf->chunk_size; - } else { - retries++; - } - if (retries >= 5) { - dev_err(ir->l.dev, "Buffer read failed!\n"); - ret = -EIO; - } - } - } - - remove_wait_queue(&rbuf->wait_poll, &wait); - put_ir_rx(rx, false); - set_current_state(TASK_RUNNING); - - dev_dbg(ir->l.dev, "read result = %d (%s)\n", ret, - ret ? "Error" : "OK"); - - return ret ? ret : written; -} - -/* send a keypress to the IR TX device */ -static int send_code(struct IR_tx *tx, unsigned int code, unsigned int key) -{ - unsigned char data_block[TX_BLOCK_SIZE]; - unsigned char buf[2]; - int i, ret; - - /* Get data for the codeset/key */ - ret = get_key_data(data_block, code, key); - - if (ret == -EPROTO) { - dev_err(tx->ir->l.dev, - "failed to get data for code %u, key %u -- check lircd.conf entries\n", - code, key); - return ret; - } else if (ret != 0) { - return ret; - } - - /* Send the data block */ - ret = send_data_block(tx, data_block); - if (ret != 0) - return ret; - - /* Send data block length? */ - buf[0] = 0x00; - buf[1] = 0x40; - ret = i2c_master_send(tx->c, buf, 2); - if (ret != 2) { - dev_err(tx->ir->l.dev, "i2c_master_send failed with %d\n", ret); - return ret < 0 ? ret : -EFAULT; - } - - /* Give the z8 a moment to process data block */ - for (i = 0; i < 10; i++) { - ret = i2c_master_send(tx->c, buf, 1); - if (ret == 1) - break; - udelay(100); - } - - if (ret != 1) { - dev_err(tx->ir->l.dev, "i2c_master_send failed with %d\n", ret); - return ret < 0 ? ret : -EFAULT; - } - - /* Send finished download? */ - ret = i2c_master_recv(tx->c, buf, 1); - if (ret != 1) { - dev_err(tx->ir->l.dev, "i2c_master_recv failed with %d\n", ret); - return ret < 0 ? ret : -EFAULT; - } - if (buf[0] != 0xA0) { - dev_err(tx->ir->l.dev, "unexpected IR TX response #1: %02x\n", - buf[0]); - return -EFAULT; - } - - /* Send prepare command? */ - buf[0] = 0x00; - buf[1] = 0x80; - ret = i2c_master_send(tx->c, buf, 2); - if (ret != 2) { - dev_err(tx->ir->l.dev, "i2c_master_send failed with %d\n", ret); - return ret < 0 ? ret : -EFAULT; - } - - /* - * The sleep bits aren't necessary on the HD PVR, and in fact, the - * last i2c_master_recv always fails with a -5, so for now, we're - * going to skip this whole mess and say we're done on the HD PVR - */ - if (!tx->post_tx_ready_poll) { - dev_dbg(tx->ir->l.dev, "sent code %u, key %u\n", code, key); - return 0; - } - - /* - * This bit NAKs until the device is ready, so we retry it - * sleeping a bit each time. This seems to be what the windows - * driver does, approximately. - * Try for up to 1s. - */ - for (i = 0; i < 20; ++i) { - set_current_state(TASK_UNINTERRUPTIBLE); - schedule_timeout((50 * HZ + 999) / 1000); - ret = i2c_master_send(tx->c, buf, 1); - if (ret == 1) - break; - dev_dbg(tx->ir->l.dev, - "NAK expected: i2c_master_send failed with %d (try %d)\n", - ret, i + 1); - } - if (ret != 1) { - dev_err(tx->ir->l.dev, - "IR TX chip never got ready: last i2c_master_send failed with %d\n", - ret); - return ret < 0 ? ret : -EFAULT; - } - - /* Seems to be an 'ok' response */ - i = i2c_master_recv(tx->c, buf, 1); - if (i != 1) { - dev_err(tx->ir->l.dev, "i2c_master_recv failed with %d\n", ret); - return -EFAULT; - } - if (buf[0] != 0x80) { - dev_err(tx->ir->l.dev, "unexpected IR TX response #2: %02x\n", - buf[0]); - return -EFAULT; - } - - /* Oh good, it worked */ - dev_dbg(tx->ir->l.dev, "sent code %u, key %u\n", code, key); - return 0; -} - -/* - * Write a code to the device. We take in a 32-bit number (an int) and then - * decode this to a codeset/key index. The key data is then decompressed and - * sent to the device. We have a spin lock as per i2c documentation to prevent - * multiple concurrent sends which would probably cause the device to explode. - */ -static ssize_t write(struct file *filep, const char __user *buf, size_t n, - loff_t *ppos) -{ - struct IR *ir = filep->private_data; - struct IR_tx *tx; - size_t i; - int failures = 0; - - /* Validate user parameters */ - if (n % sizeof(int)) - return -EINVAL; - - /* Get a struct IR_tx reference */ - tx = get_ir_tx(ir); - if (!tx) - return -ENXIO; - - /* Ensure our tx->c i2c_client remains valid for the duration */ - mutex_lock(&tx->client_lock); - if (!tx->c) { - mutex_unlock(&tx->client_lock); - put_ir_tx(tx, false); - return -ENXIO; - } - - /* Lock i2c bus for the duration */ - mutex_lock(&ir->ir_lock); - - /* Send each keypress */ - for (i = 0; i < n;) { - int ret = 0; - int command; - - if (copy_from_user(&command, buf + i, sizeof(command))) { - mutex_unlock(&ir->ir_lock); - mutex_unlock(&tx->client_lock); - put_ir_tx(tx, false); - return -EFAULT; - } - - /* Send boot data first if required */ - if (tx->need_boot == 1) { - /* Make sure we have the 'firmware' loaded, first */ - ret = fw_load(tx); - if (ret != 0) { - mutex_unlock(&ir->ir_lock); - mutex_unlock(&tx->client_lock); - put_ir_tx(tx, false); - if (ret != -ENOMEM) - ret = -EIO; - return ret; - } - /* Prep the chip for transmitting codes */ - ret = send_boot_data(tx); - if (ret == 0) - tx->need_boot = 0; - } - - /* Send the code */ - if (ret == 0) { - ret = send_code(tx, (unsigned int)command >> 16, - (unsigned int)command & 0xFFFF); - if (ret == -EPROTO) { - mutex_unlock(&ir->ir_lock); - mutex_unlock(&tx->client_lock); - put_ir_tx(tx, false); - return ret; - } - } - - /* - * Hmm, a failure. If we've had a few then give up, otherwise - * try a reset - */ - if (ret != 0) { - /* Looks like the chip crashed, reset it */ - dev_err(tx->ir->l.dev, - "sending to the IR transmitter chip failed, trying reset\n"); - - if (failures >= 3) { - dev_err(tx->ir->l.dev, - "unable to send to the IR chip after 3 resets, giving up\n"); - mutex_unlock(&ir->ir_lock); - mutex_unlock(&tx->client_lock); - put_ir_tx(tx, false); - return ret; - } - set_current_state(TASK_UNINTERRUPTIBLE); - schedule_timeout((100 * HZ + 999) / 1000); - tx->need_boot = 1; - ++failures; - } else { - i += sizeof(int); - } - } - - /* Release i2c bus */ - mutex_unlock(&ir->ir_lock); - - mutex_unlock(&tx->client_lock); - - /* Give back our struct IR_tx reference */ - put_ir_tx(tx, false); - - /* All looks good */ - return n; -} - -/* copied from lirc_dev */ -static unsigned int poll(struct file *filep, poll_table *wait) -{ - struct IR *ir = filep->private_data; - struct IR_rx *rx; - struct lirc_buffer *rbuf = ir->l.rbuf; - unsigned int ret; - - dev_dbg(ir->l.dev, "%s called\n", __func__); - - rx = get_ir_rx(ir); - if (!rx) { - /* - * Revisit this, if our poll function ever reports writeable - * status for Tx - */ - dev_dbg(ir->l.dev, "%s result = POLLERR\n", __func__); - return POLLERR; - } - - /* - * Add our lirc_buffer's wait_queue to the poll_table. A wake up on - * that buffer's wait queue indicates we may have a new poll status. - */ - poll_wait(filep, &rbuf->wait_poll, wait); - - /* Indicate what ops could happen immediately without blocking */ - ret = lirc_buffer_empty(rbuf) ? 0 : (POLLIN | POLLRDNORM); - - dev_dbg(ir->l.dev, "%s result = %s\n", __func__, - ret ? "POLLIN|POLLRDNORM" : "none"); - put_ir_rx(rx, false); - return ret; -} - -static long ioctl(struct file *filep, unsigned int cmd, unsigned long arg) -{ - struct IR *ir = filep->private_data; - unsigned long __user *uptr = (unsigned long __user *)arg; - int result; - unsigned long mode, features; - - features = ir->l.features; - - switch (cmd) { - case LIRC_GET_LENGTH: - result = put_user(13UL, uptr); - break; - case LIRC_GET_FEATURES: - result = put_user(features, uptr); - break; - case LIRC_GET_REC_MODE: - if (!(features & LIRC_CAN_REC_MASK)) - return -ENOTTY; - - result = put_user(LIRC_REC2MODE - (features & LIRC_CAN_REC_MASK), - uptr); - break; - case LIRC_SET_REC_MODE: - if (!(features & LIRC_CAN_REC_MASK)) - return -ENOTTY; - - result = get_user(mode, uptr); - if (!result && !(LIRC_MODE2REC(mode) & features)) - result = -ENOTTY; - break; - case LIRC_GET_SEND_MODE: - if (!(features & LIRC_CAN_SEND_MASK)) - return -ENOTTY; - - result = put_user(LIRC_MODE_PULSE, uptr); - break; - case LIRC_SET_SEND_MODE: - if (!(features & LIRC_CAN_SEND_MASK)) - return -ENOTTY; - - result = get_user(mode, uptr); - if (!result && mode != LIRC_MODE_PULSE) - return -EINVAL; - break; - default: - return -EINVAL; - } - return result; -} - -static struct IR *get_ir_device_by_minor(unsigned int minor) -{ - struct IR *ir; - struct IR *ret = NULL; - - mutex_lock(&ir_devices_lock); - - if (!list_empty(&ir_devices_list)) { - list_for_each_entry(ir, &ir_devices_list, list) { - if (ir->l.minor == minor) { - ret = get_ir_device(ir, true); - break; - } - } - } - - mutex_unlock(&ir_devices_lock); - return ret; -} - -/* - * Open the IR device. Get hold of our IR structure and - * stash it in private_data for the file - */ -static int open(struct inode *node, struct file *filep) -{ - struct IR *ir; - unsigned int minor = MINOR(node->i_rdev); - - /* find our IR struct */ - ir = get_ir_device_by_minor(minor); - - if (!ir) - return -ENODEV; - - atomic_inc(&ir->open_count); - - /* stash our IR struct */ - filep->private_data = ir; - - nonseekable_open(node, filep); - return 0; -} - -/* Close the IR device */ -static int close(struct inode *node, struct file *filep) -{ - /* find our IR struct */ - struct IR *ir = filep->private_data; - - if (!ir) { - pr_err("ir: %s: no private_data attached to the file!\n", - __func__); - return -ENODEV; - } - - atomic_dec(&ir->open_count); - - put_ir_device(ir, false); - return 0; -} - -static int ir_remove(struct i2c_client *client); -static int ir_probe(struct i2c_client *client, const struct i2c_device_id *id); - -#define ID_FLAG_TX 0x01 -#define ID_FLAG_HDPVR 0x02 - -static const struct i2c_device_id ir_transceiver_id[] = { - { "ir_tx_z8f0811_haup", ID_FLAG_TX }, - { "ir_rx_z8f0811_haup", 0 }, - { "ir_tx_z8f0811_hdpvr", ID_FLAG_HDPVR | ID_FLAG_TX }, - { "ir_rx_z8f0811_hdpvr", ID_FLAG_HDPVR }, - { } -}; -MODULE_DEVICE_TABLE(i2c, ir_transceiver_id); - -static struct i2c_driver driver = { - .driver = { - .name = "Zilog/Hauppauge i2c IR", - }, - .probe = ir_probe, - .remove = ir_remove, - .id_table = ir_transceiver_id, -}; - -static const struct file_operations lirc_fops = { - .owner = THIS_MODULE, - .llseek = no_llseek, - .read = read, - .write = write, - .poll = poll, - .unlocked_ioctl = ioctl, -#ifdef CONFIG_COMPAT - .compat_ioctl = ioctl, -#endif - .open = open, - .release = close -}; - -static struct lirc_driver lirc_template = { - .name = "lirc_zilog", - .minor = -1, - .code_length = 13, - .buffer_size = BUFLEN / 2, - .chunk_size = 2, - .fops = &lirc_fops, - .owner = THIS_MODULE, -}; - -static int ir_remove(struct i2c_client *client) -{ - if (strncmp("ir_tx_z8", client->name, 8) == 0) { - struct IR_tx *tx = i2c_get_clientdata(client); - - if (tx) { - mutex_lock(&tx->client_lock); - tx->c = NULL; - mutex_unlock(&tx->client_lock); - put_ir_tx(tx, false); - } - } else if (strncmp("ir_rx_z8", client->name, 8) == 0) { - struct IR_rx *rx = i2c_get_clientdata(client); - - if (rx) { - mutex_lock(&rx->client_lock); - rx->c = NULL; - mutex_unlock(&rx->client_lock); - put_ir_rx(rx, false); - } - } - return 0; -} - -/* ir_devices_lock must be held */ -static struct IR *get_ir_device_by_adapter(struct i2c_adapter *adapter) -{ - struct IR *ir; - - if (list_empty(&ir_devices_list)) - return NULL; - - list_for_each_entry(ir, &ir_devices_list, list) - if (ir->adapter == adapter) { - get_ir_device(ir, true); - return ir; - } - - return NULL; -} - -static int ir_probe(struct i2c_client *client, const struct i2c_device_id *id) -{ - struct IR *ir; - struct IR_tx *tx; - struct IR_rx *rx; - struct i2c_adapter *adap = client->adapter; - int ret; - bool tx_probe = false; - - dev_dbg(&client->dev, "%s: %s on i2c-%d (%s), client addr=0x%02x\n", - __func__, id->name, adap->nr, adap->name, client->addr); - - /* - * The IR receiver is at i2c address 0x71. - * The IR transmitter is at i2c address 0x70. - */ - - if (id->driver_data & ID_FLAG_TX) - tx_probe = true; - else if (tx_only) /* module option */ - return -ENXIO; - - pr_info("probing IR %s on %s (i2c-%d)\n", - tx_probe ? "Tx" : "Rx", adap->name, adap->nr); - - mutex_lock(&ir_devices_lock); - - /* Use a single struct IR instance for both the Rx and Tx functions */ - ir = get_ir_device_by_adapter(adap); - if (!ir) { - ir = kzalloc(sizeof(*ir), GFP_KERNEL); - if (!ir) { - ret = -ENOMEM; - goto out_no_ir; - } - kref_init(&ir->ref); - - /* store for use in ir_probe() again, and open() later on */ - INIT_LIST_HEAD(&ir->list); - list_add_tail(&ir->list, &ir_devices_list); - - ir->adapter = adap; - mutex_init(&ir->ir_lock); - atomic_set(&ir->open_count, 0); - spin_lock_init(&ir->tx_ref_lock); - spin_lock_init(&ir->rx_ref_lock); - - /* set lirc_dev stuff */ - memcpy(&ir->l, &lirc_template, sizeof(struct lirc_driver)); - /* - * FIXME this is a pointer reference to us, but no refcount. - * - * This OK for now, since lirc_dev currently won't touch this - * buffer as we provide our own lirc_fops. - * - * Currently our own lirc_fops rely on this ir->l.rbuf pointer - */ - ir->l.rbuf = &ir->rbuf; - ir->l.dev = &adap->dev; - ret = lirc_buffer_init(ir->l.rbuf, - ir->l.chunk_size, ir->l.buffer_size); - if (ret) - goto out_put_ir; - } - - if (tx_probe) { - /* Get the IR_rx instance for later, if already allocated */ - rx = get_ir_rx(ir); - - /* Set up a struct IR_tx instance */ - tx = kzalloc(sizeof(*tx), GFP_KERNEL); - if (!tx) { - ret = -ENOMEM; - goto out_put_xx; - } - kref_init(&tx->ref); - ir->tx = tx; - - ir->l.features |= LIRC_CAN_SEND_PULSE; - mutex_init(&tx->client_lock); - tx->c = client; - tx->need_boot = 1; - tx->post_tx_ready_poll = - (id->driver_data & ID_FLAG_HDPVR) ? false : true; - - /* An ir ref goes to the struct IR_tx instance */ - tx->ir = get_ir_device(ir, true); - - /* A tx ref goes to the i2c_client */ - i2c_set_clientdata(client, get_ir_tx(ir)); - - /* - * Load the 'firmware'. We do this before registering with - * lirc_dev, so the first firmware load attempt does not happen - * after a open() or write() call on the device. - * - * Failure here is not deemed catastrophic, so the receiver will - * still be usable. Firmware load will be retried in write(), - * if it is needed. - */ - fw_load(tx); - - /* Proceed only if the Rx client is also ready or not needed */ - if (!rx && !tx_only) { - dev_info(tx->ir->l.dev, - "probe of IR Tx on %s (i2c-%d) done. Waiting on IR Rx.\n", - adap->name, adap->nr); - goto out_ok; - } - } else { - /* Get the IR_tx instance for later, if already allocated */ - tx = get_ir_tx(ir); - - /* Set up a struct IR_rx instance */ - rx = kzalloc(sizeof(*rx), GFP_KERNEL); - if (!rx) { - ret = -ENOMEM; - goto out_put_xx; - } - kref_init(&rx->ref); - ir->rx = rx; - - ir->l.features |= LIRC_CAN_REC_LIRCCODE; - mutex_init(&rx->client_lock); - rx->c = client; - rx->hdpvr_data_fmt = - (id->driver_data & ID_FLAG_HDPVR) ? true : false; - - /* An ir ref goes to the struct IR_rx instance */ - rx->ir = get_ir_device(ir, true); - - /* An rx ref goes to the i2c_client */ - i2c_set_clientdata(client, get_ir_rx(ir)); - - /* - * Start the polling thread. - * It will only perform an empty loop around schedule_timeout() - * until we register with lirc_dev and the first user open() - */ - /* An ir ref goes to the new rx polling kthread */ - rx->task = kthread_run(lirc_thread, get_ir_device(ir, true), - "zilog-rx-i2c-%d", adap->nr); - if (IS_ERR(rx->task)) { - ret = PTR_ERR(rx->task); - dev_err(tx->ir->l.dev, - "%s: could not start IR Rx polling thread\n", - __func__); - /* Failed kthread, so put back the ir ref */ - put_ir_device(ir, true); - /* Failure exit, so put back rx ref from i2c_client */ - i2c_set_clientdata(client, NULL); - put_ir_rx(rx, true); - ir->l.features &= ~LIRC_CAN_REC_LIRCCODE; - goto out_put_tx; - } - - /* Proceed only if the Tx client is also ready */ - if (!tx) { - pr_info("probe of IR Rx on %s (i2c-%d) done. Waiting on IR Tx.\n", - adap->name, adap->nr); - goto out_ok; - } - } - - /* register with lirc */ - ir->l.minor = lirc_register_driver(&ir->l); - if (ir->l.minor < 0) { - dev_err(tx->ir->l.dev, - "%s: lirc_register_driver() failed: %i\n", - __func__, ir->l.minor); - ret = -EBADRQC; - goto out_put_xx; - } - dev_info(ir->l.dev, - "IR unit on %s (i2c-%d) registered as lirc%d and ready\n", - adap->name, adap->nr, ir->l.minor); - -out_ok: - if (rx) - put_ir_rx(rx, true); - if (tx) - put_ir_tx(tx, true); - put_ir_device(ir, true); - dev_info(ir->l.dev, - "probe of IR %s on %s (i2c-%d) done\n", - tx_probe ? "Tx" : "Rx", adap->name, adap->nr); - mutex_unlock(&ir_devices_lock); - return 0; - -out_put_xx: - if (rx) - put_ir_rx(rx, true); -out_put_tx: - if (tx) - put_ir_tx(tx, true); -out_put_ir: - put_ir_device(ir, true); -out_no_ir: - dev_err(&client->dev, - "%s: probing IR %s on %s (i2c-%d) failed with %d\n", - __func__, tx_probe ? "Tx" : "Rx", adap->name, adap->nr, ret); - mutex_unlock(&ir_devices_lock); - return ret; -} - -static int __init zilog_init(void) -{ - int ret; - - pr_notice("Zilog/Hauppauge IR driver initializing\n"); - - mutex_init(&tx_data_lock); - - request_module("firmware_class"); - - ret = i2c_add_driver(&driver); - if (ret) - pr_err("initialization failed\n"); - else - pr_notice("initialization complete\n"); - - return ret; -} - -static void __exit zilog_exit(void) -{ - i2c_del_driver(&driver); - /* if loaded */ - fw_unload(); - pr_notice("Zilog/Hauppauge IR driver unloaded\n"); -} - -module_init(zilog_init); -module_exit(zilog_exit); - -MODULE_DESCRIPTION("Zilog/Hauppauge infrared transmitter driver (i2c stack)"); -MODULE_AUTHOR("Gerd Knorr, Michal Kochanowicz, Christoph Bartelmus, Ulrich Mueller, Stefan Jahn, Jerome Brock, Mark Weaver, Andy Walls"); -MODULE_LICENSE("GPL"); -/* for compat with old name, which isn't all that accurate anymore */ -MODULE_ALIAS("lirc_pvr150"); - -module_param(debug, bool, 0644); -MODULE_PARM_DESC(debug, "Enable debugging messages"); - -module_param(tx_only, bool, 0644); -MODULE_PARM_DESC(tx_only, "Only handle the IR transmit function"); diff --git a/drivers/staging/qca-wifi-host-cmn/qdf/linux/src/i_qdf_time.h b/drivers/staging/qca-wifi-host-cmn/qdf/linux/src/i_qdf_time.h index 7f0b41e914ce..c661b2a799bd 100644 --- a/drivers/staging/qca-wifi-host-cmn/qdf/linux/src/i_qdf_time.h +++ b/drivers/staging/qca-wifi-host-cmn/qdf/linux/src/i_qdf_time.h @@ -314,7 +314,7 @@ static inline uint64_t __qdf_get_log_timestamp(void) #if (LINUX_VERSION_CODE >= KERNEL_VERSION(3, 17, 0)) static inline uint64_t __qdf_get_bootbased_boottime_ns(void) { - return ktime_get_boot_ns(); + return ktime_get_boottime_ns(); } #else diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c index bf834387fb0d..9b83b078667b 100644 --- a/drivers/tty/tty_io.c +++ b/drivers/tty/tty_io.c @@ -868,8 +868,13 @@ static ssize_t tty_read(struct file *file, char __user *buf, size_t count, i = -EIO; tty_ldisc_deref(ld); - if (i > 0) - tty_update_time(&inode->i_atime); + if (i > 0) { + struct timespec ts; + + ts = timespec64_to_timespec(inode->i_atime); + tty_update_time(&ts); + inode->i_atime = timespec_to_timespec64(ts); + } return i; } @@ -970,7 +975,11 @@ static inline ssize_t do_tty_write( cond_resched(); } if (written) { - tty_update_time(&file_inode(file)->i_mtime); + struct timespec ts; + + ts = timespec64_to_timespec(file_inode(file)->i_mtime); + tty_update_time(&ts); + file_inode(file)->i_mtime = timespec_to_timespec64(ts); ret = written; } out: diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c index 3d55cf1a0cf1..d2f08c194053 100644 --- a/drivers/usb/gadget/function/f_fs.c +++ b/drivers/usb/gadget/function/f_fs.c @@ -1469,7 +1469,7 @@ ffs_sb_make_inode(struct super_block *sb, void *data, inode = new_inode(sb); if (likely(inode)) { - struct timespec ts = current_time(inode); + struct timespec64 ts = current_time(inode); inode->i_ino = get_next_ino(); inode->i_mode = perms->mode; diff --git a/fs/adfs/inode.c b/fs/adfs/inode.c index 8dbd36f5e581..c836c425ca94 100644 --- a/fs/adfs/inode.c +++ b/fs/adfs/inode.c @@ -199,7 +199,7 @@ adfs_adfs2unix_time(struct timespec *tv, struct inode *inode) return; cur_time: - *tv = current_time(inode); + *tv = timespec64_to_timespec(current_time(inode)); return; too_early: @@ -242,6 +242,7 @@ adfs_unix2adfs_time(struct inode *inode, unsigned int secs) struct inode * adfs_iget(struct super_block *sb, struct object_info *obj) { + struct timespec ts; struct inode *inode; inode = new_inode(sb); @@ -270,7 +271,9 @@ adfs_iget(struct super_block *sb, struct object_info *obj) ADFS_I(inode)->stamped = ((obj->loadaddr & 0xfff00000) == 0xfff00000); inode->i_mode = adfs_atts2mode(sb, inode); - adfs_adfs2unix_time(&inode->i_mtime, inode); + ts = timespec64_to_timespec(inode->i_mtime); + adfs_adfs2unix_time(&ts, inode); + inode->i_mtime = timespec_to_timespec64(ts); inode->i_atime = inode->i_mtime; inode->i_ctime = inode->i_mtime; diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c index 7dc9c78a1c31..805ce7f2318a 100644 --- a/fs/afs/rxrpc.c +++ b/fs/afs/rxrpc.c @@ -617,7 +617,7 @@ static void afs_wake_up_async_call(struct sock *sk, struct rxrpc_call *rxcall, trace_afs_notify_call(rxcall, call); call->need_attention = true; - u = __atomic_add_unless(&call->usage, 1, 0); + u = atomic_fetch_add_unless(&call->usage, 1, 0); if (u != 0) { trace_afs_call(call, afs_call_trace_wake, u + 1, atomic_read(&afs_outstanding_calls), diff --git a/fs/attr.c b/fs/attr.c index 784d1625bf44..8b270ac51fe4 100644 --- a/fs/attr.c +++ b/fs/attr.c @@ -166,14 +166,14 @@ void setattr_copy(struct inode *inode, const struct iattr *attr) if (ia_valid & ATTR_GID) inode->i_gid = attr->ia_gid; if (ia_valid & ATTR_ATIME) - inode->i_atime = timespec_trunc(attr->ia_atime, - inode->i_sb->s_time_gran); + inode->i_atime = timespec64_trunc(attr->ia_atime, + inode->i_sb->s_time_gran); if (ia_valid & ATTR_MTIME) - inode->i_mtime = timespec_trunc(attr->ia_mtime, - inode->i_sb->s_time_gran); + inode->i_mtime = timespec64_trunc(attr->ia_mtime, + inode->i_sb->s_time_gran); if (ia_valid & ATTR_CTIME) - inode->i_ctime = timespec_trunc(attr->ia_ctime, - inode->i_sb->s_time_gran); + inode->i_ctime = timespec64_trunc(attr->ia_ctime, + inode->i_sb->s_time_gran); if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; @@ -210,7 +210,7 @@ int notify_change2(struct vfsmount *mnt, struct dentry * dentry, struct iattr * struct inode *inode = dentry->d_inode; umode_t mode = inode->i_mode; int error; - struct timespec now; + struct timespec64 now; unsigned int ia_valid = attr->ia_valid; WARN_ON_ONCE(!inode_is_locked(inode)); diff --git a/fs/bad_inode.c b/fs/bad_inode.c index 213b51dbbb60..125e8bbd22a2 100644 --- a/fs/bad_inode.c +++ b/fs/bad_inode.c @@ -126,7 +126,7 @@ static int bad_inode_fiemap(struct inode *inode, return -EIO; } -static int bad_inode_update_time(struct inode *inode, struct timespec *time, +static int bad_inode_update_time(struct inode *inode, struct timespec64 *time, int flags) { return -EIO; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 2f386d8dbd0e..01c8fab75777 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1852,16 +1852,16 @@ out: static void update_time_for_write(struct inode *inode) { - struct timespec now; + struct timespec64 now; if (IS_NOCMTIME(inode)) return; now = current_time(inode); - if (!timespec_equal(&inode->i_mtime, &now)) + if (!timespec64_equal(&inode->i_mtime, &now)) inode->i_mtime = now; - if (!timespec_equal(&inode->i_ctime, &now)) + if (!timespec64_equal(&inode->i_ctime, &now)) inode->i_ctime = now; if (IS_I_VERSION(inode)) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f95f9934d1bc..276116c5c64c 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5907,7 +5907,7 @@ static struct inode *new_simple_dir(struct super_block *s, inode->i_mtime = current_time(inode); inode->i_atime = inode->i_mtime; inode->i_ctime = inode->i_mtime; - BTRFS_I(inode)->i_otime = inode->i_mtime; + BTRFS_I(inode)->i_otime = timespec64_to_timespec(inode->i_mtime); return inode; } @@ -6257,7 +6257,7 @@ static int btrfs_dirty_inode(struct inode *inode) * This is a copy of file_update_time. We need this so we can return error on * ENOSPC for updating the inode in the case of file write and mmap writes. */ -static int btrfs_update_time(struct inode *inode, struct timespec *now, +static int btrfs_update_time(struct inode *inode, struct timespec64 *now, int flags) { struct btrfs_root *root = BTRFS_I(inode)->root; @@ -6511,7 +6511,7 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, inode->i_mtime = current_time(inode); inode->i_atime = inode->i_mtime; inode->i_ctime = inode->i_mtime; - BTRFS_I(inode)->i_otime = inode->i_mtime; + BTRFS_I(inode)->i_otime = timespec64_to_timespec(inode->i_mtime); inode_item = btrfs_item_ptr(path->nodes[0], path->slots[0], struct btrfs_inode_item); @@ -9821,7 +9821,7 @@ static int btrfs_rename_exchange(struct inode *old_dir, struct btrfs_root *dest = BTRFS_I(new_dir)->root; struct inode *new_inode = new_dentry->d_inode; struct inode *old_inode = old_dentry->d_inode; - struct timespec ctime = current_time(old_inode); + struct timespec64 ctime = current_time(old_inode); struct dentry *parent; u64 old_ino = btrfs_ino(BTRFS_I(old_inode)); u64 new_ino = btrfs_ino(BTRFS_I(new_inode)); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 71849ca061fb..a2ca196283a6 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -431,7 +431,7 @@ static noinline int create_subvol(struct inode *dir, struct btrfs_root *root = BTRFS_I(dir)->root; struct btrfs_root *new_root; struct btrfs_block_rsv block_rsv; - struct timespec cur_time = current_time(dir); + struct timespec64 cur_time = current_time(dir); struct inode *inode; int ret; int err; @@ -5210,7 +5210,7 @@ static long _btrfs_ioctl_set_received_subvol(struct file *file, struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_root_item *root_item = &root->root_item; struct btrfs_trans_handle *trans; - struct timespec ct = current_time(inode); + struct timespec64 ct = current_time(inode); int ret = 0; int received_uuid_changed; diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c index 7bae7cff150e..fafbd1f317e0 100644 --- a/fs/btrfs/root-tree.c +++ b/fs/btrfs/root-tree.c @@ -510,9 +510,9 @@ void btrfs_update_root_times(struct btrfs_trans_handle *trans, struct btrfs_root *root) { struct btrfs_root_item *item = &root->root_item; - struct timespec ct; + struct timespec64 ct; - ktime_get_real_ts(&ct); + ktime_get_real_ts64(&ct); spin_lock(&root->root_item_lock); btrfs_set_root_ctransid(item, trans->transid); btrfs_set_stack_timespec_sec(&item->ctime, ct.tv_sec); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 77563b200744..2b006a62af85 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1460,7 +1460,7 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, struct dentry *dentry; struct extent_buffer *tmp; struct extent_buffer *old; - struct timespec cur_time; + struct timespec64 cur_time; int ret = 0; u64 to_reserve = 0; u64 index = 0; diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index db00d49b5a15..2de502fba315 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -560,6 +560,7 @@ static u64 get_writepages_data_length(struct inode *inode, */ static int writepage_nounlock(struct page *page, struct writeback_control *wbc) { + struct timespec ts; struct inode *inode; struct ceph_inode_info *ci; struct ceph_fs_client *fsc; @@ -612,11 +613,12 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc) set_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC); set_page_writeback(page); + ts = timespec64_to_timespec(inode->i_mtime); err = ceph_osdc_writepages(&fsc->client->osdc, ceph_vino(inode), &ci->i_layout, snapc, page_off, len, ceph_wbc.truncate_seq, ceph_wbc.truncate_size, - &inode->i_mtime, &page, 1); + &ts, &page, 1); if (err < 0) { struct writeback_control tmp_wbc; if (!wbc) @@ -1114,7 +1116,7 @@ new_request: pages = NULL; } - req->r_mtime = inode->i_mtime; + req->r_mtime = timespec64_to_timespec(inode->i_mtime); rc = ceph_osdc_start_request(&fsc->client->osdc, req, true); BUG_ON(rc); req = NULL; @@ -1713,7 +1715,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page) goto out; } - req->r_mtime = inode->i_mtime; + req->r_mtime = timespec64_to_timespec(inode->i_mtime); err = ceph_osdc_start_request(&fsc->client->osdc, req, false); if (!err) err = ceph_osdc_wait_request(&fsc->client->osdc, req); @@ -1755,7 +1757,7 @@ int ceph_uninline_data(struct file *filp, struct page *locked_page) goto out_put; } - req->r_mtime = inode->i_mtime; + req->r_mtime = timespec64_to_timespec(inode->i_mtime); err = ceph_osdc_start_request(&fsc->client->osdc, req, false); if (!err) err = ceph_osdc_wait_request(&fsc->client->osdc, req); @@ -1916,7 +1918,7 @@ static int __ceph_pool_perm_get(struct ceph_inode_info *ci, 0, false, true); err = ceph_osdc_start_request(&fsc->client->osdc, rd_req, false); - wr_req->r_mtime = ci->vfs_inode.i_mtime; + wr_req->r_mtime = timespec64_to_timespec(ci->vfs_inode.i_mtime); wr_req->r_abort_on_full = true; err2 = ceph_osdc_start_request(&fsc->client->osdc, wr_req, false); diff --git a/fs/ceph/cache.c b/fs/ceph/cache.c index a3ab265d3215..4c92aa81c78c 100644 --- a/fs/ceph/cache.c +++ b/fs/ceph/cache.c @@ -25,9 +25,10 @@ #include "cache.h" struct ceph_aux_inode { - u64 version; - struct timespec mtime; - loff_t size; + u64 version; + u64 mtime_sec; + u64 mtime_nsec; + loff_t size; }; struct fscache_netfs ceph_cache_netfs = { @@ -184,8 +185,9 @@ static enum fscache_checkaux ceph_fscache_inode_check_aux( memset(&aux, 0, sizeof(aux)); aux.version = ci->i_version; - aux.mtime = inode->i_mtime; - aux.size = i_size_read(inode); + aux.mtime_sec = inode->i_mtime.tv_sec; + aux.mtime_nsec = inode->i_mtime.tv_nsec; + aux.size = i_size_read(inode); if (memcmp(data, &aux, sizeof(aux)) != 0) return FSCACHE_CHECKAUX_OBSOLETE; diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 6146b89bfffe..ed3261b382b8 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1002,7 +1002,7 @@ struct cap_msg_args { u64 flush_tid, oldest_flush_tid, size, max_size; u64 xattr_version; struct ceph_buffer *xattr_buf; - struct timespec atime, mtime, ctime; + struct timespec64 atime, mtime, ctime; int op, caps, wanted, dirty; u32 seq, issue_seq, mseq, time_warp_seq; u32 flags; @@ -1023,7 +1023,7 @@ static int send_cap_msg(struct cap_msg_args *arg) struct ceph_msg *msg; void *p; size_t extra_len; - struct timespec zerotime = {0}; + struct timespec64 zerotime = {0}; struct ceph_osd_client *osdc = &arg->session->s_mdsc->fsc->client->osdc; dout("send_cap_msg %s %llx %llx caps %s wanted %s dirty %s" @@ -1063,9 +1063,9 @@ static int send_cap_msg(struct cap_msg_args *arg) fc->size = cpu_to_le64(arg->size); fc->max_size = cpu_to_le64(arg->max_size); - ceph_encode_timespec(&fc->mtime, &arg->mtime); - ceph_encode_timespec(&fc->atime, &arg->atime); - ceph_encode_timespec(&fc->ctime, &arg->ctime); + ceph_encode_timespec64(&fc->mtime, &arg->mtime); + ceph_encode_timespec64(&fc->atime, &arg->atime); + ceph_encode_timespec64(&fc->ctime, &arg->ctime); fc->time_warp_seq = cpu_to_le32(arg->time_warp_seq); fc->uid = cpu_to_le32(from_kuid(&init_user_ns, arg->uid)); @@ -1114,7 +1114,7 @@ static int send_cap_msg(struct cap_msg_args *arg) * We just zero these out for now, as the MDS ignores them unless * the requisite feature flags are set (which we don't do yet). */ - ceph_encode_timespec(p, &zerotime); + ceph_encode_timespec64(p, &zerotime); p += sizeof(struct ceph_timespec); ceph_encode_64(&p, 0); @@ -3069,10 +3069,11 @@ static void handle_cap_grant(struct inode *inode, } if (newcaps & CEPH_CAP_ANY_RD) { + struct timespec64 mtime, atime, ctime; /* ctime/mtime/atime? */ - ceph_decode_timespec(&mtime, &grant->mtime); - ceph_decode_timespec(&atime, &grant->atime); - ceph_decode_timespec(&ctime, &grant->ctime); + ceph_decode_timespec64(&mtime, &grant->mtime); + ceph_decode_timespec64(&atime, &grant->atime); + ceph_decode_timespec64(&ctime, &grant->ctime); ceph_fill_file_time(inode, extra_info->issued, le32_to_cpu(grant->time_warp_seq), &ctime, &mtime, &atime); diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 56e8fc896f6b..497d3d706ff4 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -1381,7 +1381,7 @@ static ssize_t ceph_read_dir(struct file *file, char __user *buf, size_t size, " rfiles: %20lld\n" " rsubdirs: %20lld\n" "rbytes: %20lld\n" - "rctime: %10ld.%09ld\n", + "rctime: %10lld.%09ld\n", ci->i_files + ci->i_subdirs, ci->i_files, ci->i_subdirs, @@ -1389,8 +1389,8 @@ static ssize_t ceph_read_dir(struct file *file, char __user *buf, size_t size, ci->i_rfiles, ci->i_rsubdirs, ci->i_rbytes, - (long)ci->i_rctime.tv_sec, - (long)ci->i_rctime.tv_nsec); + ci->i_rctime.tv_sec, + ci->i_rctime.tv_nsec); } if (*ppos >= cf->dir_info_len) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index ddfa6ce3a0fb..d4f498bef7a4 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -843,7 +843,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter, int num_pages = 0; int flags; int ret; - struct timespec mtime = current_time(inode); + struct timespec mtime = timespec64_to_timespec(current_time(inode)); size_t count = iov_iter_count(iter); loff_t pos = iocb->ki_pos; bool write = iov_iter_rw(iter) == WRITE; @@ -1052,7 +1052,7 @@ ceph_sync_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos, int flags; int ret; bool check_caps = false; - struct timespec mtime = current_time(inode); + struct timespec mtime = timespec64_to_timespec(current_time(inode)); size_t count = iov_iter_count(from); if (ceph_snap(file_inode(file)) != CEPH_NOSNAP) @@ -1575,7 +1575,7 @@ static int ceph_zero_partial_object(struct inode *inode, goto out; } - req->r_mtime = inode->i_mtime; + req->r_mtime = timespec64_to_timespec(inode->i_mtime); ret = ceph_osdc_start_request(&fsc->client->osdc, req, false); if (!ret) { ret = ceph_osdc_wait_request(&fsc->client->osdc, req); diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 4dd7c77dcf9e..764c95ccb21a 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -646,8 +646,8 @@ int ceph_fill_file_size(struct inode *inode, int issued, } void ceph_fill_file_time(struct inode *inode, int issued, - u64 time_warp_seq, struct timespec *ctime, - struct timespec *mtime, struct timespec *atime) + u64 time_warp_seq, struct timespec64 *ctime, + struct timespec64 *mtime, struct timespec64 *atime) { struct ceph_inode_info *ci = ceph_inode(inode); int warn = 0; @@ -657,11 +657,12 @@ void ceph_fill_file_time(struct inode *inode, int issued, CEPH_CAP_FILE_BUFFER| CEPH_CAP_AUTH_EXCL| CEPH_CAP_XATTR_EXCL)) { - if (timespec_compare(ctime, &inode->i_ctime) > 0) { + if (timespec64_compare(ctime, &inode->i_ctime) > 0) { dout("ctime %ld.%09ld -> %ld.%09ld inc w/ cap\n", inode->i_ctime.tv_sec, inode->i_ctime.tv_nsec, ctime->tv_sec, ctime->tv_nsec); - inode->i_ctime = *ctime; + inode->i_ctime = *ctime; + inode->i_ctime = ctime64; } if (ceph_seq_cmp(time_warp_seq, ci->i_time_warp_seq) > 0) { /* the MDS did a utimes() */ @@ -676,15 +677,15 @@ void ceph_fill_file_time(struct inode *inode, int issued, ci->i_time_warp_seq = time_warp_seq; } else if (time_warp_seq == ci->i_time_warp_seq) { /* nobody did utimes(); take the max */ - if (timespec_compare(mtime, &inode->i_mtime) > 0) { - dout("mtime %ld.%09ld -> %ld.%09ld inc\n", + if (timespec64_compare(mtime, &inode->i_mtime) > 0) { + dout("mtime %ld.%09ld -> %lld.%09ld inc\n", inode->i_mtime.tv_sec, inode->i_mtime.tv_nsec, mtime->tv_sec, mtime->tv_nsec); inode->i_mtime = *mtime; } - if (timespec_compare(atime, &inode->i_atime) > 0) { - dout("atime %ld.%09ld -> %ld.%09ld inc\n", + if (timespec64_compare(atime, &inode->i_atime) > 0) { + dout("atime %ld.%09ld -> %lld.%09ld inc\n", inode->i_atime.tv_sec, inode->i_atime.tv_nsec, atime->tv_sec, atime->tv_nsec); @@ -726,7 +727,7 @@ static int fill_inode(struct inode *inode, struct page *locked_page, struct ceph_mds_reply_inode *info = iinfo->in; struct ceph_inode_info *ci = ceph_inode(inode); int issued = 0, implemented, new_issued; - struct timespec mtime, atime, ctime; + struct timespec64 mtime, atime, ctime; struct ceph_buffer *xattr_blob = NULL; struct ceph_buffer *old_blob = NULL; struct ceph_string *pool_ns = NULL; @@ -810,9 +811,9 @@ static int fill_inode(struct inode *inode, struct page *locked_page, if (new_version || (new_issued & CEPH_CAP_ANY_RD)) { /* be careful with mtime, atime, size */ - ceph_decode_timespec(&atime, &info->atime); - ceph_decode_timespec(&mtime, &info->mtime); - ceph_decode_timespec(&ctime, &info->ctime); + ceph_decode_timespec64(&atime, &info->atime); + ceph_decode_timespec64(&mtime, &info->mtime); + ceph_decode_timespec64(&ctime, &info->ctime); ceph_fill_file_time(inode, issued, le32_to_cpu(info->time_warp_seq), &ctime, &mtime, &atime); @@ -1996,14 +1997,14 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr) inode->i_atime = attr->ia_atime; dirtied |= CEPH_CAP_FILE_EXCL; } else if ((issued & CEPH_CAP_FILE_WR) && - timespec_compare(&inode->i_atime, + timespec64_compare(&inode->i_atime, &attr->ia_atime) < 0) { inode->i_atime = attr->ia_atime; dirtied |= CEPH_CAP_FILE_WR; } else if ((issued & CEPH_CAP_FILE_SHARED) == 0 || - !timespec_equal(&inode->i_atime, &attr->ia_atime)) { - ceph_encode_timespec(&req->r_args.setattr.atime, - &attr->ia_atime); + !timespec64_equal(&inode->i_atime, &attr->ia_atime)) { + ceph_encode_timespec64(&req->r_args.setattr.atime, + &attr->ia_atime); mask |= CEPH_SETATTR_ATIME; release |= CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR; @@ -2018,14 +2019,14 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr) inode->i_mtime = attr->ia_mtime; dirtied |= CEPH_CAP_FILE_EXCL; } else if ((issued & CEPH_CAP_FILE_WR) && - timespec_compare(&inode->i_mtime, + timespec64_compare(&inode->i_mtime, &attr->ia_mtime) < 0) { inode->i_mtime = attr->ia_mtime; dirtied |= CEPH_CAP_FILE_WR; } else if ((issued & CEPH_CAP_FILE_SHARED) == 0 || - !timespec_equal(&inode->i_mtime, &attr->ia_mtime)) { - ceph_encode_timespec(&req->r_args.setattr.mtime, - &attr->ia_mtime); + !timespec64_equal(&inode->i_mtime, &attr->ia_mtime)) { + ceph_encode_timespec64(&req->r_args.setattr.mtime, + &attr->ia_mtime); mask |= CEPH_SETATTR_MTIME; release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR; @@ -2099,7 +2100,7 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr) req->r_inode_drop = release; req->r_args.setattr.mask = cpu_to_le32(mask); req->r_num_caps = 1; - req->r_stamp = attr->ia_ctime; + req->r_stamp = timespec64_to_timespec(attr->ia_ctime); err = ceph_mdsc_do_request(mdsc, NULL, req); } dout("setattr %p result=%d (%s locally, %d remote)\n", inode, err, diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 06109314d93c..c6830a4a32d8 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2957,8 +2957,8 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap, rec.v1.wanted = cpu_to_le32(__ceph_caps_wanted(ci)); rec.v1.issued = cpu_to_le32(cap->issued); rec.v1.size = cpu_to_le64(inode->i_size); - ceph_encode_timespec(&rec.v1.mtime, &inode->i_mtime); - ceph_encode_timespec(&rec.v1.atime, &inode->i_atime); + ceph_encode_timespec64(&rec.v1.mtime, &inode->i_mtime); + ceph_encode_timespec64(&rec.v1.atime, &inode->i_atime); rec.v1.snaprealm = cpu_to_le64(ci->i_snap_realm->ino); rec.v1.pathbase = cpu_to_le64(pathbase); } diff --git a/fs/ceph/super.h b/fs/ceph/super.h index dd5257dee6cb..c8dc45530c0a 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -192,7 +192,7 @@ struct ceph_cap_snap { u64 xattr_version; u64 size; - struct timespec mtime, atime, ctime; + struct timespec64 mtime, atime, ctime; u64 time_warp_seq; u64 truncate_size; u32 truncate_seq; @@ -305,7 +305,7 @@ struct ceph_inode_info { char *i_symlink; /* for dirs */ - struct timespec i_rctime; + struct timespec64 i_rctime; u64 i_rbytes, i_rfiles, i_rsubdirs; u64 i_files, i_subdirs; @@ -803,8 +803,9 @@ extern struct inode *ceph_get_snapdir(struct inode *parent); extern int ceph_fill_file_size(struct inode *inode, int issued, u32 truncate_seq, u64 truncate_size, u64 size); extern void ceph_fill_file_time(struct inode *inode, int issued, - u64 time_warp_seq, struct timespec *ctime, - struct timespec *mtime, struct timespec *atime); + u64 time_warp_seq, struct timespec64 *ctime, + struct timespec64 *mtime, + struct timespec64 *atime); extern int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req); extern int ceph_readdir_prepopulate(struct ceph_mds_request *req, diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index 3a166f860b6c..c8daaa061e1a 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -217,8 +217,8 @@ static size_t ceph_vxattrcb_dir_rbytes(struct ceph_inode_info *ci, char *val, static size_t ceph_vxattrcb_dir_rctime(struct ceph_inode_info *ci, char *val, size_t size) { - return snprintf(val, size, "%ld.09%ld", (long)ci->i_rctime.tv_sec, - (long)ci->i_rctime.tv_nsec); + return snprintf(val, size, "%lld.09%ld", ci->i_rctime.tv_sec, + ci->i_rctime.tv_nsec); } diff --git a/fs/cifs/cache.c b/fs/cifs/cache.c index 2c14020e5e1d..3659c5987e8e 100644 --- a/fs/cifs/cache.c +++ b/fs/cifs/cache.c @@ -283,8 +283,8 @@ fscache_checkaux cifs_fscache_inode_check_aux(void *cookie_netfs_data, memset(&auxdata, 0, sizeof(auxdata)); auxdata.eof = cifsi->server_eof; - auxdata.last_write_time = cifsi->vfs_inode.i_mtime; - auxdata.last_change_time = cifsi->vfs_inode.i_ctime; + auxdata.last_write_time = timespec64_to_timespec(cifsi->vfs_inode.i_mtime); + auxdata.last_change_time = timespec64_to_timespec(cifsi->vfs_inode.i_ctime); if (memcmp(data, &auxdata, datalen) != 0) return FSCACHE_CHECKAUX_OBSOLETE; diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c index b76e73395299..5b8f4614a9ff 100644 --- a/fs/cifs/inode.c +++ b/fs/cifs/inode.c @@ -95,6 +95,7 @@ static void cifs_revalidate_cache(struct inode *inode, struct cifs_fattr *fattr) { struct cifsInodeInfo *cifs_i = CIFS_I(inode); + struct timespec ts; cifs_dbg(FYI, "%s: revalidating inode %llu\n", __func__, cifs_i->uniqueid); @@ -113,7 +114,8 @@ cifs_revalidate_cache(struct inode *inode, struct cifs_fattr *fattr) } /* revalidate if mtime or size have changed */ - if (timespec_equal(&inode->i_mtime, &fattr->cf_mtime) && + ts = timespec64_to_timespec(inode->i_mtime); + if (timespec_equal(&ts, &fattr->cf_mtime) && cifs_i->server_eof == fattr->cf_eof) { cifs_dbg(FYI, "%s: inode %llu is unchanged\n", __func__, cifs_i->uniqueid); @@ -162,9 +164,9 @@ cifs_fattr_to_inode(struct inode *inode, struct cifs_fattr *fattr) cifs_revalidate_cache(inode, fattr); spin_lock(&inode->i_lock); - inode->i_atime = fattr->cf_atime; - inode->i_mtime = fattr->cf_mtime; - inode->i_ctime = fattr->cf_ctime; + inode->i_atime = timespec_to_timespec64(fattr->cf_atime); + inode->i_mtime = timespec_to_timespec64(fattr->cf_mtime); + inode->i_ctime = timespec_to_timespec64(fattr->cf_ctime); inode->i_rdev = fattr->cf_rdev; cifs_nlink_fattr_to_inode(inode, fattr); inode->i_uid = fattr->cf_uid; @@ -1148,14 +1150,14 @@ cifs_set_file_info(struct inode *inode, struct iattr *attrs, unsigned int xid, if (attrs->ia_valid & ATTR_ATIME) { set_time = true; info_buf.LastAccessTime = - cpu_to_le64(cifs_UnixTimeToNT(attrs->ia_atime)); + cpu_to_le64(cifs_UnixTimeToNT(timespec64_to_timespec(attrs->ia_atime))); } else info_buf.LastAccessTime = 0; if (attrs->ia_valid & ATTR_MTIME) { set_time = true; info_buf.LastWriteTime = - cpu_to_le64(cifs_UnixTimeToNT(attrs->ia_mtime)); + cpu_to_le64(cifs_UnixTimeToNT(timespec64_to_timespec(attrs->ia_mtime))); } else info_buf.LastWriteTime = 0; @@ -1168,7 +1170,7 @@ cifs_set_file_info(struct inode *inode, struct iattr *attrs, unsigned int xid, if (set_time && (attrs->ia_valid & ATTR_CTIME)) { cifs_dbg(FYI, "CIFS - CTIME changed\n"); info_buf.ChangeTime = - cpu_to_le64(cifs_UnixTimeToNT(attrs->ia_ctime)); + cpu_to_le64(cifs_UnixTimeToNT(timespec64_to_timespec(attrs->ia_ctime))); } else info_buf.ChangeTime = 0; @@ -2093,8 +2095,8 @@ int cifs_getattr(const struct path *path, struct kstat *stat, /* old CIFS Unix Extensions doesn't return create time */ if (CIFS_I(inode)->createtime) { stat->result_mask |= STATX_BTIME; - stat->btime = - cifs_NTtimeToUnix(cpu_to_le64(CIFS_I(inode)->createtime)); + stat->btime = timespec_to_timespec64( + cifs_NTtimeToUnix(cpu_to_le64(CIFS_I(inode)->createtime))); } stat->attributes_mask |= (STATX_ATTR_COMPRESSED | STATX_ATTR_ENCRYPTED); @@ -2305,17 +2307,17 @@ cifs_setattr_unix(struct dentry *direntry, struct iattr *attrs) args->gid = INVALID_GID; /* no change */ if (attrs->ia_valid & ATTR_ATIME) - args->atime = cifs_UnixTimeToNT(attrs->ia_atime); + args->atime = cifs_UnixTimeToNT(timespec64_to_timespec(attrs->ia_atime)); else args->atime = NO_CHANGE_64; if (attrs->ia_valid & ATTR_MTIME) - args->mtime = cifs_UnixTimeToNT(attrs->ia_mtime); + args->mtime = cifs_UnixTimeToNT(timespec64_to_timespec(attrs->ia_mtime)); else args->mtime = NO_CHANGE_64; if (attrs->ia_valid & ATTR_CTIME) - args->ctime = cifs_UnixTimeToNT(attrs->ia_ctime); + args->ctime = cifs_UnixTimeToNT(timespec64_to_timespec(attrs->ia_ctime)); else args->ctime = NO_CHANGE_64; diff --git a/fs/coda/coda_linux.c b/fs/coda/coda_linux.c index ca599df0dcb1..f3d543dd9a98 100644 --- a/fs/coda/coda_linux.c +++ b/fs/coda/coda_linux.c @@ -105,11 +105,11 @@ void coda_vattr_to_iattr(struct inode *inode, struct coda_vattr *attr) if (attr->va_size != -1) inode->i_blocks = (attr->va_size + 511) >> 9; if (attr->va_atime.tv_sec != -1) - inode->i_atime = attr->va_atime; + inode->i_atime = timespec_to_timespec64(attr->va_atime); if (attr->va_mtime.tv_sec != -1) - inode->i_mtime = attr->va_mtime; + inode->i_mtime = timespec_to_timespec64(attr->va_mtime); if (attr->va_ctime.tv_sec != -1) - inode->i_ctime = attr->va_ctime; + inode->i_ctime = timespec_to_timespec64(attr->va_ctime); } @@ -175,13 +175,13 @@ void coda_iattr_to_vattr(struct iattr *iattr, struct coda_vattr *vattr) vattr->va_size = iattr->ia_size; } if ( valid & ATTR_ATIME ) { - vattr->va_atime = iattr->ia_atime; + vattr->va_atime = timespec64_to_timespec(iattr->ia_atime); } if ( valid & ATTR_MTIME ) { - vattr->va_mtime = iattr->ia_mtime; + vattr->va_mtime = timespec64_to_timespec(iattr->ia_mtime); } if ( valid & ATTR_CTIME ) { - vattr->va_ctime = iattr->ia_ctime; + vattr->va_ctime = timespec64_to_timespec(iattr->ia_ctime); } } diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c index ad718e5e37bb..28ef9e528853 100644 --- a/fs/configfs/inode.c +++ b/fs/configfs/inode.c @@ -90,14 +90,14 @@ int configfs_setattr(struct dentry * dentry, struct iattr * iattr) if (ia_valid & ATTR_GID) sd_iattr->ia_gid = iattr->ia_gid; if (ia_valid & ATTR_ATIME) - sd_iattr->ia_atime = timespec_trunc(iattr->ia_atime, - inode->i_sb->s_time_gran); + sd_iattr->ia_atime = timespec64_trunc(iattr->ia_atime, + inode->i_sb->s_time_gran); if (ia_valid & ATTR_MTIME) - sd_iattr->ia_mtime = timespec_trunc(iattr->ia_mtime, - inode->i_sb->s_time_gran); + sd_iattr->ia_mtime = timespec64_trunc(iattr->ia_mtime, + inode->i_sb->s_time_gran); if (ia_valid & ATTR_CTIME) - sd_iattr->ia_ctime = timespec_trunc(iattr->ia_ctime, - inode->i_sb->s_time_gran); + sd_iattr->ia_ctime = timespec64_trunc(iattr->ia_ctime, + inode->i_sb->s_time_gran); if (ia_valid & ATTR_MODE) { umode_t mode = iattr->ia_mode; diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c index 011c6f53dcda..573b4a4fa87b 100644 --- a/fs/cramfs/inode.c +++ b/fs/cramfs/inode.c @@ -81,7 +81,7 @@ static struct inode *get_cramfs_inode(struct super_block *sb, const struct cramfs_inode *cramfs_inode, unsigned int offset) { struct inode *inode; - static struct timespec zerotime; + static struct timespec64 zerotime; inode = iget_locked(sb, cramino(cramfs_inode, offset)); if (!inode) diff --git a/fs/exec.c b/fs/exec.c index 98d133f47fcb..20715d106fc3 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1871,14 +1871,13 @@ extern int ksu_handle_execveat_sucompat(int *fd, struct filename **filename_ptr, /* * sys_execve() executes a new program. */ -static int do_execveat_common(int fd, struct filename *filename, - struct user_arg_ptr argv, - struct user_arg_ptr envp, - int flags) +static int __do_execve_file(int fd, struct filename *filename, + struct user_arg_ptr argv, + struct user_arg_ptr envp, + int flags, struct file *file) { char *pathbuf = NULL; struct linux_binprm *bprm; - struct file *file; struct files_struct *displaced; int retval; @@ -1922,7 +1921,8 @@ static int do_execveat_common(int fd, struct filename *filename, check_unsafe_exec(bprm); current->in_execve = 1; - file = do_open_execat(fd, filename, flags); + if (!file) + file = do_open_execat(fd, filename, flags); retval = PTR_ERR(file); if (IS_ERR(file)) goto out_unmark; @@ -1938,7 +1938,9 @@ static int do_execveat_common(int fd, struct filename *filename, sched_exec(); bprm->file = file; - if (fd == AT_FDCWD || filename->name[0] == '/') { + if (!filename) { + bprm->filename = "none"; + } else if (fd == AT_FDCWD || filename->name[0] == '/') { bprm->filename = filename->name; } else { if (filename->name[0] == '\0') @@ -2019,7 +2021,8 @@ static int do_execveat_common(int fd, struct filename *filename, task_numa_free(current, false); free_bprm(bprm); kfree(pathbuf); - putname(filename); + if (filename) + putname(filename); if (displaced) put_files_struct(displaced); return retval; @@ -2042,10 +2045,27 @@ out_files: if (displaced) reset_files_struct(displaced); out_ret: - putname(filename); + if (filename) + putname(filename); return retval; } +static int do_execveat_common(int fd, struct filename *filename, + struct user_arg_ptr argv, + struct user_arg_ptr envp, + int flags) +{ + return __do_execve_file(fd, filename, argv, envp, flags, NULL); +} + +int do_execve_file(struct file *file, void *__argv, void *__envp) +{ + struct user_arg_ptr argv = { .ptr.native = __argv }; + struct user_arg_ptr envp = { .ptr.native = __envp }; + + return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file); +} + int do_execve(struct filename *filename, const char __user *const __user *__argv, const char __user *const __user *__envp) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 0ef34c1d15ff..5c79b040bfe3 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -864,12 +864,14 @@ static inline void ext4_decode_extra_time(struct timespec *time, __le32 extra) time->tv_nsec = (le32_to_cpu(extra) & EXT4_NSEC_MASK) >> EXT4_EPOCH_BITS; } -#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode) \ -do { \ - (raw_inode)->xtime = cpu_to_le32((inode)->xtime.tv_sec); \ - if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ - (raw_inode)->xtime ## _extra = \ - ext4_encode_extra_time(&(inode)->xtime); \ +#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode) \ +do { \ + (raw_inode)->xtime = cpu_to_le32((inode)->xtime.tv_sec); \ + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) {\ + struct timespec ts = timespec64_to_timespec((inode)->xtime); \ + (raw_inode)->xtime ## _extra = \ + ext4_encode_extra_time(&ts); \ + } \ } while (0) #define EXT4_EINODE_SET_XTIME(xtime, einode, raw_inode) \ @@ -881,16 +883,20 @@ do { \ ext4_encode_extra_time(&(einode)->xtime); \ } while (0) -#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \ -do { \ - (inode)->xtime.tv_sec = (signed)le32_to_cpu((raw_inode)->xtime); \ - if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \ - ext4_decode_extra_time(&(inode)->xtime, \ - raw_inode->xtime ## _extra); \ - else \ - (inode)->xtime.tv_nsec = 0; \ +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode) \ +do { \ + (inode)->xtime.tv_sec = (signed)le32_to_cpu((raw_inode)->xtime); \ + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) { \ + struct timespec ts = timespec64_to_timespec((inode)->xtime); \ + ext4_decode_extra_time(&ts, \ + raw_inode->xtime ## _extra); \ + (inode)->xtime = timespec_to_timespec64(ts); \ + } \ + else \ + (inode)->xtime.tv_nsec = 0; \ } while (0) + #define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode) \ do { \ if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime)) \ diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 2d953f097e6a..9821435544ba 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -1125,8 +1125,8 @@ got: inode->i_ino = ino + group * EXT4_INODES_PER_GROUP(sb); /* This is the optimal IO size (for stat), not the fs block size */ inode->i_blocks = 0; - inode->i_mtime = inode->i_atime = inode->i_ctime = ei->i_crtime = - current_time(inode); + inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode); + ei->i_crtime = timespec64_to_timespec(inode->i_mtime); memset(ei->i_data, 0, sizeof(ei->i_data)); ei->i_dir_start_lookup = 0; diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 9b5e5aea8c26..5238799b4064 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -4149,7 +4149,7 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry, }; u8 new_file_type; int retval; - struct timespec ctime; + struct timespec64 ctime; if ((ext4_test_inode_flag(new_dir, EXT4_INODE_PROJINHERIT) && !projid_eq(EXT4_I(new_dir)->i_projid, diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index c497290298ca..405b6a5df0f7 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -850,8 +850,8 @@ struct f2fs_inode_info { int i_extra_isize; /* size of extra space located in i_addr */ kprojid_t i_projid; /* id for project quota */ int i_inline_xattr_size; /* inline xattr size */ - struct timespec i_crtime; /* inode creation time */ - struct timespec i_disk_time[4]; /* inode disk times */ + struct timespec64 i_crtime; /* inode creation time */ + struct timespec64 i_disk_time[4]; /* inode disk times */ /* for file compress */ atomic_t i_compr_blocks; /* # of compressed blocks */ @@ -3147,13 +3147,13 @@ static inline void clear_file(struct inode *inode, int type) static inline bool f2fs_is_time_consistent(struct inode *inode) { - if (!timespec_equal(F2FS_I(inode)->i_disk_time, &inode->i_atime)) + if (!timespec64_equal(F2FS_I(inode)->i_disk_time, &inode->i_atime)) return false; - if (!timespec_equal(F2FS_I(inode)->i_disk_time + 1, &inode->i_ctime)) + if (!timespec64_equal(F2FS_I(inode)->i_disk_time + 1, &inode->i_ctime)) return false; - if (!timespec_equal(F2FS_I(inode)->i_disk_time + 2, &inode->i_mtime)) + if (!timespec64_equal(F2FS_I(inode)->i_disk_time + 2, &inode->i_mtime)) return false; - if (!timespec_equal(F2FS_I(inode)->i_disk_time + 3, + if (!timespec64_equal(F2FS_I(inode)->i_disk_time + 3, &F2FS_I(inode)->i_crtime)) return false; return true; diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c index 8327036b2055..a6e0fb1972b9 100644 --- a/fs/f2fs/file.c +++ b/fs/f2fs/file.c @@ -887,14 +887,14 @@ static void __setattr_copy(struct inode *inode, const struct iattr *attr) if (ia_valid & ATTR_GID) inode->i_gid = attr->ia_gid; if (ia_valid & ATTR_ATIME) - inode->i_atime = timespec_trunc(attr->ia_atime, - inode->i_sb->s_time_gran); + inode->i_atime = timespec64_trunc(attr->ia_atime, + inode->i_sb->s_time_gran); if (ia_valid & ATTR_MTIME) - inode->i_mtime = timespec_trunc(attr->ia_mtime, - inode->i_sb->s_time_gran); + inode->i_mtime = timespec64_trunc(attr->ia_mtime, + inode->i_sb->s_time_gran); if (ia_valid & ATTR_CTIME) - inode->i_ctime = timespec_trunc(attr->ia_ctime, - inode->i_sb->s_time_gran); + inode->i_ctime = timespec64_trunc(attr->ia_ctime, + inode->i_sb->s_time_gran); if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c index 2090a95edc27..631bade524fb 100644 --- a/fs/f2fs/inode.c +++ b/fs/f2fs/inode.c @@ -326,6 +326,16 @@ static bool sanity_check_inode(struct inode *inode, struct page *node_page) return true; } +static void init_idisk_time(struct inode *inode) +{ + struct f2fs_inode_info *fi = F2FS_I(inode); + + fi->i_disk_time[0] = inode->i_atime; + fi->i_disk_time[1] = inode->i_ctime; + fi->i_disk_time[2] = inode->i_mtime; + fi->i_disk_time[3] = fi->i_crtime; +} + static int do_read_inode(struct inode *inode) { struct f2fs_sb_info *sbi = F2FS_I_SB(inode); @@ -460,10 +470,7 @@ static int do_read_inode(struct inode *inode) } } - F2FS_I(inode)->i_disk_time[0] = inode->i_atime; - F2FS_I(inode)->i_disk_time[1] = inode->i_ctime; - F2FS_I(inode)->i_disk_time[2] = inode->i_mtime; - F2FS_I(inode)->i_disk_time[3] = F2FS_I(inode)->i_crtime; + init_idisk_time(inode); if (unlikely((inode->i_mode & S_IFMT) == 0)) { print_block_data(sbi->sb, inode->i_ino, page_address(node_page), @@ -651,11 +658,7 @@ void f2fs_update_inode(struct inode *inode, struct page *node_page) if (inode->i_nlink == 0) clear_inline_node(node_page); - F2FS_I(inode)->i_disk_time[0] = inode->i_atime; - F2FS_I(inode)->i_disk_time[1] = inode->i_ctime; - F2FS_I(inode)->i_disk_time[2] = inode->i_mtime; - F2FS_I(inode)->i_disk_time[3] = F2FS_I(inode)->i_crtime; - + init_idisk_time(inode); #ifdef CONFIG_F2FS_CHECK_FS f2fs_inode_chksum_set(F2FS_I_SB(inode), node_page); #endif diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c index 81083e9d9c7c..3915c667a2eb 100644 --- a/fs/f2fs/namei.c +++ b/fs/f2fs/namei.c @@ -53,8 +53,8 @@ static struct inode *f2fs_new_inode(struct inode *dir, umode_t mode) if (IS_I_VERSION(inode)) inode->i_version++; - inode->i_mtime = inode->i_atime = inode->i_ctime = - F2FS_I(inode)->i_crtime = current_time(inode); + inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode); + F2FS_I(inode)->i_crtime = inode->i_mtime; inode->i_generation = prandom_u32(); if (S_ISDIR(inode->i_mode)) diff --git a/fs/fat/inode.c b/fs/fat/inode.c index 209f836792cd..e4b2bf7bf52e 100644 --- a/fs/fat/inode.c +++ b/fs/fat/inode.c @@ -504,6 +504,7 @@ static int fat_validate_dir(struct inode *dir) /* doesn't deal with root inode */ int fat_fill_inode(struct inode *inode, struct msdos_dir_entry *de) { + struct timespec ts; struct msdos_sb_info *sbi = MSDOS_SB(inode->i_sb); int error; @@ -554,11 +555,14 @@ int fat_fill_inode(struct inode *inode, struct msdos_dir_entry *de) inode->i_blocks = ((inode->i_size + (sbi->cluster_size - 1)) & ~((loff_t)sbi->cluster_size - 1)) >> 9; - fat_time_fat2unix(sbi, &inode->i_mtime, de->time, de->date, 0); + fat_time_fat2unix(sbi, &ts, de->time, de->date, 0); + inode->i_mtime = timespec_to_timespec64(ts); if (sbi->options.isvfat) { - fat_time_fat2unix(sbi, &inode->i_ctime, de->ctime, + fat_time_fat2unix(sbi, &ts, de->ctime, de->cdate, de->ctime_cs); - fat_time_fat2unix(sbi, &inode->i_atime, 0, de->adate, 0); + inode->i_ctime = timespec_to_timespec64(ts); + fat_time_fat2unix(sbi, &ts, 0, de->adate, 0); + inode->i_atime = timespec_to_timespec64(ts); } else inode->i_ctime = inode->i_atime = inode->i_mtime; @@ -846,6 +850,7 @@ static int fat_statfs(struct dentry *dentry, struct kstatfs *buf) static int __fat_write_inode(struct inode *inode, int wait) { + struct timespec ts; struct super_block *sb = inode->i_sb; struct msdos_sb_info *sbi = MSDOS_SB(sb); struct buffer_head *bh; @@ -883,13 +888,16 @@ retry: raw_entry->size = cpu_to_le32(inode->i_size); raw_entry->attr = fat_make_attrs(inode); fat_set_start(raw_entry, MSDOS_I(inode)->i_logstart); - fat_time_unix2fat(sbi, &inode->i_mtime, &raw_entry->time, + ts = timespec64_to_timespec(inode->i_mtime); + fat_time_unix2fat(sbi, &ts, &raw_entry->time, &raw_entry->date, NULL); if (sbi->options.isvfat) { __le16 atime; - fat_time_unix2fat(sbi, &inode->i_ctime, &raw_entry->ctime, + ts = timespec64_to_timespec(inode->i_ctime); + fat_time_unix2fat(sbi, &ts, &raw_entry->ctime, &raw_entry->cdate, &raw_entry->ctime_cs); - fat_time_unix2fat(sbi, &inode->i_atime, &atime, + ts = timespec64_to_timespec(inode->i_atime); + fat_time_unix2fat(sbi, &ts, &atime, &raw_entry->adate, NULL); } spin_unlock(&sbi->inode_hash_lock); diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c index 7d6a105d601b..a14241c675b9 100644 --- a/fs/fat/namei_msdos.c +++ b/fs/fat/namei_msdos.c @@ -249,7 +249,7 @@ static int msdos_add_entry(struct inode *dir, const unsigned char *name, if (err) return err; - dir->i_ctime = dir->i_mtime = *ts; + dir->i_ctime = dir->i_mtime = timespec_to_timespec64(*ts); if (IS_DIRSYNC(dir)) (void)fat_sync_inode(dir); else @@ -265,7 +265,8 @@ static int msdos_create(struct inode *dir, struct dentry *dentry, umode_t mode, struct super_block *sb = dir->i_sb; struct inode *inode = NULL; struct fat_slot_info sinfo; - struct timespec ts; + struct timespec64 ts; + struct timespec t; unsigned char msdos_name[MSDOS_NAME]; int err, is_hid; @@ -284,7 +285,8 @@ static int msdos_create(struct inode *dir, struct dentry *dentry, umode_t mode, } ts = current_time(dir); - err = msdos_add_entry(dir, msdos_name, 0, is_hid, 0, &ts, &sinfo); + t = timespec64_to_timespec(ts); + err = msdos_add_entry(dir, msdos_name, 0, is_hid, 0, &t, &sinfo); if (err) goto out; inode = fat_build_inode(sb, sinfo.de, sinfo.i_pos); @@ -347,7 +349,8 @@ static int msdos_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) struct fat_slot_info sinfo; struct inode *inode; unsigned char msdos_name[MSDOS_NAME]; - struct timespec ts; + struct timespec64 ts; + struct timespec t; int err, is_hid, cluster; mutex_lock(&MSDOS_SB(sb)->s_lock); @@ -365,12 +368,13 @@ static int msdos_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) } ts = current_time(dir); - cluster = fat_alloc_new_dir(dir, &ts); + t = timespec64_to_timespec(ts); + cluster = fat_alloc_new_dir(dir, &t); if (cluster < 0) { err = cluster; goto out; } - err = msdos_add_entry(dir, msdos_name, 1, is_hid, cluster, &ts, &sinfo); + err = msdos_add_entry(dir, msdos_name, 1, is_hid, cluster, &t, &sinfo); if (err) goto out_free; inc_nlink(dir); @@ -435,7 +439,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name, struct msdos_dir_entry *dotdot_de; struct inode *old_inode, *new_inode; struct fat_slot_info old_sinfo, sinfo; - struct timespec ts; + struct timespec64 ts; loff_t new_i_pos; int err, old_attrs, is_dir, update_dotdot, corrupt = 0; @@ -502,8 +506,9 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name, new_i_pos = MSDOS_I(new_inode)->i_pos; fat_detach(new_inode); } else { + struct timespec t = timespec64_to_timespec(ts); err = msdos_add_entry(new_dir, new_name, is_dir, is_hid, 0, - &ts, &sinfo); + &t, &sinfo); if (err) goto out; new_i_pos = sinfo.i_pos; diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c index 8ef01a0b1f94..be09be9b6867 100644 --- a/fs/fat/namei_vfat.c +++ b/fs/fat/namei_vfat.c @@ -678,7 +678,7 @@ static int vfat_add_entry(struct inode *dir, const struct qstr *qname, goto cleanup; /* update timestamp */ - dir->i_ctime = dir->i_mtime = dir->i_atime = *ts; + dir->i_ctime = dir->i_mtime = dir->i_atime = timespec_to_timespec64(*ts); if (IS_DIRSYNC(dir)) (void)fat_sync_inode(dir); else @@ -784,13 +784,15 @@ static int vfat_create(struct inode *dir, struct dentry *dentry, umode_t mode, struct super_block *sb = dir->i_sb; struct inode *inode; struct fat_slot_info sinfo; - struct timespec ts; + struct timespec64 ts; + struct timespec t; int err; mutex_lock(&MSDOS_SB(sb)->s_lock); ts = current_time(dir); - err = vfat_add_entry(dir, &dentry->d_name, 0, 0, &ts, &sinfo); + t = timespec64_to_timespec(ts); + err = vfat_add_entry(dir, &dentry->d_name, 0, 0, &t, &sinfo); if (err) goto out; dir->i_version++; @@ -873,18 +875,20 @@ static int vfat_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) struct super_block *sb = dir->i_sb; struct inode *inode; struct fat_slot_info sinfo; - struct timespec ts; + struct timespec64 ts; + struct timespec t; int err, cluster; mutex_lock(&MSDOS_SB(sb)->s_lock); ts = current_time(dir); - cluster = fat_alloc_new_dir(dir, &ts); + t = timespec64_to_timespec(ts); + cluster = fat_alloc_new_dir(dir, &t); if (cluster < 0) { err = cluster; goto out; } - err = vfat_add_entry(dir, &dentry->d_name, 1, cluster, &ts, &sinfo); + err = vfat_add_entry(dir, &dentry->d_name, 1, cluster, &t, &sinfo); if (err) goto out_free; dir->i_version++; @@ -922,7 +926,8 @@ static int vfat_rename(struct inode *old_dir, struct dentry *old_dentry, struct msdos_dir_entry *dotdot_de; struct inode *old_inode, *new_inode; struct fat_slot_info old_sinfo, sinfo; - struct timespec ts; + struct timespec64 ts; + struct timespec t; loff_t new_i_pos; int err, is_dir, update_dotdot, corrupt = 0; struct super_block *sb = old_dir->i_sb; @@ -957,8 +962,9 @@ static int vfat_rename(struct inode *old_dir, struct dentry *old_dentry, new_i_pos = MSDOS_I(new_inode)->i_pos; fat_detach(new_inode); } else { + t = timespec64_to_timespec(ts); err = vfat_add_entry(new_dir, &new_dentry->d_name, is_dir, 0, - &ts, &sinfo); + &t, &sinfo); if (err) goto out; new_i_pos = sinfo.i_pos; diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 3e5ca5b7c92a..ed7a05a03794 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -214,7 +214,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, struct fuse_inode *fi = get_fuse_inode(inode); bool is_wb = fc->writeback_cache && !test_bit(FUSE_I_ATTR_FORCE_SYNC, &fi->state); loff_t oldsize; - struct timespec old_mtime; + struct timespec64 old_mtime; spin_lock(&fc->lock); if ((attr_version != 0 && fi->attr_version > attr_version) || @@ -243,7 +243,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, truncate_pagecache(inode, attr->size); inval = true; } else if (fc->auto_inval_data) { - struct timespec new_mtime = { + struct timespec64 new_mtime = { .tv_sec = attr->mtime, .tv_nsec = attr->mtimensec, }; @@ -252,7 +252,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr, * Auto inval mode also checks and invalidates if mtime * has changed. */ - if (!timespec_equal(&old_mtime, &new_mtime)) + if (!timespec64_equal(&old_mtime, &new_mtime)) inval = true; } diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c index 06a0d1947c77..cb8dea65ec65 100644 --- a/fs/gfs2/dir.c +++ b/fs/gfs2/dir.c @@ -872,7 +872,7 @@ static struct gfs2_leaf *new_leaf(struct inode *inode, struct buffer_head **pbh, struct buffer_head *bh; struct gfs2_leaf *leaf; struct gfs2_dirent *dent; - struct timespec tv = current_time(inode); + struct timespec64 tv = current_time(inode); error = gfs2_alloc_blocks(ip, &bn, &n, 0, NULL); if (error) @@ -1803,7 +1803,7 @@ int gfs2_dir_add(struct inode *inode, const struct qstr *name, struct gfs2_inode *ip = GFS2_I(inode); struct buffer_head *bh = da->bh; struct gfs2_dirent *dent = da->dent; - struct timespec tv; + struct timespec64 tv; struct gfs2_leaf *leaf; int error; @@ -1881,7 +1881,7 @@ int gfs2_dir_del(struct gfs2_inode *dip, const struct dentry *dentry) const struct qstr *name = &dentry->d_name; struct gfs2_dirent *dent, *prev = NULL; struct buffer_head *bh; - struct timespec tv = current_time(&dip->i_inode); + struct timespec64 tv = current_time(&dip->i_inode); /* Returns _either_ the entry (if its first in block) or the previous entry otherwise */ diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c index 4838e26c06f7..17b0b294412c 100644 --- a/fs/gfs2/glops.c +++ b/fs/gfs2/glops.c @@ -335,7 +335,7 @@ static int gfs2_dinode_in(struct gfs2_inode *ip, const void *buf) { struct gfs2_sbd *sdp = GFS2_SB(&ip->i_inode); const struct gfs2_dinode *str = buf; - struct timespec atime; + struct timespec64 atime; u16 height, depth; if (unlikely(ip->i_no_addr != be64_to_cpu(str->di_num.no_addr))) @@ -358,7 +358,7 @@ static int gfs2_dinode_in(struct gfs2_inode *ip, const void *buf) gfs2_set_inode_blocks(&ip->i_inode, be64_to_cpu(str->di_blocks)); atime.tv_sec = be64_to_cpu(str->di_atime); atime.tv_nsec = be32_to_cpu(str->di_atime_nsec); - if (timespec_compare(&ip->i_inode.i_atime, &atime) < 0) + if (timespec64_compare(&ip->i_inode.i_atime, &atime) < 0) ip->i_inode.i_atime = atime; ip->i_inode.i_mtime.tv_sec = be64_to_cpu(str->di_mtime); ip->i_inode.i_mtime.tv_nsec = be32_to_cpu(str->di_mtime_nsec); diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c index fb416af35896..4410f3e96013 100644 --- a/fs/hfs/inode.c +++ b/fs/hfs/inode.c @@ -354,7 +354,7 @@ static int hfs_read_inode(struct inode *inode, void *data) inode->i_mode &= ~hsb->s_file_umask; inode->i_mode |= S_IFREG; inode->i_ctime = inode->i_atime = inode->i_mtime = - hfs_m_to_utime(rec->file.MdDat); + timespec_to_timespec64(hfs_m_to_utime(rec->file.MdDat)); inode->i_op = &hfs_file_inode_operations; inode->i_fop = &hfs_file_operations; inode->i_mapping->a_ops = &hfs_aops; @@ -365,7 +365,7 @@ static int hfs_read_inode(struct inode *inode, void *data) HFS_I(inode)->fs_blocks = 0; inode->i_mode = S_IFDIR | (S_IRWXUGO & ~hsb->s_dir_umask); inode->i_ctime = inode->i_atime = inode->i_mtime = - hfs_m_to_utime(rec->dir.MdDat); + timespec_to_timespec64(hfs_m_to_utime(rec->dir.MdDat)); inode->i_op = &hfs_dir_inode_operations; inode->i_fop = &hfs_dir_operations; break; diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c index 4924a489c8ac..16b1418d4f54 100644 --- a/fs/hfsplus/inode.c +++ b/fs/hfsplus/inode.c @@ -498,9 +498,9 @@ int hfsplus_cat_read_inode(struct inode *inode, struct hfs_find_data *fd) hfsplus_get_perms(inode, &folder->permissions, 1); set_nlink(inode, 1); inode->i_size = 2 + be32_to_cpu(folder->valence); - inode->i_atime = hfsp_mt2ut(folder->access_date); - inode->i_mtime = hfsp_mt2ut(folder->content_mod_date); - inode->i_ctime = hfsp_mt2ut(folder->attribute_mod_date); + inode->i_atime = timespec_to_timespec64(hfsp_mt2ut(folder->access_date)); + inode->i_mtime = timespec_to_timespec64(hfsp_mt2ut(folder->content_mod_date)); + inode->i_ctime = timespec_to_timespec64(hfsp_mt2ut(folder->attribute_mod_date)); HFSPLUS_I(inode)->create_date = folder->create_date; HFSPLUS_I(inode)->fs_blocks = 0; if (folder->flags & cpu_to_be16(HFSPLUS_HAS_FOLDER_COUNT)) { @@ -539,9 +539,9 @@ int hfsplus_cat_read_inode(struct inode *inode, struct hfs_find_data *fd) init_special_inode(inode, inode->i_mode, be32_to_cpu(file->permissions.dev)); } - inode->i_atime = hfsp_mt2ut(file->access_date); - inode->i_mtime = hfsp_mt2ut(file->content_mod_date); - inode->i_ctime = hfsp_mt2ut(file->attribute_mod_date); + inode->i_atime = timespec_to_timespec64(hfsp_mt2ut(file->access_date)); + inode->i_mtime = timespec_to_timespec64(hfsp_mt2ut(file->content_mod_date)); + inode->i_ctime = timespec_to_timespec64(hfsp_mt2ut(file->attribute_mod_date)); HFSPLUS_I(inode)->create_date = file->create_date; } else { pr_err("bad catalog entry used to create inode\n"); diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c index c148e7f4f451..6ad66e985d59 100644 --- a/fs/hostfs/hostfs_kern.c +++ b/fs/hostfs/hostfs_kern.c @@ -555,9 +555,9 @@ static int read_name(struct inode *ino, char *name) set_nlink(ino, st.nlink); i_uid_write(ino, st.uid); i_gid_write(ino, st.gid); - ino->i_atime = st.atime; - ino->i_mtime = st.mtime; - ino->i_ctime = st.ctime; + ino->i_atime = timespec_to_timespec64(st.atime); + ino->i_mtime = timespec_to_timespec64(st.mtime); + ino->i_ctime = timespec_to_timespec64(st.ctime); ino->i_size = st.size; ino->i_blocks = st.blocks; return 0; @@ -838,15 +838,15 @@ static int hostfs_setattr(struct dentry *dentry, struct iattr *attr) } if (attr->ia_valid & ATTR_ATIME) { attrs.ia_valid |= HOSTFS_ATTR_ATIME; - attrs.ia_atime = attr->ia_atime; + attrs.ia_atime = timespec64_to_timespec(attr->ia_atime); } if (attr->ia_valid & ATTR_MTIME) { attrs.ia_valid |= HOSTFS_ATTR_MTIME; - attrs.ia_mtime = attr->ia_mtime; + attrs.ia_mtime = timespec64_to_timespec(attr->ia_mtime); } if (attr->ia_valid & ATTR_CTIME) { attrs.ia_valid |= HOSTFS_ATTR_CTIME; - attrs.ia_ctime = attr->ia_ctime; + attrs.ia_ctime = timespec64_to_timespec(attr->ia_ctime); } if (attr->ia_valid & ATTR_ATIME_SET) { attrs.ia_valid |= HOSTFS_ATTR_ATIME_SET; diff --git a/fs/incfs/vfs.c b/fs/incfs/vfs.c index 4af3e8170423..51e3f9c46063 100644 --- a/fs/incfs/vfs.c +++ b/fs/incfs/vfs.c @@ -404,7 +404,7 @@ static int inode_set(struct inode *inode, void *opaque) } else if (search->ino == INCFS_PENDING_READS_INODE) { /* It's an inode for .pending_reads pseudo file. */ - inode->i_ctime = (struct timespec){}; + inode->i_ctime = (struct timespec64){}; inode->i_mtime = inode->i_ctime; inode->i_atime = inode->i_ctime; inode->i_size = 0; @@ -419,7 +419,7 @@ static int inode_set(struct inode *inode, void *opaque) } else if (search->ino == INCFS_LOG_INODE) { /* It's an inode for .log pseudo file. */ - inode->i_ctime = (struct timespec){}; + inode->i_ctime = (struct timespec64){}; inode->i_mtime = inode->i_ctime; inode->i_atime = inode->i_ctime; inode->i_size = 0; diff --git a/fs/inode.c b/fs/inode.c index d27472d69ced..9606461ffff6 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -205,12 +205,28 @@ int inode_init_always(struct super_block *sb, struct inode *inode) } EXPORT_SYMBOL(inode_init_always); +void free_inode_nonrcu(struct inode *inode) +{ + kmem_cache_free(inode_cachep, inode); +} +EXPORT_SYMBOL(free_inode_nonrcu); + +static void i_callback(struct rcu_head *head) +{ + struct inode *inode = container_of(head, struct inode, i_rcu); + if (inode->free_inode) + inode->free_inode(inode); + else + free_inode_nonrcu(inode); +} + static struct inode *alloc_inode(struct super_block *sb) { + const struct super_operations *ops = sb->s_op; struct inode *inode; - if (sb->s_op->alloc_inode) - inode = sb->s_op->alloc_inode(sb); + if (ops->alloc_inode) + inode = ops->alloc_inode(sb); else inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL); @@ -218,22 +234,19 @@ static struct inode *alloc_inode(struct super_block *sb) return NULL; if (unlikely(inode_init_always(sb, inode))) { - if (inode->i_sb->s_op->destroy_inode) - inode->i_sb->s_op->destroy_inode(inode); - else - kmem_cache_free(inode_cachep, inode); + if (ops->destroy_inode) { + ops->destroy_inode(inode); + if (!ops->free_inode) + return NULL; + } + inode->free_inode = ops->free_inode; + i_callback(&inode->i_rcu); return NULL; } return inode; } -void free_inode_nonrcu(struct inode *inode) -{ - kmem_cache_free(inode_cachep, inode); -} -EXPORT_SYMBOL(free_inode_nonrcu); - void __destroy_inode(struct inode *inode) { BUG_ON(inode_has_buffers(inode)); @@ -256,20 +269,19 @@ void __destroy_inode(struct inode *inode) } EXPORT_SYMBOL(__destroy_inode); -static void i_callback(struct rcu_head *head) -{ - struct inode *inode = container_of(head, struct inode, i_rcu); - kmem_cache_free(inode_cachep, inode); -} - static void destroy_inode(struct inode *inode) { + const struct super_operations *ops = inode->i_sb->s_op; + BUG_ON(!list_empty(&inode->i_lru)); __destroy_inode(inode); - if (inode->i_sb->s_op->destroy_inode) - inode->i_sb->s_op->destroy_inode(inode); - else - call_rcu(&inode->i_rcu, i_callback); + if (ops->destroy_inode) { + ops->destroy_inode(inode); + if (!ops->free_inode) + return; + } + inode->free_inode = ops->free_inode; + call_rcu(&inode->i_rcu, i_callback); } /** @@ -1603,8 +1615,8 @@ static void update_ovl_inode_times(struct dentry *dentry, struct inode *inode, if (upperdentry) { struct inode *realinode = d_inode(upperdentry); - if ((!timespec_equal(&inode->i_mtime, &realinode->i_mtime) || - !timespec_equal(&inode->i_ctime, &realinode->i_ctime))) { + if ((!timespec64_equal(&inode->i_mtime, &realinode->i_mtime) || + !timespec64_equal(&inode->i_ctime, &realinode->i_ctime))) { inode->i_mtime = realinode->i_mtime; inode->i_ctime = realinode->i_ctime; } @@ -1627,12 +1639,12 @@ static int relatime_need_update(const struct path *path, struct inode *inode, /* * Is mtime younger than atime? If yes, update atime: */ - if (timespec_compare(&inode->i_mtime, &inode->i_atime) >= 0) + if (timespec64_compare(&inode->i_mtime, &inode->i_atime) >= 0) return 1; /* * Is ctime younger than atime? If yes, update atime: */ - if (timespec_compare(&inode->i_ctime, &inode->i_atime) >= 0) + if (timespec64_compare(&inode->i_ctime, &inode->i_atime) >= 0) return 1; /* @@ -1647,7 +1659,7 @@ static int relatime_need_update(const struct path *path, struct inode *inode, return 0; } -int generic_update_time(struct inode *inode, struct timespec *time, int flags) +int generic_update_time(struct inode *inode, struct timespec64 *time, int flags) { int iflags = I_DIRTY_TIME; @@ -1671,9 +1683,9 @@ EXPORT_SYMBOL(generic_update_time); * This does the actual work of updating an inodes time or version. Must have * had called mnt_want_write() before calling this. */ -static int update_time(struct inode *inode, struct timespec *time, int flags) +static int update_time(struct inode *inode, struct timespec64 *time, int flags) { - int (*update_time)(struct inode *, struct timespec *, int); + int (*update_time)(struct inode *, struct timespec64 *, int); update_time = inode->i_op->update_time ? inode->i_op->update_time : generic_update_time; @@ -1694,7 +1706,7 @@ bool __atime_needs_update(const struct path *path, struct inode *inode, bool rcu) { struct vfsmount *mnt = path->mnt; - struct timespec now; + struct timespec64 now; if (inode->i_flags & S_NOATIME) return false; @@ -1717,10 +1729,10 @@ bool __atime_needs_update(const struct path *path, struct inode *inode, now = current_time(inode); - if (!relatime_need_update(path, inode, now, rcu)) + if (!relatime_need_update(path, inode, timespec64_to_timespec(now), rcu)) return false; - if (timespec_equal(&inode->i_atime, &now)) + if (timespec64_equal(&inode->i_atime, &now)) return false; return true; @@ -1730,7 +1742,7 @@ void touch_atime(const struct path *path) { struct vfsmount *mnt = path->mnt; struct inode *inode = d_inode(path->dentry); - struct timespec now; + struct timespec64 now; if (!__atime_needs_update(path, inode, false)) return; @@ -1869,7 +1881,7 @@ EXPORT_SYMBOL(file_remove_privs); int file_update_time(struct file *file) { struct inode *inode = file_inode(file); - struct timespec now; + struct timespec64 now; int sync_it = 0; int need_sync = 0; int ret; @@ -1879,10 +1891,10 @@ int file_update_time(struct file *file) return 0; now = current_time(inode); - if (!timespec_equal(&inode->i_mtime, &now)) + if (!timespec64_equal(&inode->i_mtime, &now)) sync_it = S_MTIME; - if (!timespec_equal(&inode->i_ctime, &now)) + if (!timespec64_equal(&inode->i_ctime, &now)) sync_it |= S_CTIME; /* iversion impacts on "write" performance. This code just filter inodes @@ -2143,6 +2155,30 @@ void inode_nohighmem(struct inode *inode) } EXPORT_SYMBOL(inode_nohighmem); +/** + * timespec64_trunc - Truncate timespec64 to a granularity + * @t: Timespec64 + * @gran: Granularity in ns. + * + * Truncate a timespec64 to a granularity. Always rounds down. gran must + * not be 0 nor greater than a second (NSEC_PER_SEC, or 10^9 ns). + */ +struct timespec64 timespec64_trunc(struct timespec64 t, unsigned gran) +{ + /* Avoid division in the common cases 1 ns and 1 s. */ + if (gran == 1) { + /* nothing */ + } else if (gran == NSEC_PER_SEC) { + t.tv_nsec = 0; + } else if (gran > 1 && gran < NSEC_PER_SEC) { + t.tv_nsec -= t.tv_nsec % gran; + } else { + WARN(1, "illegal file time granularity: %u", gran); + } + return t; +} +EXPORT_SYMBOL(timespec64_trunc); + /** * current_time - Return FS time * @inode: inode. @@ -2153,16 +2189,16 @@ EXPORT_SYMBOL(inode_nohighmem); * Note that inode and inode->sb cannot be NULL. * Otherwise, the function warns and returns time without truncation. */ -struct timespec current_time(struct inode *inode) +struct timespec64 current_time(struct inode *inode) { - struct timespec now = current_kernel_time(); + struct timespec64 now = current_kernel_time64(); if (unlikely(!inode->i_sb)) { WARN(1, "current_time() called with uninitialized super_block in the inode"); return now; } - return timespec_trunc(now, inode->i_sb->s_time_gran); + return timespec64_trunc(now, inode->i_sb->s_time_gran); } EXPORT_SYMBOL(current_time); diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c index f4a5ec92f5dc..9258e90bb1ae 100644 --- a/fs/jffs2/dir.c +++ b/fs/jffs2/dir.c @@ -201,7 +201,7 @@ static int jffs2_create(struct inode *dir_i, struct dentry *dentry, if (ret) goto fail; - dir_i->i_mtime = dir_i->i_ctime = ITIME(je32_to_cpu(ri->ctime)); + dir_i->i_mtime = dir_i->i_ctime = timespec_to_timespec64(ITIME(je32_to_cpu(ri->ctime))); jffs2_free_raw_inode(ri); @@ -234,7 +234,7 @@ static int jffs2_unlink(struct inode *dir_i, struct dentry *dentry) if (dead_f->inocache) set_nlink(d_inode(dentry), dead_f->inocache->pino_nlink); if (!ret) - dir_i->i_mtime = dir_i->i_ctime = ITIME(now); + dir_i->i_mtime = dir_i->i_ctime = timespec_to_timespec64(ITIME(now)); return ret; } /***********************************************************************/ @@ -268,7 +268,7 @@ static int jffs2_link (struct dentry *old_dentry, struct inode *dir_i, struct de set_nlink(d_inode(old_dentry), ++f->inocache->pino_nlink); mutex_unlock(&f->sem); d_instantiate(dentry, d_inode(old_dentry)); - dir_i->i_mtime = dir_i->i_ctime = ITIME(now); + dir_i->i_mtime = dir_i->i_ctime = timespec_to_timespec64(ITIME(now)); ihold(d_inode(old_dentry)); } return ret; @@ -418,7 +418,7 @@ static int jffs2_symlink (struct inode *dir_i, struct dentry *dentry, const char goto fail; } - dir_i->i_mtime = dir_i->i_ctime = ITIME(je32_to_cpu(rd->mctime)); + dir_i->i_mtime = dir_i->i_ctime = timespec_to_timespec64(ITIME(je32_to_cpu(rd->mctime))); jffs2_free_raw_dirent(rd); @@ -561,7 +561,7 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry *dentry, umode_t mode goto fail; } - dir_i->i_mtime = dir_i->i_ctime = ITIME(je32_to_cpu(rd->mctime)); + dir_i->i_mtime = dir_i->i_ctime = timespec_to_timespec64(ITIME(je32_to_cpu(rd->mctime))); inc_nlink(dir_i); jffs2_free_raw_dirent(rd); @@ -602,7 +602,7 @@ static int jffs2_rmdir (struct inode *dir_i, struct dentry *dentry) ret = jffs2_do_unlink(c, dir_f, dentry->d_name.name, dentry->d_name.len, f, now); if (!ret) { - dir_i->i_mtime = dir_i->i_ctime = ITIME(now); + dir_i->i_mtime = dir_i->i_ctime = timespec_to_timespec64(ITIME(now)); clear_nlink(d_inode(dentry)); drop_nlink(dir_i); } @@ -737,7 +737,7 @@ static int jffs2_mknod (struct inode *dir_i, struct dentry *dentry, umode_t mode goto fail; } - dir_i->i_mtime = dir_i->i_ctime = ITIME(je32_to_cpu(rd->mctime)); + dir_i->i_mtime = dir_i->i_ctime = timespec_to_timespec64(ITIME(je32_to_cpu(rd->mctime))); jffs2_free_raw_dirent(rd); @@ -857,14 +857,14 @@ static int jffs2_rename (struct inode *old_dir_i, struct dentry *old_dentry, * caller won't do it on its own since we are returning an error. */ d_invalidate(new_dentry); - new_dir_i->i_mtime = new_dir_i->i_ctime = ITIME(now); + new_dir_i->i_mtime = new_dir_i->i_ctime = timespec_to_timespec64(ITIME(now)); return ret; } if (d_is_dir(old_dentry)) drop_nlink(old_dir_i); - new_dir_i->i_mtime = new_dir_i->i_ctime = old_dir_i->i_mtime = old_dir_i->i_ctime = ITIME(now); + new_dir_i->i_mtime = new_dir_i->i_ctime = old_dir_i->i_mtime = old_dir_i->i_ctime = timespec_to_timespec64(ITIME(now)); return 0; } diff --git a/fs/jffs2/file.c b/fs/jffs2/file.c index 221eb2bd205e..bce8fa1b32d3 100644 --- a/fs/jffs2/file.c +++ b/fs/jffs2/file.c @@ -318,7 +318,7 @@ static int jffs2_write_end(struct file *filp, struct address_space *mapping, inode->i_size = pos + writtenlen; inode->i_blocks = (inode->i_size + 511) >> 9; - inode->i_ctime = inode->i_mtime = ITIME(je32_to_cpu(ri->ctime)); + inode->i_ctime = inode->i_mtime = timespec_to_timespec64(ITIME(je32_to_cpu(ri->ctime))); } } diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c index dd7c6fbd2cc5..dc43f61d7e8d 100644 --- a/fs/jffs2/fs.c +++ b/fs/jffs2/fs.c @@ -146,9 +146,9 @@ int jffs2_do_setattr (struct inode *inode, struct iattr *iattr) return PTR_ERR(new_metadata); } /* It worked. Update the inode */ - inode->i_atime = ITIME(je32_to_cpu(ri->atime)); - inode->i_ctime = ITIME(je32_to_cpu(ri->ctime)); - inode->i_mtime = ITIME(je32_to_cpu(ri->mtime)); + inode->i_atime = timespec_to_timespec64(ITIME(je32_to_cpu(ri->atime))); + inode->i_ctime = timespec_to_timespec64(ITIME(je32_to_cpu(ri->ctime))); + inode->i_mtime = timespec_to_timespec64(ITIME(je32_to_cpu(ri->mtime))); inode->i_mode = jemode_to_cpu(ri->mode); i_uid_write(inode, je16_to_cpu(ri->uid)); i_gid_write(inode, je16_to_cpu(ri->gid)); @@ -280,9 +280,9 @@ struct inode *jffs2_iget(struct super_block *sb, unsigned long ino) i_uid_write(inode, je16_to_cpu(latest_node.uid)); i_gid_write(inode, je16_to_cpu(latest_node.gid)); inode->i_size = je32_to_cpu(latest_node.isize); - inode->i_atime = ITIME(je32_to_cpu(latest_node.atime)); - inode->i_mtime = ITIME(je32_to_cpu(latest_node.mtime)); - inode->i_ctime = ITIME(je32_to_cpu(latest_node.ctime)); + inode->i_atime = timespec_to_timespec64(ITIME(je32_to_cpu(latest_node.atime))); + inode->i_mtime = timespec_to_timespec64(ITIME(je32_to_cpu(latest_node.mtime))); + inode->i_ctime = timespec_to_timespec64(ITIME(je32_to_cpu(latest_node.ctime))); set_nlink(inode, f->inocache->pino_nlink); diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c index da25f36827eb..066cf82df21e 100644 --- a/fs/kernfs/dir.c +++ b/fs/kernfs/dir.c @@ -785,7 +785,7 @@ int kernfs_add_one(struct kernfs_node *kn) ps_iattr = parent->iattr; if (ps_iattr) { struct iattr *ps_iattrs = &ps_iattr->ia_iattr; - ktime_get_real_ts(&ps_iattrs->ia_ctime); + ktime_get_real_ts64(&ps_iattrs->ia_ctime); ps_iattrs->ia_mtime = ps_iattrs->ia_ctime; } @@ -1311,7 +1311,7 @@ static void __kernfs_remove(struct kernfs_node *kn) /* update timestamps on the parent */ if (ps_iattr) { - ktime_get_real_ts(&ps_iattr->ia_iattr.ia_ctime); + ktime_get_real_ts64(&ps_iattr->ia_iattr.ia_ctime); ps_iattr->ia_iattr.ia_mtime = ps_iattr->ia_iattr.ia_ctime; } diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c index a34303981deb..3d73fe9d56e2 100644 --- a/fs/kernfs/inode.c +++ b/fs/kernfs/inode.c @@ -52,7 +52,7 @@ static struct kernfs_iattrs *kernfs_iattrs(struct kernfs_node *kn) iattrs->ia_uid = GLOBAL_ROOT_UID; iattrs->ia_gid = GLOBAL_ROOT_GID; - ktime_get_real_ts(&iattrs->ia_atime); + ktime_get_real_ts64(&iattrs->ia_atime); iattrs->ia_mtime = iattrs->ia_atime; iattrs->ia_ctime = iattrs->ia_atime; @@ -176,9 +176,9 @@ static inline void set_inode_attr(struct inode *inode, struct iattr *iattr) struct super_block *sb = inode->i_sb; inode->i_uid = iattr->ia_uid; inode->i_gid = iattr->ia_gid; - inode->i_atime = timespec_trunc(iattr->ia_atime, sb->s_time_gran); - inode->i_mtime = timespec_trunc(iattr->ia_mtime, sb->s_time_gran); - inode->i_ctime = timespec_trunc(iattr->ia_ctime, sb->s_time_gran); + inode->i_atime = timespec64_trunc(iattr->ia_atime, sb->s_time_gran); + inode->i_mtime = timespec64_trunc(iattr->ia_mtime, sb->s_time_gran); + inode->i_ctime = timespec64_trunc(iattr->ia_ctime, sb->s_time_gran); } static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode) diff --git a/fs/locks.c b/fs/locks.c index 0b09c1bbf8b8..3d6884a54fd2 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1562,7 +1562,7 @@ EXPORT_SYMBOL(__break_lease); * exclusive leases. The justification is that if someone has an * exclusive lease, then they could be modifying it. */ -void lease_get_mtime(struct inode *inode, struct timespec *time) +void lease_get_mtime(struct inode *inode, struct timespec64 *time) { bool has_lease = false; struct file_lock_context *ctx; diff --git a/fs/nfs/callback_proc.c b/fs/nfs/callback_proc.c index 825b3166605d..fe82f1bae47f 100644 --- a/fs/nfs/callback_proc.c +++ b/fs/nfs/callback_proc.c @@ -54,8 +54,8 @@ __be32 nfs4_callback_getattr(void *argp, void *resp, res->change_attr = delegation->change_attr; if (nfs_have_writebacks(inode)) res->change_attr++; - res->ctime = inode->i_ctime; - res->mtime = inode->i_mtime; + res->ctime = timespec64_to_timespec(inode->i_ctime); + res->mtime = timespec64_to_timespec(inode->i_mtime); res->bitmap[0] = (FATTR4_WORD0_CHANGE|FATTR4_WORD0_SIZE) & args->bitmap[0]; res->bitmap[1] = (FATTR4_WORD1_TIME_METADATA|FATTR4_WORD1_TIME_MODIFY) & diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c index 3025fe8584a0..2ac0fc0ff8e0 100644 --- a/fs/nfs/fscache-index.c +++ b/fs/nfs/fscache-index.c @@ -239,8 +239,8 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void *cookie_netfs_data, memset(&auxdata, 0, sizeof(auxdata)); auxdata.size = nfsi->vfs_inode.i_size; - auxdata.mtime = nfsi->vfs_inode.i_mtime; - auxdata.ctime = nfsi->vfs_inode.i_ctime; + auxdata.mtime = timespec64_to_timespec(nfsi->vfs_inode.i_mtime); + auxdata.ctime = timespec64_to_timespec(nfsi->vfs_inode.i_ctime); if (NFS_SERVER(&nfsi->vfs_inode)->nfs_client->rpc_ops->version == 4) auxdata.change_attr = nfsi->vfs_inode.i_version; diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index ac836e202ff7..ae078829c200 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -496,15 +496,15 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr, st nfsi->read_cache_jiffies = fattr->time_start; nfsi->attr_gencount = fattr->gencount; if (fattr->valid & NFS_ATTR_FATTR_ATIME) - inode->i_atime = fattr->atime; + inode->i_atime = timespec_to_timespec64(fattr->atime); else if (nfs_server_capable(inode, NFS_CAP_ATIME)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR); if (fattr->valid & NFS_ATTR_FATTR_MTIME) - inode->i_mtime = fattr->mtime; + inode->i_mtime = timespec_to_timespec64(fattr->mtime); else if (nfs_server_capable(inode, NFS_CAP_MTIME)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR); if (fattr->valid & NFS_ATTR_FATTR_CTIME) - inode->i_ctime = fattr->ctime; + inode->i_ctime = timespec_to_timespec64(fattr->ctime); else if (nfs_server_capable(inode, NFS_CAP_CTIME)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR); if (fattr->valid & NFS_ATTR_FATTR_CHANGE) @@ -1288,6 +1288,7 @@ static bool nfs_file_has_buffered_writers(struct nfs_inode *nfsi) static unsigned long nfs_wcc_update_inode(struct inode *inode, struct nfs_fattr *fattr) { unsigned long ret = 0; + struct timespec ts; if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE) && (fattr->valid & NFS_ATTR_FATTR_CHANGE) @@ -1298,16 +1299,18 @@ static unsigned long nfs_wcc_update_inode(struct inode *inode, struct nfs_fattr ret |= NFS_INO_INVALID_ATTR; } /* If we have atomic WCC data, we may update some attributes */ + ts = timespec64_to_timespec(inode->i_ctime); if ((fattr->valid & NFS_ATTR_FATTR_PRECTIME) && (fattr->valid & NFS_ATTR_FATTR_CTIME) - && timespec_equal(&inode->i_ctime, &fattr->pre_ctime)) { + && timespec_equal(&ts, &fattr->pre_ctime)) { memcpy(&inode->i_ctime, &fattr->ctime, sizeof(inode->i_ctime)); ret |= NFS_INO_INVALID_ATTR; } + ts = timespec64_to_timespec(inode->i_mtime); if ((fattr->valid & NFS_ATTR_FATTR_PREMTIME) && (fattr->valid & NFS_ATTR_FATTR_MTIME) - && timespec_equal(&inode->i_mtime, &fattr->pre_mtime)) { + && timespec_equal(&ts, &fattr->pre_mtime)) { memcpy(&inode->i_mtime, &fattr->mtime, sizeof(inode->i_mtime)); if (S_ISDIR(inode->i_mode)) nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA); @@ -1338,7 +1341,7 @@ static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fat struct nfs_inode *nfsi = NFS_I(inode); loff_t cur_size, new_isize; unsigned long invalid = 0; - + struct timespec ts; if (nfs_have_delegated_attributes(inode)) return 0; @@ -1353,10 +1356,12 @@ static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fat if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode->i_version != fattr->change_attr) invalid |= NFS_INO_INVALID_ATTR | NFS_INO_REVAL_PAGECACHE; - if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && !timespec_equal(&inode->i_mtime, &fattr->mtime)) + ts = timespec64_to_timespec(inode->i_mtime); + if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && !timespec_equal(&ts, &fattr->mtime)) invalid |= NFS_INO_INVALID_ATTR; - if ((fattr->valid & NFS_ATTR_FATTR_CTIME) && !timespec_equal(&inode->i_ctime, &fattr->ctime)) + ts = timespec64_to_timespec(inode->i_ctime); + if ((fattr->valid & NFS_ATTR_FATTR_CTIME) && !timespec_equal(&ts, &fattr->ctime)) invalid |= NFS_INO_INVALID_ATTR; if (fattr->valid & NFS_ATTR_FATTR_SIZE) { @@ -1379,7 +1384,8 @@ static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fat if ((fattr->valid & NFS_ATTR_FATTR_NLINK) && inode->i_nlink != fattr->nlink) invalid |= NFS_INO_INVALID_ATTR; - if ((fattr->valid & NFS_ATTR_FATTR_ATIME) && !timespec_equal(&inode->i_atime, &fattr->atime)) + ts = timespec64_to_timespec(inode->i_atime); + if ((fattr->valid & NFS_ATTR_FATTR_ATIME) && !timespec_equal(&ts, &fattr->atime)) invalid |= NFS_INO_INVALID_ATIME; if (invalid != 0) @@ -1649,12 +1655,12 @@ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fa } if ((fattr->valid & NFS_ATTR_FATTR_CTIME) != 0 && (fattr->valid & NFS_ATTR_FATTR_PRECTIME) == 0) { - memcpy(&fattr->pre_ctime, &inode->i_ctime, sizeof(fattr->pre_ctime)); + fattr->pre_ctime = timespec64_to_timespec(inode->i_ctime); fattr->valid |= NFS_ATTR_FATTR_PRECTIME; } if ((fattr->valid & NFS_ATTR_FATTR_MTIME) != 0 && (fattr->valid & NFS_ATTR_FATTR_PREMTIME) == 0) { - memcpy(&fattr->pre_mtime, &inode->i_mtime, sizeof(fattr->pre_mtime)); + fattr->pre_mtime = timespec64_to_timespec(inode->i_mtime); fattr->valid |= NFS_ATTR_FATTR_PREMTIME; } if ((fattr->valid & NFS_ATTR_FATTR_SIZE) != 0 && @@ -1800,7 +1806,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr) } if (fattr->valid & NFS_ATTR_FATTR_MTIME) { - memcpy(&inode->i_mtime, &fattr->mtime, sizeof(inode->i_mtime)); + inode->i_mtime = timespec_to_timespec64(fattr->mtime); } else if (server->caps & NFS_CAP_MTIME) { nfsi->cache_validity |= save_cache_validity & (NFS_INO_INVALID_ATTR @@ -1809,7 +1815,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr) } if (fattr->valid & NFS_ATTR_FATTR_CTIME) { - memcpy(&inode->i_ctime, &fattr->ctime, sizeof(inode->i_ctime)); + inode->i_ctime = timespec_to_timespec64(fattr->ctime); } else if (server->caps & NFS_CAP_CTIME) { nfsi->cache_validity |= save_cache_validity & (NFS_INO_INVALID_ATTR @@ -1846,7 +1852,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr) if (fattr->valid & NFS_ATTR_FATTR_ATIME) - memcpy(&inode->i_atime, &fattr->atime, sizeof(inode->i_atime)); + inode->i_atime = timespec_to_timespec64(fattr->atime); else if (server->caps & NFS_CAP_ATIME) { nfsi->cache_validity |= save_cache_validity & (NFS_INO_INVALID_ATIME diff --git a/fs/nfs/nfs2xdr.c b/fs/nfs/nfs2xdr.c index 85e4b4a233f9..350675e3ed47 100644 --- a/fs/nfs/nfs2xdr.c +++ b/fs/nfs/nfs2xdr.c @@ -354,6 +354,7 @@ static __be32 *xdr_time_not_set(__be32 *p) static void encode_sattr(struct xdr_stream *xdr, const struct iattr *attr) { + struct timespec ts; __be32 *p; p = xdr_reserve_space(xdr, NFS_sattr_sz << 2); @@ -375,17 +376,21 @@ static void encode_sattr(struct xdr_stream *xdr, const struct iattr *attr) else *p++ = cpu_to_be32(NFS2_SATTR_NOT_SET); - if (attr->ia_valid & ATTR_ATIME_SET) - p = xdr_encode_time(p, &attr->ia_atime); - else if (attr->ia_valid & ATTR_ATIME) - p = xdr_encode_current_server_time(p, &attr->ia_atime); - else + if (attr->ia_valid & ATTR_ATIME_SET) { + ts = timespec64_to_timespec(attr->ia_atime); + p = xdr_encode_time(p, &ts); + } else if (attr->ia_valid & ATTR_ATIME) { + ts = timespec64_to_timespec(attr->ia_atime); + p = xdr_encode_current_server_time(p, &ts); + } else p = xdr_time_not_set(p); - if (attr->ia_valid & ATTR_MTIME_SET) - xdr_encode_time(p, &attr->ia_mtime); - else if (attr->ia_valid & ATTR_MTIME) - xdr_encode_current_server_time(p, &attr->ia_mtime); - else + if (attr->ia_valid & ATTR_MTIME_SET) { + ts = timespec64_to_timespec(attr->ia_atime); + xdr_encode_time(p, &ts); + } else if (attr->ia_valid & ATTR_MTIME) { + ts = timespec64_to_timespec(attr->ia_mtime); + xdr_encode_current_server_time(p, &ts); + } else xdr_time_not_set(p); } diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c index be666aee28cc..e72d451ae393 100644 --- a/fs/nfs/nfs3xdr.c +++ b/fs/nfs/nfs3xdr.c @@ -562,6 +562,7 @@ static __be32 *xdr_decode_nfstime3(__be32 *p, struct timespec *timep) */ static void encode_sattr3(struct xdr_stream *xdr, const struct iattr *attr) { + struct timespec ts; u32 nbytes; __be32 *p; @@ -611,8 +612,10 @@ static void encode_sattr3(struct xdr_stream *xdr, const struct iattr *attr) *p++ = xdr_zero; if (attr->ia_valid & ATTR_ATIME_SET) { + struct timespec ts; *p++ = xdr_two; - p = xdr_encode_nfstime3(p, &attr->ia_atime); + ts = timespec64_to_timespec(attr->ia_atime); + p = xdr_encode_nfstime3(p, &ts); } else if (attr->ia_valid & ATTR_ATIME) { *p++ = xdr_one; } else @@ -620,7 +623,8 @@ static void encode_sattr3(struct xdr_stream *xdr, const struct iattr *attr) if (attr->ia_valid & ATTR_MTIME_SET) { *p++ = xdr_two; - xdr_encode_nfstime3(p, &attr->ia_mtime); + ts = timespec64_to_timespec(attr->ia_mtime); + xdr_encode_nfstime3(p, &ts); } else if (attr->ia_valid & ATTR_MTIME) { *p = xdr_one; } else diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c index e604c0e02f4d..f148f53d3f90 100644 --- a/fs/nfs/nfs4xdr.c +++ b/fs/nfs/nfs4xdr.c @@ -1018,6 +1018,7 @@ static void encode_attrs(struct xdr_stream *xdr, const struct iattr *iap, const struct nfs_server *server, const uint32_t attrmask[]) { + struct timespec ts; char owner_name[IDMAP_NAMESZ]; char owner_group[IDMAP_NAMESZ]; int owner_namelen = 0; @@ -1119,16 +1120,16 @@ static void encode_attrs(struct xdr_stream *xdr, const struct iattr *iap, if (bmval[1] & FATTR4_WORD1_TIME_ACCESS_SET) { if (iap->ia_valid & ATTR_ATIME_SET) { *p++ = cpu_to_be32(NFS4_SET_TO_CLIENT_TIME); - p = xdr_encode_hyper(p, (s64)iap->ia_atime.tv_sec); - *p++ = cpu_to_be32(iap->ia_atime.tv_nsec); + ts = timespec64_to_timespec(iap->ia_atime); + p = xdr_encode_nfstime4(p, &ts); } else *p++ = cpu_to_be32(NFS4_SET_TO_SERVER_TIME); } if (bmval[1] & FATTR4_WORD1_TIME_MODIFY_SET) { if (iap->ia_valid & ATTR_MTIME_SET) { *p++ = cpu_to_be32(NFS4_SET_TO_CLIENT_TIME); - p = xdr_encode_hyper(p, (s64)iap->ia_mtime.tv_sec); - *p++ = cpu_to_be32(iap->ia_mtime.tv_nsec); + ts = timespec64_to_timespec(iap->ia_mtime); + p = xdr_encode_nfstime4(p, &ts); } else *p++ = cpu_to_be32(NFS4_SET_TO_SERVER_TIME); } diff --git a/fs/nfsd/blocklayout.c b/fs/nfsd/blocklayout.c index 3f880ae0966b..06266f7ac19f 100644 --- a/fs/nfsd/blocklayout.c +++ b/fs/nfsd/blocklayout.c @@ -121,13 +121,15 @@ nfsd4_block_commit_blocks(struct inode *inode, struct nfsd4_layoutcommit *lcp, { loff_t new_size = lcp->lc_last_wr + 1; struct iattr iattr = { .ia_valid = 0 }; + struct timespec ts; int error; + ts = timespec64_to_timespec(inode->i_mtime); if (lcp->lc_mtime.tv_nsec == UTIME_NOW || - timespec_compare(&lcp->lc_mtime, &inode->i_mtime) < 0) - lcp->lc_mtime = current_time(inode); + timespec_compare(&lcp->lc_mtime, &ts) < 0) + lcp->lc_mtime = timespec64_to_timespec(current_time(inode)); iattr.ia_valid |= ATTR_ATIME | ATTR_CTIME | ATTR_MTIME; - iattr.ia_atime = iattr.ia_ctime = iattr.ia_mtime = lcp->lc_mtime; + iattr.ia_atime = iattr.ia_ctime = iattr.ia_mtime = timespec_to_timespec64(lcp->lc_mtime); if (new_size > i_size_read(inode)) { iattr.ia_valid |= ATTR_SIZE; diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c index 6fbc48e074be..d431688fcf6d 100644 --- a/fs/nfsd/nfs3xdr.c +++ b/fs/nfsd/nfs3xdr.c @@ -165,6 +165,7 @@ static __be32 * encode_fattr3(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp, struct kstat *stat) { + struct timespec ts; *p++ = htonl(nfs3_ftypes[(stat->mode & S_IFMT) >> 12]); *p++ = htonl((u32) (stat->mode & S_IALLUGO)); *p++ = htonl((u32) stat->nlink); @@ -180,9 +181,12 @@ encode_fattr3(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp, *p++ = htonl((u32) MINOR(stat->rdev)); p = encode_fsid(p, fhp); p = xdr_encode_hyper(p, stat->ino); - p = encode_time3(p, &stat->atime); - p = encode_time3(p, &stat->mtime); - p = encode_time3(p, &stat->ctime); + ts = timespec64_to_timespec(stat->atime); + p = encode_time3(p, &ts); + ts = timespec64_to_timespec(stat->mtime); + p = encode_time3(p, &ts); + ts = timespec64_to_timespec(stat->ctime); + p = encode_time3(p, &ts); return p; } diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c index 997d3134beb3..21a01fef136d 100644 --- a/fs/nfsd/nfs4xdr.c +++ b/fs/nfsd/nfs4xdr.c @@ -320,6 +320,7 @@ nfsd4_decode_fattr(struct nfsd4_compoundargs *argp, u32 *bmval, struct iattr *iattr, struct nfs4_acl **acl, struct xdr_netobj *label, int *umask) { + struct timespec ts; int expected_len, len = 0; u32 dummy32; char *buf; @@ -421,7 +422,8 @@ nfsd4_decode_fattr(struct nfsd4_compoundargs *argp, u32 *bmval, switch (dummy32) { case NFS4_SET_TO_CLIENT_TIME: len += 12; - status = nfsd4_decode_time(argp, &iattr->ia_atime); + status = nfsd4_decode_time(argp, &ts); + iattr->ia_atime = timespec_to_timespec64(ts); if (status) return status; iattr->ia_valid |= (ATTR_ATIME | ATTR_ATIME_SET); @@ -440,7 +442,8 @@ nfsd4_decode_fattr(struct nfsd4_compoundargs *argp, u32 *bmval, switch (dummy32) { case NFS4_SET_TO_CLIENT_TIME: len += 12; - status = nfsd4_decode_time(argp, &iattr->ia_mtime); + status = nfsd4_decode_time(argp, &ts); + iattr->ia_mtime = timespec_to_timespec64(ts); if (status) return status; iattr->ia_valid |= (ATTR_MTIME | ATTR_MTIME_SET); diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c index 644a0342f0e0..bcd22d3efe4d 100644 --- a/fs/nfsd/nfsxdr.c +++ b/fs/nfsd/nfsxdr.c @@ -147,7 +147,7 @@ encode_fattr(struct svc_rqst *rqstp, __be32 *p, struct svc_fh *fhp, { struct dentry *dentry = fhp->fh_dentry; int type; - struct timespec time; + struct timespec64 time; u32 f; type = (stat->mode & S_IFMT); diff --git a/fs/nsfs.c b/fs/nsfs.c index 2aff289a7160..35fa13910e43 100644 --- a/fs/nsfs.c +++ b/fs/nsfs.c @@ -51,7 +51,7 @@ static void nsfs_evict(struct inode *inode) ns->ops->put(ns); } -static int __ns_get_path(struct path *path, struct ns_common *ns) +static void *__ns_get_path(struct path *path, struct ns_common *ns) { struct vfsmount *mnt = nsfs_mnt; struct dentry *dentry; @@ -70,13 +70,13 @@ static int __ns_get_path(struct path *path, struct ns_common *ns) got_it: path->mnt = mntget(mnt); path->dentry = dentry; - return 0; + return NULL; slow: rcu_read_unlock(); inode = new_inode_pseudo(mnt->mnt_sb); if (!inode) { ns->ops->put(ns); - return -ENOMEM; + return ERR_PTR(-ENOMEM); } inode->i_ino = ns->inum; inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode); @@ -88,7 +88,7 @@ slow: dentry = d_alloc_pseudo(mnt->mnt_sb, &empty_name); if (!dentry) { iput(inode); - return -ENOMEM; + return ERR_PTR(-ENOMEM); } d_instantiate(dentry, inode); dentry->d_flags |= DCACHE_RCUACCESS; @@ -98,22 +98,22 @@ slow: d_delete(dentry); /* make sure ->d_prune() does nothing */ dput(dentry); cpu_relax(); - return -EAGAIN; + return ERR_PTR(-EAGAIN); } goto got_it; } -int ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb, +void *ns_get_path_cb(struct path *path, ns_get_path_helper_t *ns_get_cb, void *private_data) { - int ret; + void *ret; do { struct ns_common *ns = ns_get_cb(private_data); if (!ns) - return -ENOENT; + return ERR_PTR(-ENOENT); ret = __ns_get_path(path, ns); - } while (ret == -EAGAIN); + } while (ret == ERR_PTR(-EAGAIN)); return ret; } @@ -126,7 +126,7 @@ static struct ns_common *ns_get_path_task(void *private_data) struct ns_get_path_task_args *args = private_data; return args->ns_ops->get(args->task); } -int ns_get_path(struct path *path, struct task_struct *task, +void *ns_get_path(struct path *path, struct task_struct *task, const struct proc_ns_operations *ns_ops) { struct ns_get_path_task_args args = { @@ -141,7 +141,7 @@ int open_related_ns(struct ns_common *ns, { struct path path = {}; struct file *f; - int err; + void *err; int fd; fd = get_unused_fd_flags(O_CLOEXEC); @@ -158,11 +158,11 @@ int open_related_ns(struct ns_common *ns, } err = __ns_get_path(&path, relative); - } while (err == -EAGAIN); + } while (err == ERR_PTR(-EAGAIN)); - if (err) { + if (IS_ERR(err)) { put_unused_fd(fd); - return err; + return PTR_ERR(err); } f = dentry_open(&path, O_RDONLY, current_cred()); diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c index 8ad21fd98198..2b65e5c6b75a 100644 --- a/fs/ntfs/inode.c +++ b/fs/ntfs/inode.c @@ -680,18 +680,18 @@ static int ntfs_read_locked_inode(struct inode *vi) * mtime is the last change of the data within the file. Not changed * when only metadata is changed, e.g. a rename doesn't affect mtime. */ - vi->i_mtime = ntfs2utc(si->last_data_change_time); + vi->i_mtime = timespec_to_timespec64(ntfs2utc(si->last_data_change_time)); /* * ctime is the last change of the metadata of the file. This obviously * always changes, when mtime is changed. ctime can be changed on its * own, mtime is then not changed, e.g. when a file is renamed. */ - vi->i_ctime = ntfs2utc(si->last_mft_change_time); + vi->i_ctime = timespec_to_timespec64(ntfs2utc(si->last_mft_change_time)); /* * Last access to the data within the file. Not changed during a rename * for example but changed whenever the file is written to. */ - vi->i_atime = ntfs2utc(si->last_access_time); + vi->i_atime = timespec_to_timespec64(ntfs2utc(si->last_access_time)); /* Find the attribute list attribute if present. */ ntfs_attr_reinit_search_ctx(ctx); @@ -2836,11 +2836,11 @@ done: * for real. */ if (!IS_NOCMTIME(VFS_I(base_ni)) && !IS_RDONLY(VFS_I(base_ni))) { - struct timespec now = current_time(VFS_I(base_ni)); + struct timespec64 now = current_time(VFS_I(base_ni)); int sync_it = 0; - if (!timespec_equal(&VFS_I(base_ni)->i_mtime, &now) || - !timespec_equal(&VFS_I(base_ni)->i_ctime, &now)) + if (!timespec64_equal(&VFS_I(base_ni)->i_mtime, &now) || + !timespec64_equal(&VFS_I(base_ni)->i_ctime, &now)) sync_it = 1; VFS_I(base_ni)->i_mtime = now; VFS_I(base_ni)->i_ctime = now; @@ -2955,14 +2955,14 @@ int ntfs_setattr(struct dentry *dentry, struct iattr *attr) } } if (ia_valid & ATTR_ATIME) - vi->i_atime = timespec_trunc(attr->ia_atime, - vi->i_sb->s_time_gran); + vi->i_atime = timespec64_trunc(attr->ia_atime, + vi->i_sb->s_time_gran); if (ia_valid & ATTR_MTIME) - vi->i_mtime = timespec_trunc(attr->ia_mtime, - vi->i_sb->s_time_gran); + vi->i_mtime = timespec64_trunc(attr->ia_mtime, + vi->i_sb->s_time_gran); if (ia_valid & ATTR_CTIME) - vi->i_ctime = timespec_trunc(attr->ia_ctime, - vi->i_sb->s_time_gran); + vi->i_ctime = timespec64_trunc(attr->ia_ctime, + vi->i_sb->s_time_gran); mark_inode_dirty(vi); out: return err; @@ -3029,7 +3029,7 @@ int __ntfs_write_inode(struct inode *vi, int sync) si = (STANDARD_INFORMATION*)((u8*)ctx->attr + le16_to_cpu(ctx->attr->data.resident.value_offset)); /* Update the access times if they have changed. */ - nt = utc2ntfs(vi->i_mtime); + nt = utc2ntfs(timespec64_to_timespec(vi->i_mtime)); if (si->last_data_change_time != nt) { ntfs_debug("Updating mtime for inode 0x%lx: old = 0x%llx, " "new = 0x%llx", vi->i_ino, (long long) @@ -3038,7 +3038,7 @@ int __ntfs_write_inode(struct inode *vi, int sync) si->last_data_change_time = nt; modified = true; } - nt = utc2ntfs(vi->i_ctime); + nt = utc2ntfs(timespec64_to_timespec(vi->i_ctime)); if (si->last_mft_change_time != nt) { ntfs_debug("Updating ctime for inode 0x%lx: old = 0x%llx, " "new = 0x%llx", vi->i_ino, (long long) @@ -3047,7 +3047,7 @@ int __ntfs_write_inode(struct inode *vi, int sync) si->last_mft_change_time = nt; modified = true; } - nt = utc2ntfs(vi->i_atime); + nt = utc2ntfs(timespec64_to_timespec(vi->i_atime)); if (si->last_access_time != nt) { ntfs_debug("Updating atime for inode 0x%lx: old = 0x%llx, " "new = 0x%llx", vi->i_ino, diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 3aff7dd3fd1a..6ddf427ead03 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -2103,6 +2103,7 @@ static void __ocfs2_stuff_meta_lvb(struct inode *inode) struct ocfs2_inode_info *oi = OCFS2_I(inode); struct ocfs2_lock_res *lockres = &oi->ip_inode_lockres; struct ocfs2_meta_lvb *lvb; + struct timespec ts; lvb = ocfs2_dlm_lvb(&lockres->l_lksb); @@ -2123,12 +2124,15 @@ static void __ocfs2_stuff_meta_lvb(struct inode *inode) lvb->lvb_igid = cpu_to_be32(i_gid_read(inode)); lvb->lvb_imode = cpu_to_be16(inode->i_mode); lvb->lvb_inlink = cpu_to_be16(inode->i_nlink); + ts = timespec64_to_timespec(inode->i_atime); lvb->lvb_iatime_packed = - cpu_to_be64(ocfs2_pack_timespec(&inode->i_atime)); + cpu_to_be64(ocfs2_pack_timespec(&ts)); + ts = timespec64_to_timespec(inode->i_ctime); lvb->lvb_ictime_packed = - cpu_to_be64(ocfs2_pack_timespec(&inode->i_ctime)); + cpu_to_be64(ocfs2_pack_timespec(&ts)); + ts = timespec64_to_timespec(inode->i_mtime); lvb->lvb_imtime_packed = - cpu_to_be64(ocfs2_pack_timespec(&inode->i_mtime)); + cpu_to_be64(ocfs2_pack_timespec(&ts)); lvb->lvb_iattr = cpu_to_be32(oi->ip_attr); lvb->lvb_idynfeatures = cpu_to_be16(oi->ip_dyn_features); lvb->lvb_igeneration = cpu_to_be32(inode->i_generation); @@ -2146,6 +2150,7 @@ static void ocfs2_unpack_timespec(struct timespec *spec, static void ocfs2_refresh_inode_from_lvb(struct inode *inode) { + struct timespec ts; struct ocfs2_inode_info *oi = OCFS2_I(inode); struct ocfs2_lock_res *lockres = &oi->ip_inode_lockres; struct ocfs2_meta_lvb *lvb; @@ -2173,12 +2178,15 @@ static void ocfs2_refresh_inode_from_lvb(struct inode *inode) i_gid_write(inode, be32_to_cpu(lvb->lvb_igid)); inode->i_mode = be16_to_cpu(lvb->lvb_imode); set_nlink(inode, be16_to_cpu(lvb->lvb_inlink)); - ocfs2_unpack_timespec(&inode->i_atime, + ocfs2_unpack_timespec(&ts, be64_to_cpu(lvb->lvb_iatime_packed)); - ocfs2_unpack_timespec(&inode->i_mtime, + inode->i_atime = timespec_to_timespec64(ts); + ocfs2_unpack_timespec(&ts, be64_to_cpu(lvb->lvb_imtime_packed)); - ocfs2_unpack_timespec(&inode->i_ctime, + inode->i_mtime = timespec_to_timespec64(ts); + ocfs2_unpack_timespec(&ts, be64_to_cpu(lvb->lvb_ictime_packed)); + inode->i_ctime = timespec_to_timespec64(ts); spin_unlock(&oi->ip_lock); } diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 7242dd43ae8b..36a1493be05c 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -220,7 +220,7 @@ static int ocfs2_sync_file(struct file *file, loff_t start, loff_t end, int ocfs2_should_update_atime(struct inode *inode, struct vfsmount *vfsmnt) { - struct timespec now; + struct timespec64 now; struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); if (ocfs2_is_hard_readonly(osb) || ocfs2_is_soft_readonly(osb)) @@ -246,8 +246,8 @@ int ocfs2_should_update_atime(struct inode *inode, return 0; if (vfsmnt->mnt_flags & MNT_RELATIME) { - if ((timespec_compare(&inode->i_atime, &inode->i_mtime) <= 0) || - (timespec_compare(&inode->i_atime, &inode->i_ctime) <= 0)) + if ((timespec64_compare(&inode->i_atime, &inode->i_mtime) <= 0) || + (timespec64_compare(&inode->i_atime, &inode->i_ctime) <= 0)) return 1; return 0; diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c index 5db4237d75d8..5765b8d82fd2 100644 --- a/fs/overlayfs/inode.c +++ b/fs/overlayfs/inode.c @@ -374,7 +374,7 @@ int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags) return err; } -int ovl_update_time(struct inode *inode, struct timespec *ts, int flags) +int ovl_update_time(struct inode *inode, struct timespec64 *ts, int flags) { struct dentry *alias; struct path upperpath; diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h index 7ff95b4fdcf6..f1ba6e0cf2ec 100644 --- a/fs/overlayfs/overlayfs.h +++ b/fs/overlayfs/overlayfs.h @@ -290,7 +290,7 @@ int __ovl_xattr_get(struct dentry *dentry, struct inode *inode, ssize_t ovl_listxattr(struct dentry *dentry, char *list, size_t size); struct posix_acl *ovl_get_acl(struct inode *inode, int type); int ovl_open_maybe_copy_up(struct dentry *dentry, unsigned int file_flags); -int ovl_update_time(struct inode *inode, struct timespec *ts, int flags); +int ovl_update_time(struct inode *inode, struct timespec64 *ts, int flags); bool ovl_is_private_xattr(const char *name); struct inode *ovl_new_inode(struct super_block *sb, umode_t mode, dev_t rdev); diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c index 6f1e43762338..59b17e509f46 100644 --- a/fs/proc/namespaces.c +++ b/fs/proc/namespaces.c @@ -42,14 +42,14 @@ static const char *proc_ns_get_link(struct dentry *dentry, const struct proc_ns_operations *ns_ops = PROC_I(inode)->ns_ops; struct task_struct *task; struct path ns_path; - int error = -EACCES; + void *error = ERR_PTR(-EACCES); if (!dentry) return ERR_PTR(-ECHILD); task = get_proc_task(inode); if (!task) - return ERR_PTR(-EACCES); + return error; if (ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS)) { error = ns_get_path(&ns_path, task, ns_ops); @@ -57,7 +57,7 @@ static const char *proc_ns_get_link(struct dentry *dentry, nd_jump_link(&ns_path); } put_task_struct(task); - return ERR_PTR(error); + return error; } static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int buflen) diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index e384b4a96191..481d4a91cd69 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -13,8 +13,8 @@ #include #include #include -#include #include +#include #include "internal.h" static const struct dentry_operations proc_sys_dentry_operations; @@ -576,8 +576,8 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf, struct inode *inode = file_inode(filp); struct ctl_table_header *head = grab_header(inode); struct ctl_table *table = PROC_I(inode)->sysctl_entry; + void *new_buf = NULL; ssize_t error; - size_t res; if (IS_ERR(head)) return PTR_ERR(head); @@ -595,15 +595,27 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *buf, if (!table->proc_handler) goto out; - error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write); + error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, buf, &count, + ppos, &new_buf); if (error) goto out; /* careful: calling conventions are nasty here */ - res = count; - error = table->proc_handler(table, write, buf, &res, ppos); + if (new_buf) { + mm_segment_t old_fs; + + old_fs = get_fs(); + set_fs(KERNEL_DS); + error = table->proc_handler(table, write, (void __user *)new_buf, + &count, ppos); + set_fs(old_fs); + kfree(new_buf); + } else { + error = table->proc_handler(table, write, buf, &count, ppos); + } + if (!error) - error = res; + error = count; out: sysctl_head_finish(head); diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index add19d71a220..f9dd404d294c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -39,7 +39,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) { - unsigned long text, lib, swap, ptes, pmds, anon, file, shmem; + unsigned long text, lib, swap, anon, file, shmem; unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss; anon = get_mm_counter(mm, MM_ANONPAGES); @@ -63,8 +63,6 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10; lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text; swap = get_mm_counter(mm, MM_SWAPENTS); - ptes = PTRS_PER_PTE * sizeof(pte_t) * atomic_long_read(&mm->nr_ptes); - pmds = PTRS_PER_PMD * sizeof(pmd_t) * mm_nr_pmds(mm); seq_printf(m, "VmPeak:\t%8lu kB\n" "VmSize:\t%8lu kB\n" @@ -80,7 +78,6 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) "VmExe:\t%8lu kB\n" "VmLib:\t%8lu kB\n" "VmPTE:\t%8lu kB\n" - "VmPMD:\t%8lu kB\n" "VmSwap:\t%8lu kB\n", hiwater_vm << (PAGE_SHIFT-10), total_vm << (PAGE_SHIFT-10), @@ -93,8 +90,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) shmem << (PAGE_SHIFT-10), mm->data_vm << (PAGE_SHIFT-10), mm->stack_vm << (PAGE_SHIFT-10), text, lib, - ptes >> 10, - pmds >> 10, + mm_pgtables_bytes(mm) >> 10, swap << (PAGE_SHIFT-10)); hugetlb_report_usage(m, mm); } diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c index 95a708d83721..f18e6f949e0c 100644 --- a/fs/proc/uptime.c +++ b/fs/proc/uptime.c @@ -10,7 +10,7 @@ static int uptime_proc_show(struct seq_file *m, void *v) { struct timespec uptime; - struct timespec idle; + struct timespec64 idle; u64 nsec; u32 rem; int i; diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c index 8af9efa0ca0a..1100bef3081b 100644 --- a/fs/pstore/platform.c +++ b/fs/pstore/platform.c @@ -483,10 +483,7 @@ void pstore_record_init(struct pstore_record *record, record->psi = psinfo; /* Report zeroed timestamp if called before timekeeping has resumed. */ - if (__getnstimeofday(&record->time)) { - record->time.tv_sec = 0; - record->time.tv_nsec = 0; - } + record->time = ns_to_timespec64(ktime_get_real_fast_ns()); } /* diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c index 66e55bc795e3..73aa3c297701 100644 --- a/fs/pstore/ram.c +++ b/fs/pstore/ram.c @@ -38,6 +38,11 @@ #define RAMOOPS_KERNMSG_HDR "====" #define MIN_MEM_SIZE 4096UL +#if __BITS_PER_LONG == 64 +# define TVSEC_FMT "%ld" +#else +# define TVSEC_FMT "%lld" +#endif static ulong record_size = MIN_MEM_SIZE; module_param(record_size, ulong, 0400); @@ -153,21 +158,23 @@ ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max, return prz; } -static int ramoops_read_kmsg_hdr(char *buffer, struct timespec *time, +static int ramoops_read_kmsg_hdr(char *buffer, struct timespec64 *time, bool *compressed) { char data_type; int header_length = 0; - if (sscanf(buffer, RAMOOPS_KERNMSG_HDR "%lu.%lu-%c\n%n", &time->tv_sec, - &time->tv_nsec, &data_type, &header_length) == 3) { + if (sscanf(buffer, RAMOOPS_KERNMSG_HDR TVSEC_FMT ".%lu-%c\n%n", + &time->tv_sec, &time->tv_nsec, &data_type, + &header_length) == 3) { if (data_type == 'C') *compressed = true; else *compressed = false; - } else if (sscanf(buffer, RAMOOPS_KERNMSG_HDR "%lu.%lu\n%n", - &time->tv_sec, &time->tv_nsec, &header_length) == 2) { - *compressed = false; + } else if (sscanf(buffer, RAMOOPS_KERNMSG_HDR TVSEC_FMT ".%lu\n%n", + &time->tv_sec, &time->tv_nsec, + &header_length) == 2) { + *compressed = false; } else { time->tv_sec = 0; time->tv_nsec = 0; @@ -360,7 +367,7 @@ static size_t ramoops_write_kmsg_hdr(struct persistent_ram_zone *prz, char *hdr; size_t len; - hdr = kasprintf(GFP_ATOMIC, RAMOOPS_KERNMSG_HDR "%lu.%lu-%c\n", + hdr = kasprintf(GFP_ATOMIC, RAMOOPS_KERNMSG_HDR TVSEC_FMT ".%lu-%c\n", record->time.tv_sec, record->time.tv_nsec / 1000, record->compressed ? 'C' : 'D'); diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c index ad4892837c45..2843b7cf4d7a 100644 --- a/fs/reiserfs/namei.c +++ b/fs/reiserfs/namei.c @@ -1323,7 +1323,7 @@ static int reiserfs_rename(struct inode *old_dir, struct dentry *old_dentry, int jbegin_count; umode_t old_inode_mode; unsigned long savelink = 1; - struct timespec ctime; + struct timespec64 ctime; if (flags & ~RENAME_NOREPLACE) return -EINVAL; diff --git a/fs/reiserfs/xattr.c b/fs/reiserfs/xattr.c index 57a7b9a164a2..8ff806aed994 100644 --- a/fs/reiserfs/xattr.c +++ b/fs/reiserfs/xattr.c @@ -462,10 +462,10 @@ int reiserfs_commit_write(struct file *f, struct page *page, static void update_ctime(struct inode *inode) { - struct timespec now = current_time(inode); + struct timespec64 now = current_time(inode); if (inode_unhashed(inode) || !inode->i_nlink || - timespec_equal(&inode->i_ctime, &now)) + timespec64_equal(&inode->i_ctime, &now)) return; inode->i_ctime = current_time(inode); diff --git a/fs/sdfat/sdfat.h b/fs/sdfat/sdfat.h index 8824d10ef058..fa50b16971bc 100644 --- a/fs/sdfat/sdfat.h +++ b/fs/sdfat/sdfat.h @@ -212,9 +212,9 @@ struct sdfat_inode_info { struct inode vfs_inode; }; -#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 18, 0) +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 14, 0) typedef struct timespec64 sdfat_timespec_t; -#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4, 18, 0) */ +#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4, 14, 0) */ typedef struct timespec sdfat_timespec_t; #endif diff --git a/fs/select.c b/fs/select.c index 7ceaa6754c96..4c6ade444327 100644 --- a/fs/select.c +++ b/fs/select.c @@ -292,8 +292,7 @@ static int poll_select_copy_remaining(struct timespec64 *end_time, void __user *p, int timeval, int ret) { - struct timespec64 rts64; - struct timespec rts; + struct timespec64 rts; struct timeval rtv; if (!p) @@ -306,23 +305,22 @@ static int poll_select_copy_remaining(struct timespec64 *end_time, if (!end_time->tv_sec && !end_time->tv_nsec) return ret; - ktime_get_ts64(&rts64); - rts64 = timespec64_sub(*end_time, rts64); - if (rts64.tv_sec < 0) - rts64.tv_sec = rts64.tv_nsec = 0; + ktime_get_ts64(&rts); + rts = timespec64_sub(*end_time, rts); + if (rts.tv_sec < 0) + rts.tv_sec = rts.tv_nsec = 0; - rts = timespec64_to_timespec(rts64); if (timeval) { if (sizeof(rtv) > sizeof(rtv.tv_sec) + sizeof(rtv.tv_usec)) memset(&rtv, 0, sizeof(rtv)); - rtv.tv_sec = rts64.tv_sec; - rtv.tv_usec = rts64.tv_nsec / NSEC_PER_USEC; + rtv.tv_sec = rts.tv_sec; + rtv.tv_usec = rts.tv_nsec / NSEC_PER_USEC; if (!copy_to_user(p, &rtv, sizeof(rtv))) return ret; - } else if (!copy_to_user(p, &rts, sizeof(rts))) + } else if (!put_timespec64(&rts, p)) return ret; /* @@ -705,17 +703,15 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp, const sigset_t __user *sigmask, size_t sigsetsize) { sigset_t ksigmask, sigsaved; - struct timespec ts; - struct timespec64 ts64, end_time, *to = NULL; + struct timespec64 ts, end_time, *to = NULL; int ret; if (tsp) { - if (copy_from_user(&ts, tsp, sizeof(ts))) + if (get_timespec64(&ts, tsp)) return -EFAULT; - ts64 = timespec_to_timespec64(ts); to = &end_time; - if (poll_select_set_timeout(to, ts64.tv_sec, ts64.tv_nsec)) + if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec)) return -EINVAL; } @@ -1050,12 +1046,11 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, size_t, sigsetsize) { sigset_t ksigmask, sigsaved; - struct timespec ts; - struct timespec64 end_time, *to = NULL; + struct timespec64 ts, end_time, *to = NULL; int ret; if (tsp) { - if (copy_from_user(&ts, tsp, sizeof(ts))) + if (get_timespec64(&ts, tsp)) return -EFAULT; to = &end_time; @@ -1101,10 +1096,10 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, #define __COMPAT_NFDBITS (8 * sizeof(compat_ulong_t)) static -int compat_poll_select_copy_remaining(struct timespec *end_time, void __user *p, +int compat_poll_select_copy_remaining(struct timespec64 *end_time, void __user *p, int timeval, int ret) { - struct timespec ts; + struct timespec64 ts; if (!p) return ret; @@ -1116,8 +1111,8 @@ int compat_poll_select_copy_remaining(struct timespec *end_time, void __user *p, if (!end_time->tv_sec && !end_time->tv_nsec) return ret; - ktime_get_ts(&ts); - ts = timespec_sub(*end_time, ts); + ktime_get_ts64(&ts); + ts = timespec64_sub(*end_time, ts); if (ts.tv_sec < 0) ts.tv_sec = ts.tv_nsec = 0; @@ -1130,12 +1125,7 @@ int compat_poll_select_copy_remaining(struct timespec *end_time, void __user *p, if (!copy_to_user(p, &rtv, sizeof(rtv))) return ret; } else { - struct compat_timespec rts; - - rts.tv_sec = ts.tv_sec; - rts.tv_nsec = ts.tv_nsec; - - if (!copy_to_user(p, &rts, sizeof(rts))) + if (!compat_put_timespec64(&ts, p)) return ret; } /* @@ -1193,7 +1183,7 @@ int compat_set_fd_set(unsigned long nr, compat_ulong_t __user *ufdset, */ static int compat_core_sys_select(int n, compat_ulong_t __user *inp, compat_ulong_t __user *outp, compat_ulong_t __user *exp, - struct timespec *end_time) + struct timespec64 *end_time) { fd_set_bits fds; void *bits; @@ -1266,7 +1256,7 @@ COMPAT_SYSCALL_DEFINE5(select, int, n, compat_ulong_t __user *, inp, compat_ulong_t __user *, outp, compat_ulong_t __user *, exp, struct compat_timeval __user *, tvp) { - struct timespec end_time, *to = NULL; + struct timespec64 end_time, *to = NULL; struct compat_timeval tv; int ret; @@ -1312,12 +1302,11 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp, { compat_sigset_t ss32; sigset_t ksigmask, sigsaved; - struct compat_timespec ts; - struct timespec end_time, *to = NULL; + struct timespec64 ts, end_time, *to = NULL; int ret; if (tsp) { - if (copy_from_user(&ts, tsp, sizeof(ts))) + if (compat_get_timespec64(&ts, tsp)) return -EFAULT; to = &end_time; @@ -1381,12 +1370,11 @@ COMPAT_SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, { compat_sigset_t ss32; sigset_t ksigmask, sigsaved; - struct compat_timespec ts; - struct timespec end_time, *to = NULL; + struct timespec64 ts, end_time, *to = NULL; int ret; if (tsp) { - if (copy_from_user(&ts, tsp, sizeof(ts))) + if (compat_get_timespec64(&ts, tsp)) return -EFAULT; to = &end_time; diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c index c41c3b6a3251..68bc8b610e22 100644 --- a/fs/ubifs/dir.c +++ b/fs/ubifs/dir.c @@ -1298,7 +1298,7 @@ static int do_rename(struct inode *old_dir, struct dentry *old_dentry, .dirtied_ino = 3 }; struct ubifs_budget_req ino_req = { .dirtied_ino = 1, .dirtied_ino_d = ALIGN(old_inode_ui->data_len, 8) }; - struct timespec time; + struct timespec64 time; unsigned int uninitialized_var(saved_nlink); struct fscrypt_name old_nm, new_nm; @@ -1540,7 +1540,7 @@ static int ubifs_xrename(struct inode *old_dir, struct dentry *old_dentry, int sync = IS_DIRSYNC(old_dir) || IS_DIRSYNC(new_dir); struct inode *fst_inode = d_inode(old_dentry); struct inode *snd_inode = d_inode(new_dentry); - struct timespec time; + struct timespec64 time; int err; struct fscrypt_name fst_nm, snd_nm; diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c index b1ec5e20e876..c078c1f34ab2 100644 --- a/fs/ubifs/file.c +++ b/fs/ubifs/file.c @@ -1092,14 +1092,14 @@ static void do_attr_changes(struct inode *inode, const struct iattr *attr) if (attr->ia_valid & ATTR_GID) inode->i_gid = attr->ia_gid; if (attr->ia_valid & ATTR_ATIME) - inode->i_atime = timespec_trunc(attr->ia_atime, - inode->i_sb->s_time_gran); + inode->i_atime = timespec64_trunc(attr->ia_atime, + inode->i_sb->s_time_gran); if (attr->ia_valid & ATTR_MTIME) - inode->i_mtime = timespec_trunc(attr->ia_mtime, - inode->i_sb->s_time_gran); + inode->i_mtime = timespec64_trunc(attr->ia_mtime, + inode->i_sb->s_time_gran); if (attr->ia_valid & ATTR_CTIME) - inode->i_ctime = timespec_trunc(attr->ia_ctime, - inode->i_sb->s_time_gran); + inode->i_ctime = timespec64_trunc(attr->ia_ctime, + inode->i_sb->s_time_gran); if (attr->ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; @@ -1374,8 +1374,9 @@ out: static inline int mctime_update_needed(const struct inode *inode, const struct timespec *now) { - if (!timespec_equal(&inode->i_mtime, now) || - !timespec_equal(&inode->i_ctime, now)) + struct timespec64 now64 = timespec_to_timespec64(*now); + if (!timespec64_equal(&inode->i_mtime, &now64) || + !timespec64_equal(&inode->i_ctime, &now64)) return 1; return 0; } @@ -1387,7 +1388,7 @@ static inline int mctime_update_needed(const struct inode *inode, * * This function updates time of the inode. */ -int ubifs_update_time(struct inode *inode, struct timespec *time, +int ubifs_update_time(struct inode *inode, struct timespec64 *time, int flags) { struct ubifs_inode *ui = ubifs_inode(inode); @@ -1431,7 +1432,7 @@ int ubifs_update_time(struct inode *inode, struct timespec *time, */ static int update_mctime(struct inode *inode) { - struct timespec now = current_time(inode); + struct timespec now = timespec64_to_timespec(current_time(inode)); struct ubifs_inode *ui = ubifs_inode(inode); struct ubifs_info *c = inode->i_sb->s_fs_info; @@ -1525,7 +1526,7 @@ static int ubifs_vm_page_mkwrite(struct vm_fault *vmf) struct page *page = vmf->page; struct inode *inode = file_inode(vmf->vma->vm_file); struct ubifs_info *c = inode->i_sb->s_fs_info; - struct timespec now = current_time(inode); + struct timespec now = timespec64_to_timespec(current_time(inode)); struct ubifs_budget_req req = { .new_page = 1 }; int err, update_time; diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h index 3949f8c9b515..10e3884347dd 100644 --- a/fs/ubifs/ubifs.h +++ b/fs/ubifs/ubifs.h @@ -1739,7 +1739,7 @@ int ubifs_calc_dark(const struct ubifs_info *c, int spc); int ubifs_fsync(struct file *file, loff_t start, loff_t end, int datasync); int ubifs_setattr(struct dentry *dentry, struct iattr *attr); #ifdef CONFIG_UBIFS_ATIME_SUPPORT -int ubifs_update_time(struct inode *inode, struct timespec *time, int flags); +int ubifs_update_time(struct inode *inode, struct timespec64 *time, int flags); #endif /* dir.c */ diff --git a/fs/udf/ialloc.c b/fs/udf/ialloc.c index c1ed18a10ce4..13574f17bc0f 100644 --- a/fs/udf/ialloc.c +++ b/fs/udf/ialloc.c @@ -120,8 +120,8 @@ struct inode *udf_new_inode(struct inode *dir, umode_t mode) iinfo->i_alloc_type = ICBTAG_FLAG_AD_SHORT; else iinfo->i_alloc_type = ICBTAG_FLAG_AD_LONG; - inode->i_mtime = inode->i_atime = inode->i_ctime = - iinfo->i_crtime = current_time(inode); + inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode); + iinfo->i_crtime = timespec64_to_timespec(inode->i_mtime); if (unlikely(insert_inode_locked(inode) < 0)) { make_bad_inode(inode); iput(inode); diff --git a/fs/udf/inode.c b/fs/udf/inode.c index b2564517e348..d4e3d9c5ca4f 100644 --- a/fs/udf/inode.c +++ b/fs/udf/inode.c @@ -1296,6 +1296,7 @@ static int udf_read_inode(struct inode *inode, bool hidden_inode) struct udf_inode_info *iinfo = UDF_I(inode); struct udf_sb_info *sbi = UDF_SB(inode->i_sb); struct kernel_lb_addr *iloc = &iinfo->i_location; + struct timespec ts; unsigned int link_count; unsigned int indirections = 0; int bs = inode->i_sb->s_blocksize; @@ -1471,9 +1472,12 @@ reread: inode->i_blocks = le64_to_cpu(fe->logicalBlocksRecorded) << (inode->i_sb->s_blocksize_bits - 9); - udf_disk_stamp_to_time(&inode->i_atime, fe->accessTime); - udf_disk_stamp_to_time(&inode->i_mtime, fe->modificationTime); - udf_disk_stamp_to_time(&inode->i_ctime, fe->attrTime); + udf_disk_stamp_to_time(&ts, fe->accessTime); + inode->i_atime = timespec_to_timespec64(ts); + udf_disk_stamp_to_time(&ts, fe->modificationTime); + inode->i_mtime = timespec_to_timespec64(ts); + udf_disk_stamp_to_time(&ts, fe->attrTime); + inode->i_ctime = timespec_to_timespec64(ts); iinfo->i_unique = le64_to_cpu(fe->uniqueID); iinfo->i_lenEAttr = le32_to_cpu(fe->lengthExtendedAttr); @@ -1483,10 +1487,13 @@ reread: inode->i_blocks = le64_to_cpu(efe->logicalBlocksRecorded) << (inode->i_sb->s_blocksize_bits - 9); - udf_disk_stamp_to_time(&inode->i_atime, efe->accessTime); - udf_disk_stamp_to_time(&inode->i_mtime, efe->modificationTime); + udf_disk_stamp_to_time(&ts, efe->accessTime); + inode->i_atime = timespec_to_timespec64(ts); + udf_disk_stamp_to_time(&ts, efe->modificationTime); + inode->i_mtime = timespec_to_timespec64(ts); udf_disk_stamp_to_time(&iinfo->i_crtime, efe->createTime); - udf_disk_stamp_to_time(&inode->i_ctime, efe->attrTime); + udf_disk_stamp_to_time(&ts, efe->attrTime); + inode->i_ctime = timespec_to_timespec64(ts); iinfo->i_unique = le64_to_cpu(efe->uniqueID); iinfo->i_lenEAttr = le32_to_cpu(efe->lengthExtendedAttr); @@ -1736,9 +1743,12 @@ static int udf_update_inode(struct inode *inode, int do_sync) inode->i_sb->s_blocksize - sizeof(struct fileEntry)); fe->logicalBlocksRecorded = cpu_to_le64(lb_recorded); - udf_time_to_disk_stamp(&fe->accessTime, inode->i_atime); - udf_time_to_disk_stamp(&fe->modificationTime, inode->i_mtime); - udf_time_to_disk_stamp(&fe->attrTime, inode->i_ctime); + udf_time_to_disk_stamp(&fe->accessTime, + timespec64_to_timespec(inode->i_atime)); + udf_time_to_disk_stamp(&fe->modificationTime, + timespec64_to_timespec(inode->i_mtime)); + udf_time_to_disk_stamp(&fe->attrTime, + timespec64_to_timespec(inode->i_ctime)); memset(&(fe->impIdent), 0, sizeof(struct regid)); strcpy(fe->impIdent.ident, UDF_ID_DEVELOPER); fe->impIdent.identSuffix[0] = UDF_OS_CLASS_UNIX; @@ -1757,14 +1767,17 @@ static int udf_update_inode(struct inode *inode, int do_sync) efe->objectSize = cpu_to_le64(inode->i_size); efe->logicalBlocksRecorded = cpu_to_le64(lb_recorded); - udf_adjust_time(iinfo, inode->i_atime); - udf_adjust_time(iinfo, inode->i_mtime); - udf_adjust_time(iinfo, inode->i_ctime); + udf_adjust_time(iinfo, timespec64_to_timespec(inode->i_atime)); + udf_adjust_time(iinfo, timespec64_to_timespec(inode->i_mtime)); + udf_adjust_time(iinfo, timespec64_to_timespec(inode->i_ctime)); - udf_time_to_disk_stamp(&efe->accessTime, inode->i_atime); - udf_time_to_disk_stamp(&efe->modificationTime, inode->i_mtime); + udf_time_to_disk_stamp(&efe->accessTime, + timespec64_to_timespec(inode->i_atime)); + udf_time_to_disk_stamp(&efe->modificationTime, + timespec64_to_timespec(inode->i_mtime)); udf_time_to_disk_stamp(&efe->createTime, iinfo->i_crtime); - udf_time_to_disk_stamp(&efe->attrTime, inode->i_ctime); + udf_time_to_disk_stamp(&efe->attrTime, + timespec64_to_timespec(inode->i_ctime)); memset(&(efe->impIdent), 0, sizeof(efe->impIdent)); strcpy(efe->impIdent.ident, UDF_ID_DEVELOPER); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index cfc7d6e01158..daf5786a3d58 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -778,7 +778,7 @@ xfs_ialloc( xfs_inode_t *ip; uint flags; int error; - struct timespec tv; + struct timespec64 tv; struct inode *inode; /* diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 16d5a949fb11..2d92686a0523 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1060,7 +1060,7 @@ xfs_vn_setattr( STATIC int xfs_vn_update_time( struct inode *inode, - struct timespec *now, + struct timespec64 *now, int flags) { struct xfs_inode *ip = XFS_I(inode); diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c index daa7615497f9..d791b7aaf8e6 100644 --- a/fs/xfs/xfs_trans_inode.c +++ b/fs/xfs/xfs_trans_inode.c @@ -68,7 +68,7 @@ xfs_trans_ichgtime( int flags) { struct inode *inode = VFS_I(ip); - struct timespec tv; + struct timespec64 tv; ASSERT(tp); ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); diff --git a/include/asm-generic/atomic-instrumented.h b/include/asm-generic/atomic-instrumented.h index 66f964a9a8ff..d6de08113679 100644 --- a/include/asm-generic/atomic-instrumented.h +++ b/include/asm-generic/atomic-instrumented.h @@ -84,10 +84,10 @@ static __always_inline bool atomic64_try_cmpxchg(atomic64_t *v, s64 *old, s64 ne } #endif -static __always_inline int __atomic_add_unless(atomic_t *v, int a, int u) +static __always_inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { kasan_check_write(v, sizeof(*v)); - return __arch_atomic_add_unless(v, a, u); + return arch_atomic_fetch_add_unless(v, a, u); } diff --git a/include/asm-generic/atomic.h b/include/asm-generic/atomic.h index 3f38eb03649c..d2115b9fa1b4 100644 --- a/include/asm-generic/atomic.h +++ b/include/asm-generic/atomic.h @@ -223,8 +223,8 @@ static inline void atomic_dec(atomic_t *v) #define atomic_xchg(ptr, v) (xchg(&(ptr)->counter, (v))) #define atomic_cmpxchg(v, old, new) (cmpxchg(&((v)->counter), (old), (new))) -#ifndef __atomic_add_unless -static inline int __atomic_add_unless(atomic_t *v, int a, int u) +#ifndef atomic_fetch_add_unless +static inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { int c, old; c = atomic_read(v); diff --git a/include/asm-generic/error-injection.h b/include/asm-generic/error-injection.h new file mode 100644 index 000000000000..08352c9d9f97 --- /dev/null +++ b/include/asm-generic/error-injection.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_GENERIC_ERROR_INJECTION_H +#define _ASM_GENERIC_ERROR_INJECTION_H + +#if defined(__KERNEL__) && !defined(__ASSEMBLY__) +#ifdef CONFIG_FUNCTION_ERROR_INJECTION +/* + * Whitelist ganerating macro. Specify functions which can be + * error-injectable using this macro. + */ +#define ALLOW_ERROR_INJECTION(fname) \ +static unsigned long __used \ + __attribute__((__section__("_error_injection_whitelist"))) \ + _eil_addr_##fname = (unsigned long)fname; +#else +#define ALLOW_ERROR_INJECTION(fname) +#endif +#endif + +#endif /* _ASM_GENERIC_ERROR_INJECTION_H */ diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index b89ddad8e182..c12fef18ae55 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -139,6 +139,15 @@ #define KPROBE_BLACKLIST() #endif +#ifdef CONFIG_FUNCTION_ERROR_INJECTION +#define ERROR_INJECT_WHITELIST() . = ALIGN(8); \ + VMLINUX_SYMBOL(__start_error_injection_whitelist) = .;\ + KEEP(*(_error_injection_whitelist)) \ + VMLINUX_SYMBOL(__stop_error_injection_whitelist) = .; +#else +#define ERROR_INJECT_WHITELIST() +#endif + #ifdef CONFIG_EVENT_TRACING #define FTRACE_EVENTS() . = ALIGN(8); \ VMLINUX_SYMBOL(__start_ftrace_events) = .; \ @@ -173,10 +182,10 @@ #endif #ifdef CONFIG_BPF_EVENTS -#define BPF_RAW_TP() STRUCT_ALIGN(); \ - VMLINUX_SYMBOL(__start__bpf_raw_tp) = .; \ - KEEP(*(__bpf_raw_tp_map)) \ - VMLINUX_SYMBOL(__stop__bpf_raw_tp) = .; +#define BPF_RAW_TP() STRUCT_ALIGN(); \ + VMLINUX_SYMBOL(__start__bpf_raw_tp) = .; \ + KEEP(*(__bpf_raw_tp_map)) \ + VMLINUX_SYMBOL(__stop__bpf_raw_tp) = .; #else #define BPF_RAW_TP() #endif @@ -249,7 +258,7 @@ LIKELY_PROFILE() \ BRANCH_PROFILE() \ TRACE_PRINTKS() \ - BPF_RAW_TP() \ + BPF_RAW_TP() \ TRACEPOINT_STR() /* @@ -495,10 +504,12 @@ VMLINUX_SYMBOL(__start___modver) = .; \ KEEP(*(__modver)) \ VMLINUX_SYMBOL(__stop___modver) = .; \ - . = ALIGN((align)); \ - VMLINUX_SYMBOL(__end_rodata) = .; \ } \ - . = ALIGN((align)); + \ + BTF \ + \ + . = ALIGN((align)); \ + VMLINUX_SYMBOL(__end_rodata) = .; /* RODATA & RO_DATA provided for backward compatibility. * All archs are supposed to use RO_DATA() */ @@ -600,6 +611,20 @@ VMLINUX_SYMBOL(__stop___ex_table) = .; \ } +/* + * .BTF + */ +#ifdef CONFIG_DEBUG_INFO_BTF +#define BTF \ + .BTF : AT(ADDR(.BTF) - LOAD_OFFSET) { \ + __start_BTF = .; \ + *(.BTF) \ + __stop_BTF = .; \ + } +#else +#define BTF +#endif + /* * Init task */ @@ -631,6 +656,7 @@ FTRACE_EVENTS() \ TRACE_SYSCALLS() \ KPROBE_BLACKLIST() \ + ERROR_INJECT_WHITELIST() \ MEM_DISCARD(init.rodata) \ CLK_OF_TABLES() \ RESERVEDMEM_OF_TABLES() \ diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h index 880e6be9e95e..eac72840a7d2 100644 --- a/include/crypto/scatterwalk.h +++ b/include/crypto/scatterwalk.h @@ -22,14 +22,8 @@ #include static inline void scatterwalk_crypto_chain(struct scatterlist *head, - struct scatterlist *sg, - int chain, int num) + struct scatterlist *sg, int num) { - if (chain) { - head->length += sg->length; - sg = sg_next(sg); - } - if (sg) sg_chain(head, num, sg); else diff --git a/include/linux/atomic.h b/include/linux/atomic.h index 01ce3997cb42..9cc982936675 100644 --- a/include/linux/atomic.h +++ b/include/linux/atomic.h @@ -530,7 +530,7 @@ */ static inline int atomic_add_unless(atomic_t *v, int a, int u) { - return __atomic_add_unless(v, a, u) != u; + return atomic_fetch_add_unless(v, a, u) != u; } /** diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b0abe21d6cc9..c783a7b9f284 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -147,5 +147,6 @@ extern int do_execveat(int, struct filename *, const char __user * const __user *, const char __user * const __user *, int); +int do_execve_file(struct file *file, void *__argv, void *__envp); #endif /* _LINUX_BINFMTS_H */ diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index 4b6cd41e2675..169fd25f6bc2 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -17,23 +18,27 @@ struct bpf_map; struct bpf_prog; struct bpf_sock_ops_kern; struct bpf_cgroup_storage; - struct ctl_table; struct ctl_table_header; + #ifdef CONFIG_CGROUP_BPF extern struct static_key_false cgroup_bpf_enabled_key; #define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key) -DECLARE_PER_CPU(void*, bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); +DECLARE_PER_CPU(struct bpf_cgroup_storage*, + bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); + #define for_each_cgroup_storage_type(stype) \ for (stype = 0; stype < MAX_BPF_CGROUP_STORAGE_TYPE; stype++) struct bpf_cgroup_storage_map; + struct bpf_storage_buffer { struct rcu_head rcu; char data[0]; }; + struct bpf_cgroup_storage { union { struct bpf_storage_buffer *buf; @@ -67,16 +72,22 @@ struct cgroup_bpf { u32 flags[MAX_BPF_ATTACH_TYPE]; /* temp storage for effective prog array used by prog_attach/detach */ - struct bpf_prog_array __rcu *inactive; + struct bpf_prog_array *inactive; + + /* reference counter used to detach bpf programs after cgroup removal */ + struct percpu_ref refcnt; + + /* cgroup_bpf is released using a work queue */ + struct work_struct release_work; }; -void cgroup_bpf_put(struct cgroup *cgrp); int cgroup_bpf_inherit(struct cgroup *cgrp); +void cgroup_bpf_offline(struct cgroup *cgrp); int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, enum bpf_attach_type type, u32 flags); int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, - enum bpf_attach_type type, u32 flags); + enum bpf_attach_type type); int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, union bpf_attr __user *uattr); @@ -105,10 +116,12 @@ int __cgroup_bpf_run_filter_sock_ops(struct sock *sk, enum bpf_attach_type type); int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor, - short access, enum bpf_attach_type type); + short access, enum bpf_attach_type type); int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head, struct ctl_table *table, int write, + void __user *buf, size_t *pcount, + loff_t *ppos, void **new_buf, enum bpf_attach_type type); int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int *level, @@ -127,17 +140,14 @@ static inline enum bpf_cgroup_storage_type cgroup_storage_type( return BPF_CGROUP_STORAGE_SHARED; } + static inline void bpf_cgroup_storage_set(struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE]) { enum bpf_cgroup_storage_type stype; - struct bpf_storage_buffer *buf; - for_each_cgroup_storage_type(stype) { - if (!storage[stype]) - continue; - buf = READ_ONCE(storage[stype]->buf); - this_cpu_write(bpf_cgroup_storage[stype], &buf->data[0]); - } + + for_each_cgroup_storage_type(stype) + this_cpu_write(bpf_cgroup_storage[stype], storage[stype]); } struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog, @@ -180,7 +190,7 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_SK_PROG(sk, type) \ ({ \ int __ret = 0; \ - if (cgroup_bpf_enabled && sk) { \ + if (cgroup_bpf_enabled) { \ __ret = __cgroup_bpf_run_filter_sk(sk, type); \ } \ __ret; \ @@ -262,22 +272,24 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, __ret; \ }) -#define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type, major, minor, access) \ -({ \ - int __ret = 0; \ - if (cgroup_bpf_enabled) \ - __ret = __cgroup_bpf_check_dev_permission(type, major, minor, \ - access, \ - BPF_CGROUP_DEVICE); \ - \ - __ret; \ +#define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type, major, minor, access) \ +({ \ + int __ret = 0; \ + if (cgroup_bpf_enabled) \ + __ret = __cgroup_bpf_check_dev_permission(type, major, minor, \ + access, \ + BPF_CGROUP_DEVICE); \ + \ + __ret; \ }) -#define BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write) \ + +#define BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, buf, count, pos, nbuf) \ ({ \ int __ret = 0; \ if (cgroup_bpf_enabled) \ __ret = __cgroup_bpf_run_filter_sysctl(head, table, write, \ + buf, count, pos, nbuf, \ BPF_CGROUP_SYSCTL); \ __ret; \ }) @@ -314,8 +326,38 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, __ret; \ }) +int cgroup_bpf_prog_attach(const union bpf_attr *attr, + enum bpf_prog_type ptype, struct bpf_prog *prog); +int cgroup_bpf_prog_detach(const union bpf_attr *attr, + enum bpf_prog_type ptype); +int cgroup_bpf_prog_query(const union bpf_attr *attr, + union bpf_attr __user *uattr); #else +struct bpf_prog; +struct cgroup_bpf {}; +static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; } +static inline void cgroup_bpf_offline(struct cgroup *cgrp) {} + +static inline int cgroup_bpf_prog_attach(const union bpf_attr *attr, + enum bpf_prog_type ptype, + struct bpf_prog *prog) +{ + return -EINVAL; +} + +static inline int cgroup_bpf_prog_detach(const union bpf_attr *attr, + enum bpf_prog_type ptype) +{ + return -EINVAL; +} + +static inline int cgroup_bpf_prog_query(const union bpf_attr *attr, + union bpf_attr __user *uattr) +{ + return -EINVAL; +} + static inline void bpf_cgroup_storage_set( struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE]) {} static inline int bpf_cgroup_storage_assign(struct bpf_prog *prog, @@ -323,14 +365,9 @@ static inline int bpf_cgroup_storage_assign(struct bpf_prog *prog, static inline void bpf_cgroup_storage_release(struct bpf_prog *prog, struct bpf_map *map) {} static inline struct bpf_cgroup_storage *bpf_cgroup_storage_alloc( - struct bpf_prog *prog, enum bpf_cgroup_storage_type stype) { return 0; } + struct bpf_prog *prog, enum bpf_cgroup_storage_type stype) { return NULL; } static inline void bpf_cgroup_storage_free( struct bpf_cgroup_storage *storage) {} - -struct cgroup_bpf {}; -static inline void cgroup_bpf_put(struct cgroup *cgrp) {} -static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; } - static inline int bpf_percpu_cgroup_storage_copy(struct bpf_map *map, void *key, void *value) { return 0; @@ -359,7 +396,7 @@ static inline int bpf_percpu_cgroup_storage_update(struct bpf_map *map, #define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) ({ 0; }) #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; }) #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; }) -#define BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write) ({ 0; }) +#define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; }) #define BPF_CGROUP_GETSOCKOPT_MAX_OPTLEN(optlen) ({ 0; }) #define BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock, level, optname, optval, \ optlen, max_optlen, retval) ({ retval; }) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 7074ebe4c2d0..ee38019d5805 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -16,6 +16,7 @@ #include #include #include +#include struct bpf_verifier_env; struct perf_event; @@ -26,6 +27,9 @@ struct seq_file; struct btf; struct btf_type; +extern struct idr btf_idr; +extern spinlock_t btf_idr_lock; + /* map is generic key/value storage optionally accesible by eBPF programs */ struct bpf_map_ops { /* funcs callable from userspace (via syscall) */ @@ -41,15 +45,14 @@ struct bpf_map_ops { void *(*map_lookup_elem)(struct bpf_map *map, void *key); int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 flags); int (*map_delete_elem)(struct bpf_map *map, void *key); + int (*map_push_elem)(struct bpf_map *map, void *value, u64 flags); + int (*map_pop_elem)(struct bpf_map *map, void *value); + int (*map_peek_elem)(struct bpf_map *map, void *value); /* funcs called by prog_array and perf_event_array map */ void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file, int fd); - /* If need_defer is true, the implementation should guarantee that - * the to-be-put element is still alive before the bpf program, which - * may manipulate it, exists. - */ - void (*map_fd_put_ptr)(struct bpf_map *map, void *ptr, bool need_defer); + void (*map_fd_put_ptr)(void *ptr); u32 (*map_gen_lookup)(struct bpf_map *map, struct bpf_insn *insn_buf); u32 (*map_fd_sys_lookup_elem)(void *ptr); void (*map_seq_show_elem)(struct bpf_map *map, void *key, @@ -66,6 +69,11 @@ struct bpf_map_ops { u64 imm, u32 *off); }; +struct bpf_map_memory { + u32 pages; + struct user_struct *user; +}; + struct bpf_map { /* The first two cachelines with read-mostly members of which some * are also accessed in fast-path (e.g. ops, max_entries). @@ -86,15 +94,15 @@ struct bpf_map { u32 btf_key_type_id; u32 btf_value_type_id; struct btf *btf; - u32 pages; + struct bpf_map_memory memory; bool unpriv_array; - /* 51 bytes hole */ + bool frozen; /* write-once */ + /* 48 bytes hole */ /* The 3rd and 4th cacheline with misc members to avoid false sharing * particularly with refcounting. */ - struct user_struct *user ____cacheline_aligned; - atomic_t refcnt; + atomic_t refcnt ____cacheline_aligned; atomic_t usercnt; struct work_struct work; char name[BPF_OBJ_NAME_LEN]; @@ -118,6 +126,7 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src) { if (unlikely(map_value_has_spin_lock(map))) { u32 off = map->spin_lock_off; + memcpy(dst, src, off); memcpy(dst + off + sizeof(struct bpf_spin_lock), src + off + sizeof(struct bpf_spin_lock), @@ -126,11 +135,12 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src) memcpy(dst, src, map->value_size); } } - void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, bool lock_src); +struct bpf_offload_dev; struct bpf_offloaded_map; + struct bpf_map_dev_ops { int (*map_get_next_key)(struct bpf_offloaded_map *map, void *key, void *next_key); @@ -140,6 +150,7 @@ struct bpf_map_dev_ops { void *key, void *value, u64 flags); int (*map_delete_elem)(struct bpf_offloaded_map *map, void *key); }; + struct bpf_offloaded_map { struct bpf_map map; struct net_device *netdev; @@ -147,11 +158,17 @@ struct bpf_offloaded_map { void *dev_priv; struct list_head offloads; }; + static inline struct bpf_offloaded_map *map_to_offmap(struct bpf_map *map) { return container_of(map, struct bpf_offloaded_map, map); } +static inline bool bpf_map_offload_neutral(const struct bpf_map *map) +{ + return map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY; +} + static inline bool bpf_map_support_seq_show(const struct bpf_map *map) { return map->btf && map->ops->map_seq_show_elem; @@ -266,6 +283,7 @@ enum bpf_reg_type { PTR_TO_TCP_SOCK, /* reg points to struct tcp_sock */ PTR_TO_TCP_SOCK_OR_NULL, /* reg points to struct tcp_sock or NULL */ PTR_TO_TP_BUFFER, /* reg points to a writable raw tp's buffer */ + PTR_TO_XDP_SOCK, /* reg points to struct xdp_sock */ }; /* The information passed from prog-specific *_is_valid_access @@ -284,7 +302,7 @@ bpf_ctx_record_field_size(struct bpf_insn_access_aux *aux, u32 size) struct bpf_prog_ops { int (*test_run)(struct bpf_prog *prog, const union bpf_attr *kattr, - union bpf_attr __user *uattr); + union bpf_attr __user *uattr); }; struct bpf_verifier_ops { @@ -301,6 +319,8 @@ struct bpf_verifier_ops { struct bpf_insn_access_aux *info); int (*gen_prologue)(struct bpf_insn *insn, bool direct_write, const struct bpf_prog *prog); + int (*gen_ld_abs)(const struct bpf_insn *orig, + struct bpf_insn *insn_buf); u32 (*convert_ctx_access)(enum bpf_access_type type, const struct bpf_insn *src, struct bpf_insn *dst, @@ -308,17 +328,30 @@ struct bpf_verifier_ops { }; struct bpf_prog_offload_ops { + /* verifier basic callbacks */ int (*insn_hook)(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx); + int (*finalize)(struct bpf_verifier_env *env); + /* verifier optimization callbacks (called after .finalize) */ + int (*replace_insn)(struct bpf_verifier_env *env, u32 off, + struct bpf_insn *insn); + int (*remove_insns)(struct bpf_verifier_env *env, u32 off, u32 cnt); + /* program management callbacks */ + int (*prepare)(struct bpf_prog *prog); + int (*translate)(struct bpf_prog *prog); + void (*destroy)(struct bpf_prog *prog); }; struct bpf_prog_offload { struct bpf_prog *prog; struct net_device *netdev; + struct bpf_offload_dev *offdev; void *dev_priv; struct list_head offloads; bool dev_state; - const struct bpf_prog_offload_ops *dev_ops; + bool opt_failed; + void *jited_image; + u32 jited_len; }; enum bpf_cgroup_storage_type { @@ -326,17 +359,27 @@ enum bpf_cgroup_storage_type { BPF_CGROUP_STORAGE_PERCPU, __BPF_CGROUP_STORAGE_MAX }; + #define MAX_BPF_CGROUP_STORAGE_TYPE __BPF_CGROUP_STORAGE_MAX +struct bpf_prog_stats { + u64 cnt; + u64 nsecs; + struct u64_stats_sync syncp; +}; + struct bpf_prog_aux { atomic_t refcnt; u32 used_map_cnt; u32 max_ctx_offset; + u32 max_pkt_offset; u32 max_tp_access; u32 stack_depth; u32 id; u32 func_cnt; /* used by non-func prog as the number of func progs */ u32 func_idx; /* 0 for non-func prog, the index in func array for func prog */ + bool verifier_zext; /* Zero extensions has been inserted by verifier. */ + bool offload_requested; struct bpf_prog **func; void *jit_data; /* JIT specific data. arch dependent */ struct latch_tree_node ksym_tnode; @@ -376,6 +419,7 @@ struct bpf_prog_aux { * main prog always has linfo_idx == 0 */ u32 linfo_idx; + struct bpf_prog_stats __percpu *stats; union { struct work_struct work; struct rcu_head rcu; @@ -400,8 +444,38 @@ struct bpf_array { }; }; +#define BPF_COMPLEXITY_LIMIT_INSNS 1000000 /* yes. 1M insns */ #define MAX_TAIL_CALL_CNT 32 +#define BPF_F_ACCESS_MASK (BPF_F_RDONLY | \ + BPF_F_RDONLY_PROG | \ + BPF_F_WRONLY | \ + BPF_F_WRONLY_PROG) + +#define BPF_MAP_CAN_READ BIT(0) +#define BPF_MAP_CAN_WRITE BIT(1) + +static inline u32 bpf_map_flags_to_cap(struct bpf_map *map) +{ + u32 access_flags = map->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG); + + /* Combination of BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG is + * not possible. + */ + if (access_flags & BPF_F_RDONLY_PROG) + return BPF_MAP_CAN_READ; + else if (access_flags & BPF_F_WRONLY_PROG) + return BPF_MAP_CAN_WRITE; + else + return BPF_MAP_CAN_READ | BPF_MAP_CAN_WRITE; +} + +static inline bool bpf_map_flags_access_ok(u32 access_flags) +{ + return (access_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG)) != + (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG); +} + struct bpf_event_entry { struct perf_event *event; struct file *perf_file; @@ -409,9 +483,6 @@ struct bpf_event_entry { struct rcu_head rcu; }; -u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5); -u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); - bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp); int bpf_prog_calc_tag(struct bpf_prog *fp); @@ -428,11 +499,6 @@ typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type, u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy); -int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr, - union bpf_attr __user *uattr); -int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, - union bpf_attr __user *uattr); - /* an array of programs to be executed under rcu_lock. * * Typical usage: @@ -456,48 +522,101 @@ struct bpf_prog_array { }; struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags); -void bpf_prog_array_free(struct bpf_prog_array __rcu *progs); -int bpf_prog_array_length(struct bpf_prog_array __rcu *progs); +void bpf_prog_array_free(struct bpf_prog_array *progs); +int bpf_prog_array_length(struct bpf_prog_array *progs); bool bpf_prog_array_is_empty(struct bpf_prog_array *array); -int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs, +int bpf_prog_array_copy_to_user(struct bpf_prog_array *progs, __u32 __user *prog_ids, u32 cnt); -void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs, +void bpf_prog_array_delete_safe(struct bpf_prog_array *progs, struct bpf_prog *old_prog); -int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array, +int bpf_prog_array_copy_info(struct bpf_prog_array *array, u32 *prog_ids, u32 request_cnt, u32 *prog_cnt); -int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array, +int bpf_prog_array_copy(struct bpf_prog_array *old_array, struct bpf_prog *exclude_prog, struct bpf_prog *include_prog, struct bpf_prog_array **new_array); -#define __BPF_PROG_RUN_ARRAY(array, ctx, func, check_non_null) \ +#define __BPF_PROG_RUN_ARRAY(array, ctx, func, check_non_null, set_cg_storage) \ ({ \ struct bpf_prog_array_item *_item; \ struct bpf_prog *_prog; \ struct bpf_prog_array *_array; \ u32 _ret = 1; \ + preempt_disable(); \ rcu_read_lock(); \ _array = rcu_dereference(array); \ if (unlikely(check_non_null && !_array))\ goto _out; \ _item = &_array->items[0]; \ - while ((_prog = READ_ONCE(_item->prog))) { \ - bpf_cgroup_storage_set(_item->cgroup_storage); \ + while ((_prog = READ_ONCE(_item->prog))) { \ + if (set_cg_storage) \ + bpf_cgroup_storage_set(_item->cgroup_storage); \ _ret &= func(_prog, ctx); \ _item++; \ } \ _out: \ rcu_read_unlock(); \ + preempt_enable(); \ _ret; \ }) +/* To be used by __cgroup_bpf_run_filter_skb for EGRESS BPF progs + * so BPF programs can request cwr for TCP packets. + * + * Current cgroup skb programs can only return 0 or 1 (0 to drop the + * packet. This macro changes the behavior so the low order bit + * indicates whether the packet should be dropped (0) or not (1) + * and the next bit is a congestion notification bit. This could be + * used by TCP to call tcp_enter_cwr() + * + * Hence, new allowed return values of CGROUP EGRESS BPF programs are: + * 0: drop packet + * 1: keep packet + * 2: drop packet and cn + * 3: keep packet and cn + * + * This macro then converts it to one of the NET_XMIT or an error + * code that is then interpreted as drop packet (and no cn): + * 0: NET_XMIT_SUCCESS skb should be transmitted + * 1: NET_XMIT_DROP skb should be dropped and cn + * 2: NET_XMIT_CN skb should be transmitted and cn + * 3: -EPERM skb should be dropped + */ +#define BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY(array, ctx, func) \ + ({ \ + struct bpf_prog_array_item *_item; \ + struct bpf_prog *_prog; \ + struct bpf_prog_array *_array; \ + u32 ret; \ + u32 _ret = 1; \ + u32 _cn = 0; \ + preempt_disable(); \ + rcu_read_lock(); \ + _array = rcu_dereference(array); \ + _item = &_array->items[0]; \ + while ((_prog = READ_ONCE(_item->prog))) { \ + bpf_cgroup_storage_set(_item->cgroup_storage); \ + ret = func(_prog, ctx); \ + _ret &= (ret & 1); \ + _cn |= (ret & 2); \ + _item++; \ + } \ + rcu_read_unlock(); \ + preempt_enable(); \ + if (_ret) \ + _ret = (_cn ? NET_XMIT_CN : NET_XMIT_SUCCESS); \ + else \ + _ret = (_cn ? NET_XMIT_DROP : -EPERM); \ + _ret; \ + }) + #define BPF_PROG_RUN_ARRAY(array, ctx, func) \ - __BPF_PROG_RUN_ARRAY(array, ctx, func, false) + __BPF_PROG_RUN_ARRAY(array, ctx, func, false, true) #define BPF_PROG_RUN_ARRAY_CHECK(array, ctx, func) \ - __BPF_PROG_RUN_ARRAY(array, ctx, func, true) + __BPF_PROG_RUN_ARRAY(array, ctx, func, true, false) #ifdef CONFIG_BPF_SYSCALL DECLARE_PER_CPU(int, bpf_prog_active); @@ -514,11 +633,13 @@ extern const struct file_operations bpf_prog_fops; #undef BPF_PROG_TYPE #undef BPF_MAP_TYPE -extern const struct bpf_verifier_ops bpf_offload_verifier_ops; extern const struct bpf_prog_ops bpf_offload_prog_ops; +extern const struct bpf_verifier_ops tc_cls_act_analyzer_ops; +extern const struct bpf_verifier_ops xdp_analyzer_ops; struct bpf_prog *bpf_prog_get(u32 ufd); -struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type); +struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type, + bool attach_drv); struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i); void bpf_prog_sub(struct bpf_prog *prog, int i); struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog); @@ -533,12 +654,17 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock); struct bpf_map *bpf_map_get_with_uref(u32 ufd); struct bpf_map *__bpf_map_get(struct fd f); struct bpf_map * __must_check bpf_map_inc(struct bpf_map *map, bool uref); +struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map, + bool uref); void bpf_map_put_with_uref(struct bpf_map *map); void bpf_map_put(struct bpf_map *map); -int bpf_map_precharge_memlock(u32 pages); int bpf_map_charge_memlock(struct bpf_map *map, u32 pages); void bpf_map_uncharge_memlock(struct bpf_map *map, u32 pages); -void *bpf_map_area_alloc(size_t size, int numa_node); +int bpf_map_charge_init(struct bpf_map_memory *mem, u64 size); +void bpf_map_charge_finish(struct bpf_map_memory *mem); +void bpf_map_charge_move(struct bpf_map_memory *dst, + struct bpf_map_memory *src); +void *bpf_map_area_alloc(u64 size, int numa_node); void bpf_map_area_free(void *base); void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr); @@ -589,17 +715,27 @@ static inline void bpf_long_memcpy(void *dst, const void *src, u32 size) /* verify correctness of eBPF program */ int bpf_check(struct bpf_prog **fp, union bpf_attr *attr, union bpf_attr __user *uattr); + #ifndef CONFIG_BPF_JIT_ALWAYS_ON void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth); #endif -struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type type); - /* Map specifics */ -struct net_device *__dev_map_lookup_elem(struct bpf_map *map, u32 key); -struct net_device *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key); -void __dev_map_insert_ctx(struct bpf_map *map, u32 index); +struct xdp_buff; +struct sk_buff; + +struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key); +struct bpf_dtab_netdev *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key); void __dev_map_flush(struct bpf_map *map); +int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp, + struct net_device *dev_rx); +int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb, + struct bpf_prog *xdp_prog); + +struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key); +void __cpu_map_flush(struct bpf_map *map); +int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp, + struct net_device *dev_rx); /* Return map's numa specified by userspace */ static inline int bpf_map_attr_numa_node(const union bpf_attr *attr) @@ -608,22 +744,35 @@ static inline int bpf_map_attr_numa_node(const union bpf_attr *attr) attr->numa_node : NUMA_NO_NODE; } +struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type type); +int array_map_alloc_check(union bpf_attr *attr); + +int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr, + union bpf_attr __user *uattr); +int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, + union bpf_attr __user *uattr); +int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, + const union bpf_attr *kattr, + union bpf_attr __user *uattr); + static inline bool unprivileged_ebpf_enabled(void) { return !sysctl_unprivileged_bpf_disabled; } -#else +#else /* !CONFIG_BPF_SYSCALL */ static inline struct bpf_prog *bpf_prog_get(u32 ufd) { return ERR_PTR(-EOPNOTSUPP); } -static inline struct bpf_prog *bpf_prog_get_type(u32 ufd, - enum bpf_prog_type type) +static inline struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, + enum bpf_prog_type type, + bool attach_drv) { return ERR_PTR(-EOPNOTSUPP); } + static inline struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i) { @@ -663,12 +812,6 @@ static inline int bpf_obj_get_user(const char __user *pathname, int flags) return -EOPNOTSUPP; } -static inline struct bpf_prog *bpf_prog_get_type_path(const char *name, - enum bpf_prog_type type) -{ - return ERR_PTR(-EOPNOTSUPP); -} - static inline struct net_device *__dev_map_lookup_elem(struct bpf_map *map, u32 key) { @@ -681,12 +824,71 @@ static inline struct net_device *__dev_map_hash_lookup_elem(struct bpf_map *map return NULL; } -static inline void __dev_map_insert_ctx(struct bpf_map *map, u32 index) +static inline void __dev_map_flush(struct bpf_map *map) { } -static inline void __dev_map_flush(struct bpf_map *map) +struct xdp_buff; +struct bpf_dtab_netdev; + +static inline +int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp, + struct net_device *dev_rx) { + return 0; +} + +struct sk_buff; + +static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, + struct sk_buff *skb, + struct bpf_prog *xdp_prog) +{ + return 0; +} + +static inline +struct bpf_cpu_map_entry *__cpu_map_lookup_elem(struct bpf_map *map, u32 key) +{ + return NULL; +} + +static inline void __cpu_map_flush(struct bpf_map *map) +{ +} + +static inline int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, + struct xdp_buff *xdp, + struct net_device *dev_rx) +{ + return 0; +} + +static inline struct bpf_prog *bpf_prog_get_type_path(const char *name, + enum bpf_prog_type type) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +static inline int bpf_prog_test_run_xdp(struct bpf_prog *prog, + const union bpf_attr *kattr, + union bpf_attr __user *uattr) +{ + return -ENOTSUPP; +} + +static inline int bpf_prog_test_run_skb(struct bpf_prog *prog, + const union bpf_attr *kattr, + union bpf_attr __user *uattr) +{ + return -ENOTSUPP; +} + +static inline int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, + const union bpf_attr *kattr, + union bpf_attr __user *uattr) +{ + return -ENOTSUPP; } static inline bool unprivileged_ebpf_enabled(void) @@ -696,6 +898,14 @@ static inline bool unprivileged_ebpf_enabled(void) #endif /* CONFIG_BPF_SYSCALL */ +static inline struct bpf_prog *bpf_prog_get_type(u32 ufd, + enum bpf_prog_type type) +{ + return bpf_prog_get_type_dev(ufd, type, false); +} + +bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool); + int bpf_prog_offload_compile(struct bpf_prog *prog); void bpf_prog_offload_destroy(struct bpf_prog *prog); int bpf_prog_offload_info_fill(struct bpf_prog_info *info, @@ -709,19 +919,34 @@ int bpf_map_offload_update_elem(struct bpf_map *map, int bpf_map_offload_delete_elem(struct bpf_map *map, void *key); int bpf_map_offload_get_next_key(struct bpf_map *map, void *key, void *next_key); -bool bpf_offload_dev_match(struct bpf_prog *prog, struct bpf_map *map); + +bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map); + +struct bpf_offload_dev * +bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv); +void bpf_offload_dev_destroy(struct bpf_offload_dev *offdev); +void *bpf_offload_dev_priv(struct bpf_offload_dev *offdev); +int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev, + struct net_device *netdev); +void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev, + struct net_device *netdev); +bool bpf_offload_dev_match(struct bpf_prog *prog, struct net_device *netdev); + +void unpriv_ebpf_notify(int new_state); #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL) int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr); -static inline bool bpf_prog_is_dev_bound(struct bpf_prog_aux *aux) +static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux) { - return aux->offload; + return aux->offload_requested; } + static inline bool bpf_map_is_dev_bound(struct bpf_map *map) { return unlikely(map->ops == &bpf_map_offload_ops); } + struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr); void bpf_map_offload_map_free(struct bpf_map *map); #else @@ -740,21 +965,26 @@ static inline bool bpf_map_is_dev_bound(struct bpf_map *map) { return false; } + static inline struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr) { return ERR_PTR(-EOPNOTSUPP); } + static inline void bpf_map_offload_map_free(struct bpf_map *map) { } #endif /* CONFIG_NET && CONFIG_BPF_SYSCALL */ #if defined(CONFIG_BPF_STREAM_PARSER) -int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog, u32 which); +int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog, + struct bpf_prog *old, u32 which); int sock_map_get_from_fd(const union bpf_attr *attr, struct bpf_prog *prog); +int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype); #else static inline int sock_map_prog_update(struct bpf_map *map, - struct bpf_prog *prog, u32 which) + struct bpf_prog *prog, + struct bpf_prog *old, u32 which) { return -EOPNOTSUPP; } @@ -764,12 +994,73 @@ static inline int sock_map_get_from_fd(const union bpf_attr *attr, { return -EINVAL; } + +static inline int sock_map_prog_detach(const union bpf_attr *attr, + enum bpf_prog_type ptype) +{ + return -EOPNOTSUPP; +} #endif +#if defined(CONFIG_XDP_SOCKETS) +struct xdp_sock; +struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key); +int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp, + struct xdp_sock *xs); +void __xsk_map_flush(struct bpf_map *map); +#else +struct xdp_sock; +static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, + u32 key) +{ + return NULL; +} + +static inline int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp, + struct xdp_sock *xs) +{ + return -EOPNOTSUPP; +} + +static inline void __xsk_map_flush(struct bpf_map *map) +{ +} +#endif + +#if defined(CONFIG_INET) && defined(CONFIG_BPF_SYSCALL) +void bpf_sk_reuseport_detach(struct sock *sk); +int bpf_fd_reuseport_array_lookup_elem(struct bpf_map *map, void *key, + void *value); +int bpf_fd_reuseport_array_update_elem(struct bpf_map *map, void *key, + void *value, u64 map_flags); +#else +static inline void bpf_sk_reuseport_detach(struct sock *sk) +{ +} + +#ifdef CONFIG_BPF_SYSCALL +static inline int bpf_fd_reuseport_array_lookup_elem(struct bpf_map *map, + void *key, void *value) +{ + return -EOPNOTSUPP; +} + +static inline int bpf_fd_reuseport_array_update_elem(struct bpf_map *map, + void *key, void *value, + u64 map_flags) +{ + return -EOPNOTSUPP; +} +#endif /* CONFIG_BPF_SYSCALL */ +#endif /* defined(CONFIG_INET) && defined(CONFIG_BPF_SYSCALL) */ + /* verifier prototypes for helper functions called from eBPF programs */ extern const struct bpf_func_proto bpf_map_lookup_elem_proto; extern const struct bpf_func_proto bpf_map_update_elem_proto; extern const struct bpf_func_proto bpf_map_delete_elem_proto; +extern const struct bpf_func_proto bpf_map_push_elem_proto; +extern const struct bpf_func_proto bpf_map_pop_elem_proto; +extern const struct bpf_func_proto bpf_map_peek_elem_proto; extern const struct bpf_func_proto bpf_get_prandom_u32_proto; extern const struct bpf_func_proto bpf_get_smp_processor_id_proto; @@ -780,19 +1071,20 @@ extern const struct bpf_func_proto bpf_ktime_get_boot_ns_proto; extern const struct bpf_func_proto bpf_get_current_pid_tgid_proto; extern const struct bpf_func_proto bpf_get_current_uid_gid_proto; extern const struct bpf_func_proto bpf_get_current_comm_proto; -extern const struct bpf_func_proto bpf_skb_vlan_push_proto; -extern const struct bpf_func_proto bpf_skb_vlan_pop_proto; extern const struct bpf_func_proto bpf_get_stackid_proto; +extern const struct bpf_func_proto bpf_get_stack_proto; extern const struct bpf_func_proto bpf_sock_map_update_proto; extern const struct bpf_func_proto bpf_sock_hash_update_proto; extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto; -extern const struct bpf_func_proto bpf_spin_lock_proto; -extern const struct bpf_func_proto bpf_spin_unlock_proto; extern const struct bpf_func_proto bpf_msg_redirect_hash_proto; extern const struct bpf_func_proto bpf_msg_redirect_map_proto; extern const struct bpf_func_proto bpf_sk_redirect_hash_proto; extern const struct bpf_func_proto bpf_sk_redirect_map_proto; +extern const struct bpf_func_proto bpf_spin_lock_proto; +extern const struct bpf_func_proto bpf_spin_unlock_proto; extern const struct bpf_func_proto bpf_get_local_storage_proto; +extern const struct bpf_func_proto bpf_strtol_proto; +extern const struct bpf_func_proto bpf_strtoul_proto; extern const struct bpf_func_proto bpf_tcp_sock_proto; /* Shared helpers among cBPF and eBPF. */ @@ -836,11 +1128,21 @@ static inline u32 bpf_sock_convert_ctx_access(enum bpf_access_type type, #ifdef CONFIG_INET bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type, struct bpf_insn_access_aux *info); + u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type, const struct bpf_insn *si, struct bpf_insn *insn_buf, struct bpf_prog *prog, u32 *target_size); + +bool bpf_xdp_sock_is_valid_access(int off, int size, enum bpf_access_type type, + struct bpf_insn_access_aux *info); + +u32 bpf_xdp_sock_convert_ctx_access(enum bpf_access_type type, + const struct bpf_insn *si, + struct bpf_insn *insn_buf, + struct bpf_prog *prog, + u32 *target_size); #else static inline bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type, @@ -857,6 +1159,21 @@ static inline u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type, { return 0; } +static inline bool bpf_xdp_sock_is_valid_access(int off, int size, + enum bpf_access_type type, + struct bpf_insn_access_aux *info) +{ + return false; +} + +static inline u32 bpf_xdp_sock_convert_ctx_access(enum bpf_access_type type, + const struct bpf_insn *si, + struct bpf_insn *insn_buf, + struct bpf_prog *prog, + u32 *target_size) +{ + return 0; +} #endif /* CONFIG_INET */ #endif /* _LINUX_BPF_H */ diff --git a/include/linux/bpf_lirc.h b/include/linux/bpf_lirc.h new file mode 100644 index 000000000000..9d9ff755ec29 --- /dev/null +++ b/include/linux/bpf_lirc.h @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _BPF_LIRC_H +#define _BPF_LIRC_H + +#include + +#ifdef CONFIG_BPF_LIRC_MODE2 +int lirc_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog); +int lirc_prog_detach(const union bpf_attr *attr); +int lirc_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr); +#else +static inline int lirc_prog_attach(const union bpf_attr *attr, + struct bpf_prog *prog) +{ + return -EINVAL; +} + +static inline int lirc_prog_detach(const union bpf_attr *attr) +{ + return -EINVAL; +} + +static inline int lirc_prog_query(const union bpf_attr *attr, + union bpf_attr __user *uattr) +{ + return -EINVAL; +} +#endif + +#endif /* _BPF_LIRC_H */ diff --git a/include/linux/bpf_trace.h b/include/linux/bpf_trace.h index e6fe98ae3794..ddf896abcfb6 100644 --- a/include/linux/bpf_trace.h +++ b/include/linux/bpf_trace.h @@ -2,7 +2,6 @@ #ifndef __LINUX_BPF_TRACE_H__ #define __LINUX_BPF_TRACE_H__ -#include #include #endif /* __LINUX_BPF_TRACE_H__ */ diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 80715af1e553..36a9c2325176 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -6,15 +6,19 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_FILTER, sk_filter) BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_CLS, tc_cls_act) BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act) BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp) +#ifdef CONFIG_CGROUP_BPF BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb) BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock) BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK_ADDR, cg_sock_addr) -BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout) -BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout) +#endif +BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_in) +BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_out) BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit) +BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_SEG6LOCAL, lwt_seg6local) BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops) BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb) BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg) +BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector) #endif #ifdef CONFIG_BPF_EVENTS BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe) @@ -28,8 +32,12 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev) BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SYSCTL, cg_sysctl) BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCKOPT, cg_sockopt) #endif - -BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector) +#ifdef CONFIG_BPF_LIRC_MODE2 +BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2) +#endif +#ifdef CONFIG_INET +BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport) +#endif BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) @@ -48,17 +56,25 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_HASH, htab_lru_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_PERCPU_HASH, htab_lru_percpu_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_LPM_TRIE, trie_map_ops) #ifdef CONFIG_PERF_EVENTS -BPF_MAP_TYPE(BPF_MAP_TYPE_STACK_TRACE, stack_map_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_STACK_TRACE, stack_trace_map_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_of_maps_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops) #ifdef CONFIG_NET BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops) -BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP_HASH, dev_map_hash_ops) -#ifdef CONFIG_STREAM_PARSER +BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops) +#if defined(CONFIG_BPF_STREAM_PARSER) BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops) +#if defined(CONFIG_XDP_SOCKETS) +BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops) #endif +#ifdef CONFIG_INET +BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops) +#endif +#endif +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops) diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 0e2dda88234b..e0b52351127d 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -36,11 +36,15 @@ */ enum bpf_reg_liveness { REG_LIVE_NONE = 0, /* reg hasn't been read or written this branch */ - REG_LIVE_READ, /* reg was read, so we're sensitive to initial value */ - REG_LIVE_WRITTEN, /* reg was written first, screening off later reads */ + REG_LIVE_READ32 = 0x1, /* reg was read, so we're sensitive to initial value */ + REG_LIVE_READ64 = 0x2, /* likewise, but full 64-bit content matters */ + REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64, + REG_LIVE_WRITTEN = 0x4, /* reg was written first, screening off later reads */ + REG_LIVE_DONE = 0x8, /* liveness won't be updating this register anymore */ }; struct bpf_reg_state { + /* Ordering of fields matters. See states_equal() */ enum bpf_reg_type type; union { /* valid when type == PTR_TO_PACKET */ @@ -64,7 +68,46 @@ struct bpf_reg_state { * same reference to the socket, to determine proper reference freeing. */ u32 id; - /* Ordering of fields matters. See states_equal() */ + /* PTR_TO_SOCKET and PTR_TO_TCP_SOCK could be a ptr returned + * from a pointer-cast helper, bpf_sk_fullsock() and + * bpf_tcp_sock(). + * + * Consider the following where "sk" is a reference counted + * pointer returned from "sk = bpf_sk_lookup_tcp();": + * + * 1: sk = bpf_sk_lookup_tcp(); + * 2: if (!sk) { return 0; } + * 3: fullsock = bpf_sk_fullsock(sk); + * 4: if (!fullsock) { bpf_sk_release(sk); return 0; } + * 5: tp = bpf_tcp_sock(fullsock); + * 6: if (!tp) { bpf_sk_release(sk); return 0; } + * 7: bpf_sk_release(sk); + * 8: snd_cwnd = tp->snd_cwnd; // verifier will complain + * + * After bpf_sk_release(sk) at line 7, both "fullsock" ptr and + * "tp" ptr should be invalidated also. In order to do that, + * the reg holding "fullsock" and "sk" need to remember + * the original refcounted ptr id (i.e. sk_reg->id) in ref_obj_id + * such that the verifier can reset all regs which have + * ref_obj_id matching the sk_reg->id. + * + * sk_reg->ref_obj_id is set to sk_reg->id at line 1. + * sk_reg->id will stay as NULL-marking purpose only. + * After NULL-marking is done, sk_reg->id can be reset to 0. + * + * After "fullsock = bpf_sk_fullsock(sk);" at line 3, + * fullsock_reg->ref_obj_id is set to sk_reg->ref_obj_id. + * + * After "tp = bpf_tcp_sock(fullsock);" at line 5, + * tp_reg->ref_obj_id is set to fullsock_reg->ref_obj_id + * which is the same as sk_reg->ref_obj_id. + * + * From the verifier perspective, if sk, fullsock and tp + * are not NULL, they are the same ptr with different + * reg->type. In particular, bpf_sk_release(tp) is also + * allowed and has the same effect as bpf_sk_release(sk). + */ + u32 ref_obj_id; /* For scalar types (SCALAR_VALUE), this represents our knowledge of * the actual value. * For pointer types, this represents the variable part of the offset @@ -81,22 +124,30 @@ struct bpf_reg_state { s64 smax_value; /* maximum possible (s64)value */ u64 umin_value; /* minimum possible (u64)value */ u64 umax_value; /* maximum possible (u64)value */ + /* parentage chain for liveness checking */ + struct bpf_reg_state *parent; /* Inside the callee two registers can be both PTR_TO_STACK like * R1=fp-8 and R2=fp-8, but one of them points to this function stack * while another to the caller's stack. To differentiate them 'frameno' * is used which is an index in bpf_verifier_state->frame[] array * pointing to bpf_func_state. - * This field must be second to last, for states_equal() reasons. */ u32 frameno; - /* This field must be last, for states_equal() reasons. */ + /* Tracks subreg definition. The stored value is the insn_idx of the + * writing insn. This is safe because subreg_def is used before any insn + * patching which only happens after main verification finished. + */ + s32 subreg_def; enum bpf_reg_liveness live; + /* if (!precise && SCALAR_VALUE) min/max/tnum don't affect safety */ + bool precise; }; enum bpf_stack_slot_type { STACK_INVALID, /* nothing was stored in this stack slot */ STACK_SPILL, /* register spilled into stack */ - STACK_MISC /* BPF program wrote some data into this slot */ + STACK_MISC, /* BPF program wrote some data into this slot */ + STACK_ZERO, /* BPF program wrote constant zero */ }; #define BPF_REG_SIZE 8 /* size of eBPF register in bytes */ @@ -105,6 +156,7 @@ struct bpf_stack_state { struct bpf_reg_state spilled_ptr; u8 slot_type[BPF_REG_SIZE]; }; + struct bpf_reference_state { /* Track each reference created with a unique id, even if the same * instruction creates the reference multiple times (eg, via CALL). @@ -121,7 +173,6 @@ struct bpf_reference_state { */ struct bpf_func_state { struct bpf_reg_state regs[MAX_BPF_REG]; - struct bpf_verifier_state *parent; /* index of call instruction that called into this func */ int callsite; /* stack frame number of this function state from pov of @@ -133,6 +184,7 @@ struct bpf_func_state { * zero == main subprog */ u32 subprogno; + /* The following fields should be last. See copy_func_state() */ int acquired_refs; struct bpf_reference_state *refs; @@ -140,14 +192,84 @@ struct bpf_func_state { struct bpf_stack_state *stack; }; +struct bpf_id_pair { + u32 old; + u32 cur; +}; + +/* Maximum number of register states that can exist at once */ +#define BPF_ID_MAP_SIZE (MAX_BPF_REG + MAX_BPF_STACK / BPF_REG_SIZE) +struct bpf_idx_pair { + u32 prev_idx; + u32 idx; +}; + #define MAX_CALL_FRAMES 8 struct bpf_verifier_state { /* call stack tracking */ struct bpf_func_state *frame[MAX_CALL_FRAMES]; struct bpf_verifier_state *parent; + /* + * 'branches' field is the number of branches left to explore: + * 0 - all possible paths from this state reached bpf_exit or + * were safely pruned + * 1 - at least one path is being explored. + * This state hasn't reached bpf_exit + * 2 - at least two paths are being explored. + * This state is an immediate parent of two children. + * One is fallthrough branch with branches==1 and another + * state is pushed into stack (to be explored later) also with + * branches==1. The parent of this state has branches==1. + * The verifier state tree connected via 'parent' pointer looks like: + * 1 + * 1 + * 2 -> 1 (first 'if' pushed into stack) + * 1 + * 2 -> 1 (second 'if' pushed into stack) + * 1 + * 1 + * 1 bpf_exit. + * + * Once do_check() reaches bpf_exit, it calls update_branch_counts() + * and the verifier state tree will look: + * 1 + * 1 + * 2 -> 1 (first 'if' pushed into stack) + * 1 + * 1 -> 1 (second 'if' pushed into stack) + * 0 + * 0 + * 0 bpf_exit. + * After pop_stack() the do_check() will resume at second 'if'. + * + * If is_state_visited() sees a state with branches > 0 it means + * there is a loop. If such state is exactly equal to the current state + * it's an infinite loop. Note states_equal() checks for states + * equvalency, so two states being 'states_equal' does not mean + * infinite loop. The exact comparison is provided by + * states_maybe_looping() function. It's a stronger pre-check and + * much faster than states_equal(). + * + * This algorithm may not find all possible infinite loops or + * loop iteration count may be too high. + * In such cases BPF_COMPLEXITY_LIMIT_INSNS limit kicks in. + */ + u32 branches; + u32 insn_idx; u32 curframe; u32 active_spin_lock; bool speculative; + + /* first and last insn idx of this verifier state */ + u32 first_insn_idx; + u32 last_insn_idx; + /* jmp history recorded from first to last. + * backtracking is using it to go from last to first. + * For most states jmp_history_cnt is [0-3]. + * For loops can go up to ~40. + */ + struct bpf_idx_pair *jmp_history; + u32 jmp_history_cnt; }; #define bpf_get_spilled_reg(slot, frame) \ @@ -165,6 +287,7 @@ struct bpf_verifier_state { struct bpf_verifier_state_list { struct bpf_verifier_state state; struct bpf_verifier_state_list *next; + int miss_cnt, hit_cnt; }; /* Possible states for alu_state member. */ @@ -188,9 +311,12 @@ struct bpf_insn_aux_data { }; }; int ctx_field_size; /* the ctx field size for load insn, maybe 0 */ - int sanitize_stack_off; /* stack slot to be cleared */ bool seen; /* this insn was processed by the verifier */ + bool sanitize_stack_spill; /* subject to Spectre v4 sanitation */ + bool zext_dst; /* this insn zero extends dst reg */ u8 alu_state; /* used in combination with alu_limit */ + bool prune_point; + unsigned int orig_idx; /* original instruction index */ }; #define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */ @@ -210,23 +336,24 @@ static inline bool bpf_verifier_log_full(const struct bpf_verifier_log *log) return log->len_used >= log->len_total - 1; } +#define BPF_LOG_LEVEL1 1 +#define BPF_LOG_LEVEL2 2 +#define BPF_LOG_STATS 4 +#define BPF_LOG_LEVEL (BPF_LOG_LEVEL1 | BPF_LOG_LEVEL2) +#define BPF_LOG_MASK (BPF_LOG_LEVEL | BPF_LOG_STATS) + static inline bool bpf_verifier_log_needed(const struct bpf_verifier_log *log) { return log->level && log->ubuf && !bpf_verifier_log_full(log); } -__printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env, - const char *fmt, ...); - -void bpf_verifier_vlog(struct bpf_verifier_log *log, const char *fmt, - va_list args); - #define BPF_MAX_SUBPROGS 256 struct bpf_subprog_info { u32 start; /* insn idx of function entry point */ u32 linfo_idx; /* The idx to the main_prog->aux->linfo */ u16 stack_depth; /* max. stack depth used by this function */ + bool has_tail_call; }; /* single container for all structs @@ -240,22 +367,51 @@ struct bpf_verifier_env { struct bpf_verifier_stack_elem *head; /* stack of verifier states to be processed */ int stack_size; /* number of states to be processed */ bool strict_alignment; /* perform strict pointer alignment checks */ + bool test_state_freq; /* test verifier with different pruning frequency */ struct bpf_verifier_state *cur_state; /* current verifier state */ struct bpf_verifier_state_list **explored_states; /* search pruning optimization */ - const struct bpf_ext_analyzer_ops *analyzer_ops; /* external analyzer ops */ - void *analyzer_priv; /* pointer to external analyzer's private data */ + struct bpf_verifier_state_list *free_list; struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by eBPF program */ u32 used_map_cnt; /* number of used maps */ u32 id_gen; /* used to generate unique reg IDs */ + bool explore_alu_limits; bool allow_ptr_leaks; bool seen_direct_write; struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */ - + const struct bpf_line_info *prev_linfo; struct bpf_verifier_log log; struct bpf_subprog_info subprog_info[BPF_MAX_SUBPROGS + 1]; + struct bpf_id_pair idmap_scratch[BPF_ID_MAP_SIZE]; + struct { + int *insn_state; + int *insn_stack; + int cur_stack; + } cfg; u32 subprog_cnt; + /* number of instructions analyzed by the verifier */ + u32 prev_insn_processed, insn_processed; + /* number of jmps, calls, exits analyzed so far */ + u32 prev_jmps_processed, jmps_processed; + /* total verification time */ + u64 verification_time; + /* maximum number of verifier states kept in 'branching' instructions */ + u32 max_states_per_insn; + /* total number of allocated verifier states */ + u32 total_states; + /* some states are freed during program analysis. + * this is peak number of states. this number dominates kernel + * memory consumption during verification + */ + u32 peak_states; + /* longest register parentage chain walked for liveness marking */ + u32 longest_mark_read_walk; }; +__printf(2, 0) void bpf_verifier_vlog(struct bpf_verifier_log *log, + const char *fmt, va_list args); +__printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env, + const char *fmt, ...); + static inline struct bpf_func_state *cur_func(struct bpf_verifier_env *env) { struct bpf_verifier_state *cur = env->cur_state; @@ -268,19 +424,14 @@ static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env) return cur_func(env)->regs; } -int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env); +int bpf_prog_offload_verifier_prep(struct bpf_prog *prog); int bpf_prog_offload_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx); - -int bpf_analyzer(struct bpf_prog *prog, const struct bpf_ext_analyzer_ops *ops, - void *priv); - -#include -#include -static inline void *__compat_kvcalloc(size_t n, size_t size, gfp_t flags) -{ - return kvmalloc_array(n, size, flags | __GFP_ZERO); -} -#define kvcalloc __compat_kvcalloc +int bpf_prog_offload_finalize(struct bpf_verifier_env *env); +void +bpf_prog_offload_replace_insn(struct bpf_verifier_env *env, u32 off, + struct bpf_insn *insn); +void +bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt); #endif /* _LINUX_BPF_VERIFIER_H */ diff --git a/include/linux/bpfilter.h b/include/linux/bpfilter.h new file mode 100644 index 000000000000..d815622cd31e --- /dev/null +++ b/include/linux/bpfilter.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_BPFILTER_H +#define _LINUX_BPFILTER_H + +#include +#include + +struct sock; +int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval, + unsigned int optlen); +int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char __user *optval, + int __user *optlen); +struct bpfilter_umh_ops { + struct umh_info info; + /* since ip_getsockopt() can run in parallel, serialize access to umh */ + struct mutex lock; + int (*sockopt)(struct sock *sk, int optname, + char __user *optval, + unsigned int optlen, bool is_set); + int (*start)(void); + bool stop; +}; +extern struct bpfilter_umh_ops bpfilter_ops; +#endif diff --git a/include/linux/btf.h b/include/linux/btf.h index 634216e1f258..64cdf2a23d42 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -50,8 +50,8 @@ u32 btf_id(const struct btf *btf); bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, const struct btf_member *m, u32 expected_offset, u32 expected_size); - int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t); +bool btf_type_is_void(const struct btf_type *t); #ifdef CONFIG_BPF_SYSCALL const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id); @@ -62,7 +62,6 @@ static inline const struct btf_type *btf_type_by_id(const struct btf *btf, { return NULL; } - static inline const char *btf_name_by_offset(const struct btf *btf, u32 offset) { diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 19290afc87bb..3afd96d06ed9 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -565,6 +565,36 @@ static inline bool cgroup_is_descendant(struct cgroup *cgrp, return cgrp->ancestor_ids[ancestor->level] == ancestor->id; } +/** + * cgroup_ancestor - find ancestor of cgroup + * @cgrp: cgroup to find ancestor of + * @ancestor_level: level of ancestor to find starting from root + * + * Find ancestor of cgroup at specified level starting from root if it exists + * and return pointer to it. Return NULL if @cgrp doesn't have ancestor at + * @ancestor_level. + * + * This function is safe to call as long as @cgrp is accessible. + */ +static inline struct cgroup *cgroup_ancestor(struct cgroup *cgrp, + int ancestor_level) +{ + struct cgroup *ptr; + + if (cgrp->level < ancestor_level) + return NULL; + + for (ptr = cgrp; + ptr && ptr->level > ancestor_level; + ptr = cgroup_parent(ptr)) + ; + + if (ptr && ptr->level == ancestor_level) + return ptr; + + return NULL; +} + /** * task_under_cgroup_hierarchy - test task's membership of cgroup ancestry * @task: the task to be tested @@ -813,4 +843,22 @@ static inline void put_cgroup_ns(struct cgroup_namespace *ns) free_cgroup_ns(ns); } +#ifdef CONFIG_CGROUP_BPF +static inline void cgroup_bpf_get(struct cgroup *cgrp) +{ + percpu_ref_get(&cgrp->bpf.refcnt); +} + +static inline void cgroup_bpf_put(struct cgroup *cgrp) +{ + percpu_ref_put(&cgrp->bpf.refcnt); +} + +#else /* CONFIG_CGROUP_BPF */ + +static inline void cgroup_bpf_get(struct cgroup *cgrp) {} +static inline void cgroup_bpf_put(struct cgroup *cgrp) {} + +#endif /* CONFIG_CGROUP_BPF */ + #endif /* _LINUX_CGROUP_H */ diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h index e94290b29e99..aa5b50b249de 100644 --- a/include/linux/cgroup_rdma.h +++ b/include/linux/cgroup_rdma.h @@ -1,9 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright (C) 2016 Parav Pandit - * - * This file is subject to the terms and conditions of version 2 of the GNU - * General Public License. See the file COPYING in the main directory of the - * Linux distribution for more details. */ #ifndef _CGROUP_RDMA_H diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 1b7a3a625b62..66269827085c 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -110,9 +110,14 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val, ".pushsection .discard.unreachable\n\t" \ ".long 999b - .\n\t" \ ".popsection\n\t" + +/* Annotate a C jump table to allow objtool to follow the code flow */ +#define __annotate_jump_table __section(".rodata..c_jump_table") + #else #define annotate_reachable() #define annotate_unreachable() +#define __annotate_jump_table #endif #ifndef ASM_UNREACHABLE diff --git a/include/linux/error-injection.h b/include/linux/error-injection.h new file mode 100644 index 000000000000..130a67c50dac --- /dev/null +++ b/include/linux/error-injection.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_ERROR_INJECTION_H +#define _LINUX_ERROR_INJECTION_H + +#ifdef CONFIG_FUNCTION_ERROR_INJECTION + +#include + +extern bool within_error_injection_list(unsigned long addr); + +#else /* !CONFIG_FUNCTION_ERROR_INJECTION */ + +#include +static inline bool within_error_injection_list(unsigned long addr) +{ + return false; +} + +#endif + +#endif /* _LINUX_ERROR_INJECTION_H */ diff --git a/include/linux/filter.h b/include/linux/filter.h index 1f50979749e5..b9d0c3b45b96 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -19,9 +19,12 @@ #include #include #include +#include +#include #include +#include #include #include @@ -31,6 +34,7 @@ struct seccomp_data; struct bpf_prog_aux; struct xdp_rxq_info; struct xdp_buff; +struct sock_reuseport; struct ctl_table; struct ctl_table_header; @@ -49,7 +53,9 @@ struct ctl_table_header; /* Additional register mappings for converted user programs. */ #define BPF_REG_A BPF_REG_0 #define BPF_REG_X BPF_REG_7 -#define BPF_REG_TMP BPF_REG_8 +#define BPF_REG_TMP BPF_REG_2 /* scratch reg */ +#define BPF_REG_D BPF_REG_8 /* data, callee-saved */ +#define BPF_REG_H BPF_REG_9 /* hlen, callee-saved */ /* Kernel hidden auxiliary/helper register. */ #define BPF_REG_AX MAX_BPF_REG @@ -62,6 +68,11 @@ struct ctl_table_header; /* unused opcode to mark call to interpreter with arguments */ #define BPF_CALL_ARGS 0xe0 +/* unused opcode to mark speculation barrier for mitigating + * Speculative Store Bypass + */ +#define BPF_NOSPEC 0xc0 + /* As per nm, we expose JITed images as text (code) section for * kallsyms. That way, tools like perf can find it to match * addresses. @@ -179,6 +190,20 @@ struct ctl_table_header; .off = (insn).off, \ .imm = (insn).imm }) +/* Special form of mov32, used for doing explicit zero extension on dst. */ +#define BPF_ZEXT_REG(DST) \ + ((struct bpf_insn) { \ + .code = BPF_ALU | BPF_MOV | BPF_X, \ + .dst_reg = DST, \ + .src_reg = DST, \ + .off = 0, \ + .imm = 1 }) + +static inline bool insn_is_zext(const struct bpf_insn *insn) +{ + return insn->code == (BPF_ALU | BPF_MOV | BPF_X) && insn->imm == 1; +} + /* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */ #define BPF_LD_IMM64(DST, IMM) \ BPF_LD_IMM64_RAW(DST, 0, IMM) @@ -299,6 +324,26 @@ struct ctl_table_header; .off = OFF, \ .imm = IMM }) +/* Like BPF_JMP_REG, but with 32-bit wide operands for comparison. */ + +#define BPF_JMP32_REG(OP, DST, SRC, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_JMP32 | BPF_OP(OP) | BPF_X, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = OFF, \ + .imm = 0 }) + +/* Like BPF_JMP_IMM, but with 32-bit wide operands for comparison. */ + +#define BPF_JMP32_IMM(OP, DST, IMM, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_JMP32 | BPF_OP(OP) | BPF_K, \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = OFF, \ + .imm = IMM }) + /* Unconditional jumps, goto pc + off16 */ #define BPF_JMP_A(OFF) \ @@ -311,6 +356,9 @@ struct ctl_table_header; /* Function call */ +#define BPF_CAST_CALL(x) \ + ((u64 (*)(u64, u64, u64, u64, u64))(x)) + #define BPF_EMIT_CALL(FUNC) \ ((struct bpf_insn) { \ .code = BPF_JMP | BPF_CALL, \ @@ -339,6 +387,16 @@ struct ctl_table_header; .off = 0, \ .imm = 0 }) +/* Speculation barrier */ + +#define BPF_ST_NOSPEC() \ + ((struct bpf_insn) { \ + .code = BPF_ST | BPF_NOSPEC, \ + .dst_reg = 0, \ + .src_reg = 0, \ + .off = 0, \ + .imm = 0 }) + /* Internal classic blocks for direct assignment */ #define __BPF_STMT(CODE, K) \ @@ -432,29 +490,32 @@ struct ctl_table_header; __BPF_MAP(n, __BPF_DECL_ARGS, __BPF_N, u64, __ur_1, u64, __ur_2, \ u64, __ur_3, u64, __ur_4, u64, __ur_5) -#define BPF_CALL_x(x, name, ...) \ +#define BPF_CALL_x(x, attr, name, ...) \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ - u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)); \ - u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)) \ + typedef u64 (*btf_##name)(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ + attr u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)); \ + attr u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)) \ { \ - return ____##name(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\ + return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\ } \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)) -#define BPF_CALL_0(name, ...) BPF_CALL_x(0, name, __VA_ARGS__) -#define BPF_CALL_1(name, ...) BPF_CALL_x(1, name, __VA_ARGS__) -#define BPF_CALL_2(name, ...) BPF_CALL_x(2, name, __VA_ARGS__) -#define BPF_CALL_3(name, ...) BPF_CALL_x(3, name, __VA_ARGS__) -#define BPF_CALL_4(name, ...) BPF_CALL_x(4, name, __VA_ARGS__) -#define BPF_CALL_5(name, ...) BPF_CALL_x(5, name, __VA_ARGS__) +#define __NOATTR +#define BPF_CALL_0(name, ...) BPF_CALL_x(0, __NOATTR, name, __VA_ARGS__) +#define BPF_CALL_1(name, ...) BPF_CALL_x(1, __NOATTR, name, __VA_ARGS__) +#define BPF_CALL_2(name, ...) BPF_CALL_x(2, __NOATTR, name, __VA_ARGS__) +#define BPF_CALL_3(name, ...) BPF_CALL_x(3, __NOATTR, name, __VA_ARGS__) +#define BPF_CALL_4(name, ...) BPF_CALL_x(4, __NOATTR, name, __VA_ARGS__) +#define BPF_CALL_5(name, ...) BPF_CALL_x(5, __NOATTR, name, __VA_ARGS__) + +#define NOTRACE_BPF_CALL_1(name, ...) BPF_CALL_x(1, notrace, name, __VA_ARGS__) #define bpf_ctx_range(TYPE, MEMBER) \ offsetof(TYPE, MEMBER) ... offsetofend(TYPE, MEMBER) - 1 #define bpf_ctx_range_till(TYPE, MEMBER1, MEMBER2) \ offsetof(TYPE, MEMBER1) ... offsetofend(TYPE, MEMBER2) - 1 - #if BITS_PER_LONG == 64 # define bpf_ctx_range_ptr(TYPE, MEMBER) \ offsetof(TYPE, MEMBER) ... offsetofend(TYPE, MEMBER) - 1 @@ -497,12 +558,14 @@ struct bpf_prog { u16 pages; /* Number of allocated pages */ u16 jited:1, /* Is our filter JIT'ed? */ jit_requested:1,/* archs need to JIT the prog */ - undo_set_mem:1, /* Passed set_memory_ro() checkpoint */ gpl_compatible:1, /* Is filter GPL compatible? */ cb_access:1, /* Is control block accessed? */ dst_needed:1, /* Do we need dst entry? */ blinded:1, /* Was blinded */ - is_func:1; /* program is a bpf function */ + is_func:1, /* program is a bpf function */ + kprobe_override:1, /* Do we override a kprobe? */ + has_callchain_buf:1, /* callchain buffer allocated? */ + enforce_expected_attach_type:1; /* Enforce expected_attach_type checking at attach time */ enum bpf_prog_type type; /* Type of BPF program */ enum bpf_attach_type expected_attach_type; /* For some prog types */ u32 len; /* Number of filter blocks */ @@ -580,7 +643,24 @@ static inline void bpf_jit_set_header_magic(struct bpf_binary_header *hdr) } #endif -#define BPF_PROG_RUN(filter, ctx) bpf_call_func(filter, ctx) +DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key); + +#define BPF_PROG_RUN(prog, ctx) ({ \ + u32 ret; \ + cant_sleep(); \ + if (static_branch_unlikely(&bpf_stats_enabled_key)) { \ + struct bpf_prog_stats *stats; \ + u64 start = sched_clock(); \ + ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi); \ + stats = this_cpu_ptr(prog->aux->stats); \ + u64_stats_update_begin(&stats->syncp); \ + stats->cnt++; \ + stats->nsecs += sched_clock() - start; \ + u64_stats_update_end(&stats->syncp); \ + } else { \ + ret = (*(prog)->bpf_func)(ctx, (prog)->insnsi); \ + } \ + ret; }) #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN @@ -590,6 +670,20 @@ struct bpf_skb_data_end { void *data_end; }; +struct bpf_redirect_info { + u32 flags; + u32 tgt_index; + void *tgt_value; + struct bpf_map *map; + struct bpf_map *map_to_flush; + u32 kern_flags; +}; + +DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info); + +/* flags for bpf_redirect_info kern_flags */ +#define BPF_RI_F_RF_NO_DIRECT BIT(0) /* no napi_direct on return_frame */ + /* Compute the linear packet data range [data, data_end) which * will be accessed by various program types (cls_bpf, act_bpf, * lwt, ...). Subsystems allowing direct data access must (!) @@ -605,6 +699,27 @@ static inline void bpf_compute_data_pointers(struct sk_buff *skb) cb->data_end = skb->data + skb_headlen(skb); } +/* Similar to bpf_compute_data_pointers(), except that save orginal + * data in cb->data and cb->meta_data for restore. + */ +static inline void bpf_compute_and_save_data_end( + struct sk_buff *skb, void **saved_data_end) +{ + struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb; + + *saved_data_end = cb->data_end; + cb->data_end = skb->data + skb_headlen(skb); +} + +/* Restore data saved by bpf_compute_data_pointers(). */ +static inline void bpf_restore_data_end( + struct sk_buff *skb, void *saved_data_end) +{ + struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb; + + cb->data_end = saved_data_end; +} + static inline u8 *bpf_skb_cb(struct sk_buff *skb) { /* eBPF programs may read/write skb->cb[] area to transfer meta @@ -624,8 +739,8 @@ static inline u8 *bpf_skb_cb(struct sk_buff *skb) return qdisc_skb_cb(skb)->data; } -static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog, - struct sk_buff *skb) +static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog, + struct sk_buff *skb) { u8 *cb_data = bpf_skb_cb(skb); u8 cb_saved[BPF_SKB_CB_LEN]; @@ -644,15 +759,30 @@ static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog, return res; } +static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog, + struct sk_buff *skb) +{ + u32 res; + + preempt_disable(); + res = __bpf_prog_run_save_cb(prog, skb); + preempt_enable(); + return res; +} + static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog, struct sk_buff *skb) { u8 *cb_data = bpf_skb_cb(skb); + u32 res; if (unlikely(prog->cb_access)) memset(cb_data, 0, BPF_SKB_CB_LEN); - return BPF_PROG_RUN(prog, skb); + preempt_disable(); + res = BPF_PROG_RUN(prog, skb); + preempt_enable(); + return res; } static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog, @@ -694,42 +824,54 @@ static inline bool bpf_prog_was_classic(const struct bpf_prog *prog) return prog->type == BPF_PROG_TYPE_UNSPEC; } -static inline bool -bpf_ctx_narrow_access_ok(u32 off, u32 size, const u32 size_default) +static inline u32 bpf_ctx_off_adjust_machine(u32 size) { - bool off_ok; -#ifdef __LITTLE_ENDIAN - off_ok = (off & (size_default - 1)) == 0; -#else - off_ok = (off & (size_default - 1)) + size == size_default; -#endif - return off_ok && size <= size_default && (size & (size - 1)) == 0; + const u32 size_machine = sizeof(unsigned long); + + if (size > size_machine && size % size_machine == 0) + size = size_machine; + + return size; } +static inline bool +bpf_ctx_narrow_access_ok(u32 off, u32 size, u32 size_default) +{ + return size <= size_default && (size & (size - 1)) == 0; +} + +static inline u8 +bpf_ctx_narrow_access_offset(u32 off, u32 size, u32 size_default) +{ + u8 access_off = off & (size_default - 1); + +#ifdef __LITTLE_ENDIAN + return access_off; +#else + return size_default - (access_off + size); +#endif +} + +#define bpf_ctx_wide_access_ok(off, size, type, field) \ + (size == sizeof(__u64) && \ + off >= offsetof(type, field) && \ + off + sizeof(__u64) <= offsetofend(type, field) && \ + off % sizeof(__u64) == 0) + #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0])) static inline void bpf_prog_lock_ro(struct bpf_prog *fp) { - fp->undo_set_mem = 1; + set_vm_flush_reset_perms(fp); set_memory_ro((unsigned long)fp, fp->pages); } -static inline void bpf_prog_unlock_ro(struct bpf_prog *fp) -{ - if (fp->undo_set_mem) - set_memory_rw((unsigned long)fp, fp->pages); -} - static inline void bpf_jit_binary_lock_ro(struct bpf_binary_header *hdr) { + set_vm_flush_reset_perms(hdr); set_memory_ro((unsigned long)hdr, hdr->pages); } -static inline void bpf_jit_binary_unlock_ro(struct bpf_binary_header *hdr) -{ - set_memory_rw((unsigned long)hdr, hdr->pages); -} - static inline struct bpf_binary_header * bpf_jit_binary_hdr(const struct bpf_prog *fp) { @@ -748,6 +890,8 @@ static inline int sk_filter(struct sock *sk, struct sk_buff *skb) struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err); void bpf_prog_free(struct bpf_prog *fp); +bool bpf_opcode_in_insntable(u8 code); + void bpf_prog_free_linfo(struct bpf_prog *prog); void bpf_prog_fill_jited_linfo(struct bpf_prog *prog, const u32 *insn_to_jit_off); @@ -756,13 +900,13 @@ void bpf_prog_free_jited_linfo(struct bpf_prog *prog); void bpf_prog_free_unused_jited_linfo(struct bpf_prog *prog); struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags); +struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags); struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, gfp_t gfp_extra_flags); void __bpf_prog_free(struct bpf_prog *fp); static inline void bpf_prog_unlock_free(struct bpf_prog *fp) { - bpf_prog_unlock_ro(fp); __bpf_prog_free(fp); } @@ -778,6 +922,7 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk); int sk_attach_bpf(u32 ufd, struct sock *sk); int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk); int sk_reuseport_attach_bpf(u32 ufd, struct sock *sk); +void sk_reuseport_prog_free(struct bpf_prog *prog); int sk_detach_filter(struct sock *sk); int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned int len); @@ -792,18 +937,58 @@ u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog); void bpf_jit_compile(struct bpf_prog *prog); +bool bpf_jit_needs_zext(void); bool bpf_helper_changes_pkt_data(void *func); -static inline bool bpf_dump_raw_ok(void) +static inline bool bpf_dump_raw_ok(const struct cred *cred) { /* Reconstruction of call-sites is dependent on kallsyms, * thus make dump the same restriction. */ - return kallsyms_show_value() == 1; + return kallsyms_show_value(cred); } struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, const struct bpf_insn *patch, u32 len); +int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt); + +void bpf_clear_redirect_map(struct bpf_map *map); + +static inline bool xdp_return_frame_no_direct(void) +{ + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + + return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT; +} + +static inline void xdp_set_return_frame_no_direct(void) +{ + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + + ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT; +} + +static inline void xdp_clear_return_frame_no_direct(void) +{ + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + + ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT; +} + +static inline int xdp_ok_fwd_dev(const struct net_device *fwd, + unsigned int pktlen) +{ + unsigned int len; + + if (unlikely(!(fwd->flags & IFF_UP))) + return -ENETDOWN; + + len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN; + if (pktlen > len) + return -EMSGSIZE; + + return 0; +} /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the * same cpu context. Further for best results no more than a single map @@ -812,7 +997,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, * This does not appear to be a real limitation for existing software. */ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb, - struct bpf_prog *prog); + struct xdp_buff *xdp, struct bpf_prog *prog); int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, struct bpf_prog *prog); @@ -820,6 +1005,20 @@ void xdp_do_flush_map(void); void bpf_warn_invalid_xdp_action(u32 act); +#ifdef CONFIG_INET +struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, + struct bpf_prog *prog, struct sk_buff *skb, + u32 hash); +#else +static inline struct sock * +bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, + struct bpf_prog *prog, struct sk_buff *skb, + u32 hash) +{ + return NULL; +} +#endif + #ifdef CONFIG_BPF_JIT extern int bpf_jit_enable; extern int bpf_jit_harden; @@ -839,6 +1038,10 @@ void *bpf_jit_alloc_exec(unsigned long size); void bpf_jit_free_exec(void *addr); void bpf_jit_free(struct bpf_prog *fp); +int bpf_jit_get_func_addr(const struct bpf_prog *prog, + const struct bpf_insn *insn, bool extra_pass, + u64 *func_addr, bool *func_addr_fixed); + struct bpf_prog *bpf_jit_blind_constants(struct bpf_prog *fp); void bpf_jit_prog_release_other(struct bpf_prog *fp, struct bpf_prog *fp_other); @@ -924,6 +1127,7 @@ bpf_address_lookup(unsigned long addr, unsigned long *size, void bpf_prog_kallsyms_add(struct bpf_prog *fp); void bpf_prog_kallsyms_del(struct bpf_prog *fp); +void bpf_get_prog_name(const struct bpf_prog *prog, char *sym); #else /* CONFIG_BPF_JIT */ @@ -979,8 +1183,16 @@ static inline void bpf_prog_kallsyms_add(struct bpf_prog *fp) static inline void bpf_prog_kallsyms_del(struct bpf_prog *fp) { } + +static inline void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) +{ + sym[0] = '\0'; +} + #endif /* CONFIG_BPF_JIT */ +void bpf_prog_kallsyms_del_all(struct bpf_prog *fp); + #define BPF_ANC BIT(15) static inline bool bpf_needs_clear_a(const struct sock_filter *first) @@ -1068,15 +1280,34 @@ struct bpf_sock_ops_kern { struct sock *sk; u32 op; union { + u32 args[4]; u32 reply; u32 replylong[4]; }; + u32 is_fullsock; + u64 temp; /* temp and everything after is not + * initialized to 0 before calling + * the BPF program. New fields that + * should be initialized to 0 should + * be inserted before temp. + * temp is scratch storage used by + * sock_ops_convert_ctx_access + * as temporary storage of a register. + */ }; struct bpf_sysctl_kern { struct ctl_table_header *head; struct ctl_table *table; + void *cur_val; + size_t cur_len; + void *new_val; + size_t new_len; + int new_updated; int write; + loff_t *ppos; + /* Temporary "register" for indirect stores to ppos. */ + u64 tmp_reg; }; struct bpf_sockopt_kern { diff --git a/include/linux/fs.h b/include/linux/fs.h index c2a0c49790c2..b6f5ba4a80cc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -212,9 +212,9 @@ struct iattr { kuid_t ia_uid; kgid_t ia_gid; loff_t ia_size; - struct timespec ia_atime; - struct timespec ia_mtime; - struct timespec ia_ctime; + struct timespec64 ia_atime; + struct timespec64 ia_mtime; + struct timespec64 ia_ctime; /* * Not an attribute, but an auxiliary info for filesystems wanting to @@ -620,9 +620,9 @@ struct inode { }; dev_t i_rdev; loff_t i_size; - struct timespec i_atime; - struct timespec i_mtime; - struct timespec i_ctime; + struct timespec64 i_atime; + struct timespec64 i_mtime; + struct timespec64 i_ctime; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ unsigned short i_bytes; unsigned int i_blkbits; @@ -665,7 +665,10 @@ struct inode { #ifdef CONFIG_IMA atomic_t i_readcount; /* struct files open RO */ #endif - const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ + union { + const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ + void (*free_inode)(struct inode *); + }; struct file_lock_context *i_flctx; struct address_space i_data; struct list_head i_devices; @@ -1118,7 +1121,7 @@ extern int vfs_lock_file(struct file *, unsigned int, struct file_lock *, struct extern int vfs_cancel_lock(struct file *filp, struct file_lock *fl); extern int locks_lock_inode_wait(struct inode *inode, struct file_lock *fl); extern int __break_lease(struct inode *inode, unsigned int flags, unsigned int type); -extern void lease_get_mtime(struct inode *, struct timespec *time); +extern void lease_get_mtime(struct inode *, struct timespec64 *time); extern int generic_setlease(struct file *, long, struct file_lock **, void **priv); extern int vfs_setlease(struct file *, long, struct file_lock **, void **); extern int lease_modify(struct file_lock *, int, struct list_head *); @@ -1233,7 +1236,8 @@ static inline int __break_lease(struct inode *inode, unsigned int mode, unsigned return 0; } -static inline void lease_get_mtime(struct inode *inode, struct timespec *time) +static inline void lease_get_mtime(struct inode *inode, + struct timespec64 *time) { return; } @@ -1516,7 +1520,8 @@ static inline void i_gid_write(struct inode *inode, gid_t gid) inode->i_gid = make_kgid(inode->i_sb->s_user_ns, gid); } -extern struct timespec current_time(struct inode *inode); +extern struct timespec64 timespec64_trunc(struct timespec64 t, unsigned gran); +extern struct timespec64 current_time(struct inode *inode); /* * Snapshotting support. @@ -1814,7 +1819,7 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); - int (*update_time)(struct inode *, struct timespec *, int); + int (*update_time)(struct inode *, struct timespec64 *, int); int (*atomic_open)(struct inode *, struct dentry *, struct file *, unsigned open_flag, umode_t create_mode, int *opened); @@ -1867,6 +1872,7 @@ extern int vfs_dedupe_file_range(struct file *file, struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); + void (*free_inode)(struct inode *); void (*dirty_inode) (struct inode *, int flags); int (*write_inode) (struct inode *, struct writeback_control *wbc); @@ -2272,7 +2278,7 @@ extern int current_umask(void); extern void ihold(struct inode * inode); extern void iput(struct inode *); -extern int generic_update_time(struct inode *, struct timespec *, int); +extern int generic_update_time(struct inode *, struct timespec64 *, int); /* /sys/fs */ extern struct kobject *fs_kobj; diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 7d799c4d2669..6cacf5155499 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -458,9 +458,7 @@ struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops, static inline void hmm_devmem_page_set_drvdata(struct page *page, unsigned long data) { - unsigned long *drvdata = (unsigned long *)&page->pgmap; - - drvdata[1] = data; + page->hmm_data = data; } /* @@ -469,11 +467,9 @@ static inline void hmm_devmem_page_set_drvdata(struct page *page, * @page: pointer to struct page * Return: driver data value */ -static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page) +static inline unsigned long hmm_devmem_page_get_drvdata(const struct page *page) { - unsigned long *drvdata = (unsigned long *)&page->pgmap; - - return drvdata[1]; + return page->hmm_data; } diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h index 0a777c5216b1..4d280c25d1d8 100644 --- a/include/linux/kallsyms.h +++ b/include/linux/kallsyms.h @@ -9,21 +9,74 @@ #include #include #include +#include +#include + +#include #define KSYM_NAME_LEN 128 #define KSYM_SYMBOL_LEN (sizeof("%s+%#lx/%#lx [%s]") + (KSYM_NAME_LEN - 1) + \ 2*(BITS_PER_LONG*3/10) + (MODULE_NAME_LEN - 1) + 1) -/* How and when do we show kallsyms values? */ -extern int kallsyms_show_value(void); #ifndef CONFIG_64BIT # define KALLSYM_FMT "%08lx" #else # define KALLSYM_FMT "%016lx" #endif +struct cred; struct module; +static inline int is_kernel_inittext(unsigned long addr) +{ + if (addr >= (unsigned long)_sinittext + && addr <= (unsigned long)_einittext) + return 1; + return 0; +} + +static inline int is_kernel_text(unsigned long addr) +{ + if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext) || + arch_is_kernel_text(addr)) + return 1; + return in_gate_area_no_mm(addr); +} + +static inline int is_kernel(unsigned long addr) +{ + if (addr >= (unsigned long)_stext && addr <= (unsigned long)_end) + return 1; + return in_gate_area_no_mm(addr); +} + +static inline int is_ksym_addr(unsigned long addr) +{ + if (IS_ENABLED(CONFIG_KALLSYMS_ALL)) + return is_kernel(addr); + + return is_kernel_text(addr) || is_kernel_inittext(addr); +} + +static inline void *dereference_symbol_descriptor(void *ptr) +{ +#ifdef HAVE_DEREFERENCE_FUNCTION_DESCRIPTOR + struct module *mod; + + ptr = dereference_kernel_function_descriptor(ptr); + if (is_ksym_addr((unsigned long)ptr)) + return ptr; + + preempt_disable(); + mod = __module_address((unsigned long)ptr); + preempt_enable(); + + if (mod) + ptr = dereference_module_function_descriptor(mod, ptr); +#endif + return ptr; +} + #ifdef CONFIG_KALLSYMS /* Lookup the address for a symbol. Returns 0 if not found. */ unsigned long kallsyms_lookup_name(const char *name); @@ -54,6 +107,9 @@ extern void __print_symbol(const char *fmt, unsigned long address); int lookup_symbol_name(unsigned long addr, char *symname); int lookup_symbol_attrs(unsigned long addr, unsigned long *size, unsigned long *offset, char *modname, char *name); +/* How and when do we show kallsyms values? */ +extern bool kallsyms_show_value(const struct cred *cred); + #else /* !CONFIG_KALLSYMS */ static inline unsigned long kallsyms_lookup_name(const char *name) @@ -112,6 +168,11 @@ static inline int lookup_symbol_attrs(unsigned long addr, unsigned long *size, u return -ERANGE; } +static inline bool kallsyms_show_value(const struct cred *cred) +{ + return false; +} + /* Stupid that this does nothing, but I didn't create this mess. */ #define __print_symbol(fmt, addr) #endif /*CONFIG_KALLSYMS*/ diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 2114e3e5abd1..6766c845da8b 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -212,8 +212,10 @@ extern int _cond_resched(void); #endif #ifdef CONFIG_DEBUG_ATOMIC_SLEEP - void ___might_sleep(const char *file, int line, int preempt_offset); - void __might_sleep(const char *file, int line, int preempt_offset); +extern void ___might_sleep(const char *file, int line, int preempt_offset); +extern void __might_sleep(const char *file, int line, int preempt_offset); +extern void __cant_sleep(const char *file, int line, int preempt_offset); + /** * might_sleep - annotation for functions that can sleep * @@ -226,6 +228,13 @@ extern int _cond_resched(void); */ # define might_sleep() \ do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0) +/** + * cant_sleep - annotation for functions that cannot sleep + * + * this macro will print a stack trace if it is executed with preemption enabled + */ +# define cant_sleep() \ + do { __cant_sleep(__FILE__, __LINE__, 0); } while (0) # define sched_annotate_sleep() (current->task_state_change = 0) #else static inline void ___might_sleep(const char *file, int line, @@ -233,6 +242,7 @@ extern int _cond_resched(void); static inline void __might_sleep(const char *file, int line, int preempt_offset) { } # define might_sleep() do { might_resched(); } while (0) +# define cant_sleep() do { } while (0) # define sched_annotate_sleep() do { } while (0) #endif diff --git a/include/linux/ktime.h b/include/linux/ktime.h index 0c8bd45c8206..5b9fddbaac41 100644 --- a/include/linux/ktime.h +++ b/include/linux/ktime.h @@ -270,5 +270,6 @@ static inline ktime_t ms_to_ktime(u64 ms) } # include +# include #endif diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h index 84284e3353ed..27570fa47b4d 100644 --- a/include/linux/libnvdimm.h +++ b/include/linux/libnvdimm.h @@ -1,16 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * libnvdimm - Non-volatile-memory Devices Subsystem * * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef __LIBNVDIMM_H__ #define __LIBNVDIMM_H__ diff --git a/include/linux/list.h b/include/linux/list.h index d2c12ef7a4e3..d3d4b06691fb 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -106,6 +106,20 @@ static inline void __list_del(struct list_head * prev, struct list_head * next) WRITE_ONCE(prev->next, next); } +/* + * Delete a list entry and clear the 'prev' pointer. + * + * This is a special-purpose list clearing method used in the networking code + * for lists allocated as per-cpu, where we don't want to incur the extra + * WRITE_ONCE() overhead of a regular list_del_init(). The code that uses this + * needs to check the node 'prev' pointer instead of calling list_empty(). + */ +static inline void __list_del_clearprev(struct list_head *entry) +{ + __list_del(entry->prev, entry->next); + entry->prev = NULL; +} + /** * list_del - deletes entry from list. * @entry: the element to delete from the list. diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 1db7dde2adfe..ecd852b557a0 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -1235,7 +1235,7 @@ * @cred contains the credentials to use. * @ns contains the user namespace we want the capability in * @cap contains the capability . - * @audit contains whether to write an audit message or not + * @opts contains options for the capable check * Return 0 if the capability is granted for @tsk. * @syslog: * Check permission before accessing the kernel message ring or changing @@ -1411,8 +1411,10 @@ union security_list_options { const kernel_cap_t *effective, const kernel_cap_t *inheritable, const kernel_cap_t *permitted); - int (*capable)(const struct cred *cred, struct user_namespace *ns, - int cap, int audit); + int (*capable)(const struct cred *cred, + struct user_namespace *ns, + int cap, + unsigned int opts); int (*quotactl)(int cmds, int type, int id, struct super_block *sb); int (*quota_on)(struct dentry *dentry); int (*syslog)(int type); diff --git a/include/linux/mm.h b/include/linux/mm.h index 1a15242092ef..4c144755d2d3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -25,6 +25,7 @@ #include #include #include +#include struct mempolicy; struct anon_vma; @@ -452,17 +453,18 @@ struct vm_operations_struct { void (*close)(struct vm_area_struct * area); int (*split)(struct vm_area_struct * area, unsigned long addr); int (*mremap)(struct vm_area_struct * area); - int (*fault)(struct vm_fault *vmf); - int (*huge_fault)(struct vm_fault *vmf, enum page_entry_size pe_size); + vm_fault_t (*fault)(struct vm_fault *vmf); + vm_fault_t (*huge_fault)(struct vm_fault *vmf, + enum page_entry_size pe_size); void (*map_pages)(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff); /* notification that a previously read-only page is about to become * writable, if an error is returned it will cause a SIGBUS */ - int (*page_mkwrite)(struct vm_fault *vmf); + vm_fault_t (*page_mkwrite)(struct vm_fault *vmf); /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */ - int (*pfn_mkwrite)(struct vm_fault *vmf); + vm_fault_t (*pfn_mkwrite)(struct vm_fault *vmf); /* called by access_process_vm when get_user_pages() fails, typically * for use by special VMAs that can switch between memory and hardware @@ -624,10 +626,17 @@ static inline void *kvzalloc(size_t size, gfp_t flags) static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags) { - if (size != 0 && n > SIZE_MAX / size) + size_t bytes; + + if (unlikely(check_mul_overflow(n, size, &bytes))) return NULL; - return kvmalloc(n * size, flags); + return kvmalloc(bytes, flags); +} + +static inline void *kvcalloc(size_t n, size_t size, gfp_t flags) +{ + return kvmalloc_array(n, size, flags | __GFP_ZERO); } extern void kvfree(const void *addr); @@ -1748,14 +1757,27 @@ static inline int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address); #endif -#ifdef __PAGETABLE_PUD_FOLDED +#if defined(__PAGETABLE_PUD_FOLDED) || !defined(CONFIG_MMU) static inline int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address) { return 0; } +static inline void mm_inc_nr_puds(struct mm_struct *mm) {} +static inline void mm_dec_nr_puds(struct mm_struct *mm) {} + #else int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address); + +static inline void mm_inc_nr_puds(struct mm_struct *mm) +{ + atomic_long_add(PTRS_PER_PUD * sizeof(pud_t), &mm->pgtables_bytes); +} + +static inline void mm_dec_nr_puds(struct mm_struct *mm) +{ + atomic_long_sub(PTRS_PER_PUD * sizeof(pud_t), &mm->pgtables_bytes); +} #endif #if defined(__PAGETABLE_PMD_FOLDED) || !defined(CONFIG_MMU) @@ -1765,40 +1787,55 @@ static inline int __pmd_alloc(struct mm_struct *mm, pud_t *pud, return 0; } -static inline void mm_nr_pmds_init(struct mm_struct *mm) {} - -static inline unsigned long mm_nr_pmds(struct mm_struct *mm) -{ - return 0; -} - static inline void mm_inc_nr_pmds(struct mm_struct *mm) {} static inline void mm_dec_nr_pmds(struct mm_struct *mm) {} #else int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address); -static inline void mm_nr_pmds_init(struct mm_struct *mm) -{ - atomic_long_set(&mm->nr_pmds, 0); -} - -static inline unsigned long mm_nr_pmds(struct mm_struct *mm) -{ - return atomic_long_read(&mm->nr_pmds); -} - static inline void mm_inc_nr_pmds(struct mm_struct *mm) { - atomic_long_inc(&mm->nr_pmds); + atomic_long_add(PTRS_PER_PMD * sizeof(pmd_t), &mm->pgtables_bytes); } static inline void mm_dec_nr_pmds(struct mm_struct *mm) { - atomic_long_dec(&mm->nr_pmds); + atomic_long_sub(PTRS_PER_PMD * sizeof(pmd_t), &mm->pgtables_bytes); } #endif +#ifdef CONFIG_MMU +static inline void mm_pgtables_bytes_init(struct mm_struct *mm) +{ + atomic_long_set(&mm->pgtables_bytes, 0); +} + +static inline unsigned long mm_pgtables_bytes(const struct mm_struct *mm) +{ + return atomic_long_read(&mm->pgtables_bytes); +} + +static inline void mm_inc_nr_ptes(struct mm_struct *mm) +{ + atomic_long_add(PTRS_PER_PTE * sizeof(pte_t), &mm->pgtables_bytes); +} + +static inline void mm_dec_nr_ptes(struct mm_struct *mm) +{ + atomic_long_sub(PTRS_PER_PTE * sizeof(pte_t), &mm->pgtables_bytes); +} +#else + +static inline void mm_pgtables_bytes_init(struct mm_struct *mm) {} +static inline unsigned long mm_pgtables_bytes(const struct mm_struct *mm) +{ + return 0; +} + +static inline void mm_inc_nr_ptes(struct mm_struct *mm) {} +static inline void mm_dec_nr_ptes(struct mm_struct *mm) {} +#endif + int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address); int __pte_alloc_kernel(pmd_t *pmd, unsigned long address); @@ -2510,6 +2547,44 @@ int vm_insert_mixed_mkwrite(struct vm_area_struct *vma, unsigned long addr, pfn_t pfn); int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len); +static inline vm_fault_t vmf_insert_page(struct vm_area_struct *vma, + unsigned long addr, struct page *page) +{ + int err = vm_insert_page(vma, addr, page); + + if (err == -ENOMEM) + return VM_FAULT_OOM; + if (err < 0 && err != -EBUSY) + return VM_FAULT_SIGBUS; + + return VM_FAULT_NOPAGE; +} + +static inline vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, + unsigned long addr, pfn_t pfn) +{ + int err = vm_insert_mixed(vma, addr, pfn); + + if (err == -ENOMEM) + return VM_FAULT_OOM; + if (err < 0 && err != -EBUSY) + return VM_FAULT_SIGBUS; + + return VM_FAULT_NOPAGE; +} + +static inline vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, + unsigned long addr, unsigned long pfn) +{ + int err = vm_insert_pfn(vma, addr, pfn); + + if (err == -ENOMEM) + return VM_FAULT_OOM; + if (err < 0 && err != -EBUSY) + return VM_FAULT_SIGBUS; + + return VM_FAULT_NOPAGE; +} struct page *follow_page(struct vm_area_struct *vma, unsigned long address, unsigned int foll_flags); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 92f7c16cef5f..e94d52ab9388 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -22,6 +22,8 @@ #endif #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1)) +typedef int vm_fault_t; + struct address_space; struct mem_cgroup; struct hmm; @@ -33,161 +35,157 @@ struct hmm; * a page, though if it is a pagecache page, rmap structures can tell us * who is mapping it. * - * The objects in struct page are organized in double word blocks in - * order to allows us to use atomic double word operations on portions - * of struct page. That is currently only used by slub but the arrangement - * allows the use of atomic double word operations on the flags/mapping - * and lru list pointers also. + * If you allocate the page using alloc_pages(), you can use some of the + * space in struct page for your own purposes. The five words in the main + * union are available, except for bit 0 of the first word which must be + * kept clear. Many users use this word to store a pointer to an object + * which is guaranteed to be aligned. If you use the same storage as + * page->mapping, you must restore it to NULL before freeing the page. + * + * If your page will not be mapped to userspace, you can also use the four + * bytes in the mapcount union, but you must call page_mapcount_reset() + * before freeing it. + * + * If you want to use the refcount field, it must be used in such a way + * that other CPUs temporarily incrementing and then decrementing the + * refcount does not cause problems. On receiving the page from + * alloc_pages(), the refcount will be positive. + * + * If you allocate pages of order > 0, you can use some of the fields + * in each subpage, but you may need to restore some of their values + * afterwards. + * + * SLUB uses cmpxchg_double() to atomically update its freelist and + * counters. That requires that freelist & counters be adjacent and + * double-word aligned. We align all struct pages to double-word + * boundaries, and ensure that 'freelist' is aligned within the + * struct. */ +#ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE +#define _struct_page_alignment __aligned(2 * sizeof(unsigned long)) +#else +#define _struct_page_alignment +#endif + struct page { - /* First double word block */ unsigned long flags; /* Atomic flags, some possibly * updated asynchronously */ + /* + * Five words (20/40 bytes) are available in this union. + * WARNING: bit 0 of the first word is used for PageTail(). That + * means the other users of this union MUST NOT use the bit to + * avoid collision and false-positive PageTail(). + */ union { - struct address_space *mapping; /* If low bit clear, points to - * inode address_space, or NULL. - * If page mapped as anonymous - * memory, low bit is set, and - * it points to anon_vma object: - * see PAGE_MAPPING_ANON below. - */ - void *s_mem; /* slab first object */ - atomic_t compound_mapcount; /* first tail page */ - /* page_deferred_list().next -- second tail page */ - }; - - /* Second double word */ - union { - pgoff_t index; /* Our offset within mapping. */ - void *freelist; /* sl[aou]b first free object */ - /* page_deferred_list().prev -- second tail page */ - }; - - union { -#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \ - defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE) - /* Used for cmpxchg_double in slub */ - unsigned long counters; -#else - /* - * Keep _refcount separate from slub cmpxchg_double data. - * As the rest of the double word is protected by slab_lock - * but _refcount is not. - */ - unsigned counters; -#endif - struct { - + struct { /* Page cache and anonymous pages */ + /** + * @lru: Pageout list, eg. active_list protected by + * zone_lru_lock. Sometimes used as a generic list + * by the page owner. + */ + struct list_head lru; + /* See page-flags.h for PAGE_MAPPING_FLAGS */ + struct address_space *mapping; + pgoff_t index; /* Our offset within mapping. */ + /** + * @private: Mapping-private opaque data. + * Usually used for buffer_heads if PagePrivate. + * Used for swp_entry_t if PageSwapCache. + * Indicates order in the buddy system if PageBuddy. + */ + unsigned long private; + }; + struct { /* page_pool used by netstack */ + /** + * @dma_addr: might require a 64-bit value even on + * 32-bit architectures. + */ + dma_addr_t dma_addr; + }; + struct { /* slab, slob and slub */ union { - /* - * Count of ptes mapped in mms, to show when - * page is mapped & limit reverse map searches. - * - * Extra information about page type may be - * stored here for pages that are never mapped, - * in which case the value MUST BE <= -2. - * See page-flags.h for more details. - */ - atomic_t _mapcount; - - unsigned int active; /* SLAB */ + struct list_head slab_list; /* uses lru */ + struct { /* Partial pages */ + struct page *next; +#ifdef CONFIG_64BIT + int pages; /* Nr of pages left */ + int pobjects; /* Approximate count */ +#else + short int pages; + short int pobjects; +#endif + }; + }; + struct kmem_cache *slab_cache; /* not slob */ + /* Double-word boundary */ + void *freelist; /* first free object */ + union { + void *s_mem; /* slab: first object */ + unsigned long counters; /* SLUB */ struct { /* SLUB */ unsigned inuse:16; unsigned objects:15; unsigned frozen:1; }; - int units; /* SLOB */ }; - /* - * Usage count, *USE WRAPPER FUNCTION* when manual - * accounting. See page_ref.h - */ - atomic_t _refcount; }; - }; - - /* - * Third double word block - * - * WARNING: bit 0 of the first word encode PageTail(). That means - * the rest users of the storage space MUST NOT use the bit to - * avoid collision and false-positive PageTail(). - */ - union { - struct list_head lru; /* Pageout list, eg. active_list - * protected by zone_lru_lock ! - * Can be used as a generic list - * by the page owner. - */ - struct dev_pagemap *pgmap; /* ZONE_DEVICE pages are never on an - * lru or handled by a slab - * allocator, this points to the - * hosting device page map. - */ - struct { /* slub per cpu partial pages */ - struct page *next; /* Next partial slab */ -#ifdef CONFIG_64BIT - int pages; /* Nr of partial slabs left */ - int pobjects; /* Approximate # of objects */ -#else - short int pages; - short int pobjects; -#endif - }; - - struct rcu_head rcu_head; /* Used by SLAB - * when destroying via RCU - */ - /* Tail pages of compound page */ - struct { - unsigned long compound_head; /* If bit zero is set */ + struct { /* Tail pages of compound page */ + unsigned long compound_head; /* Bit zero is set */ /* First tail page only */ -#ifdef CONFIG_64BIT - /* - * On 64 bit system we have enough space in struct page - * to encode compound_dtor and compound_order with - * unsigned int. It can help compiler generate better or - * smaller code on some archtectures. - */ - unsigned int compound_dtor; - unsigned int compound_order; -#else - unsigned short int compound_dtor; - unsigned short int compound_order; -#endif + unsigned char compound_dtor; + unsigned char compound_order; + atomic_t compound_mapcount; }; - -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS - struct { - unsigned long __pad; /* do not overlay pmd_huge_pte - * with compound_head to avoid - * possible bit 0 collision. - */ + struct { /* Second tail page of compound page */ + unsigned long _compound_pad_1; /* compound_head */ + unsigned long _compound_pad_2; + struct list_head deferred_list; + }; + struct { /* Page table pages */ + unsigned long _pt_pad_1; /* compound_head */ pgtable_t pmd_huge_pte; /* protected by page->ptl */ - }; + unsigned long _pt_pad_2; /* mapping */ + struct mm_struct *pt_mm; /* x86 pgds only */ +#if ALLOC_SPLIT_PTLOCKS + spinlock_t *ptl; +#else + spinlock_t ptl; #endif + }; + struct { /* ZONE_DEVICE pages */ + /** @pgmap: Points to the hosting device page map. */ + struct dev_pagemap *pgmap; + unsigned long hmm_data; + unsigned long _zd_pad_1; /* uses mapping */ + }; + + /** @rcu_head: You can use this to free a page by RCU. */ + struct rcu_head rcu_head; }; - /* Remainder is not double word aligned */ - union { - unsigned long private; /* Mapping-private opaque data: - * usually used for buffer_heads - * if PagePrivate set; used for - * swp_entry_t if PageSwapCache; - * indicates order in the buddy - * system if PG_buddy is set. - */ -#if USE_SPLIT_PTE_PTLOCKS -#if ALLOC_SPLIT_PTLOCKS - spinlock_t *ptl; -#else - spinlock_t ptl; -#endif -#endif - struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */ + union { /* This union is 4 bytes in size. */ + /* + * If the page can be mapped to userspace, encodes the number + * of times this page is referenced by a page table. + */ + atomic_t _mapcount; + + /* + * If the page is neither PageSlab nor mappable to userspace, + * the value stored here may help determine what this page + * is used for. See page-flags.h for a list of page types + * which are currently stored here. + */ + unsigned int page_type; + + unsigned int active; /* SLAB */ + int units; /* SLOB */ }; + /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */ + atomic_t _refcount; + #ifdef CONFIG_MEMCG struct mem_cgroup *mem_cgroup; #endif @@ -210,15 +208,7 @@ struct page { #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS int _last_cpupid; #endif -} -/* - * The struct page can be forced to be double word aligned so that atomic ops - * on double words work. The SLUB allocator can make use of such a feature. - */ -#ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE - __aligned(2 * sizeof(unsigned long)) -#endif -; +} _struct_page_alignment; #define PAGE_FRAG_CACHE_MAX_SIZE __ALIGN_MASK(32768, ~PAGE_MASK) #define PAGE_FRAG_CACHE_MAX_ORDER get_order(PAGE_FRAG_CACHE_MAX_SIZE) @@ -403,9 +393,8 @@ struct mm_struct { */ atomic_t mm_count; - atomic_long_t nr_ptes; /* PTE page table pages */ -#if CONFIG_PGTABLE_LEVELS > 2 - atomic_long_t nr_pmds; /* PMD page table pages */ +#ifdef CONFIG_MMU + atomic_long_t pgtables_bytes; /* PTE page table pages */ #endif int map_count; /* number of VMAs */ @@ -644,9 +633,9 @@ struct vm_special_mapping { * If non-NULL, then this is called to resolve page faults * on the special mapping. If used, .pages is not checked. */ - int (*fault)(const struct vm_special_mapping *sm, - struct vm_area_struct *vma, - struct vm_fault *vmf); + vm_fault_t (*fault)(const struct vm_special_mapping *sm, + struct vm_area_struct *vma, + struct vm_fault *vmf); int (*mremap)(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma); diff --git a/include/linux/module.h b/include/linux/module.h index 6fd9449ca6b1..b4bd7a4d2369 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -437,6 +437,10 @@ struct module { unsigned int num_tracepoints; struct tracepoint * const *tracepoints_ptrs; #endif +#ifdef CONFIG_BPF_EVENTS + unsigned int num_bpf_raw_events; + struct bpf_raw_event_map *bpf_raw_events; +#endif #ifdef HAVE_JUMP_LABEL struct jump_entry *jump_entries; unsigned int num_jump_entries; @@ -481,6 +485,11 @@ struct module { ctor_fn_t *ctors; unsigned int num_ctors; #endif + +#ifdef CONFIG_FUNCTION_ERROR_INJECTION + unsigned int num_ei_funcs; + unsigned long *ei_funcs; +#endif } ____cacheline_aligned __randomize_layout; #ifndef MODULE_ARCH_INIT #define MODULE_ARCH_INIT {} diff --git a/include/linux/nd.h b/include/linux/nd.h index 5dc6b695437d..0bdaae2f525a 100644 --- a/include/linux/nd.h +++ b/include/linux/nd.h @@ -1,14 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef __LINUX_ND_H__ #define __LINUX_ND_H__ diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b1542e8a68a0..fa403396f850 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -808,21 +808,18 @@ enum bpf_netdev_command { */ XDP_SETUP_PROG, XDP_SETUP_PROG_HW, - /* Check if a bpf program is set on the device. The callee should - * set @prog_attached to one of XDP_ATTACHED_* values, note that "true" - * is equivalent to XDP_ATTACHED_DRV. - */ XDP_QUERY_PROG, + XDP_QUERY_PROG_HW, /* BPF program for offload callbacks, invoked at program load time. */ - BPF_OFFLOAD_VERIFIER_PREP, - BPF_OFFLOAD_TRANSLATE, - BPF_OFFLOAD_DESTROY, BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE, + XDP_QUERY_XSK_UMEM, + XDP_SETUP_XSK_UMEM, }; struct bpf_prog_offload_ops; struct netlink_ext_ack; +struct xdp_umem; struct netdev_bpf { enum bpf_netdev_command command; @@ -833,24 +830,21 @@ struct netdev_bpf { struct bpf_prog *prog; struct netlink_ext_ack *extack; }; - /* XDP_QUERY_PROG */ + /* XDP_QUERY_PROG, XDP_QUERY_PROG_HW */ struct { - u8 prog_attached; u32 prog_id; + /* flags with which program was installed */ + u32 prog_flags; }; - /* BPF_OFFLOAD_VERIFIER_PREP */ - struct { - struct bpf_prog *prog; - const struct bpf_prog_offload_ops *ops; /* callee set */ - } verifier; - /* BPF_OFFLOAD_TRANSLATE, BPF_OFFLOAD_DESTROY */ - struct { - struct bpf_prog *prog; - } offload; /* BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE */ struct { struct bpf_offloaded_map *offmap; }; + /* XDP_QUERY_XSK_UMEM, XDP_SETUP_XSK_UMEM */ + struct { + struct xdp_umem *umem; /* out for query*/ + u16 queue_id; /* in for query */ + } xsk; }; }; @@ -1188,9 +1182,13 @@ struct macsec_ops { * This function is used to set or query state related to XDP on the * netdevice and manage BPF offload. See definition of * enum bpf_netdev_command for details. - * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_buff *xdp); - * This function is used to submit a XDP packet for transmit on a - * netdevice. + * int (*ndo_xdp_xmit)(struct net_device *dev, int n, struct xdp_frame **xdp, + * u32 flags); + * This function is used to submit @n XDP packets for transmit on a + * netdevice. Returns number of frames successfully transmitted, frames + * that got dropped are freed/returned via xdp_return_frame(). + * Returns negative number, means general error invoking ndo, meaning + * no frames were xmit'ed and core-caller will free all frames. * void (*ndo_xdp_flush)(struct net_device *dev); * This function is used to inform the driver to flush a particular * xdp tx queue. Must be called on same CPU as xdp_xmit. @@ -1377,8 +1375,9 @@ struct net_device_ops { int needed_headroom); int (*ndo_bpf)(struct net_device *dev, struct netdev_bpf *bpf); - int (*ndo_xdp_xmit)(struct net_device *dev, - struct xdp_buff *xdp); + int (*ndo_xdp_xmit)(struct net_device *dev, int n, + struct xdp_frame **xdp, + u32 flags); void (*ndo_xdp_flush)(struct net_device *dev); }; @@ -2525,14 +2524,6 @@ void netdev_freemem(struct net_device *dev); void synchronize_net(void); int init_dummy_netdev(struct net_device *dev); -DECLARE_PER_CPU(int, xmit_recursion); -#define XMIT_RECURSION_LIMIT 8 - -static inline int dev_recursion_level(void) -{ - return this_cpu_read(xmit_recursion); -} - struct net_device *dev_get_by_index(struct net *net, int ifindex); struct net_device *__dev_get_by_index(struct net *net, int ifindex); struct net_device *dev_get_by_index_rcu(struct net *net, int ifindex); @@ -2880,9 +2871,11 @@ struct softnet_data { struct Qdisc *output_queue; struct Qdisc **output_queue_tailp; struct sk_buff *completion_queue; -#ifdef CONFIG_XFRM_OFFLOAD - struct sk_buff_head xfrm_backlog; -#endif + /* written and read only by owning cpu: */ + struct { + u16 recursion; + u8 more; + } xmit; #ifdef CONFIG_RPS /* input_queue_head should be written by cpu owning this struct, * and only read by other cpus. Worth using a cache line. @@ -2918,6 +2911,28 @@ static inline void input_queue_tail_incr_save(struct softnet_data *sd, DECLARE_PER_CPU_ALIGNED(struct softnet_data, softnet_data); +static inline int dev_recursion_level(void) +{ + return __this_cpu_read(softnet_data.xmit.recursion); +} + +#define XMIT_RECURSION_LIMIT 8 +static inline bool dev_xmit_recursion(void) +{ + return unlikely(__this_cpu_read(softnet_data.xmit.recursion) > + XMIT_RECURSION_LIMIT); +} + +static inline void dev_xmit_recursion_inc(void) +{ + __this_cpu_inc(softnet_data.xmit.recursion); +} + +static inline void dev_xmit_recursion_dec(void) +{ + __this_cpu_dec(softnet_data.xmit.recursion); +} + void __netif_schedule(struct Qdisc *q); void netif_schedule_queue(struct netdev_queue *txq); @@ -3305,6 +3320,12 @@ static inline int netif_set_real_num_rx_queues(struct net_device *dev, } #endif +static inline struct netdev_rx_queue * +__netif_get_rx_queue(struct net_device *dev, unsigned int rxq) +{ + return dev->_rx + rxq; +} + #ifdef CONFIG_SYSFS static inline unsigned int get_netdev_rx_queue_index( struct netdev_rx_queue *queue) @@ -3372,6 +3393,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb); int netif_rx(struct sk_buff *skb); int netif_rx_ni(struct sk_buff *skb); int netif_receive_skb(struct sk_buff *skb); +int netif_receive_skb_core(struct sk_buff *skb); gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb); void napi_gro_flush(struct napi_struct *napi, bool flush_old); struct sk_buff *napi_get_frags(struct napi_struct *napi); @@ -3414,14 +3436,16 @@ int dev_get_phys_port_id(struct net_device *dev, int dev_get_phys_port_name(struct net_device *dev, char *name, size_t len); int dev_change_proto_down(struct net_device *dev, bool proto_down); -struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev, bool *again); +struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev); struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, int *ret); typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, int fd, u32 flags); -u8 __dev_xdp_attached(struct net_device *dev, bpf_op_t xdp_op, u32 *prog_id); +u32 __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op, + enum bpf_netdev_command cmd); +int xdp_umem_query(struct net_device *dev, u16 queue_id); int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb); int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); @@ -4153,6 +4177,11 @@ static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops, return ops->ndo_start_xmit(skb, dev); } +static inline bool netdev_xmit_more(void) +{ + return __this_cpu_read(softnet_data.xmit.more); +} + static inline netdev_tx_t netdev_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq, bool more) { diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index a02794afc4dc..7da4687ca781 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -654,49 +654,56 @@ PAGEFLAG_FALSE(DoubleMap) #endif /* - * For pages that are never mapped to userspace, page->mapcount may be - * used for storing extra information about page type. Any value used - * for this purpose must be <= -2, but it's better start not too close - * to -2 so that an underflow of the page_mapcount() won't be mistaken - * for a special page. + * For pages that are never mapped to userspace (and aren't PageSlab), + * page_type may be used. Because it is initialised to -1, we invert the + * sense of the bit, so __SetPageFoo *clears* the bit used for PageFoo, and + * __ClearPageFoo *sets* the bit used for PageFoo. We reserve a few high and + * low bits so that an underflow or overflow of page_mapcount() won't be + * mistaken for a page type value. */ -#define PAGE_MAPCOUNT_OPS(uname, lname) \ + +#define PAGE_TYPE_BASE 0xf0000000 +/* Reserve 0x0000007f to catch underflows of page_mapcount */ +#define PG_buddy 0x00000080 +#define PG_balloon 0x00000100 +#define PG_kmemcg 0x00000200 + +#define PageType(page, flag) \ + ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) + +#define PAGE_TYPE_OPS(uname, lname) \ static __always_inline int Page##uname(struct page *page) \ { \ - return atomic_read(&page->_mapcount) == \ - PAGE_##lname##_MAPCOUNT_VALUE; \ + return PageType(page, PG_##lname); \ } \ static __always_inline void __SetPage##uname(struct page *page) \ { \ - VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page); \ - atomic_set(&page->_mapcount, PAGE_##lname##_MAPCOUNT_VALUE); \ + VM_BUG_ON_PAGE(!PageType(page, 0), page); \ + page->page_type &= ~PG_##lname; \ } \ static __always_inline void __ClearPage##uname(struct page *page) \ { \ VM_BUG_ON_PAGE(!Page##uname(page), page); \ - atomic_set(&page->_mapcount, -1); \ + page->page_type |= PG_##lname; \ } /* - * PageBuddy() indicate that the page is free and in the buddy system + * PageBuddy() indicates that the page is free and in the buddy system * (see mm/page_alloc.c). */ -#define PAGE_BUDDY_MAPCOUNT_VALUE (-128) -PAGE_MAPCOUNT_OPS(Buddy, BUDDY) +PAGE_TYPE_OPS(Buddy, buddy) /* - * PageBalloon() is set on pages that are on the balloon page list + * PageBalloon() is true for pages that are on the balloon page list * (see mm/balloon_compaction.c). */ -#define PAGE_BALLOON_MAPCOUNT_VALUE (-256) -PAGE_MAPCOUNT_OPS(Balloon, BALLOON) +PAGE_TYPE_OPS(Balloon, balloon) /* * If kmemcg is enabled, the buddy allocator will set PageKmemcg() on * pages allocated with __GFP_ACCOUNT. It gets cleared on page free. */ -#define PAGE_KMEMCG_MAPCOUNT_VALUE (-512) -PAGE_MAPCOUNT_OPS(Kmemcg, KMEMCG) +PAGE_TYPE_OPS(Kmemcg, kmemcg) extern bool is_free_buddy_page(struct page *page); diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index c132338b59cb..53c195133d2b 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -841,6 +841,7 @@ struct perf_output_handle { struct bpf_perf_event_data_kern { struct pt_regs *regs; struct perf_sample_data *data; + struct perf_event *event; }; #ifdef CONFIG_CGROUP_PERF @@ -900,6 +901,7 @@ extern void perf_event_exit_task(struct task_struct *child); extern void perf_event_free_task(struct task_struct *task); extern void perf_event_delayed_put(struct task_struct *task); extern struct file *perf_event_get(unsigned int fd); +extern const struct perf_event *perf_get_event(struct file *file); extern const struct perf_event_attr *perf_event_attrs(struct perf_event *event); extern void perf_event_print_debug(void); extern void perf_pmu_disable(struct pmu *pmu); @@ -919,7 +921,8 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, void *context); extern void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu); -int perf_event_read_local(struct perf_event *event, u64 *value); +int perf_event_read_local(struct perf_event *event, u64 *value, + u64 *enabled, u64 *running); extern u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running); @@ -1153,6 +1156,13 @@ static inline void perf_event_task_sched_out(struct task_struct *prev, } extern void perf_event_mmap(struct vm_area_struct *vma); + +extern void perf_event_ksymbol(u16 ksym_type, u64 addr, u32 len, + bool unregister, const char *sym); +extern void perf_event_bpf_event(struct bpf_prog *prog, + enum perf_bpf_event_type type, + u16 flags); + extern struct perf_guest_info_callbacks *perf_guest_cbs; extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks); @@ -1362,11 +1372,16 @@ static inline void perf_event_exit_task(struct task_struct *child) { } static inline void perf_event_free_task(struct task_struct *task) { } static inline void perf_event_delayed_put(struct task_struct *task) { } static inline struct file *perf_event_get(unsigned int fd) { return ERR_PTR(-EINVAL); } +static inline const struct perf_event *perf_get_event(struct file *file) +{ + return ERR_PTR(-EINVAL); +} static inline const struct perf_event_attr *perf_event_attrs(struct perf_event *event) { return ERR_PTR(-EINVAL); } -static inline int perf_event_read_local(struct perf_event *event, u64 *value) +static inline int perf_event_read_local(struct perf_event *event, u64 *value, + u64 *enabled, u64 *running) { return -EINVAL; } @@ -1391,6 +1406,13 @@ static inline int perf_unregister_guest_info_callbacks (struct perf_guest_info_callbacks *callbacks) { return 0; } static inline void perf_event_mmap(struct vm_area_struct *vma) { } + +typedef int (perf_ksymbol_get_name_f)(char *name, int name_len, void *data); +static inline void perf_event_ksymbol(u16 ksym_type, u64 addr, u32 len, + bool unregister, const char *sym) { } +static inline void perf_event_bpf_event(struct bpf_prog *prog, + enum perf_bpf_event_type type, + u16 flags) { } static inline void perf_event_exec(void) { } static inline void perf_event_comm(struct task_struct *tsk, bool exec) { } static inline void perf_event_namespaces(struct task_struct *tsk) { } diff --git a/include/linux/platform_data/media/gpio-ir-recv.h b/include/linux/platform_data/media/gpio-ir-recv.h deleted file mode 100644 index 0c298f569d5a..000000000000 --- a/include/linux/platform_data/media/gpio-ir-recv.h +++ /dev/null @@ -1,23 +0,0 @@ -/* Copyright (c) 2012, Code Aurora Forum. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 and - * only version 2 as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - */ - -#ifndef __GPIO_IR_RECV_H__ -#define __GPIO_IR_RECV_H__ - -struct gpio_ir_recv_platform_data { - int gpio_nr; - bool active_low; - u64 allowed_protos; - const char *map_name; -}; - -#endif /* __GPIO_IR_RECV_H__ */ diff --git a/include/linux/platform_data/media/ir-rx51.h b/include/linux/platform_data/media/ir-rx51.h deleted file mode 100644 index 9d127aa648e7..000000000000 --- a/include/linux/platform_data/media/ir-rx51.h +++ /dev/null @@ -1,9 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _IR_RX51_H -#define _IR_RX51_H - -struct ir_rx51_platform_data { - int(*set_max_mpu_wakeup_lat)(struct device *dev, long t); -}; - -#endif diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h index 480946b3b449..912a91bb41c5 100644 --- a/include/linux/proc_ns.h +++ b/include/linux/proc_ns.h @@ -76,10 +76,10 @@ static inline int ns_alloc_inum(struct ns_common *ns) extern struct file *proc_ns_fget(int fd); #define get_proc_ns(inode) ((struct ns_common *)(inode)->i_private) -extern int ns_get_path(struct path *path, struct task_struct *task, +extern void *ns_get_path(struct path *path, struct task_struct *task, const struct proc_ns_operations *ns_ops); typedef struct ns_common *ns_get_path_helper_t(void *); -extern int ns_get_path_cb(struct path *path, ns_get_path_helper_t ns_get_cb, +extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t ns_get_cb, void *private_data); extern int ns_get_name(char *buf, size_t size, struct task_struct *task, const struct proc_ns_operations *ns_ops); diff --git a/include/linux/pstore.h b/include/linux/pstore.h index 70913ec87785..de9093d6e660 100644 --- a/include/linux/pstore.h +++ b/include/linux/pstore.h @@ -71,7 +71,7 @@ struct pstore_record { struct pstore_info *psi; enum pstore_type_id type; u64 id; - struct timespec time; + struct timespec64 time; char *buf; ssize_t size; ssize_t ecc_notice_size; diff --git a/include/linux/rculist.h b/include/linux/rculist.h index 127f534fec94..4ac3cd23539d 100644 --- a/include/linux/rculist.h +++ b/include/linux/rculist.h @@ -40,6 +40,24 @@ static inline void INIT_LIST_HEAD_RCU(struct list_head *list) */ #define list_next_rcu(list) (*((struct list_head __rcu **)(&(list)->next))) +/* + * Check during list traversal that we are within an RCU reader + */ + +#define check_arg_count_one(dummy) + +#ifdef CONFIG_PROVE_RCU_LIST +#define __list_check_rcu(dummy, cond, extra...) \ + ({ \ + check_arg_count_one(extra); \ + RCU_LOCKDEP_WARN(!cond && !rcu_read_lock_any_held(), \ + "RCU-list traversed in non-reader section!"); \ + }) +#else +#define __list_check_rcu(dummy, cond, extra...) \ + ({ check_arg_count_one(extra); }) +#endif + /* * Insert a new entry between two known consecutive entries. * @@ -343,14 +361,16 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, * @pos: the type * to use as a loop cursor. * @head: the head for your list. * @member: the name of the list_head within the struct. + * @cond: optional lockdep expression if called from non-RCU protection. * * This list-traversal primitive may safely run concurrently with * the _rcu list-mutation primitives such as list_add_rcu() * as long as the traversal is guarded by rcu_read_lock(). */ -#define list_for_each_entry_rcu(pos, head, member) \ - for (pos = list_entry_rcu((head)->next, typeof(*pos), member); \ - &pos->member != (head); \ +#define list_for_each_entry_rcu(pos, head, member, cond...) \ + for (__list_check_rcu(dummy, ## cond, 0), \ + pos = list_entry_rcu((head)->next, typeof(*pos), member); \ + &pos->member != (head); \ pos = list_entry_rcu(pos->member.next, typeof(*pos), member)) /** @@ -588,13 +608,15 @@ static inline void hlist_add_behind_rcu(struct hlist_node *n, * @pos: the type * to use as a loop cursor. * @head: the head for your list. * @member: the name of the hlist_node within the struct. + * @cond: optional lockdep expression if called from non-RCU protection. * * This list-traversal primitive may safely run concurrently with * the _rcu list-mutation primitives such as hlist_add_head_rcu() * as long as the traversal is guarded by rcu_read_lock(). */ -#define hlist_for_each_entry_rcu(pos, head, member) \ - for (pos = hlist_entry_safe (rcu_dereference_raw(hlist_first_rcu(head)),\ +#define hlist_for_each_entry_rcu(pos, head, member, cond...) \ + for (__list_check_rcu(dummy, ## cond, 0), \ + pos = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),\ typeof(*(pos)), member); \ pos; \ pos = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(\ diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 8d570190e9b4..ff9e13d16e28 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -198,6 +198,37 @@ do { \ rcu_note_voluntary_context_switch(current); \ } while (0) +/** + * rcu_softirq_qs_periodic - Report RCU and RCU-Tasks quiescent states + * @old_ts: jiffies at start of processing. + * + * This helper is for long-running softirq handlers, such as NAPI threads in + * networking. The caller should initialize the variable passed in as @old_ts + * at the beginning of the softirq handler. When invoked frequently, this macro + * will invoke rcu_softirq_qs() every 100 milliseconds thereafter, which will + * provide both RCU and RCU-Tasks quiescent states. Note that this macro + * modifies its old_ts argument. + * + * Because regions of code that have disabled softirq act as RCU read-side + * critical sections, this macro should be invoked with softirq (and + * preemption) enabled. + * + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would + * have more chance to invoke schedule() calls and provide necessary quiescent + * states. As a contrast, calling cond_resched() only won't achieve the same + * effect because cond_resched() does not provide RCU-Tasks quiescent states. + */ +#define rcu_softirq_qs_periodic(old_ts) \ +do { \ + if (!IS_ENABLED(CONFIG_PREEMPT_RT) && \ + time_after(jiffies, (old_ts) + HZ / 10)) { \ + preempt_disable(); \ + rcu_softirq_qs(); \ + preempt_enable(); \ + (old_ts) = jiffies; \ + } \ +} while (0) + /* * Infrastructure to implement the synchronize_() primitives in * TREE_RCU and rcu_barrier_() primitives in TINY_RCU. @@ -255,6 +286,7 @@ int debug_lockdep_rcu_enabled(void); int rcu_read_lock_held(void); int rcu_read_lock_bh_held(void); int rcu_read_lock_sched_held(void); +int rcu_read_lock_any_held(void); #else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ @@ -275,6 +307,12 @@ static inline int rcu_read_lock_sched_held(void) { return !preemptible(); } + +static inline int rcu_read_lock_any_held(void) +{ + return !preemptible(); +} + #endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */ #ifdef CONFIG_PROVE_RCU diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index b3dbf9502fd0..7980be8e0080 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -90,6 +90,11 @@ static inline void kfree_call_rcu(struct rcu_head *head, call_rcu(head, func); } +static inline void rcu_softirq_qs(void) +{ + rcu_sched_qs(); +} + #define rcu_note_context_switch(preempt) \ do { \ rcu_sched_qs(); \ @@ -116,6 +121,11 @@ static inline void rcu_irq_exit_irqson(void) { } static inline void rcu_irq_enter_irqson(void) { } static inline void rcu_irq_exit(void) { } static inline void exit_rcu(void) { } +static inline bool rcu_preempt_need_deferred_qs(struct task_struct *t) +{ + return false; +} +static inline void rcu_preempt_deferred_qs(struct task_struct *t) { } #ifdef CONFIG_SRCU void rcu_scheduler_starting(void); #else /* #ifndef CONFIG_SRCU */ diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index 37d6fd3b7ff8..2c44a87c328d 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -30,6 +30,7 @@ #ifndef __LINUX_RCUTREE_H #define __LINUX_RCUTREE_H +void rcu_softirq_qs(void); void rcu_note_context_switch(bool preempt); int rcu_needs_cpu(u64 basem, u64 *nextevt); void rcu_cpu_stall_reset(void); diff --git a/include/linux/sched.h b/include/linux/sched.h index 70f5317bdacd..20bcc24eca02 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1589,6 +1589,7 @@ extern struct pid *cad_pid; #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ #define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */ +#define PF_UMH 0x02000000 /* I'm an Usermodehelper process */ #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ #define PF_WAKE_UP_IDLE 0x10000000 /* TTWU on an idle CPU */ @@ -2007,4 +2008,12 @@ static inline void set_wake_up_idle(bool enabled) current->flags &= ~PF_WAKE_UP_IDLE; } +void __exit_umh(struct task_struct *tsk); + +static inline void exit_umh(struct task_struct *tsk) +{ + if (unlikely(tsk->flags & PF_UMH)) + __exit_umh(tsk); +} + #endif diff --git a/include/linux/security.h b/include/linux/security.h index a1a407bc9e92..fa06673570b8 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -57,9 +57,12 @@ struct xattr; struct xfrm_sec_ctx; struct mm_struct; +/* Default (no) options for the capable function */ +#define CAP_OPT_NONE 0x0 /* If capable should audit the security request */ -#define SECURITY_CAP_NOAUDIT 0 -#define SECURITY_CAP_AUDIT 1 +#define CAP_OPT_NOAUDIT BIT(1) +/* If capable is being called by a setid function */ +#define CAP_OPT_INSETID BIT(2) /* LSM Agnostic defines for sb_set_mnt_opts */ #define SECURITY_LSM_NATIVE_LABELS 1 @@ -103,7 +106,7 @@ enum lsm_event { /* These functions are in security/commoncap.c */ extern int cap_capable(const struct cred *cred, struct user_namespace *ns, - int cap, int audit); + int cap, unsigned int opts); extern int cap_settime(const struct timespec64 *ts, const struct timezone *tz); extern int cap_ptrace_access_check(struct task_struct *child, unsigned int mode); extern int cap_ptrace_traceme(struct task_struct *parent); @@ -242,10 +245,10 @@ int security_capset(struct cred *new, const struct cred *old, const kernel_cap_t *effective, const kernel_cap_t *inheritable, const kernel_cap_t *permitted); -int security_capable(const struct cred *cred, struct user_namespace *ns, - int cap); -int security_capable_noaudit(const struct cred *cred, struct user_namespace *ns, - int cap); +int security_capable(const struct cred *cred, + struct user_namespace *ns, + int cap, + unsigned int opts); int security_quotactl(int cmds, int type, int id, struct super_block *sb); int security_quota_on(struct dentry *dentry); int security_syslog(int type); @@ -507,14 +510,11 @@ static inline int security_capset(struct cred *new, } static inline int security_capable(const struct cred *cred, - struct user_namespace *ns, int cap) + struct user_namespace *ns, + int cap, + unsigned int opts) { - return cap_capable(cred, ns, cap, SECURITY_CAP_AUDIT); -} - -static inline int security_capable_noaudit(const struct cred *cred, - struct user_namespace *ns, int cap) { - return cap_capable(cred, ns, cap, SECURITY_CAP_NOAUDIT); + return cap_capable(cred, ns, cap, opts); } static inline int security_quotactl(int cmds, int type, int id, diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h index e5140648f638..99ebd122a23b 100644 --- a/include/linux/set_memory.h +++ b/include/linux/set_memory.h @@ -17,4 +17,15 @@ static inline int set_memory_x(unsigned long addr, int numpages) { return 0; } static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; } #endif +#ifndef CONFIG_ARCH_HAS_SET_DIRECT_MAP +static inline int set_direct_map_invalid_noflush(struct page *page) +{ + return 0; +} +static inline int set_direct_map_default_noflush(struct page *page) +{ + return 0; +} +#endif + #endif /* _LINUX_SET_MEMORY_H_ */ diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index efbfae01adfb..6817e3ee5d66 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -245,7 +245,6 @@ struct iov_iter; struct napi_struct; struct bpf_prog; union bpf_attr; -struct skb_ext; #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) struct nf_conntrack { @@ -637,7 +636,6 @@ typedef unsigned char *sk_buff_data_t; * @queue_mapping: Queue mapping for multiqueue devices * @xmit_more: More SKBs are pending for this queue * @pfmemalloc: skbuff was allocated from PFMEMALLOC reserves - * @active_extensions: active extensions (skb_ext_id types) * @ndisc_nodetype: router type (from link layer) * @ooo_okay: allow the mapping of a socket to a queue to be changed * @l4_hash: indicate hash is a canonical 4-tuple hash over transport @@ -666,7 +664,6 @@ typedef unsigned char *sk_buff_data_t; * @data: Data head pointer * @truesize: Buffer size * @users: User count - see {datagram,tcp}.c - * @extensions: allocated extensions, valid if active_extensions is nonzero */ struct sk_buff { @@ -743,9 +740,7 @@ struct sk_buff { head_frag:1, xmit_more:1, pfmemalloc:1; -#ifdef CONFIG_SKB_EXTENSIONS - __u8 active_extensions; -#endif + /* fields enclosed in headers_start/headers_end are copied * using a single memcpy() in __copy_skb_header() */ @@ -859,10 +854,6 @@ struct sk_buff { *data; unsigned int truesize; refcount_t users; -#ifdef CONFIG_SKB_EXTENSIONS - /* only useable after checking ->active_extensions != 0 */ - struct skb_ext *extensions; -#endif }; #ifdef __KERNEL__ @@ -998,6 +989,16 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags, int node); struct sk_buff *__build_skb(void *data, unsigned int frag_size); struct sk_buff *build_skb(void *data, unsigned int frag_size); +struct sk_buff *build_skb_around(struct sk_buff *skb, + void *data, unsigned int frag_size); + +/** + * alloc_skb - allocate a network buffer + * @size: size to allocate + * @priority: allocation mask + * + * This function is a convenient wrapper around __alloc_skb(). + */ static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority) { @@ -1186,7 +1187,7 @@ void __skb_get_hash(struct sk_buff *skb); u32 __skb_get_hash_symmetric(const struct sk_buff *skb); u32 skb_get_poff(const struct sk_buff *skb); u32 __skb_get_poff(const struct sk_buff *skb, void *data, - const struct flow_keys *keys, int hlen); + const struct flow_keys_basic *keys, int hlen); __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto, void *data, int hlen_proto); @@ -1200,12 +1201,38 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector, const struct flow_dissector_key *key, unsigned int key_count); +#ifdef CONFIG_NET +int skb_flow_dissector_prog_query(const union bpf_attr *attr, + union bpf_attr __user *uattr); int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog); int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr); +#else +static inline int skb_flow_dissector_prog_query(const union bpf_attr *attr, + union bpf_attr __user *uattr) +{ + return -EOPNOTSUPP; +} -bool __skb_flow_dissect(const struct sk_buff *skb, +static inline int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr, + struct bpf_prog *prog) +{ + return -EOPNOTSUPP; +} + +static inline int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr) +{ + return -EOPNOTSUPP; +} +#endif + +struct bpf_flow_dissector; +bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx, + __be16 proto, int nhoff, int hlen, unsigned int flags); + +bool __skb_flow_dissect(const struct net *net, + const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, void *data, __be16 proto, int nhoff, int hlen, @@ -1215,8 +1242,8 @@ static inline bool skb_flow_dissect(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, unsigned int flags) { - return __skb_flow_dissect(skb, flow_dissector, target_container, - NULL, 0, 0, 0, flags); + return __skb_flow_dissect(NULL, skb, flow_dissector, + target_container, NULL, 0, 0, 0, flags); } static inline bool skb_flow_dissect_flow_keys(const struct sk_buff *skb, @@ -1224,17 +1251,19 @@ static inline bool skb_flow_dissect_flow_keys(const struct sk_buff *skb, unsigned int flags) { memset(flow, 0, sizeof(*flow)); - return __skb_flow_dissect(skb, &flow_keys_dissector, flow, - NULL, 0, 0, 0, flags); + return __skb_flow_dissect(NULL, skb, &flow_keys_dissector, + flow, NULL, 0, 0, 0, flags); } -static inline bool skb_flow_dissect_flow_keys_buf(struct flow_keys *flow, - void *data, __be16 proto, - int nhoff, int hlen, - unsigned int flags) +static inline bool +skb_flow_dissect_flow_keys_basic(const struct net *net, + const struct sk_buff *skb, + struct flow_keys_basic *flow, void *data, + __be16 proto, int nhoff, int hlen, + unsigned int flags) { memset(flow, 0, sizeof(*flow)); - return __skb_flow_dissect(NULL, &flow_keys_buf_dissector, flow, + return __skb_flow_dissect(net, skb, &flow_keys_basic_dissector, flow, data, proto, nhoff, hlen, flags); } @@ -2104,6 +2133,14 @@ static inline void skb_set_tail_pointer(struct sk_buff *skb, const int offset) #endif /* NET_SKBUFF_DATA_USES_OFFSET */ +static inline void skb_assert_len(struct sk_buff *skb) +{ +#ifdef CONFIG_DEBUG_NET + if (WARN_ONCE(!skb->len, "%s\n", __func__)) + DO_ONCE_LITE(skb_dump, KERN_ERR, skb, false); +#endif /* CONFIG_DEBUG_NET */ +} + /* * Add data to an sk_buff */ @@ -2449,11 +2486,13 @@ static inline void skb_pop_mac_header(struct sk_buff *skb) static inline void skb_probe_transport_header(struct sk_buff *skb, const int offset_hint) { - struct flow_keys keys; + struct flow_keys_basic keys; if (skb_transport_header_was_set(skb)) return; - else if (skb_flow_dissect_flow_keys(skb, &keys, 0)) + + if (skb_flow_dissect_flow_keys_basic(NULL, skb, &keys, + NULL, 0, 0, 0, 0)) skb_set_transport_header(skb, keys.control.thoff); else if (offset_hint >= 0) skb_set_transport_header(skb, offset_hint); @@ -3568,6 +3607,7 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a, /* Using more efficient varaiant than plain call to memcmp(). */ #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64 u64 diffs = 0; + switch (meta_len) { #define __it(x, op) (x -= sizeof(u##op)) #define __it_diff(a, b, op) (*(u##op *)__it(a, op)) ^ (*(u##op *)__it(b, op)) @@ -3593,8 +3633,10 @@ static inline bool skb_metadata_differs(const struct sk_buff *skb_a, { u8 len_a = skb_metadata_len(skb_a); u8 len_b = skb_metadata_len(skb_b); + if (!(len_a | len_b)) return false; + return len_a != len_b ? true : __skb_metadata_differs(skb_a, skb_b, len_a); } @@ -3929,100 +3971,6 @@ static inline void nf_conntrack_get(struct nf_conntrack *nfct) atomic_inc(&nfct->use); } #endif - -#ifdef CONFIG_SKB_EXTENSIONS -enum skb_ext_id { -#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) - SKB_EXT_BRIDGE_NF, -#endif - SKB_EXT_NUM, /* must be last */ -}; - -/** - * struct skb_ext - sk_buff extensions - * @refcnt: 1 on allocation, deallocated on 0 - * @offset: offset to add to @data to obtain extension address - * @chunks: size currently allocated, stored in SKB_EXT_ALIGN_SHIFT units - * @data: start of extension data, variable sized - * - * Note: offsets/lengths are stored in chunks of 8 bytes, this allows - * to use 'u8' types while allowing up to 2kb worth of extension data. - */ -struct skb_ext { - refcount_t refcnt; - u8 offset[SKB_EXT_NUM]; /* in chunks of 8 bytes */ - u8 chunks; /* same */ - char data[0] __aligned(8); -}; - -void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id); -void __skb_ext_del(struct sk_buff *skb, enum skb_ext_id id); -void __skb_ext_put(struct skb_ext *ext); -static inline void skb_ext_put(struct sk_buff *skb) -{ - if (skb->active_extensions) - __skb_ext_put(skb->extensions); -} - -static inline void skb_ext_get(struct sk_buff *skb) -{ - if (skb->active_extensions) { - struct skb_ext *ext = skb->extensions; - if (ext) - refcount_inc(&ext->refcnt); - } -} - -static inline void __skb_ext_copy(struct sk_buff *dst, - const struct sk_buff *src) -{ - dst->active_extensions = src->active_extensions; - if (src->active_extensions) { - struct skb_ext *ext = src->extensions; - refcount_inc(&ext->refcnt); - dst->extensions = ext; - } -} - -static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *src) -{ - skb_ext_put(dst); - __skb_ext_copy(dst, src); -} - -static inline bool __skb_ext_exist(const struct skb_ext *ext, enum skb_ext_id i) -{ - return !!ext->offset[i]; -} - -static inline bool skb_ext_exist(const struct sk_buff *skb, enum skb_ext_id id) -{ - return skb->active_extensions & (1 << id); -} - -static inline void skb_ext_del(struct sk_buff *skb, enum skb_ext_id id) -{ - if (skb_ext_exist(skb, id)) - __skb_ext_del(skb, id); -} - -static inline void *skb_ext_find(const struct sk_buff *skb, enum skb_ext_id id) -{ - if (skb_ext_exist(skb, id)) { - struct skb_ext *ext = skb->extensions; - return (void *)ext + (ext->offset[id] << 3); - } - return NULL; -} - -#else -static inline void skb_ext_put(struct sk_buff *skb) {} -static inline void skb_ext_get(struct sk_buff *skb) {} -static inline void skb_ext_del(struct sk_buff *skb, int unused) {} -static inline void __skb_ext_copy(struct sk_buff *d, const struct sk_buff *s) {} -static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *s) {} -#endif /* CONFIG_SKB_EXTENSIONS */ - #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge) { @@ -4108,19 +4056,12 @@ static inline void skb_init_secmark(struct sk_buff *skb) { } #endif -static inline int secpath_exists(const struct sk_buff *skb) -{ -#ifdef CONFIG_XFRM - return skb->sp != NULL; -#else - return 0; -#endif -} - static inline bool skb_irq_freeable(const struct sk_buff *skb) { return !skb->destructor && - !secpath_exists(skb) && +#if IS_ENABLED(CONFIG_XFRM) + !skb->sp && +#endif !skb_nfct(skb) && !skb->_skb_refdst && !skb_has_frag_list(skb); @@ -4261,6 +4202,12 @@ static inline bool skb_is_gso_sctp(const struct sk_buff *skb) return skb_shinfo(skb)->gso_type & SKB_GSO_SCTP; } +/* Note: Should be called only if skb_is_gso(skb) is true */ +static inline bool skb_is_gso_tcp(const struct sk_buff *skb) +{ + return skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6); +} + static inline void skb_gso_reset(struct sk_buff *skb) { skb_shinfo(skb)->gso_size = 0; diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index 95678103c4a0..17beb72ab34c 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -14,6 +14,7 @@ #include #define MAX_MSG_FRAGS MAX_SKB_FRAGS +#define NR_MSG_FRAG_IDS (MAX_MSG_FRAGS + 1) enum __sk_action { __SK_DROP = 0, @@ -29,9 +30,16 @@ struct sk_msg_sg { u32 size; u32 copybreak; bool copy[MAX_MSG_FRAGS]; - struct scatterlist data[MAX_MSG_FRAGS]; + /* The extra two elements: + * 1) used for chaining the front and sections when the list becomes + * partitioned (e.g. end < start). The crypto APIs require the + * chaining; + * 2) to chain tailer SG entries after the message. + */ + struct scatterlist data[MAX_MSG_FRAGS + 2]; }; +/* UAPI in filter.c depends on struct sk_msg_sg being first element. */ struct sk_msg { struct sk_msg_sg sg; void *data; @@ -102,6 +110,8 @@ struct sk_psock { int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len, int elem_first_coalesce); +int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src, + u32 off, u32 len); void sk_msg_trim(struct sock *sk, struct sk_msg *msg, int len); int sk_msg_free(struct sock *sk, struct sk_msg *msg); int sk_msg_free_nocharge(struct sock *sk, struct sk_msg *msg); @@ -110,6 +120,7 @@ void sk_msg_free_partial_nocharge(struct sock *sk, struct sk_msg *msg, u32 bytes); void sk_msg_return(struct sock *sk, struct sk_msg *msg, int bytes); +void sk_msg_return_zero(struct sock *sk, struct sk_msg *msg, int bytes); int sk_msg_zerocopy_from_iter(struct sock *sk, struct iov_iter *from, struct sk_msg *msg, u32 bytes); @@ -131,10 +142,15 @@ static inline void sk_msg_apply_bytes(struct sk_psock *psock, u32 bytes) } } +static inline u32 sk_msg_iter_dist(u32 start, u32 end) +{ + return end >= start ? end - start : end + (NR_MSG_FRAG_IDS - start); +} + #define sk_msg_iter_var_prev(var) \ do { \ if (var == 0) \ - var = MAX_MSG_FRAGS - 1; \ + var = NR_MSG_FRAG_IDS - 1; \ else \ var--; \ } while (0) @@ -142,7 +158,7 @@ static inline void sk_msg_apply_bytes(struct sk_psock *psock, u32 bytes) #define sk_msg_iter_var_next(var) \ do { \ var++; \ - if (var == MAX_MSG_FRAGS) \ + if (var == NR_MSG_FRAG_IDS) \ var = 0; \ } while (0) @@ -159,8 +175,9 @@ static inline void sk_msg_clear_meta(struct sk_msg *msg) static inline void sk_msg_init(struct sk_msg *msg) { + BUILD_BUG_ON(ARRAY_SIZE(msg->sg.data) - 1 != NR_MSG_FRAG_IDS); memset(msg, 0, sizeof(*msg)); - sg_init_marker(msg->sg.data, ARRAY_SIZE(msg->sg.data)); + sg_init_marker(msg->sg.data, NR_MSG_FRAG_IDS); } static inline void sk_msg_xfer(struct sk_msg *dst, struct sk_msg *src, @@ -168,20 +185,25 @@ static inline void sk_msg_xfer(struct sk_msg *dst, struct sk_msg *src, { dst->sg.data[which] = src->sg.data[which]; dst->sg.data[which].length = size; + dst->sg.size += size; src->sg.data[which].length -= size; src->sg.data[which].offset += size; } -static inline u32 sk_msg_elem_used(const struct sk_msg *msg) +static inline void sk_msg_xfer_full(struct sk_msg *dst, struct sk_msg *src) { - return msg->sg.end >= msg->sg.start ? - msg->sg.end - msg->sg.start : - msg->sg.end + (MAX_MSG_FRAGS - msg->sg.start); + memcpy(dst, src, sizeof(*src)); + sk_msg_init(src); } static inline bool sk_msg_full(const struct sk_msg *msg) { - return (msg->sg.end == msg->sg.start) && msg->sg.size; + return sk_msg_iter_dist(msg->sg.start, msg->sg.end) == MAX_MSG_FRAGS; +} + +static inline u32 sk_msg_elem_used(const struct sk_msg *msg) +{ + return sk_msg_iter_dist(msg->sg.start, msg->sg.end); } static inline struct scatterlist *sk_msg_elem(struct sk_msg *msg, int which) @@ -189,6 +211,11 @@ static inline struct scatterlist *sk_msg_elem(struct sk_msg *msg, int which) return &msg->sg.data[which]; } +static inline struct scatterlist sk_msg_elem_cpy(struct sk_msg *msg, int which) +{ + return msg->sg.data[which]; +} + static inline struct page *sk_msg_page(struct sk_msg *msg, int which) { return sg_page(sk_msg_elem(msg, which)); @@ -227,6 +254,26 @@ static inline void sk_msg_page_add(struct sk_msg *msg, struct page *page, sk_msg_iter_next(msg, end); } +static inline void sk_msg_sg_copy(struct sk_msg *msg, u32 i, bool copy_state) +{ + do { + msg->sg.copy[i] = copy_state; + sk_msg_iter_var_next(i); + if (i == msg->sg.end) + break; + } while (1); +} + +static inline void sk_msg_sg_copy_set(struct sk_msg *msg, u32 start) +{ + sk_msg_sg_copy(msg, start, true); +} + +static inline void sk_msg_sg_copy_clear(struct sk_msg *msg, u32 start) +{ + sk_msg_sg_copy(msg, start, false); +} + static inline struct sk_psock *sk_psock(const struct sock *sk) { return rcu_dereference_sk_user_data(sk); @@ -243,6 +290,11 @@ static inline void sk_psock_queue_msg(struct sk_psock *psock, list_add_tail(&msg->list, &psock->ingress_msg); } +static inline bool sk_psock_queue_empty(const struct sk_psock *psock) +{ + return psock ? list_empty(&psock->ingress_msg) : true; +} + static inline void sk_psock_report_error(struct sk_psock *psock, int err) { struct sock *sk = psock->sk; @@ -361,6 +413,19 @@ static inline void psock_set_prog(struct bpf_prog **pprog, bpf_prog_put(prog); } +static inline int psock_replace_prog(struct bpf_prog **pprog, + struct bpf_prog *prog, + struct bpf_prog *old) +{ + if (cmpxchg(pprog, old, prog) != old) + return -ENOENT; + + if (old) + bpf_prog_put(old); + + return 0; +} + static inline void psock_progs_drop(struct sk_psock_progs *progs) { psock_set_prog(&progs->msg_parser, NULL); diff --git a/include/linux/socket.h b/include/linux/socket.h index add9360cec7f..a668c219e814 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -207,8 +207,9 @@ struct ucred { * PF_SMC protocol family that * reuses AF_INET address family */ +#define AF_XDP 44 /* XDP sockets */ -#define AF_MAX 44 /* For now.. */ +#define AF_MAX 45 /* For now.. */ /* Protocol families, same as address families. */ #define PF_UNSPEC AF_UNSPEC @@ -257,6 +258,7 @@ struct ucred { #define PF_KCM AF_KCM #define PF_QIPCRTR AF_QIPCRTR #define PF_SMC AF_SMC +#define PF_XDP AF_XDP #define PF_MAX AF_MAX /* Maximum queue length specifiable by listen. */ @@ -338,6 +340,7 @@ struct ucred { #define SOL_NFC 280 #define SOL_KCM 281 #define SOL_TLS 282 +#define SOL_XDP 283 /* IPX options */ #define IPX_TYPE 1 diff --git a/include/linux/stat.h b/include/linux/stat.h index 07295841fccd..528c4baad091 100644 --- a/include/linux/stat.h +++ b/include/linux/stat.h @@ -42,10 +42,10 @@ struct kstat { kuid_t uid; kgid_t gid; loff_t size; - struct timespec atime; - struct timespec mtime; - struct timespec ctime; - struct timespec btime; /* File creation time */ + struct timespec64 atime; + struct timespec64 mtime; + struct timespec64 ctime; + struct timespec64 btime; /* File creation time */ u64 blocks; }; diff --git a/include/linux/stddef.h b/include/linux/stddef.h index 2181719fd907..998a4ba28eba 100644 --- a/include/linux/stddef.h +++ b/include/linux/stddef.h @@ -19,6 +19,14 @@ enum { #define offsetof(TYPE, MEMBER) ((size_t)&((TYPE *)0)->MEMBER) #endif +/** + * sizeof_field(TYPE, MEMBER) + * + * @TYPE: The structure containing the field of interest + * @MEMBER: The field to return the size of + */ +#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER)) + /** * offsetofend(TYPE, MEMBER) * @@ -26,6 +34,6 @@ enum { * @MEMBER: The member within the structure to get the end offset of */ #define offsetofend(TYPE, MEMBER) \ - (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER)) + (offsetof(TYPE, MEMBER) + sizeof_field(TYPE, MEMBER)) #endif diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h index 270bad0e1bed..8903f42f6e3e 100644 --- a/include/linux/sunrpc/cache.h +++ b/include/linux/sunrpc/cache.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * include/linux/sunrpc/cache.h * @@ -5,9 +6,6 @@ * used by sunrpc clients and servers. * * Copyright (C) 2002 Neil Brown - * - * Released under terms in GPL version 2. See COPYING. - * */ #ifndef _LINUX_SUNRPC_CACHE_H_ diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 02c9b18bbbf7..9ba832445964 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -66,6 +66,9 @@ extern int proc_do_large_bitmap(struct ctl_table *, int, extern int proc_douintvec_capacity(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +extern int proc_do_static_key(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos); /* * Register a set of sysctl names by calling register_sysctl_table diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 45a85277c2ea..fa76c1113529 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -178,10 +178,16 @@ struct tcp_sock { u32 data_segs_out; /* RFC4898 tcpEStatsPerfDataSegsOut * total number of data segments sent. */ + u64 bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut + * total number of data bytes sent. + */ u64 bytes_acked; /* RFC4898 tcpEStatsAppHCThruOctetsAcked * sum(delta(snd_una)), or how many bytes * were acked. */ + u32 dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups + * total number of DSACK blocks received + */ u32 snd_una; /* First byte we want an ack for */ u32 snd_sml; /* Last byte of the most recently transmitted small packet */ u32 rcv_tstamp; /* timestamp of last received ACK (for keepalives) */ @@ -233,6 +239,8 @@ struct tcp_sock { is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */ u32 tlp_high_seq; /* snd_nxt at the time of TLP */ + u64 tcp_wstamp_ns; /* departure time for next sent data packet */ + /* RTT measurement */ u64 tcp_mstamp; /* most recent packet received/sent */ u32 srtt_us; /* smoothed round trip time << 3 in usecs */ @@ -272,6 +280,7 @@ struct tcp_sock { * receiver in Recovery. */ u32 prr_out; /* Total number of pkts sent during Recovery. */ u32 delivered; /* Total data packets delivered incl. rexmits */ + u32 delivered_ce; /* Like the above but only ECE marked packets */ u32 lost; /* Total data packets lost incl. rexmits */ u32 app_limited; /* limited until "delivered" reaches this val */ u64 first_tx_mstamp; /* start of window send phase */ @@ -319,6 +328,9 @@ struct tcp_sock { * the first SYN. */ u32 undo_marker; /* snd_una upon a new recovery episode. */ int undo_retrans; /* number of undoable retransmissions. */ + u64 bytes_retrans; /* RFC4898 tcpEStatsPerfOctetsRetrans + * Total data bytes retransmitted + */ u32 total_retrans; /* Total retransmits for entire connection */ u32 urg_seq; /* Seq of received urgent pointer */ @@ -327,6 +339,17 @@ struct tcp_sock { int linger2; + +/* Sock_ops bpf program related variables */ +#ifdef CONFIG_BPF + u8 bpf_sock_ops_cb_flags; /* Control calling BPF programs + * values defined in uapi/linux/tcp.h + */ +#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG) +#else +#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0 +#endif + /* Receiver side RTT estimation */ struct { u32 rtt_us; diff --git a/include/linux/time.h b/include/linux/time.h index 21086c5143d9..9f1501f83e42 100644 --- a/include/linux/time.h +++ b/include/linux/time.h @@ -18,149 +18,10 @@ int get_itimerspec64(struct itimerspec64 *it, int put_itimerspec64(const struct itimerspec64 *it, struct itimerspec __user *uit); -#define TIME_T_MAX (time_t)((1UL << ((sizeof(time_t) << 3) - 1)) - 1) - -static inline int timespec_equal(const struct timespec *a, - const struct timespec *b) -{ - return (a->tv_sec == b->tv_sec) && (a->tv_nsec == b->tv_nsec); -} - -/* - * lhs < rhs: return <0 - * lhs == rhs: return 0 - * lhs > rhs: return >0 - */ -static inline int timespec_compare(const struct timespec *lhs, const struct timespec *rhs) -{ - if (lhs->tv_sec < rhs->tv_sec) - return -1; - if (lhs->tv_sec > rhs->tv_sec) - return 1; - return lhs->tv_nsec - rhs->tv_nsec; -} - -static inline int timeval_compare(const struct timeval *lhs, const struct timeval *rhs) -{ - if (lhs->tv_sec < rhs->tv_sec) - return -1; - if (lhs->tv_sec > rhs->tv_sec) - return 1; - return lhs->tv_usec - rhs->tv_usec; -} - extern time64_t mktime64(const unsigned int year, const unsigned int mon, const unsigned int day, const unsigned int hour, const unsigned int min, const unsigned int sec); -/** - * Deprecated. Use mktime64(). - */ -static inline unsigned long mktime(const unsigned int year, - const unsigned int mon, const unsigned int day, - const unsigned int hour, const unsigned int min, - const unsigned int sec) -{ - return mktime64(year, mon, day, hour, min, sec); -} - -extern void set_normalized_timespec(struct timespec *ts, time_t sec, s64 nsec); - -/* - * timespec_add_safe assumes both values are positive and checks - * for overflow. It will return TIME_T_MAX if the reutrn would be - * smaller then either of the arguments. - */ -extern struct timespec timespec_add_safe(const struct timespec lhs, - const struct timespec rhs); - - -static inline struct timespec timespec_add(struct timespec lhs, - struct timespec rhs) -{ - struct timespec ts_delta; - set_normalized_timespec(&ts_delta, lhs.tv_sec + rhs.tv_sec, - lhs.tv_nsec + rhs.tv_nsec); - return ts_delta; -} - -/* - * sub = lhs - rhs, in normalized form - */ -static inline struct timespec timespec_sub(struct timespec lhs, - struct timespec rhs) -{ - struct timespec ts_delta; - set_normalized_timespec(&ts_delta, lhs.tv_sec - rhs.tv_sec, - lhs.tv_nsec - rhs.tv_nsec); - return ts_delta; -} - -/* - * Returns true if the timespec is norm, false if denorm: - */ -static inline bool timespec_valid(const struct timespec *ts) -{ - /* Dates before 1970 are bogus */ - if (ts->tv_sec < 0) - return false; - /* Can't have more nanoseconds then a second */ - if ((unsigned long)ts->tv_nsec >= NSEC_PER_SEC) - return false; - return true; -} - -static inline bool timespec_valid_strict(const struct timespec *ts) -{ - if (!timespec_valid(ts)) - return false; - /* Disallow values that could overflow ktime_t */ - if ((unsigned long long)ts->tv_sec >= KTIME_SEC_MAX) - return false; - return true; -} - -static inline bool timeval_valid(const struct timeval *tv) -{ - /* Dates before 1970 are bogus */ - if (tv->tv_sec < 0) - return false; - - /* Can't have more microseconds then a second */ - if (tv->tv_usec < 0 || tv->tv_usec >= USEC_PER_SEC) - return false; - - return true; -} - -extern struct timespec timespec_trunc(struct timespec t, unsigned gran); - -/* - * Validates if a timespec/timeval used to inject a time offset is valid. - * Offsets can be postive or negative. The value of the timeval/timespec - * is the sum of its fields, but *NOTE*: the field tv_usec/tv_nsec must - * always be non-negative. - */ -static inline bool timeval_inject_offset_valid(const struct timeval *tv) -{ - /* We don't check the tv_sec as it can be positive or negative */ - - /* Can't have more microseconds then a second */ - if (tv->tv_usec < 0 || tv->tv_usec >= USEC_PER_SEC) - return false; - return true; -} - -static inline bool timespec_inject_offset_valid(const struct timespec *ts) -{ - /* We don't check the tv_sec as it can be positive or negative */ - - /* Can't have more nanoseconds then a second */ - if (ts->tv_nsec < 0 || ts->tv_nsec >= NSEC_PER_SEC) - return false; - return true; -} - /* Some architectures do not supply their own clocksource. * This is mainly the case in architectures that get their * inter-tick times by reading the counter on their interval @@ -209,73 +70,7 @@ struct tm { void time64_to_tm(time64_t totalsecs, int offset, struct tm *result); -/** - * time_to_tm - converts the calendar time to local broken-down time - * - * @totalsecs the number of seconds elapsed since 00:00:00 on January 1, 1970, - * Coordinated Universal Time (UTC). - * @offset offset seconds adding to totalsecs. - * @result pointer to struct tm variable to receive broken-down time - */ -static inline void time_to_tm(time_t totalsecs, int offset, struct tm *result) -{ - time64_to_tm(totalsecs, offset, result); -} - -/** - * timespec_to_ns - Convert timespec to nanoseconds - * @ts: pointer to the timespec variable to be converted - * - * Returns the scalar nanosecond representation of the timespec - * parameter. - */ -static inline s64 timespec_to_ns(const struct timespec *ts) -{ - return ((s64) ts->tv_sec * NSEC_PER_SEC) + ts->tv_nsec; -} - -/** - * timeval_to_ns - Convert timeval to nanoseconds - * @ts: pointer to the timeval variable to be converted - * - * Returns the scalar nanosecond representation of the timeval - * parameter. - */ -static inline s64 timeval_to_ns(const struct timeval *tv) -{ - return ((s64) tv->tv_sec * NSEC_PER_SEC) + - tv->tv_usec * NSEC_PER_USEC; -} - -/** - * ns_to_timespec - Convert nanoseconds to timespec - * @nsec: the nanoseconds value to be converted - * - * Returns the timespec representation of the nsec parameter. - */ -extern struct timespec ns_to_timespec(const s64 nsec); - -/** - * ns_to_timeval - Convert nanoseconds to timeval - * @nsec: the nanoseconds value to be converted - * - * Returns the timeval representation of the nsec parameter. - */ -extern struct timeval ns_to_timeval(const s64 nsec); - -/** - * timespec_add_ns - Adds nanoseconds to a timespec - * @a: pointer to timespec to be incremented - * @ns: unsigned nanoseconds value to be added - * - * This must always be inlined because its used from the x86-64 vdso, - * which cannot call other kernel functions. - */ -static __always_inline void timespec_add_ns(struct timespec *a, u64 ns) -{ - a->tv_sec += __iter_div_u64_rem(a->tv_nsec + ns, NSEC_PER_SEC, &ns); - a->tv_nsec = ns; -} +# include static inline bool itimerspec64_valid(const struct itimerspec64 *its) { diff --git a/include/linux/time32.h b/include/linux/time32.h new file mode 100644 index 000000000000..0b14f936100a --- /dev/null +++ b/include/linux/time32.h @@ -0,0 +1,210 @@ +#ifndef _LINUX_TIME32_H +#define _LINUX_TIME32_H +/* + * These are all interfaces based on the old time_t definition + * that overflows in 2038 on 32-bit architectures. New code + * should use the replacements based on time64_t and timespec64. + * + * Any interfaces in here that become unused as we migrate + * code to time64_t should get removed. + */ + +#include + +#define TIME_T_MAX (time_t)((1UL << ((sizeof(time_t) << 3) - 1)) - 1) + +#if __BITS_PER_LONG == 64 + +/* timespec64 is defined as timespec here */ +static inline struct timespec timespec64_to_timespec(const struct timespec64 ts64) +{ + return *(const struct timespec *)&ts64; +} + +static inline struct timespec64 timespec_to_timespec64(const struct timespec ts) +{ + return *(const struct timespec64 *)&ts; +} + +#else +static inline struct timespec timespec64_to_timespec(const struct timespec64 ts64) +{ + struct timespec ret; + + ret.tv_sec = (time_t)ts64.tv_sec; + ret.tv_nsec = ts64.tv_nsec; + return ret; +} + +static inline struct timespec64 timespec_to_timespec64(const struct timespec ts) +{ + struct timespec64 ret; + + ret.tv_sec = ts.tv_sec; + ret.tv_nsec = ts.tv_nsec; + return ret; +} +#endif + +static inline int timespec_equal(const struct timespec *a, + const struct timespec *b) +{ + return (a->tv_sec == b->tv_sec) && (a->tv_nsec == b->tv_nsec); +} + +/* + * lhs < rhs: return <0 + * lhs == rhs: return 0 + * lhs > rhs: return >0 + */ +static inline int timespec_compare(const struct timespec *lhs, const struct timespec *rhs) +{ + if (lhs->tv_sec < rhs->tv_sec) + return -1; + if (lhs->tv_sec > rhs->tv_sec) + return 1; + return lhs->tv_nsec - rhs->tv_nsec; +} + +extern void set_normalized_timespec(struct timespec *ts, time_t sec, s64 nsec); + +static inline struct timespec timespec_add(struct timespec lhs, + struct timespec rhs) +{ + struct timespec ts_delta; + + set_normalized_timespec(&ts_delta, lhs.tv_sec + rhs.tv_sec, + lhs.tv_nsec + rhs.tv_nsec); + return ts_delta; +} + +/* + * sub = lhs - rhs, in normalized form + */ +static inline struct timespec timespec_sub(struct timespec lhs, + struct timespec rhs) +{ + struct timespec ts_delta; + + set_normalized_timespec(&ts_delta, lhs.tv_sec - rhs.tv_sec, + lhs.tv_nsec - rhs.tv_nsec); + return ts_delta; +} + +/* + * Returns true if the timespec is norm, false if denorm: + */ +static inline bool timespec_valid(const struct timespec *ts) +{ + /* Dates before 1970 are bogus */ + if (ts->tv_sec < 0) + return false; + /* Can't have more nanoseconds then a second */ + if ((unsigned long)ts->tv_nsec >= NSEC_PER_SEC) + return false; + return true; +} + +static inline bool timespec_valid_strict(const struct timespec *ts) +{ + if (!timespec_valid(ts)) + return false; + /* Disallow values that could overflow ktime_t */ + if ((unsigned long long)ts->tv_sec >= KTIME_SEC_MAX) + return false; + return true; +} + +/** + * timespec_to_ns - Convert timespec to nanoseconds + * @ts: pointer to the timespec variable to be converted + * + * Returns the scalar nanosecond representation of the timespec + * parameter. + */ +static inline s64 timespec_to_ns(const struct timespec *ts) +{ + return ((s64) ts->tv_sec * NSEC_PER_SEC) + ts->tv_nsec; +} + +/** + * ns_to_timespec - Convert nanoseconds to timespec + * @nsec: the nanoseconds value to be converted + * + * Returns the timespec representation of the nsec parameter. + */ +extern struct timespec ns_to_timespec(const s64 nsec); + +/** + * timespec_add_ns - Adds nanoseconds to a timespec + * @a: pointer to timespec to be incremented + * @ns: unsigned nanoseconds value to be added + * + * This must always be inlined because its used from the x86-64 vdso, + * which cannot call other kernel functions. + */ +static __always_inline void timespec_add_ns(struct timespec *a, u64 ns) +{ + a->tv_sec += __iter_div_u64_rem(a->tv_nsec + ns, NSEC_PER_SEC, &ns); + a->tv_nsec = ns; +} + +/** + * time_to_tm - converts the calendar time to local broken-down time + * + * @totalsecs the number of seconds elapsed since 00:00:00 on January 1, 1970, + * Coordinated Universal Time (UTC). + * @offset offset seconds adding to totalsecs. + * @result pointer to struct tm variable to receive broken-down time + */ +static inline void time_to_tm(time_t totalsecs, int offset, struct tm *result) +{ + time64_to_tm(totalsecs, offset, result); +} + +static inline unsigned long mktime(const unsigned int year, + const unsigned int mon, const unsigned int day, + const unsigned int hour, const unsigned int min, + const unsigned int sec) +{ + return mktime64(year, mon, day, hour, min, sec); +} + +static inline bool timeval_valid(const struct timeval *tv) +{ + /* Dates before 1970 are bogus */ + if (tv->tv_sec < 0) + return false; + + /* Can't have more microseconds then a second */ + if (tv->tv_usec < 0 || tv->tv_usec >= USEC_PER_SEC) + return false; + + return true; +} + +extern struct timespec timespec_trunc(struct timespec t, unsigned int gran); + +/** + * timeval_to_ns - Convert timeval to nanoseconds + * @ts: pointer to the timeval variable to be converted + * + * Returns the scalar nanosecond representation of the timeval + * parameter. + */ +static inline s64 timeval_to_ns(const struct timeval *tv) +{ + return ((s64) tv->tv_sec * NSEC_PER_SEC) + + tv->tv_usec * NSEC_PER_USEC; +} + +/** + * ns_to_timeval - Convert nanoseconds to timeval + * @nsec: the nanoseconds value to be converted + * + * Returns the timeval representation of the nsec parameter. + */ +extern struct timeval ns_to_timeval(const s64 nsec); +extern struct __kernel_old_timeval ns_to_kernel_old_timeval(s64 nsec); + +#endif diff --git a/include/linux/time64.h b/include/linux/time64.h index 99ab4a686c30..fbb7fd156e4b 100644 --- a/include/linux/time64.h +++ b/include/linux/time64.h @@ -2,20 +2,20 @@ #ifndef _LINUX_TIME64_H #define _LINUX_TIME64_H -#include #include typedef __s64 time64_t; typedef __u64 timeu64_t; -/* - * This wants to go into uapi/linux/time.h once we agreed about the - * userspace interfaces. +/* CONFIG_64BIT_TIME enables new 64 bit time_t syscalls in the compat path + * and 32-bit emulation. */ -#if __BITS_PER_LONG == 64 -# define timespec64 timespec -#define itimerspec64 itimerspec -#else +#ifndef CONFIG_64BIT_TIME +#define __kernel_timespec timespec +#endif + +#include + struct timespec64 { time64_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ @@ -26,8 +26,6 @@ struct itimerspec64 { struct timespec64 it_value; }; -#endif - /* Parameters used to convert the timespec values: */ #define MSEC_PER_SEC 1000L #define USEC_PER_MSEC 1000L @@ -42,77 +40,6 @@ struct itimerspec64 { #define KTIME_MAX ((s64)~((u64)1 << 63)) #define KTIME_SEC_MAX (KTIME_MAX / NSEC_PER_SEC) -#if __BITS_PER_LONG == 64 - -static inline struct timespec timespec64_to_timespec(const struct timespec64 ts64) -{ - return ts64; -} - -static inline struct timespec64 timespec_to_timespec64(const struct timespec ts) -{ - return ts; -} - -static inline struct itimerspec itimerspec64_to_itimerspec(struct itimerspec64 *its64) -{ - return *its64; -} - -static inline struct itimerspec64 itimerspec_to_itimerspec64(struct itimerspec *its) -{ - return *its; -} - -# define timespec64_equal timespec_equal -# define timespec64_compare timespec_compare -# define set_normalized_timespec64 set_normalized_timespec -# define timespec64_add timespec_add -# define timespec64_sub timespec_sub -# define timespec64_valid timespec_valid -# define timespec64_valid_strict timespec_valid_strict -# define timespec64_to_ns timespec_to_ns -# define ns_to_timespec64 ns_to_timespec -# define timespec64_add_ns timespec_add_ns - -#else - -static inline struct timespec timespec64_to_timespec(const struct timespec64 ts64) -{ - struct timespec ret; - - ret.tv_sec = (time_t)ts64.tv_sec; - ret.tv_nsec = ts64.tv_nsec; - return ret; -} - -static inline struct timespec64 timespec_to_timespec64(const struct timespec ts) -{ - struct timespec64 ret; - - ret.tv_sec = ts.tv_sec; - ret.tv_nsec = ts.tv_nsec; - return ret; -} - -static inline struct itimerspec itimerspec64_to_itimerspec(struct itimerspec64 *its64) -{ - struct itimerspec ret; - - ret.it_interval = timespec64_to_timespec(its64->it_interval); - ret.it_value = timespec64_to_timespec(its64->it_value); - return ret; -} - -static inline struct itimerspec64 itimerspec_to_itimerspec64(struct itimerspec *its) -{ - struct itimerspec64 ret; - - ret.it_interval = timespec_to_timespec64(its->it_interval); - ret.it_value = timespec_to_timespec64(its->it_value); - return ret; -} - static inline int timespec64_equal(const struct timespec64 *a, const struct timespec64 *b) { @@ -218,8 +145,6 @@ static __always_inline void timespec64_add_ns(struct timespec64 *a, u64 ns) a->tv_nsec = ns; } -#endif - /* * timespec64_add_safe assumes both values are positive and checks for * overflow. It will return TIME64_MAX in case of overflow. diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h index 97154c61e5d2..7e9011101cb0 100644 --- a/include/linux/timekeeper_internal.h +++ b/include/linux/timekeeper_internal.h @@ -14,19 +14,22 @@ /** * struct tk_read_base - base structure for timekeeping readout * @clock: Current clocksource used for timekeeping. - * @read: Read function of @clock * @mask: Bitmask for two's complement subtraction of non 64bit clocks * @cycle_last: @clock cycle value at last update * @mult: (NTP adjusted) multiplier for scaled math conversion * @shift: Shift value for scaled math conversion * @xtime_nsec: Shifted (fractional) nano seconds offset for readout * @base: ktime_t (nanoseconds) base time for readout + * @base_real: Nanoseconds base value for clock REALTIME readout * * This struct has size 56 byte on 64 bit. Together with a seqcount it * occupies a single 64byte cache line. * * The struct is separate from struct timekeeper as it is also used * for a fast NMI safe accessors. + * + * @base_real is for the fast NMI safe accessor to allow reading clock + * realtime from any context. */ struct tk_read_base { struct clocksource *clock; @@ -36,6 +39,7 @@ struct tk_read_base { u32 shift; u64 xtime_nsec; ktime_t base; + u64 base_real; }; /** diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h index 0021575fe871..acab1e03bb1f 100644 --- a/include/linux/timekeeping.h +++ b/include/linux/timekeeping.h @@ -16,150 +16,27 @@ extern void xtime_update(unsigned long ticks); /* * Get and set timeofday */ -extern void do_gettimeofday(struct timeval *tv); extern int do_settimeofday64(const struct timespec64 *ts); extern int do_sys_settimeofday64(const struct timespec64 *tv, const struct timezone *tz); -/* - * Kernel time accessors - */ -unsigned long get_seconds(void); -struct timespec64 current_kernel_time64(void); -/* does not take xtime_lock */ -struct timespec __current_kernel_time(void); - -static inline struct timespec current_kernel_time(void) -{ - struct timespec64 now = current_kernel_time64(); - - return timespec64_to_timespec(now); -} /* - * timespec based interfaces + * timespec64 based interfaces */ -struct timespec64 get_monotonic_coarse64(void); -extern void getrawmonotonic64(struct timespec64 *ts); +extern void ktime_get_raw_ts64(struct timespec64 *ts); extern void ktime_get_ts64(struct timespec64 *ts); +extern void ktime_get_real_ts64(struct timespec64 *tv); +extern void ktime_get_coarse_ts64(struct timespec64 *ts); +extern void ktime_get_coarse_real_ts64(struct timespec64 *ts); + +void getboottime64(struct timespec64 *ts); + +/* + * time64_t base interfaces + */ extern time64_t ktime_get_seconds(void); extern time64_t ktime_get_real_seconds(void); -extern int __getnstimeofday64(struct timespec64 *tv); -extern void getnstimeofday64(struct timespec64 *tv); -extern void getboottime64(struct timespec64 *ts); - -#if BITS_PER_LONG == 64 -/** - * Deprecated. Use do_settimeofday64(). - */ -static inline int do_settimeofday(const struct timespec *ts) -{ - return do_settimeofday64(ts); -} - -static inline int __getnstimeofday(struct timespec *ts) -{ - return __getnstimeofday64(ts); -} - -static inline void getnstimeofday(struct timespec *ts) -{ - getnstimeofday64(ts); -} - -static inline void ktime_get_ts(struct timespec *ts) -{ - ktime_get_ts64(ts); -} - -static inline void ktime_get_real_ts(struct timespec *ts) -{ - getnstimeofday64(ts); -} - -static inline void getrawmonotonic(struct timespec *ts) -{ - getrawmonotonic64(ts); -} - -static inline struct timespec get_monotonic_coarse(void) -{ - return get_monotonic_coarse64(); -} - -static inline void getboottime(struct timespec *ts) -{ - return getboottime64(ts); -} -#else -/** - * Deprecated. Use do_settimeofday64(). - */ -static inline int do_settimeofday(const struct timespec *ts) -{ - struct timespec64 ts64; - - ts64 = timespec_to_timespec64(*ts); - return do_settimeofday64(&ts64); -} - -static inline int __getnstimeofday(struct timespec *ts) -{ - struct timespec64 ts64; - int ret = __getnstimeofday64(&ts64); - - *ts = timespec64_to_timespec(ts64); - return ret; -} - -static inline void getnstimeofday(struct timespec *ts) -{ - struct timespec64 ts64; - - getnstimeofday64(&ts64); - *ts = timespec64_to_timespec(ts64); -} - -static inline void ktime_get_ts(struct timespec *ts) -{ - struct timespec64 ts64; - - ktime_get_ts64(&ts64); - *ts = timespec64_to_timespec(ts64); -} - -static inline void ktime_get_real_ts(struct timespec *ts) -{ - struct timespec64 ts64; - - getnstimeofday64(&ts64); - *ts = timespec64_to_timespec(ts64); -} - -static inline void getrawmonotonic(struct timespec *ts) -{ - struct timespec64 ts64; - - getrawmonotonic64(&ts64); - *ts = timespec64_to_timespec(ts64); -} - -static inline struct timespec get_monotonic_coarse(void) -{ - return timespec64_to_timespec(get_monotonic_coarse64()); -} - -static inline void getboottime(struct timespec *ts) -{ - struct timespec64 ts64; - - getboottime64(&ts64); - *ts = timespec64_to_timespec(ts64); -} -#endif - -#define ktime_get_real_ts64(ts) getnstimeofday64(ts) - /* * ktime_t based interfaces */ @@ -173,6 +50,7 @@ enum tk_offsets { extern ktime_t ktime_get(void); extern ktime_t ktime_get_with_offset(enum tk_offsets offs); +extern ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs); extern ktime_t ktime_mono_to_any(ktime_t tmono, enum tk_offsets offs); extern ktime_t ktime_get_raw(void); extern u32 ktime_get_resolution_ns(void); @@ -185,6 +63,11 @@ static inline ktime_t ktime_get_real(void) return ktime_get_with_offset(TK_OFFS_REAL); } +static inline ktime_t ktime_get_coarse_real(void) +{ + return ktime_get_coarse_with_offset(TK_OFFS_REAL); +} + /** * ktime_get_boottime - Returns monotonic time since boot in ktime_t format * @@ -196,6 +79,11 @@ static inline ktime_t ktime_get_boottime(void) return ktime_get_with_offset(TK_OFFS_BOOT); } +static inline ktime_t ktime_get_coarse_boottime(void) +{ + return ktime_get_coarse_with_offset(TK_OFFS_BOOT); +} + /** * ktime_get_clocktai - Returns the TAI time of day in ktime_t format */ @@ -204,6 +92,11 @@ static inline ktime_t ktime_get_clocktai(void) return ktime_get_with_offset(TK_OFFS_TAI); } +static inline ktime_t ktime_get_coarse_clocktai(void) +{ + return ktime_get_coarse_with_offset(TK_OFFS_TAI); +} + /** * ktime_mono_to_real - Convert monotonic time to clock realtime */ @@ -222,12 +115,12 @@ static inline u64 ktime_get_real_ns(void) return ktime_to_ns(ktime_get_real()); } -static inline u64 ktime_get_boot_ns(void) +static inline u64 ktime_get_boottime_ns(void) { return ktime_to_ns(ktime_get_boottime()); } -static inline u64 ktime_get_tai_ns(void) +static inline u64 ktime_get_clocktai_ns(void) { return ktime_to_ns(ktime_get_clocktai()); } @@ -240,26 +133,17 @@ static inline u64 ktime_get_raw_ns(void) extern u64 ktime_get_mono_fast_ns(void); extern u64 ktime_get_raw_fast_ns(void); extern u64 ktime_get_boot_fast_ns(void); +extern u64 ktime_get_real_fast_ns(void); /* - * Timespec interfaces utilizing the ktime based ones + * timespec64 interfaces utilizing the ktime based ones */ -static inline void get_monotonic_boottime(struct timespec *ts) -{ - *ts = ktime_to_timespec(ktime_get_boottime()); -} - -static inline void get_monotonic_boottime64(struct timespec64 *ts) +static inline void ktime_get_boottime_ts64(struct timespec64 *ts) { *ts = ktime_to_timespec64(ktime_get_boottime()); } -static inline void timekeeping_clocktai(struct timespec *ts) -{ - *ts = ktime_to_timespec(ktime_get_clocktai()); -} - -static inline void timekeeping_clocktai64(struct timespec64 *ts) +static inline void ktime_get_clocktai_ts64(struct timespec64 *ts) { *ts = ktime_to_timespec64(ktime_get_clocktai()); } @@ -341,11 +225,34 @@ extern void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot); */ extern int persistent_clock_is_local; -extern void read_persistent_clock(struct timespec *ts); extern void read_persistent_clock64(struct timespec64 *ts); extern void read_boot_clock64(struct timespec64 *ts); -extern int update_persistent_clock(struct timespec now); extern int update_persistent_clock64(struct timespec64 now); +/* + * deprecated aliases, don't use in new code + */ +#define getnstimeofday64(ts) ktime_get_real_ts64(ts) +#define get_monotonic_boottime64(ts) ktime_get_boottime_ts64(ts) +#define getrawmonotonic64(ts) ktime_get_raw_ts64(ts) +#define timekeeping_clocktai64(ts) ktime_get_clocktai_ts64(ts) + +static inline struct timespec64 current_kernel_time64(void) +{ + struct timespec64 ts; + + ktime_get_coarse_real_ts64(&ts); + + return ts; +} + +static inline struct timespec64 get_monotonic_coarse64(void) +{ + struct timespec64 ts; + + ktime_get_coarse_ts64(&ts); + + return ts; +} #endif diff --git a/include/linux/timekeeping32.h b/include/linux/timekeeping32.h new file mode 100644 index 000000000000..8762c2f45f8b --- /dev/null +++ b/include/linux/timekeeping32.h @@ -0,0 +1,100 @@ +#ifndef _LINUX_TIMEKEEPING32_H +#define _LINUX_TIMEKEEPING32_H +/* + * These interfaces are all based on the old timespec type + * and should get replaced with the timespec64 based versions + * over time so we can remove the file here. + */ + +extern void do_gettimeofday(struct timeval *tv); +unsigned long get_seconds(void); + +static inline struct timespec current_kernel_time(void) +{ + struct timespec64 ts64; + + ktime_get_coarse_real_ts64(&ts64); + + return timespec64_to_timespec(ts64); +} + +/** + * Deprecated. Use do_settimeofday64(). + */ +static inline int do_settimeofday(const struct timespec *ts) +{ + struct timespec64 ts64; + + ts64 = timespec_to_timespec64(*ts); + return do_settimeofday64(&ts64); +} + +static inline void getnstimeofday(struct timespec *ts) +{ + struct timespec64 ts64; + + ktime_get_real_ts64(&ts64); + *ts = timespec64_to_timespec(ts64); +} + +static inline void ktime_get_ts(struct timespec *ts) +{ + struct timespec64 ts64; + + ktime_get_ts64(&ts64); + *ts = timespec64_to_timespec(ts64); +} + +static inline void ktime_get_real_ts(struct timespec *ts) +{ + struct timespec64 ts64; + + ktime_get_real_ts64(&ts64); + *ts = timespec64_to_timespec(ts64); +} + +static inline void getrawmonotonic(struct timespec *ts) +{ + struct timespec64 ts64; + + ktime_get_raw_ts64(&ts64); + *ts = timespec64_to_timespec(ts64); +} + +static inline struct timespec get_monotonic_coarse(void) +{ + struct timespec64 ts64; + + ktime_get_coarse_ts64(&ts64); + + return timespec64_to_timespec(ts64); +} + +static inline void getboottime(struct timespec *ts) +{ + struct timespec64 ts64; + + getboottime64(&ts64); + *ts = timespec64_to_timespec(ts64); +} + +/* + * Timespec interfaces utilizing the ktime based ones + */ +static inline void get_monotonic_boottime(struct timespec *ts) +{ + *ts = ktime_to_timespec(ktime_get_boottime()); +} + +static inline void timekeeping_clocktai(struct timespec *ts) +{ + *ts = ktime_to_timespec(ktime_get_clocktai()); +} + +/* + * Persistent clock related interfaces + */ +extern void read_persistent_clock(struct timespec *ts); +extern int update_persistent_clock(struct timespec now); + +#endif diff --git a/include/linux/tnum.h b/include/linux/tnum.h index 0d2d3da46139..06b9c20cc77e 100644 --- a/include/linux/tnum.h +++ b/include/linux/tnum.h @@ -23,8 +23,10 @@ struct tnum tnum_range(u64 min, u64 max); /* Arithmetic and logical ops */ /* Shift a tnum left (by a fixed shift) */ struct tnum tnum_lshift(struct tnum a, u8 shift); -/* Shift a tnum right (by a fixed shift) */ +/* Shift (rsh) a tnum right (by a fixed shift) */ struct tnum tnum_rshift(struct tnum a, u8 shift); +/* Shift (arsh) a tnum right (by a fixed min_shift) */ +struct tnum tnum_arshift(struct tnum a, u8 min_shift, u8 insn_bitness); /* Add two tnums, return @a + @b */ struct tnum tnum_add(struct tnum a, struct tnum b); /* Subtract two tnums, return @a - @b */ diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 6bbdef44d811..173caf91cdbe 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -467,7 +467,11 @@ void perf_event_detach_bpf_prog(struct perf_event *event); int perf_event_query_prog_array(struct perf_event *event, void __user *info); int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog); int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog); -struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name); +struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name); +void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp); +int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, + u32 *fd_type, const char **buf, + u64 *probe_offset, u64 *probe_addr); #else static inline unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) { @@ -495,10 +499,20 @@ static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf { return -EOPNOTSUPP; } -static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name) +static inline struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name) { return NULL; } +static inline void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp) +{ +} +static inline int bpf_get_perf_event_info(const struct perf_event *event, + u32 *prog_id, u32 *fd_type, + const char **buf, u64 *probe_offset, + u64 *probe_addr) +{ + return -EOPNOTSUPP; +} #endif enum { @@ -546,11 +560,27 @@ do { \ struct perf_event; DECLARE_PER_CPU(struct pt_regs, perf_trace_regs); +DECLARE_PER_CPU(int, bpf_kprobe_override); extern int perf_trace_init(struct perf_event *event); extern void perf_trace_destroy(struct perf_event *event); extern int perf_trace_add(struct perf_event *event, int flags); extern void perf_trace_del(struct perf_event *event, int flags); +#ifdef CONFIG_KPROBE_EVENTS +extern int perf_kprobe_init(struct perf_event *event, bool is_retprobe); +extern void perf_kprobe_destroy(struct perf_event *event); +extern int bpf_get_kprobe_info(const struct perf_event *event, + u32 *fd_type, const char **symbol, + u64 *probe_offset, u64 *probe_addr, + bool perf_type_tracepoint); +#endif +#ifdef CONFIG_UPROBE_EVENTS +extern int perf_uprobe_init(struct perf_event *event, bool is_retprobe); +extern void perf_uprobe_destroy(struct perf_event *event); +extern int bpf_get_uprobe_info(const struct perf_event *event, + u32 *fd_type, const char **filename, + u64 *probe_offset, bool perf_type_tracepoint); +#endif extern int ftrace_profile_set_filter(struct perf_event *event, int event_id, char *filter_str); extern void ftrace_profile_free_filter(struct perf_event *event); @@ -560,30 +590,30 @@ void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp); void bpf_trace_run1(struct bpf_prog *prog, u64 arg1); void bpf_trace_run2(struct bpf_prog *prog, u64 arg1, u64 arg2); void bpf_trace_run3(struct bpf_prog *prog, u64 arg1, u64 arg2, - u64 arg3); + u64 arg3); void bpf_trace_run4(struct bpf_prog *prog, u64 arg1, u64 arg2, - u64 arg3, u64 arg4); + u64 arg3, u64 arg4); void bpf_trace_run5(struct bpf_prog *prog, u64 arg1, u64 arg2, - u64 arg3, u64 arg4, u64 arg5); + u64 arg3, u64 arg4, u64 arg5); void bpf_trace_run6(struct bpf_prog *prog, u64 arg1, u64 arg2, - u64 arg3, u64 arg4, u64 arg5, u64 arg6); + u64 arg3, u64 arg4, u64 arg5, u64 arg6); void bpf_trace_run7(struct bpf_prog *prog, u64 arg1, u64 arg2, - u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7); + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7); void bpf_trace_run8(struct bpf_prog *prog, u64 arg1, u64 arg2, - u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, - u64 arg8); + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, + u64 arg8); void bpf_trace_run9(struct bpf_prog *prog, u64 arg1, u64 arg2, - u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, - u64 arg8, u64 arg9); + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, + u64 arg8, u64 arg9); void bpf_trace_run10(struct bpf_prog *prog, u64 arg1, u64 arg2, - u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, - u64 arg8, u64 arg9, u64 arg10); + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, + u64 arg8, u64 arg9, u64 arg10); void bpf_trace_run11(struct bpf_prog *prog, u64 arg1, u64 arg2, - u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, - u64 arg8, u64 arg9, u64 arg10, u64 arg11); + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, + u64 arg8, u64 arg9, u64 arg10, u64 arg11); void bpf_trace_run12(struct bpf_prog *prog, u64 arg1, u64 arg2, - u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, - u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12); + u64 arg3, u64 arg4, u64 arg5, u64 arg6, u64 arg7, + u64 arg8, u64 arg9, u64 arg10, u64 arg11, u64 arg12); void perf_trace_run_bpf_submit(void *raw_data, int size, int rctx, struct trace_event_call *call, u64 count, struct pt_regs *regs, struct hlist_head *head, diff --git a/include/linux/tracepoint-defs.h b/include/linux/tracepoint-defs.h index 1c752459aac4..ad6ce90b8ad3 100644 --- a/include/linux/tracepoint-defs.h +++ b/include/linux/tracepoint-defs.h @@ -36,9 +36,9 @@ struct tracepoint { }; struct bpf_raw_event_map { - struct tracepoint *tp; - void *bpf_func; - u32 num_args; + struct tracepoint *tp; + void *bpf_func; + u32 num_args; u32 writable_size; } __aligned(32); diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 06cb9184a42a..f2ebb94c8a71 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -39,7 +39,17 @@ extern int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data, int prio); extern int +tracepoint_probe_register_prio_may_exist(struct tracepoint *tp, void *probe, void *data, + int prio); +extern int tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data); +static inline int +tracepoint_probe_register_may_exist(struct tracepoint *tp, void *probe, + void *data) +{ + return tracepoint_probe_register_prio_may_exist(tp, probe, data, + TRACEPOINT_DEFAULT_PRIO); +} extern void for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv), void *priv); diff --git a/include/linux/umh.h b/include/linux/umh.h index 244aff638220..0c08de356d0d 100644 --- a/include/linux/umh.h +++ b/include/linux/umh.h @@ -22,8 +22,10 @@ struct subprocess_info { const char *path; char **argv; char **envp; + struct file *file; int wait; int retval; + pid_t pid; int (*init)(struct subprocess_info *info, struct cred *new); void (*cleanup)(struct subprocess_info *info); void *data; @@ -38,6 +40,19 @@ call_usermodehelper_setup(const char *path, char **argv, char **envp, int (*init)(struct subprocess_info *info, struct cred *new), void (*cleanup)(struct subprocess_info *), void *data); +struct subprocess_info *call_usermodehelper_setup_file(struct file *file, + int (*init)(struct subprocess_info *info, struct cred *new), + void (*cleanup)(struct subprocess_info *), void *data); +struct umh_info { + const char *cmdline; + struct file *pipe_to_umh; + struct file *pipe_from_umh; + struct list_head list; + void (*cleanup)(struct umh_info *info); + pid_t pid; +}; +int fork_usermode_blob(void *data, size_t len, struct umh_info *info); + extern int call_usermodehelper_exec(struct subprocess_info *info, int wait); diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 7517dd15f87b..6047058d6703 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -101,7 +101,7 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, * probe and drop if does not match one of the above types. */ if (gso_type && skb->network_header) { - struct flow_keys keys; + struct flow_keys_basic keys; if (!skb->protocol) { __be16 protocol = dev_parse_header_protocol(skb); @@ -114,7 +114,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, skb->protocol = protocol; } retry: - if (!skb_flow_dissect_flow_keys(skb, &keys, 0)) { + if (!skb_flow_dissect_flow_keys_basic(NULL, skb, &keys, + NULL, 0, 0, 0, + 0)) { /* UFO does not specify ipv4 or 6: try both */ if (gso_type & SKB_GSO_UDP && skb->protocol == htons(ETH_P_IP)) { diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index e40bf8b2a83a..53cad7dca9bd 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -23,6 +23,11 @@ struct notifier_block; /* in notifier.h */ #define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */ #define VM_LOWMEM 0x00000100 /* Tracking of direct mapped lowmem */ +/* + * Memory with VM_FLUSH_RESET_PERMS cannot be freed in an interrupt or with + * vfree_atomic(). + */ +#define VM_FLUSH_RESET_PERMS 0x00000120 /* Reset direct map and flush TLB on unmap */ /* bits [20..32] reserved for arch specific ioremap internals */ /* @@ -147,6 +152,13 @@ extern int map_kernel_range_noflush(unsigned long start, unsigned long size, pgprot_t prot, struct page **pages); extern void unmap_kernel_range_noflush(unsigned long addr, unsigned long size); extern void unmap_kernel_range(unsigned long addr, unsigned long size); +static inline void set_vm_flush_reset_perms(void *addr) +{ + struct vm_struct *vm = find_vm_area(addr); + + if (vm) + vm->flags |= VM_FLUSH_RESET_PERMS; +} #else static inline int map_kernel_range_noflush(unsigned long start, unsigned long size, @@ -162,6 +174,9 @@ static inline void unmap_kernel_range(unsigned long addr, unsigned long size) { } +static inline void set_vm_flush_reset_perms(void *addr) +{ +} #endif /* Allocate/destroy a 'vmalloc' VM area. */ diff --git a/include/media/cec.h b/include/media/cec.h index b7339cc6fd3d..ca7ecd199aa7 100644 --- a/include/media/cec.h +++ b/include/media/cec.h @@ -191,11 +191,6 @@ struct cec_adapter { u32 tx_timeouts; -#ifdef CONFIG_MEDIA_CEC_RC - bool rc_repeating; - int rc_last_scancode; - u64 rc_last_keypress; -#endif #ifdef CONFIG_CEC_NOTIFIER struct cec_notifier *notifier; #endif diff --git a/include/media/lirc.h b/include/media/lirc.h deleted file mode 100644 index 554988c860c1..000000000000 --- a/include/media/lirc.h +++ /dev/null @@ -1 +0,0 @@ -#include diff --git a/include/media/lirc_dev.h b/include/media/lirc_dev.h deleted file mode 100644 index 86d15a9b6c01..000000000000 --- a/include/media/lirc_dev.h +++ /dev/null @@ -1,206 +0,0 @@ -/* - * LIRC base driver - * - * by Artur Lipowski - * This code is licensed under GNU GPL - * - */ - -#ifndef _LINUX_LIRC_DEV_H -#define _LINUX_LIRC_DEV_H - -#define MAX_IRCTL_DEVICES 8 -#define BUFLEN 16 - -#include -#include -#include -#include -#include -#include - -struct lirc_buffer { - wait_queue_head_t wait_poll; - spinlock_t fifo_lock; - unsigned int chunk_size; - unsigned int size; /* in chunks */ - /* Using chunks instead of bytes pretends to simplify boundary checking - * And should allow for some performance fine tunning later */ - struct kfifo fifo; -}; - -static inline void lirc_buffer_clear(struct lirc_buffer *buf) -{ - unsigned long flags; - - if (kfifo_initialized(&buf->fifo)) { - spin_lock_irqsave(&buf->fifo_lock, flags); - kfifo_reset(&buf->fifo); - spin_unlock_irqrestore(&buf->fifo_lock, flags); - } else - WARN(1, "calling %s on an uninitialized lirc_buffer\n", - __func__); -} - -static inline int lirc_buffer_init(struct lirc_buffer *buf, - unsigned int chunk_size, - unsigned int size) -{ - int ret; - - init_waitqueue_head(&buf->wait_poll); - spin_lock_init(&buf->fifo_lock); - buf->chunk_size = chunk_size; - buf->size = size; - ret = kfifo_alloc(&buf->fifo, size * chunk_size, GFP_KERNEL); - - return ret; -} - -static inline void lirc_buffer_free(struct lirc_buffer *buf) -{ - if (kfifo_initialized(&buf->fifo)) { - kfifo_free(&buf->fifo); - } else - WARN(1, "calling %s on an uninitialized lirc_buffer\n", - __func__); -} - -static inline int lirc_buffer_len(struct lirc_buffer *buf) -{ - int len; - unsigned long flags; - - spin_lock_irqsave(&buf->fifo_lock, flags); - len = kfifo_len(&buf->fifo); - spin_unlock_irqrestore(&buf->fifo_lock, flags); - - return len; -} - -static inline int lirc_buffer_full(struct lirc_buffer *buf) -{ - return lirc_buffer_len(buf) == buf->size * buf->chunk_size; -} - -static inline int lirc_buffer_empty(struct lirc_buffer *buf) -{ - return !lirc_buffer_len(buf); -} - -static inline unsigned int lirc_buffer_read(struct lirc_buffer *buf, - unsigned char *dest) -{ - unsigned int ret = 0; - - if (lirc_buffer_len(buf) >= buf->chunk_size) - ret = kfifo_out_locked(&buf->fifo, dest, buf->chunk_size, - &buf->fifo_lock); - return ret; - -} - -static inline unsigned int lirc_buffer_write(struct lirc_buffer *buf, - unsigned char *orig) -{ - unsigned int ret; - - ret = kfifo_in_locked(&buf->fifo, orig, buf->chunk_size, - &buf->fifo_lock); - - return ret; -} - -/** - * struct lirc_driver - Defines the parameters on a LIRC driver - * - * @name: this string will be used for logs - * - * @minor: indicates minor device (/dev/lirc) number for - * registered driver if caller fills it with negative - * value, then the first free minor number will be used - * (if available). - * - * @code_length: length of the remote control key code expressed in bits. - * - * @buffer_size: Number of FIFO buffers with @chunk_size size. If zero, - * creates a buffer with BUFLEN size (16 bytes). - * - * @features: lirc compatible hardware features, like LIRC_MODE_RAW, - * LIRC_CAN\_\*, as defined at include/media/lirc.h. - * - * @chunk_size: Size of each FIFO buffer. - * - * @data: it may point to any driver data and this pointer will - * be passed to all callback functions. - * - * @min_timeout: Minimum timeout for record. Valid only if - * LIRC_CAN_SET_REC_TIMEOUT is defined. - * - * @max_timeout: Maximum timeout for record. Valid only if - * LIRC_CAN_SET_REC_TIMEOUT is defined. - * - * @rbuf: if not NULL, it will be used as a read buffer, you will - * have to write to the buffer by other means, like irq's - * (see also lirc_serial.c). - * - * @rdev: Pointed to struct rc_dev associated with the LIRC - * device. - * - * @fops: file_operations for drivers which don't fit the current - * driver model. - * Some ioctl's can be directly handled by lirc_dev if the - * driver's ioctl function is NULL or if it returns - * -ENOIOCTLCMD (see also lirc_serial.c). - * - * @dev: pointer to the struct device associated with the LIRC - * device. - * - * @owner: the module owning this struct - */ -struct lirc_driver { - char name[40]; - int minor; - __u32 code_length; - unsigned int buffer_size; /* in chunks holding one code each */ - __u32 features; - - unsigned int chunk_size; - - void *data; - int min_timeout; - int max_timeout; - struct lirc_buffer *rbuf; - struct rc_dev *rdev; - const struct file_operations *fops; - struct device *dev; - struct module *owner; -}; - -/* following functions can be called ONLY from user context - * - * returns negative value on error or minor number - * of the registered device if success - * contents of the structure pointed by p is copied - */ -extern int lirc_register_driver(struct lirc_driver *d); - -/* returns negative value on error or 0 if success -*/ -extern int lirc_unregister_driver(int minor); - -/* Returns the private data stored in the lirc_driver - * associated with the given device file pointer. - */ -void *lirc_get_pdata(struct file *file); - -/* default file operations - * used by drivers if they override only some operations - */ -int lirc_dev_fop_open(struct inode *inode, struct file *file); -int lirc_dev_fop_close(struct inode *inode, struct file *file); -unsigned int lirc_dev_fop_poll(struct file *file, poll_table *wait); -long lirc_dev_fop_ioctl(struct file *file, unsigned int cmd, unsigned long arg); -ssize_t lirc_dev_fop_read(struct file *file, char __user *buffer, size_t length, - loff_t *ppos); -#endif diff --git a/include/media/rc-core.h b/include/media/rc-core.h index 314a1edb6189..61571773a98d 100644 --- a/include/media/rc-core.h +++ b/include/media/rc-core.h @@ -17,22 +17,16 @@ #define _RC_CORE #include +#include #include #include #include #include -extern int rc_core_debug; -#define IR_dprintk(level, fmt, ...) \ -do { \ - if (rc_core_debug >= level) \ - printk(KERN_DEBUG pr_fmt(fmt), ##__VA_ARGS__); \ -} while (0) - /** - * enum rc_driver_type - type of the RC output + * enum rc_driver_type - type of the RC driver. * - * @RC_DRIVER_SCANCODE: Driver or hardware generates a scancode + * @RC_DRIVER_SCANCODE: Driver or hardware generates a scancode. * @RC_DRIVER_IR_RAW: Driver or hardware generates pulse/space sequences. * It needs a Infra-Red pulse/space decoder * @RC_DRIVER_IR_RAW_TX: Device transmitter only, @@ -67,6 +61,33 @@ enum rc_filter_type { RC_FILTER_MAX }; +/** + * struct lirc_fh - represents an open lirc file + * @list: list of open file handles + * @rc: rcdev for this lirc chardev + * @carrier_low: when setting the carrier range, first the low end must be + * set with an ioctl and then the high end with another ioctl + * @send_timeout_reports: report timeouts in lirc raw IR. + * @rawir: queue for incoming raw IR + * @scancodes: queue for incoming decoded scancodes + * @wait_poll: poll struct for lirc device + * @send_mode: lirc mode for sending, either LIRC_MODE_SCANCODE or + * LIRC_MODE_PULSE + * @rec_mode: lirc mode for receiving, either LIRC_MODE_SCANCODE or + * LIRC_MODE_MODE2 + */ +struct lirc_fh { + struct list_head list; + struct rc_dev *rc; + int carrier_low; + bool send_timeout_reports; + DECLARE_KFIFO_PTR(rawir, unsigned int); + DECLARE_KFIFO_PTR(scancodes, struct lirc_scancode); + wait_queue_head_t wait_poll; + u8 send_mode; + u8 rec_mode; +}; + /** * struct rc_dev - represents a remote control device * @dev: driver model's view of this device @@ -106,6 +127,8 @@ enum rc_filter_type { * @keypressed: whether a key is currently pressed * @keyup_jiffies: time (in jiffies) when the current keypress should be released * @timer_keyup: timer for releasing a keypress + * @timer_repeat: timer for autorepeat events. This is needed for CEC, which + * has non-standard repeats. * @last_keycode: keycode of last keypress * @last_protocol: protocol of last keypress * @last_scancode: scancode of last keypress @@ -115,6 +138,15 @@ enum rc_filter_type { * @max_timeout: maximum timeout supported by device * @rx_resolution : resolution (in ns) of input sampler * @tx_resolution: resolution (in ns) of output sampler + * @lirc_dev: lirc device + * @lirc_cdev: lirc char cdev + * @gap_start: time when gap starts + * @gap_duration: duration of initial gap + * @gap: true if we're in a gap + * @lirc_fh_lock: protects lirc_fh list + * @lirc_fh: list of open files + * @registered: set to true by rc_register_device(), false by + * rc_unregister_device * @change_protocol: allow changing the protocol used on hardware decoders * @open: callback to allow drivers to enable polling/irq when IR input device * is opened. @@ -165,6 +197,7 @@ struct rc_dev { bool keypressed; unsigned long keyup_jiffies; struct timer_list timer_keyup; + struct timer_list timer_repeat; u32 last_keycode; enum rc_proto last_protocol; u32 last_scancode; @@ -174,6 +207,16 @@ struct rc_dev { u32 max_timeout; u32 rx_resolution; u32 tx_resolution; +#ifdef CONFIG_LIRC + struct device lirc_dev; + struct cdev lirc_cdev; + ktime_t gap_start; + u64 gap_duration; + bool gap; + spinlock_t lirc_fh_lock; + struct list_head lirc_fh; +#endif + bool registered; int (*change_protocol)(struct rc_dev *dev, u64 *rc_proto); int (*open)(struct rc_dev *dev); void (*close)(struct rc_dev *dev); @@ -248,20 +291,6 @@ int devm_rc_register_device(struct device *parent, struct rc_dev *dev); */ void rc_unregister_device(struct rc_dev *dev); -/** - * rc_open - Opens a RC device - * - * @rdev: pointer to struct rc_dev. - */ -int rc_open(struct rc_dev *rdev); - -/** - * rc_close - Closes a RC device - * - * @rdev: pointer to struct rc_dev. - */ -void rc_close(struct rc_dev *rdev); - void rc_repeat(struct rc_dev *dev); void rc_keydown(struct rc_dev *dev, enum rc_proto protocol, u32 scancode, u8 toggle); @@ -305,16 +334,20 @@ void ir_raw_event_handle(struct rc_dev *dev); int ir_raw_event_store(struct rc_dev *dev, struct ir_raw_event *ev); int ir_raw_event_store_edge(struct rc_dev *dev, bool pulse); int ir_raw_event_store_with_filter(struct rc_dev *dev, - struct ir_raw_event *ev); + struct ir_raw_event *ev); +int ir_raw_event_store_with_timeout(struct rc_dev *dev, + struct ir_raw_event *ev); void ir_raw_event_set_idle(struct rc_dev *dev, bool idle); int ir_raw_encode_scancode(enum rc_proto protocol, u32 scancode, struct ir_raw_event *events, unsigned int max); +int ir_raw_encode_carrier(enum rc_proto protocol); static inline void ir_raw_event_reset(struct rc_dev *dev) { struct ir_raw_event ev = { .reset = true }; ir_raw_event_store(dev, &ev); + dev->idle = true; ir_raw_event_handle(dev); } diff --git a/include/media/rc-map.h b/include/media/rc-map.h index 2a160e6e823c..bfa3017cecba 100644 --- a/include/media/rc-map.h +++ b/include/media/rc-map.h @@ -10,59 +10,7 @@ */ #include - -/** - * enum rc_proto - the Remote Controller protocol - * - * @RC_PROTO_UNKNOWN: Protocol not known - * @RC_PROTO_OTHER: Protocol known but proprietary - * @RC_PROTO_RC5: Philips RC5 protocol - * @RC_PROTO_RC5X_20: Philips RC5x 20 bit protocol - * @RC_PROTO_RC5_SZ: StreamZap variant of RC5 - * @RC_PROTO_JVC: JVC protocol - * @RC_PROTO_SONY12: Sony 12 bit protocol - * @RC_PROTO_SONY15: Sony 15 bit protocol - * @RC_PROTO_SONY20: Sony 20 bit protocol - * @RC_PROTO_NEC: NEC protocol - * @RC_PROTO_NECX: Extended NEC protocol - * @RC_PROTO_NEC32: NEC 32 bit protocol - * @RC_PROTO_SANYO: Sanyo protocol - * @RC_PROTO_MCIR2_KBD: RC6-ish MCE keyboard - * @RC_PROTO_MCIR2_MSE: RC6-ish MCE mouse - * @RC_PROTO_RC6_0: Philips RC6-0-16 protocol - * @RC_PROTO_RC6_6A_20: Philips RC6-6A-20 protocol - * @RC_PROTO_RC6_6A_24: Philips RC6-6A-24 protocol - * @RC_PROTO_RC6_6A_32: Philips RC6-6A-32 protocol - * @RC_PROTO_RC6_MCE: MCE (Philips RC6-6A-32 subtype) protocol - * @RC_PROTO_SHARP: Sharp protocol - * @RC_PROTO_XMP: XMP protocol - * @RC_PROTO_CEC: CEC protocol - */ -enum rc_proto { - RC_PROTO_UNKNOWN = 0, - RC_PROTO_OTHER = 1, - RC_PROTO_RC5 = 2, - RC_PROTO_RC5X_20 = 3, - RC_PROTO_RC5_SZ = 4, - RC_PROTO_JVC = 5, - RC_PROTO_SONY12 = 6, - RC_PROTO_SONY15 = 7, - RC_PROTO_SONY20 = 8, - RC_PROTO_NEC = 9, - RC_PROTO_NECX = 10, - RC_PROTO_NEC32 = 11, - RC_PROTO_SANYO = 12, - RC_PROTO_MCIR2_KBD = 13, - RC_PROTO_MCIR2_MSE = 14, - RC_PROTO_RC6_0 = 15, - RC_PROTO_RC6_6A_20 = 16, - RC_PROTO_RC6_6A_24 = 17, - RC_PROTO_RC6_6A_32 = 18, - RC_PROTO_RC6_MCE = 19, - RC_PROTO_SHARP = 20, - RC_PROTO_XMP = 21, - RC_PROTO_CEC = 22, -}; +#include #define RC_PROTO_BIT_NONE 0ULL #define RC_PROTO_BIT_UNKNOWN BIT_ULL(RC_PROTO_UNKNOWN) @@ -88,6 +36,7 @@ enum rc_proto { #define RC_PROTO_BIT_SHARP BIT_ULL(RC_PROTO_SHARP) #define RC_PROTO_BIT_XMP BIT_ULL(RC_PROTO_XMP) #define RC_PROTO_BIT_CEC BIT_ULL(RC_PROTO_CEC) +#define RC_PROTO_BIT_IMON BIT_ULL(RC_PROTO_IMON) #define RC_PROTO_BIT_ALL \ (RC_PROTO_BIT_UNKNOWN | RC_PROTO_BIT_OTHER | \ @@ -101,7 +50,8 @@ enum rc_proto { RC_PROTO_BIT_RC6_0 | RC_PROTO_BIT_RC6_6A_20 | \ RC_PROTO_BIT_RC6_6A_24 | RC_PROTO_BIT_RC6_6A_32 | \ RC_PROTO_BIT_RC6_MCE | RC_PROTO_BIT_SHARP | \ - RC_PROTO_BIT_XMP | RC_PROTO_BIT_CEC) + RC_PROTO_BIT_XMP | RC_PROTO_BIT_CEC | \ + RC_PROTO_BIT_IMON) /* All rc protocols for which we have decoders */ #define RC_PROTO_BIT_ALL_IR_DECODER \ (RC_PROTO_BIT_RC5 | RC_PROTO_BIT_RC5X_20 | \ @@ -114,7 +64,7 @@ enum rc_proto { RC_PROTO_BIT_RC6_0 | RC_PROTO_BIT_RC6_6A_20 | \ RC_PROTO_BIT_RC6_6A_24 | RC_PROTO_BIT_RC6_6A_32 | \ RC_PROTO_BIT_RC6_MCE | RC_PROTO_BIT_SHARP | \ - RC_PROTO_BIT_XMP) + RC_PROTO_BIT_XMP | RC_PROTO_BIT_IMON) #define RC_PROTO_BIT_ALL_IR_ENCODER \ (RC_PROTO_BIT_RC5 | RC_PROTO_BIT_RC5X_20 | \ @@ -127,7 +77,7 @@ enum rc_proto { RC_PROTO_BIT_RC6_0 | RC_PROTO_BIT_RC6_6A_20 | \ RC_PROTO_BIT_RC6_6A_24 | \ RC_PROTO_BIT_RC6_6A_32 | RC_PROTO_BIT_RC6_MCE | \ - RC_PROTO_BIT_SHARP) + RC_PROTO_BIT_SHARP | RC_PROTO_BIT_IMON) #define RC_SCANCODE_UNKNOWN(x) (x) #define RC_SCANCODE_OTHER(x) (x) @@ -211,6 +161,7 @@ struct rc_map *rc_map_get(const char *name); #define RC_MAP_ALINK_DTU_M "rc-alink-dtu-m" #define RC_MAP_ANYSEE "rc-anysee" #define RC_MAP_APAC_VIEWCOMP "rc-apac-viewcomp" +#define RC_MAP_ASTROMETA_T2HYBRID "rc-astrometa-t2hybrid" #define RC_MAP_ASUS_PC39 "rc-asus-pc39" #define RC_MAP_ASUS_PS3_100 "rc-asus-ps3-100" #define RC_MAP_ATI_TV_WONDER_HD_600 "rc-ati-tv-wonder-hd-600" @@ -258,8 +209,11 @@ struct rc_map *rc_map_get(const char *name); #define RC_MAP_GENIUS_TVGO_A11MCE "rc-genius-tvgo-a11mce" #define RC_MAP_GOTVIEW7135 "rc-gotview7135" #define RC_MAP_HAUPPAUGE_NEW "rc-hauppauge" +#define RC_MAP_HISI_POPLAR "rc-hisi-poplar" +#define RC_MAP_HISI_TV_DEMO "rc-hisi-tv-demo" #define RC_MAP_IMON_MCE "rc-imon-mce" #define RC_MAP_IMON_PAD "rc-imon-pad" +#define RC_MAP_IMON_RSC "rc-imon-rsc" #define RC_MAP_IODATA_BCTV7E "rc-iodata-bctv7e" #define RC_MAP_IT913X_V1 "rc-it913x-v1" #define RC_MAP_IT913X_V2 "rc-it913x-v2" @@ -300,6 +254,7 @@ struct rc_map *rc_map_get(const char *name); #define RC_MAP_REDDO "rc-reddo" #define RC_MAP_SNAPSTREAM_FIREFLY "rc-snapstream-firefly" #define RC_MAP_STREAMZAP "rc-streamzap" +#define RC_MAP_TANGO "rc-tango" #define RC_MAP_TBS_NEC "rc-tbs-nec" #define RC_MAP_TECHNISAT_TS35 "rc-technisat-ts35" #define RC_MAP_TECHNISAT_USB2 "rc-technisat-usb2" diff --git a/include/net/addrconf.h b/include/net/addrconf.h index d06ff0183766..ecff0e070291 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -95,8 +95,9 @@ int __ipv6_get_lladdr(struct inet6_dev *idev, struct in6_addr *addr, u32 banned_flags); int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr, u32 banned_flags); -int inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2, - bool match_wildcard); +bool inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2, + bool match_wildcard); +bool inet_rcv_saddr_any(const struct sock *sk); void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr); void addrconf_leave_solict(struct inet6_dev *idev, const struct in6_addr *addr); @@ -227,6 +228,22 @@ struct ipv6_stub { const struct sock *sk, struct flowi6 *fl6, const struct in6_addr *final_dst); + + struct fib6_table *(*fib6_get_table)(struct net *net, u32 id); + struct fib6_info *(*fib6_lookup)(struct net *net, int oif, + struct flowi6 *fl6, int flags); + struct fib6_info *(*fib6_table_lookup)(struct net *net, + struct fib6_table *table, + int oif, struct flowi6 *fl6, + int flags); + struct fib6_info *(*fib6_multipath_select)(const struct net *net, + struct fib6_info *f6i, + struct flowi6 *fl6, int oif, + const struct sk_buff *skb, + int strict); + u32 (*ip6_mtu_from_fib6)(struct fib6_info *f6i, struct in6_addr *daddr, + struct in6_addr *saddr); + void (*udpv6_encap_enable)(void); void (*ndisc_send_na)(struct net_device *dev, const struct in6_addr *daddr, const struct in6_addr *solicited_addr, @@ -239,6 +256,11 @@ extern const struct ipv6_stub *ipv6_stub __read_mostly; struct ipv6_bpf_stub { int (*inet6_bind)(struct sock *sk, struct sockaddr *uaddr, int addr_len, bool force_bind_address_no_port, bool with_lock); + struct sock *(*udp6_lib_lookup)(struct net *net, + const struct in6_addr *saddr, __be16 sport, + const struct in6_addr *daddr, __be16 dport, + int dif, int sdif, struct udp_table *tbl, + struct sk_buff *skb); }; extern const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly; @@ -314,6 +336,20 @@ static inline struct inet6_dev *__in6_dev_get(const struct net_device *dev) return rcu_dereference_rtnl(dev->ip6_ptr); } +/** + * __in6_dev_get_safely - get inet6_dev pointer from netdevice + * @dev: network device + * + * This is a safer version of __in6_dev_get + */ +static inline struct inet6_dev *__in6_dev_get_safely(const struct net_device *dev) +{ + if (likely(dev)) + return rcu_dereference_rtnl(dev->ip6_ptr); + else + return NULL; +} + /** * in6_dev_get - get inet6_dev pointer from netdevice * @dev: network device diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h index b9dcb02e756b..8e4f831d2e52 100644 --- a/include/net/bpf_sk_storage.h +++ b/include/net/bpf_sk_storage.h @@ -10,4 +10,14 @@ void bpf_sk_storage_free(struct sock *sk); extern const struct bpf_func_proto bpf_sk_storage_get_proto; extern const struct bpf_func_proto bpf_sk_storage_delete_proto; +#ifdef CONFIG_BPF_SYSCALL +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk); +#else +static inline int bpf_sk_storage_clone(const struct sock *sk, + struct sock *newsk) +{ + return 0; +} +#endif + #endif /* _BPF_SK_STORAGE_H */ diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h index b0f3b530415e..e9b2b46f2fd4 100644 --- a/include/net/cfg80211.h +++ b/include/net/cfg80211.h @@ -1935,7 +1935,7 @@ enum cfg80211_signal_type { * received by the device (not just by the host, in case it was * buffered on the device) and be accurate to about 10ms. * If the frame isn't buffered, just passing the return value of - * ktime_get_boot_ns() is likely appropriate. + * ktime_get_boottime_ns() is likely appropriate. * @parent_tsf: the time at the start of reception of the first octet of the * timestamp field of the frame. The time is the TSF of the BSS specified * by %parent_bssid. diff --git a/include/net/dst.h b/include/net/dst.h index 938a0f02ff1c..67f147c66369 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -34,13 +34,9 @@ struct sk_buff; struct dst_entry { struct net_device *dev; - struct rcu_head rcu_head; - struct dst_entry *child; struct dst_ops *ops; unsigned long _metrics; unsigned long expires; - struct dst_entry *path; - struct dst_entry *from; #ifdef CONFIG_XFRM struct xfrm_state *xfrm; #else @@ -59,8 +55,6 @@ struct dst_entry { #define DST_XFRM_QUEUE 0x0040 #define DST_METADATA 0x0080 - short error; - /* A non-zero value of dst->obsolete forces by-hand validation * of the route entry. Positive values are set by the generic * dst layer to indicate that the entry has been forcefully @@ -76,34 +70,27 @@ struct dst_entry { #define DST_OBSOLETE_KILL -2 unsigned short header_len; /* more space at head required */ unsigned short trailer_len; /* space to reserve at tail */ - unsigned short __pad3; -#ifdef CONFIG_IP_ROUTE_CLASSID - __u32 tclassid; -#else - __u32 __pad2; -#endif - -#ifdef CONFIG_64BIT - /* - * Align __refcnt to a 64 bytes alignment - * (L1_CACHE_SIZE would be too much) - */ - long __pad_to_align_refcnt[2]; -#endif /* * __refcnt wants to be on a different cache line from * input/output/ops or performance tanks badly */ - atomic_t __refcnt; /* client references */ +#ifdef CONFIG_64BIT + atomic_t __refcnt; /* 64-bit offset 64 */ +#endif int __use; unsigned long lastuse; struct lwtunnel_state *lwtstate; + struct rcu_head rcu_head; + short error; + short __pad; + __u32 tclassid; +#ifndef CONFIG_64BIT + atomic_t __refcnt; /* 32-bit offset 64 */ +#endif + union { struct dst_entry *next; - struct rtable __rcu *rt_next; - struct rt6_info *rt6_next; - struct dn_route __rcu *dn_next; }; }; @@ -250,7 +237,7 @@ static inline void dst_hold(struct dst_entry *dst) { /* * If your kernel compilation stops here, please check - * __pad_to_align_refcnt declaration in struct dst_entry + * the placement of __refcnt in struct dst_entry */ BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63); WARN_ON(atomic_inc_not_zero(&dst->__refcnt) == 0); diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h index 177b1aabf95d..55b1181aeba4 100644 --- a/include/net/dst_metadata.h +++ b/include/net/dst_metadata.h @@ -89,6 +89,7 @@ static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a, void metadata_dst_free(struct metadata_dst *); struct metadata_dst *metadata_dst_alloc(u8 optslen, enum metadata_type type, gfp_t flags); +void metadata_dst_free_percpu(struct metadata_dst __percpu *md_dst); struct metadata_dst __percpu * metadata_dst_alloc_percpu(u8 optslen, enum metadata_type type, gfp_t flags); diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h index 632086b2f644..3ae2fda29507 100644 --- a/include/net/dst_ops.h +++ b/include/net/dst_ops.h @@ -24,7 +24,7 @@ struct dst_ops { void (*destroy)(struct dst_entry *); void (*ifdown)(struct dst_entry *, struct net_device *dev, int how); - struct dst_entry * (*negative_advice)(struct dst_entry *); + void (*negative_advice)(struct sock *sk, struct dst_entry *); void (*link_failure)(struct sk_buff *); void (*update_pmtu)(struct dst_entry *dst, struct sock *sk, struct sk_buff *skb, u32 mtu, diff --git a/include/net/erspan.h b/include/net/erspan.h index ca94fc86865e..6e758d08c9ee 100644 --- a/include/net/erspan.h +++ b/include/net/erspan.h @@ -58,4 +58,55 @@ struct erspanhdr { struct erspan_metadata md; }; +static inline u8 tos_to_cos(u8 tos) +{ + u8 dscp, cos; + + dscp = tos >> 2; + cos = dscp >> 3; + return cos; +} + +static inline void erspan_build_header(struct sk_buff *skb, + __be32 id, u32 index, + bool truncate, bool is_ipv4) +{ + struct ethhdr *eth = eth_hdr(skb); + enum erspan_encap_type enc_type; + struct erspanhdr *ershdr; + struct qtag_prefix { + __be16 eth_type; + __be16 tci; + } *qp; + u16 vlan_tci = 0; + u8 tos; + + tos = is_ipv4 ? ip_hdr(skb)->tos : + (ipv6_hdr(skb)->priority << 4) + + (ipv6_hdr(skb)->flow_lbl[0] >> 4); + + enc_type = ERSPAN_ENCAP_NOVLAN; + + /* If mirrored packet has vlan tag, extract tci and + * perserve vlan header in the mirrored frame. + */ + if (eth->h_proto == htons(ETH_P_8021Q)) { + qp = (struct qtag_prefix *)(skb->data + 2 * ETH_ALEN); + vlan_tci = ntohs(qp->tci); + enc_type = ERSPAN_ENCAP_INFRAME; + } + + skb_push(skb, sizeof(*ershdr)); + ershdr = (struct erspanhdr *)skb->data; + memset(ershdr, 0, sizeof(*ershdr)); + + ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) | + (ERSPAN_VERSION << VER_OFFSET)); + ershdr->session_id = htons((u16)(ntohl(id) & ID_MASK) | + ((tos_to_cos(tos) << COS_OFFSET) & COS_MASK) | + (enc_type << EN_OFFSET & EN_MASK) | + ((truncate << T_OFFSET) & T_MASK)); + ershdr->md.index = htonl(index & INDEX_MASK); +} + #endif diff --git a/include/net/fib_notifier.h b/include/net/fib_notifier.h index 669b9716dc7a..4ffcb646ba1f 100644 --- a/include/net/fib_notifier.h +++ b/include/net/fib_notifier.h @@ -9,6 +9,7 @@ struct fib_notifier_info { struct net *net; int family; + struct netlink_ext_ack *extack; }; enum fib_event_type { diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h index b8fd023ba625..c836289e68b8 100644 --- a/include/net/fib_rules.h +++ b/include/net/fib_rules.h @@ -26,7 +26,8 @@ struct fib_rule { u32 table; u8 action; u8 l3mdev; - /* 2 bytes hole, try to use */ + u8 proto; + u8 ip_proto; u32 target; __be64 tun_id; struct fib_rule __rcu *ctarget; @@ -39,11 +40,14 @@ struct fib_rule { char iifname[IFNAMSIZ]; char oifname[IFNAMSIZ]; struct fib_kuid_range uid_range; + struct fib_rule_port_range sport_range; + struct fib_rule_port_range dport_range; struct rcu_head rcu; }; struct fib_lookup_arg { void *lookup_ptr; + const void *lookup_data; void *result; struct fib_rule *rule; u32 table; @@ -109,7 +113,12 @@ struct fib_rule_notifier_info { [FRA_SUPPRESS_IFGROUP] = { .type = NLA_U32 }, \ [FRA_GOTO] = { .type = NLA_U32 }, \ [FRA_L3MDEV] = { .type = NLA_U8 }, \ - [FRA_UID_RANGE] = { .len = sizeof(struct fib_rule_uid_range) } + [FRA_UID_RANGE] = { .len = sizeof(struct fib_rule_uid_range) }, \ + [FRA_PROTOCOL] = { .type = NLA_U8 }, \ + [FRA_IP_PROTO] = { .type = NLA_U8 }, \ + [FRA_SPORT_RANGE] = { .len = sizeof(struct fib_rule_port_range) }, \ + [FRA_DPORT_RANGE] = { .len = sizeof(struct fib_rule_port_range) } + static inline void fib_rule_get(struct fib_rule *rule) { @@ -143,6 +152,38 @@ static inline u32 frh_get_table(struct fib_rule_hdr *frh, struct nlattr **nla) return frh->table; } +static inline bool fib_rule_port_range_set(const struct fib_rule_port_range *range) +{ + return range->start != 0 && range->end != 0; +} + +static inline bool fib_rule_port_inrange(const struct fib_rule_port_range *a, + __be16 port) +{ + return ntohs(port) >= a->start && + ntohs(port) <= a->end; +} + +static inline bool fib_rule_port_range_valid(const struct fib_rule_port_range *a) +{ + return a->start != 0 && a->end != 0 && a->end < 0xffff && + a->start <= a->end; +} + +static inline bool fib_rule_port_range_compare(struct fib_rule_port_range *a, + struct fib_rule_port_range *b) +{ + return a->start == b->start && + a->end == b->end; +} + +static inline bool fib_rule_requires_fldissect(struct fib_rule *rule) +{ + return rule->ip_proto || + fib_rule_port_range_set(&rule->sport_range) || + fib_rule_port_range_set(&rule->dport_range); +} + struct fib_rules_ops *fib_rules_register(const struct fib_rules_ops *, struct net *); void fib_rules_unregister(struct fib_rules_ops *); diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h index ddf916e5e57d..8fbf9608e933 100644 --- a/include/net/flow_dissector.h +++ b/include/net/flow_dissector.h @@ -213,9 +213,8 @@ enum flow_dissector_key_id { }; #define FLOW_DISSECTOR_F_PARSE_1ST_FRAG BIT(0) -#define FLOW_DISSECTOR_F_STOP_AT_L3 BIT(1) -#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL BIT(2) -#define FLOW_DISSECTOR_F_STOP_AT_ENCAP BIT(3) +#define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL BIT(1) +#define FLOW_DISSECTOR_F_STOP_AT_ENCAP BIT(2) struct flow_dissector_key { enum flow_dissector_key_id key_id; @@ -228,6 +227,11 @@ struct flow_dissector { unsigned short int offset[FLOW_DISSECTOR_KEY_MAX]; }; +struct flow_keys_basic { + struct flow_dissector_key_control control; + struct flow_dissector_key_basic basic; +}; + struct flow_keys { struct flow_dissector_key_control control; #define FLOW_KEYS_HASH_START_FIELD basic @@ -246,7 +250,7 @@ __be32 flow_get_u32_src(const struct flow_keys *flow); __be32 flow_get_u32_dst(const struct flow_keys *flow); extern struct flow_dissector flow_keys_dissector; -extern struct flow_dissector flow_keys_buf_dissector; +extern struct flow_dissector flow_keys_basic_dissector; /* struct flow_keys_digest: * @@ -291,4 +295,11 @@ flow_dissector_init_keys(struct flow_dissector_key_control *key_control, memset(key_basic, 0, sizeof(*key_basic)); } +struct bpf_flow_dissector { + struct bpf_flow_keys *flow_keys; + const struct sk_buff *skb; + void *data; + void *data_end; +}; + #endif diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h index c8569e7a9f68..b23bc5d6efd9 100644 --- a/include/net/if_inet6.h +++ b/include/net/if_inet6.h @@ -64,7 +64,7 @@ struct inet6_ifaddr { struct delayed_work dad_work; struct inet6_dev *idev; - struct rt6_info *rt; + struct fib6_info *rt; struct hlist_node addr_lst; struct list_head if_list; @@ -143,8 +143,7 @@ struct ipv6_ac_socklist { struct ifacaddr6 { struct in6_addr aca_addr; - struct inet6_dev *aca_idev; - struct rt6_info *aca_rt; + struct fib6_info *aca_rt; struct ifacaddr6 *aca_next; int aca_users; refcount_t aca_refcnt; diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 3d85113c2276..f48c33be29c9 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -77,6 +77,7 @@ struct inet_connection_sock_af_ops { * @icsk_af_ops Operations which are AF_INET{4,6} specific * @icsk_ulp_ops Pluggable ULP control hook * @icsk_ulp_data ULP private data + * @icsk_listen_portaddr_node hash to the portaddr listener hashtable * @icsk_ca_state: Congestion control state * @icsk_retransmits: Number of unrecovered [RTO] timeouts * @icsk_pending: Scheduled timer event @@ -101,6 +102,7 @@ struct inet_connection_sock { const struct inet_connection_sock_af_ops *icsk_af_ops; const struct tcp_ulp_ops *icsk_ulp_ops; void *icsk_ulp_data; + struct hlist_node icsk_listen_portaddr_node; unsigned int (*icsk_sync_mss)(struct sock *sk, u32 pmtu); __u8 icsk_ca_state:6, icsk_ca_setsockopt:1, diff --git a/include/net/inet_ecn.h b/include/net/inet_ecn.h index 09ed8a48b454..75568f090035 100644 --- a/include/net/inet_ecn.h +++ b/include/net/inet_ecn.h @@ -148,7 +148,7 @@ static inline void ipv6_copy_dscp(unsigned int dscp, struct ipv6hdr *inner) static inline int INET_ECN_set_ce(struct sk_buff *skb) { - switch (skb->protocol) { + switch (skb_protocol(skb, true)) { case cpu_to_be16(ETH_P_IP): if (skb_network_header(skb) + sizeof(struct iphdr) <= skb_tail_pointer(skb)) @@ -215,12 +215,16 @@ static inline int IP_ECN_decapsulate(const struct iphdr *oiph, { __u8 inner; - if (skb->protocol == htons(ETH_P_IP)) + switch (skb_protocol(skb, true)) { + case htons(ETH_P_IP): inner = ip_hdr(skb)->tos; - else if (skb->protocol == htons(ETH_P_IPV6)) + break; + case htons(ETH_P_IPV6): inner = ipv6_get_dsfield(ipv6_hdr(skb)); - else + break; + default: return 0; + } return INET_ECN_decapsulate(skb, oiph->tos, inner); } @@ -230,12 +234,16 @@ static inline int IP6_ECN_decapsulate(const struct ipv6hdr *oipv6h, { __u8 inner; - if (skb->protocol == htons(ETH_P_IP)) + switch (skb_protocol(skb, true)) { + case htons(ETH_P_IP): inner = ip_hdr(skb)->tos; - else if (skb->protocol == htons(ETH_P_IPV6)) + break; + case htons(ETH_P_IPV6): inner = ipv6_get_dsfield(ipv6_hdr(skb)); - else + break; + default: return 0; + } return INET_ECN_decapsulate(skb, ipv6_get_dsfield(oipv6h), inner); } diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index e079478bf5c9..2d04f3e06de1 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -114,6 +114,7 @@ struct inet_bind_hashbucket { #define LISTENING_NULLS_BASE (1U << 29) struct inet_listen_hashbucket { spinlock_t lock; + unsigned int count; union { struct hlist_head head; struct hlist_nulls_head nulls_head; @@ -138,12 +139,13 @@ struct inet_hashinfo { /* Ok, let's try this, I give up, we do need a local binding * TCP hash as well as the others for fast bind/connect. */ - struct inet_bind_hashbucket *bhash; - - unsigned int bhash_size; - /* 4 bytes hole on 64 bit */ - struct kmem_cache *bind_bucket_cachep; + struct inet_bind_hashbucket *bhash; + unsigned int bhash_size; + + /* The 2nd listener table hashed by local port and address */ + unsigned int lhash2_mask; + struct inet_listen_hashbucket *lhash2; /* All the above members are written once at bootup and * never written again _or_ are predominantly read-access. @@ -151,14 +153,25 @@ struct inet_hashinfo { * Now align to a new cache line as all the following members * might be often dirty. */ - /* All sockets in TCP_LISTEN state will be in here. This is the only - * table where wildcard'd TCP sockets can exist. Hash function here - * is just local port number. + /* All sockets in TCP_LISTEN state will be in listening_hash. + * This is the only table where wildcard'd TCP sockets can + * exist. listening_hash is only hashed by local port number. + * If lhash2 is initialized, the same socket will also be hashed + * to lhash2 by port and address. */ struct inet_listen_hashbucket listening_hash[INET_LHTABLE_SIZE] ____cacheline_aligned_in_smp; }; +#define inet_lhash2_for_each_icsk_rcu(__icsk, list) \ + hlist_for_each_entry_rcu(__icsk, list, icsk_listen_portaddr_node) + +static inline struct inet_listen_hashbucket * +inet_lhash2_bucket(struct inet_hashinfo *h, u32 hash) +{ + return &h->lhash2[hash & h->lhash2_mask]; +} + static inline struct inet_ehash_bucket *inet_ehash_bucket( struct inet_hashinfo *hashinfo, unsigned int hash) @@ -214,6 +227,10 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child); void inet_put_port(struct sock *sk); void inet_hashinfo_init(struct inet_hashinfo *h); +void inet_hashinfo2_init(struct inet_hashinfo *h, const char *name, + unsigned long numentries, int scale, + unsigned long low_limit, + unsigned long high_limit); bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk); bool inet_ehash_nolisten(struct sock *sk, struct sock *osk, diff --git a/include/net/ip.h b/include/net/ip.h index fb1f38263266..c8dee1b1d431 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -26,12 +26,14 @@ #include #include #include +#include #include #include #include #include #include +#include #define IPV4_MAX_PMTU 65535U /* RFC 2675, Section 5.1 */ #define IPV4_MIN_MTU 68 /* RFC 791 */ @@ -395,6 +397,9 @@ static inline unsigned int ip_skb_dst_mtu(struct sock *sk, return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU); } +int ip_metrics_convert(struct net *net, struct nlattr *fc_mx, int fc_mx_len, + u32 *metrics); + u32 ip_idents_reserve(u32 hash, int segs); void __ip_select_ident(struct net *net, struct iphdr *iph, int segs); @@ -538,6 +543,13 @@ static inline unsigned int ipv4_addr_hash(__be32 ip) return (__force unsigned int) ip; } +static inline u32 ipv4_portaddr_hash(const struct net *net, + __be32 saddr, + unsigned int port) +{ + return jhash_1word((__force u32)saddr, net_hash_mix(net)) ^ port; +} + bool ip_call_ra_chain(struct sk_buff *skb); /* diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h index 63ffb47b0e52..d5a9edc09638 100644 --- a/include/net/ip6_fib.h +++ b/include/net/ip6_fib.h @@ -32,7 +32,16 @@ #define FIB6_TABLE_HASHSZ 1 #endif +#define RT6_DEBUG 2 + +#if RT6_DEBUG >= 3 +#define RT6_TRACE(x...) pr_debug(x) +#else +#define RT6_TRACE(x...) do { ; } while (0) +#endif + struct rt6_info; +struct fib6_info; struct fib6_config { u32 fc_table; @@ -63,32 +72,32 @@ struct fib6_config { }; struct fib6_node { - struct fib6_node *parent; - struct fib6_node *left; - struct fib6_node *right; + struct fib6_node __rcu *parent; + struct fib6_node __rcu *left; + struct fib6_node __rcu *right; #ifdef CONFIG_IPV6_SUBTREES - struct fib6_node *subtree; + struct fib6_node __rcu *subtree; #endif - struct rt6_info *leaf; + struct fib6_info __rcu *leaf; __u16 fn_bit; /* bit key */ __u16 fn_flags; int fn_sernum; - struct rt6_info *rr_ptr; + struct fib6_info __rcu *rr_ptr; struct rcu_head rcu; }; +struct fib6_gc_args { + int timeout; + int more; +}; + #ifndef CONFIG_IPV6_SUBTREES #define FIB6_SUBTREE(fn) NULL #else -#define FIB6_SUBTREE(fn) ((fn)->subtree) +#define FIB6_SUBTREE(fn) (rcu_dereference_protected((fn)->subtree, 1)) #endif -struct mx6_config { - const u32 *mx; - DECLARE_BITMAP(mx_valid, RTAX_MAX); -}; - /* * routing information * @@ -101,78 +110,126 @@ struct rt6key { struct fib6_table; -struct rt6_info { - struct dst_entry dst; +struct rt6_exception_bucket { + struct hlist_head chain; + int depth; +}; - /* - * Tail elements of dst_entry (__refcnt etc.) - * and these elements (rarely used in hot path) are in - * the same cache line. - */ - struct fib6_table *rt6i_table; - struct fib6_node __rcu *rt6i_node; +struct rt6_exception { + struct hlist_node hlist; + struct rt6_info *rt6i; + unsigned long stamp; + struct rcu_head rcu; +}; - struct in6_addr rt6i_gateway; +#define FIB6_EXCEPTION_BUCKET_SIZE_SHIFT 10 +#define FIB6_EXCEPTION_BUCKET_SIZE (1 << FIB6_EXCEPTION_BUCKET_SIZE_SHIFT) +#define FIB6_MAX_DEPTH 5 + +struct fib6_nh { + struct in6_addr nh_gw; + struct net_device *nh_dev; + struct lwtunnel_state *nh_lwtstate; + + unsigned int nh_flags; + atomic_t nh_upper_bound; + int nh_weight; +}; + +struct fib6_info { + struct fib6_table *fib6_table; + struct fib6_info __rcu *fib6_next; + struct fib6_node __rcu *fib6_node; /* Multipath routes: - * siblings is a list of rt6_info that have the the same metric/weight, + * siblings is a list of fib6_info that have the the same metric/weight, * destination, but not the same gateway. nsiblings is just a cache * to speed up lookup. */ - struct list_head rt6i_siblings; - unsigned int rt6i_nsiblings; + struct list_head fib6_siblings; + unsigned int fib6_nsiblings; - atomic_t rt6i_ref; + atomic_t fib6_ref; + unsigned long expires; + struct dst_metrics *fib6_metrics; +#define fib6_pmtu fib6_metrics->metrics[RTAX_MTU-1] - unsigned int rt6i_nh_flags; + struct rt6key fib6_dst; + u32 fib6_flags; + struct rt6key fib6_src; + struct rt6key fib6_prefsrc; - /* These are in a separate cache line. */ - struct rt6key rt6i_dst ____cacheline_aligned_in_smp; - u32 rt6i_flags; + struct rt6_info * __percpu *rt6i_pcpu; + struct rt6_exception_bucket __rcu *rt6i_exception_bucket; + +#ifdef CONFIG_IPV6_ROUTER_PREF + unsigned long last_probe; +#endif + + u32 fib6_metric; + u8 fib6_protocol; + u8 fib6_type; + u8 exception_bucket_flushed:1, + should_flush:1, + dst_nocount:1, + dst_nopolicy:1, + dst_host:1, + fib6_destroying:1, + unused:2; + + struct fib6_nh fib6_nh; + struct rcu_head rcu; +}; + +struct rt6_info { + struct dst_entry dst; + struct fib6_info __rcu *from; + + struct rt6key rt6i_dst; struct rt6key rt6i_src; + struct in6_addr rt6i_gateway; + struct inet6_dev *rt6i_idev; + u32 rt6i_flags; struct rt6key rt6i_prefsrc; struct list_head rt6i_uncached; struct uncached_list *rt6i_uncached_list; - struct inet6_dev *rt6i_idev; - struct rt6_info * __percpu *rt6i_pcpu; - - u32 rt6i_metric; - u32 rt6i_pmtu; /* more non-fragment space at head required */ unsigned short rt6i_nfheader_len; - u8 rt6i_protocol; }; +#define for_each_fib6_node_rt_rcu(fn) \ + for (rt = rcu_dereference((fn)->leaf); rt; \ + rt = rcu_dereference(rt->fib6_next)) + +#define for_each_fib6_walker_rt(w) \ + for (rt = (w)->leaf; rt; \ + rt = rcu_dereference_protected(rt->fib6_next, 1)) + static inline struct inet6_dev *ip6_dst_idev(struct dst_entry *dst) { return ((struct rt6_info *)dst)->rt6i_idev; } -static inline void rt6_clean_expires(struct rt6_info *rt) +static inline void fib6_clean_expires(struct fib6_info *f6i) { - rt->rt6i_flags &= ~RTF_EXPIRES; - rt->dst.expires = 0; + f6i->fib6_flags &= ~RTF_EXPIRES; + f6i->expires = 0; } -static inline void rt6_set_expires(struct rt6_info *rt, unsigned long expires) +static inline void fib6_set_expires(struct fib6_info *f6i, + unsigned long expires) { - rt->dst.expires = expires; - rt->rt6i_flags |= RTF_EXPIRES; + f6i->expires = expires; + f6i->fib6_flags |= RTF_EXPIRES; } -static inline void rt6_update_expires(struct rt6_info *rt0, int timeout) +static inline bool fib6_check_expired(const struct fib6_info *f6i) { - struct rt6_info *rt; - - for (rt = rt0; rt && !(rt->rt6i_flags & RTF_EXPIRES); - rt = (struct rt6_info *)rt->dst.from); - if (rt && rt != rt0) - rt0->dst.expires = rt->dst.expires; - - dst_set_expires(&rt0->dst, timeout); - rt0->rt6i_flags |= RTF_EXPIRES; + if (f6i->fib6_flags & RTF_EXPIRES) + return time_after(jiffies, f6i->expires); + return false; } /* Function to safely get fn->sernum for passed in rt @@ -180,32 +237,36 @@ static inline void rt6_update_expires(struct rt6_info *rt0, int timeout) * Return true if we can get cookie safely * Return false if not */ -static inline bool rt6_get_cookie_safe(const struct rt6_info *rt, - u32 *cookie) +static inline bool fib6_get_cookie_safe(const struct fib6_info *f6i, + u32 *cookie) { struct fib6_node *fn; bool status = false; - rcu_read_lock(); - fn = rcu_dereference(rt->rt6i_node); + fn = rcu_dereference(f6i->fib6_node); if (fn) { - *cookie = fn->fn_sernum; + *cookie = READ_ONCE(fn->fn_sernum); + /* pairs with smp_wmb() in fib6_update_sernum_upto_root() */ + smp_rmb(); status = true; } - rcu_read_unlock(); return status; } static inline u32 rt6_get_cookie(const struct rt6_info *rt) { + struct fib6_info *from; u32 cookie = 0; - if (rt->dst.from) - rt = (struct rt6_info *)(rt->dst.from); + rcu_read_lock(); - rt6_get_cookie_safe(rt, &cookie); + from = rcu_dereference(rt->from); + if (from) + fib6_get_cookie_safe(from, &cookie); + + rcu_read_unlock(); return cookie; } @@ -219,23 +280,23 @@ static inline void ip6_rt_put(struct rt6_info *rt) dst_release(&rt->dst); } -void rt6_free_pcpu(struct rt6_info *non_pcpu_rt); +struct fib6_info *fib6_info_alloc(gfp_t gfp_flags); +void fib6_info_destroy_rcu(struct rcu_head *head); -static inline void rt6_hold(struct rt6_info *rt) +static inline void fib6_info_hold(struct fib6_info *f6i) { - atomic_inc(&rt->rt6i_ref); + atomic_inc(&f6i->fib6_ref); } -static inline void rt6_release(struct rt6_info *rt) +static inline bool fib6_info_hold_safe(struct fib6_info *f6i) { - if (atomic_dec_and_test(&rt->rt6i_ref)) { - if (strstr(rt->dst.dev->name, "rmnet_data")) - net_log("rt6_release(): %s : Prefix: %pI6/%u, GW: %pI6, prot: %u\n", - rt->dst.dev->name, &rt->rt6i_dst.addr, rt->rt6i_dst.plen, &rt->rt6i_gateway, rt->rt6i_protocol); - rt6_free_pcpu(rt); - dst_dev_put(&rt->dst); - dst_release(&rt->dst); - } + return atomic_inc_not_zero(&f6i->fib6_ref); +} + +static inline void fib6_info_release(struct fib6_info *f6i) +{ + if (f6i && atomic_dec_and_test(&f6i->fib6_ref)) + call_rcu(&f6i->rcu, fib6_info_destroy_rcu); } enum fib6_walk_state { @@ -251,9 +312,8 @@ enum fib6_walk_state { struct fib6_walker { struct list_head lh; struct fib6_node *root, *node; - struct rt6_info *leaf; + struct fib6_info *leaf; enum fib6_walk_state state; - bool prune; unsigned int skip; unsigned int count; int (*func)(struct fib6_walker *); @@ -261,12 +321,15 @@ struct fib6_walker { }; struct rt6_statistics { - __u32 fib_nodes; - __u32 fib_route_nodes; - __u32 fib_rt_alloc; /* permanent routes */ - __u32 fib_rt_entries; /* rt entries in table */ - __u32 fib_rt_cache; /* cache routes */ - __u32 fib_discarded_routes; + __u32 fib_nodes; /* all fib6 nodes */ + __u32 fib_route_nodes; /* intermediate nodes */ + __u32 fib_rt_entries; /* rt entries in fib table */ + __u32 fib_rt_cache; /* cached rt entries in exception table */ + __u32 fib_discarded_routes; /* total number of routes delete */ + + /* The following stats are not protected by any lock */ + atomic_t fib_rt_alloc; /* total number of routes alloced */ + atomic_t fib_rt_uncache; /* rt entries in uncached list */ }; #define RTN_TL_ROOT 0x0001 @@ -282,7 +345,7 @@ struct rt6_statistics { struct fib6_table { struct hlist_node tb6_hlist; u32 tb6_id; - rwlock_t tb6_lock; + spinlock_t tb6_lock; struct fib6_node tb6_root; struct inet_peer_base tb6_peers; unsigned int flags; @@ -308,11 +371,12 @@ struct fib6_table { typedef struct rt6_info *(*pol_lookup_t)(struct net *, struct fib6_table *, - struct flowi6 *, int); + struct flowi6 *, + const struct sk_buff *, int); struct fib6_entry_notifier_info { struct fib_notifier_info info; /* must be first */ - struct rt6_info *rt; + struct fib6_info *rt; }; /* @@ -322,25 +386,52 @@ struct fib6_entry_notifier_info { struct fib6_table *fib6_get_table(struct net *net, u32 id); struct fib6_table *fib6_new_table(struct net *net, u32 id); struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6, + const struct sk_buff *skb, int flags, pol_lookup_t lookup); -struct fib6_node *fib6_lookup(struct fib6_node *root, - const struct in6_addr *daddr, - const struct in6_addr *saddr); +/* called with rcu lock held; can return error pointer + * caller needs to select path + */ +struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6, + int flags); + +/* called with rcu lock held; caller needs to select path */ +struct fib6_info *fib6_table_lookup(struct net *net, struct fib6_table *table, + int oif, struct flowi6 *fl6, int strict); + +struct fib6_info *fib6_multipath_select(const struct net *net, + struct fib6_info *match, + struct flowi6 *fl6, int oif, + const struct sk_buff *skb, int strict); + +struct fib6_node *fib6_node_lookup(struct fib6_node *root, + const struct in6_addr *daddr, + const struct in6_addr *saddr); struct fib6_node *fib6_locate(struct fib6_node *root, const struct in6_addr *daddr, int dst_len, - const struct in6_addr *saddr, int src_len); + const struct in6_addr *saddr, int src_len, + bool exact_match); -void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *arg), +void fib6_clean_all(struct net *net, int (*func)(struct fib6_info *, void *arg), void *arg); -int fib6_add(struct fib6_node *root, struct rt6_info *rt, - struct nl_info *info, struct mx6_config *mxc, - struct netlink_ext_ack *extack); -int fib6_del(struct rt6_info *rt, struct nl_info *info); +int fib6_add(struct fib6_node *root, struct fib6_info *rt, + struct nl_info *info, struct netlink_ext_ack *extack); +int fib6_del(struct fib6_info *rt, struct nl_info *info); -void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info, +static inline struct net_device *fib6_info_nh_dev(const struct fib6_info *f6i) +{ + return f6i->fib6_nh.nh_dev; +} + +static inline +struct lwtunnel_state *fib6_info_nh_lwt(const struct fib6_info *f6i) +{ + return f6i->fib6_nh.nh_lwtstate; +} + +void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info, unsigned int flags); void fib6_run_gc(unsigned long expires, struct net *net, bool force); @@ -363,12 +454,39 @@ void __net_exit fib6_notifier_exit(struct net *net); unsigned int fib6_tables_seq_read(struct net *net); int fib6_tables_dump(struct net *net, struct notifier_block *nb); +void fib6_update_sernum(struct net *net, struct fib6_info *rt); +void fib6_update_sernum_upto_root(struct net *net, struct fib6_info *rt); + +void fib6_metric_set(struct fib6_info *f6i, int metric, u32 val); +static inline bool fib6_metric_locked(struct fib6_info *f6i, int metric) +{ + return !!(f6i->fib6_metrics->metrics[RTAX_LOCK - 1] & (1 << metric)); +} + #ifdef CONFIG_IPV6_MULTIPLE_TABLES int fib6_rules_init(void); void fib6_rules_cleanup(void); bool fib6_rule_default(const struct fib_rule *rule); int fib6_rules_dump(struct net *net, struct notifier_block *nb); unsigned int fib6_rules_seq_read(struct net *net); + +static inline bool fib6_rules_early_flow_dissect(struct net *net, + struct sk_buff *skb, + struct flowi6 *fl6, + struct flow_keys *flkeys) +{ + unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP; + + if (!net->ipv6.fib6_rules_require_fldissect) + return false; + + skb_flow_dissect_flow_keys(skb, flkeys, flag); + fl6->fl6_sport = flkeys->ports.src; + fl6->fl6_dport = flkeys->ports.dst; + fl6->flowi6_proto = flkeys->basic.ip_proto; + + return true; +} #else static inline int fib6_rules_init(void) { @@ -390,5 +508,12 @@ static inline unsigned int fib6_rules_seq_read(struct net *net) { return 0; } +static inline bool fib6_rules_early_flow_dissect(struct net *net, + struct sk_buff *skb, + struct flowi6 *fl6, + struct flow_keys *flkeys) +{ + return false; +} #endif #endif diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index 34139bc9a853..670b3fc067ac 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -66,10 +66,17 @@ static inline bool rt6_need_strict(const struct in6_addr *daddr) (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL | IPV6_ADDR_LOOPBACK); } +static inline bool rt6_qualify_for_ecmp(const struct fib6_info *f6i) +{ + return (f6i->fib6_flags & (RTF_GATEWAY|RTF_ADDRCONF|RTF_DYNAMIC)) == + RTF_GATEWAY; +} + void ip6_route_input(struct sk_buff *skb); struct dst_entry *ip6_route_input_lookup(struct net *net, struct net_device *dev, - struct flowi6 *fl6, int flags); + struct flowi6 *fl6, + const struct sk_buff *skb, int flags); struct dst_entry *ip6_route_output_flags(struct net *net, const struct sock *sk, struct flowi6 *fl6, int flags); @@ -82,9 +89,10 @@ static inline struct dst_entry *ip6_route_output(struct net *net, } struct dst_entry *ip6_route_lookup(struct net *net, struct flowi6 *fl6, - int flags); + const struct sk_buff *skb, int flags); struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, - int ifindex, struct flowi6 *fl6, int flags); + int ifindex, struct flowi6 *fl6, + const struct sk_buff *skb, int flags); void ip6_route_init_special_entries(void); int ip6_route_init(void); @@ -92,38 +100,46 @@ void ip6_route_cleanup(void); int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg); -int ip6_route_add(struct fib6_config *cfg, struct netlink_ext_ack *extack); -int ip6_ins_rt(struct rt6_info *); -int ip6_del_rt(struct rt6_info *); +int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags, + struct netlink_ext_ack *extack); +int ip6_ins_rt(struct net *net, struct fib6_info *f6i); +int ip6_del_rt(struct net *net, struct fib6_info *f6i); -static inline int ip6_route_get_saddr(struct net *net, struct rt6_info *rt, +void rt6_flush_exceptions(struct fib6_info *f6i); +void rt6_age_exceptions(struct fib6_info *f6i, struct fib6_gc_args *gc_args, + unsigned long now); + +static inline int ip6_route_get_saddr(struct net *net, struct fib6_info *f6i, const struct in6_addr *daddr, unsigned int prefs, struct in6_addr *saddr) { - struct inet6_dev *idev = - rt ? ip6_dst_idev((struct dst_entry *)rt) : NULL; int err = 0; - if (rt && rt->rt6i_prefsrc.plen) - *saddr = rt->rt6i_prefsrc.addr; - else - err = ipv6_dev_get_saddr(net, idev ? idev->dev : NULL, - daddr, prefs, saddr); + if (f6i && f6i->fib6_prefsrc.plen) { + *saddr = f6i->fib6_prefsrc.addr; + } else { + struct net_device *dev = f6i ? fib6_info_nh_dev(f6i) : NULL; + + err = ipv6_dev_get_saddr(net, dev, daddr, prefs, saddr); + } return err; } struct rt6_info *rt6_lookup(struct net *net, const struct in6_addr *daddr, - const struct in6_addr *saddr, int oif, int flags); -u32 rt6_multipath_hash(const struct flowi6 *fl6, const struct sk_buff *skb); + const struct in6_addr *saddr, int oif, + const struct sk_buff *skb, int flags); +u32 rt6_multipath_hash(const struct net *net, const struct flowi6 *fl6, + const struct sk_buff *skb, struct flow_keys *hkeys); struct dst_entry *icmp6_dst_alloc(struct net_device *dev, struct flowi6 *fl6); void fib6_force_start_gc(struct net *net); -struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev, - const struct in6_addr *addr, bool anycast); +struct fib6_info *addrconf_f6i_alloc(struct net *net, struct inet6_dev *idev, + const struct in6_addr *addr, bool anycast, + gfp_t gfp_flags); struct rt6_info *ip6_dst_alloc(struct net *net, struct net_device *dev, int flags); @@ -132,9 +148,11 @@ struct rt6_info *ip6_dst_alloc(struct net *net, struct net_device *dev, * support functions for ND * */ -struct rt6_info *rt6_get_dflt_router(const struct in6_addr *addr, +struct fib6_info *rt6_get_dflt_router(struct net *net, + const struct in6_addr *addr, struct net_device *dev); -struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr, +struct fib6_info *rt6_add_dflt_router(struct net *net, + const struct in6_addr *gwaddr, struct net_device *dev, unsigned int pref); void rt6_purge_dflt_routers(struct net *net); @@ -159,11 +177,14 @@ struct rt6_rtnl_dump_arg { struct net *net; }; -int rt6_dump_route(struct rt6_info *rt, void *p_arg); -void rt6_ifdown(struct net *net, struct net_device *dev); +int rt6_dump_route(struct fib6_info *f6i, void *p_arg); void rt6_mtu_change(struct net_device *dev, unsigned int mtu); void rt6_remove_prefsrc(struct inet6_ifaddr *ifp); void rt6_clean_tohost(struct net *net, struct in6_addr *gateway); +void rt6_sync_up(struct net_device *dev, unsigned int nh_flags); +void rt6_disable_ip(struct net_device *dev, unsigned long event); +void rt6_sync_down_dev(struct net_device *dev, unsigned long event); +void rt6_multipath_rebalance(struct fib6_info *f6i); static inline const struct rt6_info *skb_rt6_info(const struct sk_buff *skb) { @@ -246,11 +267,38 @@ static inline struct in6_addr *rt6_nexthop(struct rt6_info *rt, return daddr; } -static inline bool rt6_duplicate_nexthop(struct rt6_info *a, struct rt6_info *b) +static inline bool rt6_duplicate_nexthop(struct fib6_info *a, struct fib6_info *b) { - return a->dst.dev == b->dst.dev && - a->rt6i_idev == b->rt6i_idev && - ipv6_addr_equal(&a->rt6i_gateway, &b->rt6i_gateway) && - !lwtunnel_cmp_encap(a->dst.lwtstate, b->dst.lwtstate); + return a->fib6_nh.nh_dev == b->fib6_nh.nh_dev && + ipv6_addr_equal(&a->fib6_nh.nh_gw, &b->fib6_nh.nh_gw) && + !lwtunnel_cmp_encap(a->fib6_nh.nh_lwtstate, b->fib6_nh.nh_lwtstate); } + +static inline unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst) +{ + struct inet6_dev *idev; + unsigned int mtu; + + if (dst_metric_locked(dst, RTAX_MTU)) { + mtu = dst_metric_raw(dst, RTAX_MTU); + if (mtu) + return mtu; + } + + mtu = IPV6_MIN_MTU; + rcu_read_lock(); + idev = __in6_dev_get(dst->dev); + if (idev) + mtu = idev->cnf.mtu6; + rcu_read_unlock(); + + return mtu; +} + +u32 ip6_mtu_from_fib6(struct fib6_info *f6i, struct in6_addr *daddr, + struct in6_addr *saddr); + +struct neighbour *ip6_neigh_lookup(const struct in6_addr *gw, + struct net_device *dev, struct sk_buff *skb, + const void *daddr); #endif diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h index 3b0e3cdee1c3..6ca6e5834b22 100644 --- a/include/net/ip6_tunnel.h +++ b/include/net/ip6_tunnel.h @@ -36,6 +36,7 @@ struct __ip6_tnl_parm { __be32 o_key; __u32 fwmark; + __u32 index; /* ERSPAN type II index */ }; /* IPv6 tunnel */ diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index b711317a796c..0e326ddfc39f 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -428,4 +428,6 @@ static inline void fib_proc_exit(struct net *net) } #endif +u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr); + #endif /* _NET_FIB_H */ diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h index 7844e393f905..caa337a39929 100644 --- a/include/net/ip_tunnels.h +++ b/include/net/ip_tunnels.h @@ -486,10 +486,12 @@ static inline void ip_tunnel_info_opts_get(void *to, } static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info, - const void *from, int len) + const void *from, int len, + __be16 flags) { memcpy(ip_tunnel_info_opts(info), from, len); info->options_len = len; + info->key.tun_flags |= flags; } static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state *lwtstate) @@ -531,9 +533,11 @@ static inline void ip_tunnel_info_opts_get(void *to, } static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info, - const void *from, int len) + const void *from, int len, + __be16 flags) { info->options_len = 0; + info->key.tun_flags |= flags; } #endif /* CONFIG_INET */ diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 384e71baa1b0..e4986d7a0fa2 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -22,6 +22,7 @@ #include #include #include +#include #define SIN6_LEN_RFC2133 24 @@ -51,6 +52,46 @@ #define IPV6_DEFAULT_HOPLIMIT 64 #define IPV6_DEFAULT_MCASTHOPS 1 +/* Limits on Hop-by-Hop and Destination options. + * + * Per RFC8200 there is no limit on the maximum number or lengths of options in + * Hop-by-Hop or Destination options other then the packet must fit in an MTU. + * We allow configurable limits in order to mitigate potential denial of + * service attacks. + * + * There are three limits that may be set: + * - Limit the number of options in a Hop-by-Hop or Destination options + * extension header + * - Limit the byte length of a Hop-by-Hop or Destination options extension + * header + * - Disallow unknown options + * + * The limits are expressed in corresponding sysctls: + * + * ipv6.sysctl.max_dst_opts_cnt + * ipv6.sysctl.max_hbh_opts_cnt + * ipv6.sysctl.max_dst_opts_len + * ipv6.sysctl.max_hbh_opts_len + * + * max_*_opts_cnt is the number of TLVs that are allowed for Destination + * options or Hop-by-Hop options. If the number is less than zero then unknown + * TLVs are disallowed and the number of known options that are allowed is the + * absolute value. Setting the value to INT_MAX indicates no limit. + * + * max_*_opts_len is the length limit in bytes of a Destination or + * Hop-by-Hop options extension header. Setting the value to INT_MAX + * indicates no length limit. + * + * If a limit is exceeded when processing an extension header the packet is + * silently discarded. + */ + +/* Default limits for Hop-by-Hop and Destination options */ +#define IP6_DEFAULT_MAX_DST_OPTS_CNT 8 +#define IP6_DEFAULT_MAX_HBH_OPTS_CNT 8 +#define IP6_DEFAULT_MAX_DST_OPTS_LEN INT_MAX /* No limit */ +#define IP6_DEFAULT_MAX_HBH_OPTS_LEN INT_MAX /* No limit */ + /* * Addr type * @@ -573,6 +614,22 @@ static inline bool ipv6_addr_v4mapped(const struct in6_addr *a) cpu_to_be32(0x0000ffff))) == 0UL; } +static inline u32 ipv6_portaddr_hash(const struct net *net, + const struct in6_addr *addr6, + unsigned int port) +{ + unsigned int hash, mix = net_hash_mix(net); + + if (ipv6_addr_any(addr6)) + hash = jhash_1word(0, mix); + else if (ipv6_addr_v4mapped(addr6)) + hash = jhash_1word((__force u32)addr6->s6_addr32[3], mix); + else + hash = jhash2((__force u32 *)addr6->s6_addr32, 4, mix); + + return hash ^ port; +} + /* * Check for a RFC 4843 ORCHID address * (Overlay Routable Cryptographic Hash Identifiers) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index 0ab4647ccc24..6ceb0ff6c353 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -129,7 +129,20 @@ int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b); int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb); int lwtunnel_input(struct sk_buff *skb); int lwtunnel_xmit(struct sk_buff *skb); +int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, + bool ingress); +static inline void lwtunnel_set_redirect(struct dst_entry *dst) +{ + if (lwtunnel_output_redirect(dst->lwtstate)) { + dst->lwtstate->orig_output = dst->output; + dst->output = lwtunnel_output; + } + if (lwtunnel_input_redirect(dst->lwtstate)) { + dst->lwtstate->orig_input = dst->input; + dst->input = lwtunnel_input; + } +} #else static inline void lwtstate_free(struct lwtunnel_state *lws) @@ -161,6 +174,10 @@ static inline bool lwtunnel_xmit_redirect(struct lwtunnel_state *lwtstate) return false; } +static inline void lwtunnel_set_redirect(struct dst_entry *dst) +{ +} + static inline unsigned int lwtunnel_headroom(struct lwtunnel_state *lwtstate, unsigned int mtu) { diff --git a/include/net/mpls.h b/include/net/mpls.h index 1dbc669b770e..ccaf238e8ea7 100644 --- a/include/net/mpls.h +++ b/include/net/mpls.h @@ -1,14 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright (c) 2014 Nicira, Inc. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef _NET_MPLS_H diff --git a/include/net/mpls_iptunnel.h b/include/net/mpls_iptunnel.h index 9d22bf67ac86..6b4759eae158 100644 --- a/include/net/mpls_iptunnel.h +++ b/include/net/mpls_iptunnel.h @@ -1,14 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright (c) 2015 Cumulus Networks, Inc. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef _NET_MPLS_IPTUNNEL_H diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 672e7ec3d89b..be308b7fa615 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -44,6 +45,7 @@ struct sock; struct netns_ipvs; struct bpf_prog; + #define NETDEV_HASHBITS 8 #define NETDEV_HASHENTRIES (1 << NETDEV_HASHBITS) @@ -151,6 +153,9 @@ struct net { #endif #if IS_ENABLED(CONFIG_CAN) struct netns_can can; +#endif +#ifdef CONFIG_XDP_SOCKETS + struct netns_xdp xdp; #endif struct sock *diag_nlsk; atomic_t fnhe_genid; diff --git a/include/net/netevent.h b/include/net/netevent.h index f728d9cad170..d9918261701c 100644 --- a/include/net/netevent.h +++ b/include/net/netevent.h @@ -26,6 +26,8 @@ enum netevent_notif_type { NETEVENT_NEIGH_UPDATE = 1, /* arg is struct neighbour ptr */ NETEVENT_REDIRECT, /* arg is struct netevent_redirect ptr */ NETEVENT_DELAY_PROBE_TIME_UPDATE, /* arg is struct neigh_parms ptr */ + NETEVENT_IPV4_MPATH_HASH_UPDATE, /* arg is struct net ptr */ + NETEVENT_IPV6_MPATH_HASH_UPDATE, /* arg is struct net ptr */ }; int register_netevent_notifier(struct notifier_block *nb); diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index ed9c8aaa65f0..7885f3feeec9 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -38,6 +38,8 @@ struct inet_timewait_death_row { int sysctl_max_tw_buckets; }; +struct tcp_fastopen_context; + struct netns_ipv4 { #ifdef CONFIG_SYSCTL struct ctl_table_header *forw_hdr; @@ -132,6 +134,13 @@ struct netns_ipv4 { int sysctl_tcp_default_init_rwnd; struct inet_timewait_death_row tcp_death_row; int sysctl_max_syn_backlog; + int sysctl_tcp_fastopen; + const struct tcp_congestion_ops __rcu *tcp_congestion_control; + struct tcp_fastopen_context __rcu *tcp_fastopen_ctx; + spinlock_t tcp_fastopen_ctx_lock; + unsigned int sysctl_tcp_fastopen_blackhole_timeout; + atomic_t tfo_active_disable_times; + unsigned long tfo_active_disable_stamp; #ifdef CONFIG_NET_L3_MASTER_DEV int sysctl_udp_l3mdev_accept; diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h index 290ca18589ee..a55d82a54ee9 100644 --- a/include/net/netns/ipv6.h +++ b/include/net/netns/ipv6.h @@ -28,6 +28,7 @@ struct netns_sysctl_ipv6 { int ip6_rt_gc_elasticity; int ip6_rt_mtu_expires; int ip6_rt_min_advmss; + int multipath_hash_policy; int flowlabel_consistency; int auto_flowlabels; int icmpv6_time; @@ -38,6 +39,10 @@ struct netns_sysctl_ipv6 { int idgen_delay; int flowlabel_state_ranges; int flowlabel_reflect; + int max_dst_opts_cnt; + int max_hbh_opts_cnt; + int max_dst_opts_len; + int max_hbh_opts_len; }; struct netns_ipv6 { @@ -55,7 +60,8 @@ struct netns_ipv6 { #endif struct xt_table *ip6table_nat; #endif - struct rt6_info *ip6_null_entry; + struct fib6_info *fib6_null_entry; + struct rt6_info *ip6_null_entry; struct rt6_statistics *rt6_stats; struct timer_list ip6_fib_timer; struct hlist_head *fib_table_hash; @@ -67,7 +73,8 @@ struct netns_ipv6 { atomic_t ip6_rt_gc_expire; unsigned long ip6_rt_last_gc; #ifdef CONFIG_IPV6_MULTIPLE_TABLES - bool fib6_has_custom_rules; + unsigned int fib6_rules_require_fldissect; + bool fib6_has_custom_rules; struct rt6_info *ip6_prohibit_entry; struct rt6_info *ip6_blk_hole_entry; struct fib6_table *fib6_local_tbl; diff --git a/include/net/netns/xdp.h b/include/net/netns/xdp.h new file mode 100644 index 000000000000..e5734261ba0a --- /dev/null +++ b/include/net/netns/xdp.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __NETNS_XDP_H__ +#define __NETNS_XDP_H__ + +#include +#include + +struct netns_xdp { + struct mutex lock; + struct hlist_head list; +}; + +#endif /* __NETNS_XDP_H__ */ diff --git a/include/net/page_pool.h b/include/net/page_pool.h new file mode 100644 index 000000000000..e240fac4c5b9 --- /dev/null +++ b/include/net/page_pool.h @@ -0,0 +1,163 @@ +/* SPDX-License-Identifier: GPL-2.0 + * + * page_pool.h + * Author: Jesper Dangaard Brouer + * Copyright (C) 2016 Red Hat, Inc. + */ + +/** + * DOC: page_pool allocator + * + * This page_pool allocator is optimized for the XDP mode that + * uses one-frame-per-page, but have fallbacks that act like the + * regular page allocator APIs. + * + * Basic use involve replacing alloc_pages() calls with the + * page_pool_alloc_pages() call. Drivers should likely use + * page_pool_dev_alloc_pages() replacing dev_alloc_pages(). + * + * If page_pool handles DMA mapping (use page->private), then API user + * is responsible for invoking page_pool_put_page() once. In-case of + * elevated refcnt, the DMA state is released, assuming other users of + * the page will eventually call put_page(). + * + * If no DMA mapping is done, then it can act as shim-layer that + * fall-through to alloc_page. As no state is kept on the page, the + * regular put_page() call is sufficient. + */ +#ifndef _NET_PAGE_POOL_H +#define _NET_PAGE_POOL_H + +#include /* Needed by ptr_ring */ +#include +#include + +#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */ +#define PP_FLAG_ALL PP_FLAG_DMA_MAP + +/* + * Fast allocation side cache array/stack + * + * The cache size and refill watermark is related to the network + * use-case. The NAPI budget is 64 packets. After a NAPI poll the RX + * ring is usually refilled and the max consumed elements will be 64, + * thus a natural max size of objects needed in the cache. + * + * Keeping room for more objects, is due to XDP_DROP use-case. As + * XDP_DROP allows the opportunity to recycle objects directly into + * this array, as it shares the same softirq/NAPI protection. If + * cache is already full (or partly full) then the XDP_DROP recycles + * would have to take a slower code path. + */ +#define PP_ALLOC_CACHE_SIZE 128 +#define PP_ALLOC_CACHE_REFILL 64 +struct pp_alloc_cache { + u32 count; + void *cache[PP_ALLOC_CACHE_SIZE]; +}; + +struct page_pool_params { + unsigned int flags; + unsigned int order; + unsigned int pool_size; + int nid; /* Numa node id to allocate from pages from */ + struct device *dev; /* device, for DMA pre-mapping purposes */ + enum dma_data_direction dma_dir; /* DMA mapping direction */ +}; + +struct page_pool { + struct rcu_head rcu; + struct page_pool_params p; + + /* + * Data structure for allocation side + * + * Drivers allocation side usually already perform some kind + * of resource protection. Piggyback on this protection, and + * require driver to protect allocation side. + * + * For NIC drivers this means, allocate a page_pool per + * RX-queue. As the RX-queue is already protected by + * Softirq/BH scheduling and napi_schedule. NAPI schedule + * guarantee that a single napi_struct will only be scheduled + * on a single CPU (see napi_schedule). + */ + struct pp_alloc_cache alloc ____cacheline_aligned_in_smp; + + /* Data structure for storing recycled pages. + * + * Returning/freeing pages is more complicated synchronization + * wise, because free's can happen on remote CPUs, with no + * association with allocation resource. + * + * Use ptr_ring, as it separates consumer and producer + * effeciently, it a way that doesn't bounce cache-lines. + * + * TODO: Implement bulk return pages into this structure. + */ + struct ptr_ring ring; +}; + +struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp); + +static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool) +{ + gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN); + + return page_pool_alloc_pages(pool, gfp); +} + +struct page_pool *page_pool_create(const struct page_pool_params *params); + +void page_pool_destroy(struct page_pool *pool); + +/* Never call this directly, use helpers below */ +void __page_pool_put_page(struct page_pool *pool, + struct page *page, bool allow_direct); + +static inline void page_pool_put_page(struct page_pool *pool, + struct page *page, bool allow_direct) +{ + /* When page_pool isn't compiled-in, net/core/xdp.c doesn't + * allow registering MEM_TYPE_PAGE_POOL, but shield linker. + */ +#ifdef CONFIG_PAGE_POOL + __page_pool_put_page(pool, page, allow_direct); +#endif +} +/* Very limited use-cases allow recycle direct */ +static inline void page_pool_recycle_direct(struct page_pool *pool, + struct page *page) +{ + __page_pool_put_page(pool, page, true); +} + +/* Disconnects a page (from a page_pool). API users can have a need + * to disconnect a page (from a page_pool), to allow it to be used as + * a regular page (that will eventually be returned to the normal + * page-allocator via put_page). + */ +void page_pool_unmap_page(struct page_pool *pool, struct page *page); +static inline void page_pool_release_page(struct page_pool *pool, + struct page *page) +{ +#ifdef CONFIG_PAGE_POOL + page_pool_unmap_page(pool, page); +#endif +} + +static inline dma_addr_t page_pool_get_dma_addr(struct page *page) +{ + return page->dma_addr; +} + +static inline bool is_page_pool_compiled_in(void) +{ +#ifdef CONFIG_PAGE_POOL + return true; +#else + return false; +#endif +} + +#endif /* _NET_PAGE_POOL_H */ diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h index 8b3f0fccb8a0..02fd755e6627 100644 --- a/include/net/pkt_sched.h +++ b/include/net/pkt_sched.h @@ -115,17 +115,6 @@ static inline void qdisc_run(struct Qdisc *q) __qdisc_run(q); } -static inline __be16 tc_skb_protocol(const struct sk_buff *skb) -{ - /* We need to take extra care in case the skb came via - * vlan accelerated path. In that case, use skb->vlan_proto - * as the original vlan header was already stripped. - */ - if (skb_vlan_tag_present(skb)) - return skb->vlan_proto; - return skb->protocol; -} - extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1]; extern int tc_qdisc_flow_control(struct net_device *dev, u32 tcm_handle, diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 2e5fe6a6eb60..6b43c92a0af8 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -256,13 +256,10 @@ struct tcf_proto { }; struct qdisc_skb_cb { - union { - struct { - unsigned int pkt_len; - u16 slave_dev_queue_mapping; - u16 tc_classid; - }; - struct bpf_flow_keys *flow_keys; + struct { + unsigned int pkt_len; + u16 slave_dev_queue_mapping; + u16 tc_classid; }; #define QDISC_CB_PRIV_LEN 20 unsigned char data[QDISC_CB_PRIV_LEN]; diff --git a/include/net/seg6.h b/include/net/seg6.h index 099bad59dc90..f450bc37d196 100644 --- a/include/net/seg6.h +++ b/include/net/seg6.h @@ -63,5 +63,6 @@ extern bool seg6_validate_srh(struct ipv6_sr_hdr *srh, int len); extern int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto); extern int seg6_do_srh_inline(struct sk_buff *skb, struct ipv6_sr_hdr *osrh); - +extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr, + u32 tbl_id); #endif diff --git a/include/net/seg6_local.h b/include/net/seg6_local.h new file mode 100644 index 000000000000..08359e2d8b35 --- /dev/null +++ b/include/net/seg6_local.h @@ -0,0 +1,34 @@ +/* + * SR-IPv6 implementation + * + * Authors: + * David Lebrun + * eBPF support: Mathieu Xhonneux + * + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#ifndef _NET_SEG6_LOCAL_H +#define _NET_SEG6_LOCAL_H + +#include +#include +#include + +extern int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr, + u32 tbl_id); +extern bool seg6_bpf_has_valid_srh(struct sk_buff *skb); + +struct seg6_bpf_srh_state { + struct ipv6_sr_hdr *srh; + u16 hdrlen; + bool valid; +}; + +DECLARE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states); + +#endif diff --git a/include/net/sock.h b/include/net/sock.h index f35411bab1a7..bf4234303856 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1841,19 +1841,12 @@ sk_dst_get(struct sock *sk) static inline void dst_negative_advice(struct sock *sk) { - struct dst_entry *ndst, *dst = __sk_dst_get(sk); + struct dst_entry *dst = __sk_dst_get(sk); sk_rethink_txhash(sk); - if (dst && dst->ops->negative_advice) { - ndst = dst->ops->negative_advice(dst); - - if (ndst != dst) { - rcu_assign_pointer(sk->sk_dst_cache, ndst); - sk_tx_queue_clear(sk); - WRITE_ONCE(sk->sk_dst_pending_confirm, 0); - } - } + if (dst && dst->ops->negative_advice) + dst->ops->negative_advice(sk, dst); } static inline void @@ -2241,10 +2234,6 @@ static inline struct page_frag *sk_page_frag(struct sock *sk) bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag); -int sk_alloc_sg(struct sock *sk, int len, struct scatterlist *sg, - int sg_start, int *sg_curr, unsigned int *sg_size, - int first_coalesce); - /* * Default write policy as shown to user space via poll/select/SIGIO */ diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 0054b3a9b923..b0ff88240049 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -5,25 +5,32 @@ #include #include #include +#include #include +extern spinlock_t reuseport_lock; + struct sock_reuseport { struct rcu_head rcu; u16 max_socks; /* length of socks */ u16 num_socks; /* elements in socks */ + /* ID stays the same even after the size of socks[] grows. */ + unsigned int reuseport_id; + bool bind_inany; struct bpf_prog __rcu *prog; /* optional BPF sock selector */ struct sock *socks[0]; /* array of sock pointers */ }; -extern int reuseport_alloc(struct sock *sk); -extern int reuseport_add_sock(struct sock *sk, struct sock *sk2); +extern int reuseport_alloc(struct sock *sk, bool bind_inany); +extern int reuseport_add_sock(struct sock *sk, struct sock *sk2, + bool bind_inany); extern void reuseport_detach_sock(struct sock *sk); extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash, struct sk_buff *skb, int hdr_len); -extern struct bpf_prog *reuseport_attach_prog(struct sock *sk, - struct bpf_prog *prog); +extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog); +int reuseport_get_id(struct sock_reuseport *reuse); #endif /* _SOCK_REUSEPORT_H */ diff --git a/include/net/strparser.h b/include/net/strparser.h index d96b59f45eba..ef0822d110a8 100644 --- a/include/net/strparser.h +++ b/include/net/strparser.h @@ -57,10 +57,24 @@ struct strp_msg { int offset; }; +struct _strp_msg { + /* Internal cb structure. struct strp_msg must be first for passing + * to upper layer. + */ + struct strp_msg strp; + int accum_len; +}; + +struct sk_skb_cb { +#define SK_SKB_CB_PRIV_LEN 20 + unsigned char data[SK_SKB_CB_PRIV_LEN]; + struct _strp_msg strp; +}; + static inline struct strp_msg *strp_msg(struct sk_buff *skb) { return (struct strp_msg *)((void *)skb->cb + - offsetof(struct qdisc_skb_cb, data)); + offsetof(struct sk_skb_cb, strp)); } /* Structure for an attached lower socket */ @@ -90,6 +104,8 @@ static inline void strp_pause(struct strparser *strp) /* May be called without holding lock for attached socket */ void strp_unpause(struct strparser *strp); +/* Must be called with process lock held (lock_sock) */ +void __strp_unpause(struct strparser *strp); static inline void save_strp_stats(struct strparser *strp, struct strp_aggr_stats *agg_stats) diff --git a/include/net/tcp.h b/include/net/tcp.h index 3d18bc1fd737..af16baa4e8f1 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -249,7 +249,6 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); /* sysctl variables for tcp */ -extern int sysctl_tcp_fastopen; extern int sysctl_tcp_retrans_collapse; extern int sysctl_tcp_stdurg; extern int sysctl_tcp_rfc1337; @@ -439,6 +438,7 @@ bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst); void tcp_disable_fack(struct tcp_sock *tp); void tcp_close(struct sock *sk, long timeout); void tcp_init_sock(struct sock *sk); +void tcp_init_transfer(struct sock *sk, int bpf_op); unsigned int tcp_poll(struct file *file, struct socket *sock, struct poll_table_struct *wait); int tcp_getsockopt(struct sock *sk, int level, int optname, @@ -458,6 +458,16 @@ void tcp_parse_options(const struct net *net, const struct sk_buff *skb, int estab, struct tcp_fastopen_cookie *foc); const u8 *tcp_parse_md5sig_option(const struct tcphdr *th); +/* + * BPF SKB-less helpers + */ +u16 tcp_v4_get_syncookie(struct sock *sk, struct iphdr *iph, + struct tcphdr *th, u32 *cookie); +u16 tcp_v6_get_syncookie(struct sock *sk, struct ipv6hdr *iph, + struct tcphdr *th, u32 *cookie); +u16 tcp_get_syncookie_mss(struct request_sock_ops *rsk_ops, + const struct tcp_request_sock_ops *af_ops, + struct sock *sk, struct tcphdr *th); /* * TCP v4 functions exported for the inet6 API */ @@ -771,17 +781,7 @@ static inline u32 tcp_time_stamp_raw(void) return div_u64(tcp_clock_ns(), NSEC_PER_SEC / TCP_TS_HZ); } - -/* Refresh 1us clock of a TCP socket, - * ensuring monotically increasing values. - */ -static inline void tcp_mstamp_refresh(struct tcp_sock *tp) -{ - u64 val = tcp_clock_us(); - - if (val > tp->tcp_mstamp) - tp->tcp_mstamp = val; -} +void tcp_mstamp_refresh(struct tcp_sock *tp); static inline u32 tcp_stamp_us_delta(u64 t1, u64 t0) { @@ -1086,8 +1086,8 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *type); void tcp_assign_congestion_control(struct sock *sk); void tcp_init_congestion_control(struct sock *sk); void tcp_cleanup_congestion_control(struct sock *sk); -int tcp_set_default_congestion_control(const char *name); -void tcp_get_default_congestion_control(char *name); +int tcp_set_default_congestion_control(struct net *net, const char *name); +void tcp_get_default_congestion_control(struct net *net, char *name); void tcp_get_available_congestion_control(char *buf, size_t len); void tcp_get_allowed_congestion_control(char *buf, size_t len); int tcp_set_allowed_congestion_control(char *allowed); @@ -1102,7 +1102,7 @@ void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked); extern struct tcp_congestion_ops tcp_reno; struct tcp_congestion_ops *tcp_ca_find_key(u32 key); -u32 tcp_ca_get_key_by_name(const char *name, bool *ecn_ca); +u32 tcp_ca_get_key_by_name(struct net *net, const char *name, bool *ecn_ca); #ifdef CONFIG_INET char *tcp_ca_get_name_by_key(u32 key, char *buffer); #else @@ -1607,8 +1607,7 @@ int tcp_md5_hash_key(struct tcp_md5sig_pool *hp, /* From tcp_fastopen.c */ void tcp_fastopen_cache_get(struct sock *sk, u16 *mss, - struct tcp_fastopen_cookie *cookie, int *syn_loss, - unsigned long *last_syn_loss); + struct tcp_fastopen_cookie *cookie); void tcp_fastopen_cache_set(struct sock *sk, u16 mss, struct tcp_fastopen_cookie *cookie, bool syn_lost, u16 try_exp); @@ -1621,13 +1620,13 @@ struct tcp_fastopen_request { }; void tcp_free_fastopen_req(struct tcp_sock *tp); -extern struct tcp_fastopen_context __rcu *tcp_fastopen_ctx; -int tcp_fastopen_reset_cipher(void *key, unsigned int len); +void tcp_fastopen_ctx_destroy(struct net *net); +int tcp_fastopen_reset_cipher(struct net *net, void *key, unsigned int len); void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb); struct sock *tcp_try_fastopen(struct sock *sk, struct sk_buff *skb, struct request_sock *req, struct tcp_fastopen_cookie *foc); -void tcp_fastopen_init_key_once(bool publish); +void tcp_fastopen_init_key_once(struct net *net); bool tcp_fastopen_cookie_check(struct sock *sk, u16 *mss, struct tcp_fastopen_cookie *cookie); bool tcp_fastopen_defer_connect(struct sock *sk, int *err); @@ -1640,11 +1639,10 @@ struct tcp_fastopen_context { struct rcu_head rcu; }; -extern unsigned int sysctl_tcp_fastopen_blackhole_timeout; void tcp_fastopen_active_disable(struct sock *sk); bool tcp_fastopen_active_should_disable(struct sock *sk); void tcp_fastopen_active_disable_ofo_check(struct sock *sk); -void tcp_fastopen_active_timeout_reset(void); +void tcp_fastopen_active_detect_blackhole(struct sock *sk, bool expired); /* Latencies incurred by various limits for a sender. They are * chronograph-like stats that are mutually exclusive. @@ -2148,7 +2146,6 @@ struct tcp_ulp_ops { int tcp_register_ulp(struct tcp_ulp_ops *type); void tcp_unregister_ulp(struct tcp_ulp_ops *type); int tcp_set_ulp(struct sock *sk, const char *name); -int tcp_set_ulp_id(struct sock *sk, const int ulp); void tcp_get_available_ulp(char *buf, size_t len); void tcp_cleanup_ulp(struct sock *sk); @@ -2161,15 +2158,12 @@ struct sk_psock; int tcp_bpf_init(struct sock *sk); void tcp_bpf_reinit(struct sock *sk); - int tcp_bpf_sendmsg_redir(struct sock *sk, struct sk_msg *msg, u32 bytes, int flags); - int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len); - int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, - struct msghdr *msg, int len); + struct msghdr *msg, int len, int flags); /* Call BPF_SOCK_OPS program that returns an int. If the return value * is < 0, then the BPF op failed (for example if the loaded BPF @@ -2177,17 +2171,21 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, * program loaded). */ #ifdef CONFIG_BPF -static inline int tcp_call_bpf(struct sock *sk, int op) +static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args) { struct bpf_sock_ops_kern sock_ops; int ret; - if (sk_fullsock(sk)) + memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); + if (sk_fullsock(sk)) { + sock_ops.is_fullsock = 1; sock_owned_by_me(sk); + } - memset(&sock_ops, 0, sizeof(sock_ops)); sock_ops.sk = sk; sock_ops.op = op; + if (nargs > 0) + memcpy(sock_ops.args, args, nargs * sizeof(*args)); ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops); if (ret == 0) @@ -2196,18 +2194,46 @@ static inline int tcp_call_bpf(struct sock *sk, int op) ret = -1; return ret; } + +static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 arg2) +{ + u32 args[2] = {arg1, arg2}; + + return tcp_call_bpf(sk, op, 2, args); +} + +static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 arg2, + u32 arg3) +{ + u32 args[3] = {arg1, arg2, arg3}; + + return tcp_call_bpf(sk, op, 3, args); +} + #else -static inline int tcp_call_bpf(struct sock *sk, int op) +static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args) { return -EPERM; } + +static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 arg2) +{ + return -EPERM; +} + +static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 arg2, + u32 arg3) +{ + return -EPERM; +} + #endif static inline u32 tcp_timeout_init(struct sock *sk) { int timeout; - timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT); + timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL); if (timeout <= 0) timeout = TCP_TIMEOUT_INIT; @@ -2218,7 +2244,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk) { int rwnd; - rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT); + rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL); if (rwnd < 0) rwnd = 0; @@ -2227,6 +2253,15 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk) static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk) { - return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1); + return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1); } + +static inline void tcp_bpf_rtt(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + + if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTT_CB_FLAG)) + tcp_call_bpf(sk, BPF_SOCK_OPS_RTT_CB, 0, NULL); +} + #endif /* _TCP_H */ diff --git a/include/net/tls.h b/include/net/tls.h index 604fd982da19..e195c2d3b11b 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -36,10 +36,14 @@ #include #include +#include #include #include -#include +#include +#include +#include +#include #include @@ -54,31 +58,151 @@ #define TLS_RECORD_TYPE_DATA 0x17 #define TLS_AAD_SPACE_SIZE 13 +#define TLS_DEVICE_NAME_MAX 32 -struct tls_sw_context { - struct crypto_aead *aead_send; - - /* Sending context */ - char aad_space[TLS_AAD_SPACE_SIZE]; - - unsigned int sg_plaintext_size; - int sg_plaintext_num_elem; - struct scatterlist sg_plaintext_data[MAX_SKB_FRAGS]; - - unsigned int sg_encrypted_size; - int sg_encrypted_num_elem; - struct scatterlist sg_encrypted_data[MAX_SKB_FRAGS]; - - /* AAD | sg_plaintext_data | sg_tag */ - struct scatterlist sg_aead_in[2]; - /* AAD | sg_encrypted_data (data contain overhead for hdr&iv&tag) */ - struct scatterlist sg_aead_out[2]; +/* + * This structure defines the routines for Inline TLS driver. + * The following routines are optional and filled with a + * null pointer if not defined. + * + * @name: Its the name of registered Inline tls device + * @dev_list: Inline tls device list + * int (*feature)(struct tls_device *device); + * Called to return Inline TLS driver capability + * + * int (*hash)(struct tls_device *device, struct sock *sk); + * This function sets Inline driver for listen and program + * device specific functioanlity as required + * + * void (*unhash)(struct tls_device *device, struct sock *sk); + * This function cleans listen state set by Inline TLS driver + */ +struct tls_device { + char name[TLS_DEVICE_NAME_MAX]; + struct list_head dev_list; + int (*feature)(struct tls_device *device); + int (*hash)(struct tls_device *device, struct sock *sk); + void (*unhash)(struct tls_device *device, struct sock *sk); }; +enum { + TLS_BASE, + TLS_SW, +#ifdef CONFIG_TLS_DEVICE + TLS_HW, +#endif + TLS_HW_RECORD, + TLS_NUM_CONFIG, +}; + +/* TLS records are maintained in 'struct tls_rec'. It stores the memory pages + * allocated or mapped for each TLS record. After encryption, the records are + * stores in a linked list. + */ +struct tls_rec { + struct list_head list; + int tx_ready; + int tx_flags; + int inplace_crypto; + + struct sk_msg msg_plaintext; + struct sk_msg msg_encrypted; + + /* AAD | msg_plaintext.sg.data | sg_tag */ + struct scatterlist sg_aead_in[2]; + /* AAD | msg_encrypted.sg.data (data contains overhead for hdr & iv & tag) */ + struct scatterlist sg_aead_out[2]; + + char aad_space[TLS_AAD_SPACE_SIZE]; + struct aead_request aead_req; + u8 aead_req_ctx[]; +}; + +struct tx_work { + struct delayed_work work; + struct sock *sk; +}; + +struct tls_sw_context_tx { + struct crypto_aead *aead_send; + struct crypto_wait async_wait; + struct tx_work tx_work; + struct tls_rec *open_rec; + struct list_head tx_list; + atomic_t encrypt_pending; + int async_notify; + +#define BIT_TX_SCHEDULED 0 + unsigned long tx_bitmask; +}; + +struct tls_sw_context_rx { + struct crypto_aead *aead_recv; + struct crypto_wait async_wait; + + struct strparser strp; + void (*saved_data_ready)(struct sock *sk); + + struct sk_buff *recv_pkt; + u8 control; + bool decrypted; +}; + +struct tls_record_info { + struct list_head list; + u32 end_seq; + int len; + int num_frags; + skb_frag_t frags[MAX_SKB_FRAGS]; +}; + +struct tls_offload_context_tx { + struct crypto_aead *aead_send; + spinlock_t lock; /* protects records list */ + struct list_head records_list; + struct tls_record_info *open_record; + struct tls_record_info *retransmit_hint; + u64 hint_record_sn; + u64 unacked_record_sn; + + struct scatterlist sg_tx_data[MAX_SKB_FRAGS]; + void (*sk_destruct)(struct sock *sk); + u8 driver_state[]; + /* The TLS layer reserves room for driver specific state + * Currently the belief is that there is not enough + * driver specific state to justify another layer of indirection + */ +#define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *))) +}; + +#define TLS_OFFLOAD_CONTEXT_SIZE_TX \ + (ALIGN(sizeof(struct tls_offload_context_tx), sizeof(void *)) + \ + TLS_DRIVER_STATE_SIZE) + enum { TLS_PENDING_CLOSED_RECORD }; +enum tls_context_flags { + TLS_RX_SYNC_RUNNING = 0, + /* tls_dev_del was called for the RX side, device state was released, + * but tls_ctx->netdev might still be kept, because TX-side driver + * resources might not be released yet. Used to prevent the second + * tls_dev_del call in tls_device_down if it happens simultaneously. + */ + TLS_RX_DEV_CLOSED = 2, +}; + +struct cipher_context { + u16 prepend_size; + u16 tag_size; + u16 overhead_size; + u16 iv_size; + char *iv; + u16 rec_seq_size; + char *rec_seq; +}; + union tls_crypto_context { struct tls_crypto_info info; struct tls12_crypto_info_aes_gcm_128 aes_gcm_128; @@ -86,28 +210,32 @@ union tls_crypto_context { struct tls_context { union tls_crypto_context crypto_send; + union tls_crypto_context crypto_recv; - void *priv_ctx; + struct list_head list; + struct net_device *netdev; + refcount_t refcount; - u8 tx_conf:2; + void *priv_ctx_tx; + void *priv_ctx_rx; - u16 prepend_size; - u16 tag_size; - u16 overhead_size; - u16 iv_size; - char *iv; - u16 rec_seq_size; - char *rec_seq; + u8 tx_conf:3; + u8 rx_conf:3; + + struct cipher_context tx; + struct cipher_context rx; struct scatterlist *partially_sent_record; u16 partially_sent_offset; + unsigned long flags; bool in_tcp_sendpages; + bool pending_open_record_frags; - u16 pending_open_record_frags; int (*push_pending_record)(struct sock *sk, int flags); void (*sk_write_space)(struct sock *sk); + void (*sk_destruct)(struct sock *sk); void (*sk_proto_close)(struct sock *sk, long timeout); int (*setsockopt)(struct sock *sk, int level, @@ -116,28 +244,76 @@ struct tls_context { int (*getsockopt)(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen); + int (*hash)(struct sock *sk); + void (*unhash)(struct sock *sk); }; +struct tls_offload_context_rx { + /* sw must be the first member of tls_offload_context_rx */ + struct tls_sw_context_rx sw; + atomic64_t resync_req; + u8 driver_state[]; + /* The TLS layer reserves room for driver specific state + * Currently the belief is that there is not enough + * driver specific state to justify another layer of indirection + */ +}; + +#define TLS_OFFLOAD_CONTEXT_SIZE_RX \ + (ALIGN(sizeof(struct tls_offload_context_rx), sizeof(void *)) + \ + TLS_DRIVER_STATE_SIZE) + +void tls_ctx_free(struct tls_context *ctx); int wait_on_pending_writer(struct sock *sk, long *timeo); int tls_sk_query(struct sock *sk, int optname, char __user *optval, int __user *optlen); int tls_sk_attach(struct sock *sk, int optname, char __user *optval, unsigned int optlen); - -int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx); +int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx); int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size); int tls_sw_sendpage(struct sock *sk, struct page *page, int offset, size_t size, int flags); void tls_sw_close(struct sock *sk, long timeout); -void tls_sw_free_tx_resources(struct sock *sk); +void tls_sw_free_resources_tx(struct sock *sk); +void tls_sw_free_resources_rx(struct sock *sk); +void tls_sw_release_resources_rx(struct sock *sk); +int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, + int nonblock, int flags, int *addr_len); +bool tls_sw_stream_read(const struct sock *sk); +ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, + struct pipe_inode_info *pipe, + size_t len, unsigned int flags); + +int tls_set_device_offload(struct sock *sk, struct tls_context *ctx); +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size); +int tls_device_sendpage(struct sock *sk, struct page *page, + int offset, size_t size, int flags); +void tls_device_sk_destruct(struct sock *sk); +void tls_device_init(void); +void tls_device_cleanup(void); +int tls_tx_records(struct sock *sk, int flags); + +struct tls_record_info *tls_get_record(struct tls_offload_context_tx *context, + u32 seq, u64 *p_record_sn); + +static inline bool tls_record_is_start_marker(struct tls_record_info *rec) +{ + return rec->len == 0; +} + +static inline u32 tls_record_start_seq(struct tls_record_info *rec) +{ + return rec->end_seq - rec->len; +} void tls_sk_destruct(struct sock *sk, struct tls_context *ctx); -void tls_icsk_clean_acked(struct sock *sk); - int tls_push_sg(struct sock *sk, struct tls_context *ctx, struct scatterlist *sg, u16 first_offset, int flags); +int tls_push_partial_record(struct sock *sk, struct tls_context *ctx, + int flags); + int tls_push_pending_closed_record(struct sock *sk, struct tls_context *ctx, int flags, long *timeo); @@ -171,9 +347,35 @@ static inline bool tls_is_pending_open_record(struct tls_context *tls_ctx) return tls_ctx->pending_open_record_frags; } -static inline void tls_err_abort(struct sock *sk) +static inline bool is_tx_ready(struct tls_sw_context_tx *ctx) { - sk->sk_err = EBADMSG; + struct tls_rec *rec; + + rec = list_first_entry(&ctx->tx_list, struct tls_rec, list); + if (!rec) + return false; + + return READ_ONCE(rec->tx_ready); +} + +struct sk_buff * +tls_validate_xmit_skb(struct sock *sk, struct net_device *dev, + struct sk_buff *skb); + +static inline bool tls_is_sk_tx_device_offloaded(struct sock *sk) +{ +#ifdef CONFIG_SOCK_VALIDATE_XMIT + return sk_fullsock(sk) && + (smp_load_acquire(&sk->sk_validate_xmit_skb) == + &tls_validate_xmit_skb); +#else + return false; +#endif +} + +static inline void tls_err_abort(struct sock *sk, int err) +{ + sk->sk_err = err; sk->sk_error_report(sk); } @@ -191,10 +393,10 @@ static inline bool tls_bigint_increment(unsigned char *seq, int len) } static inline void tls_advance_record_sn(struct sock *sk, - struct tls_context *ctx) + struct cipher_context *ctx) { if (tls_bigint_increment(ctx->rec_seq, ctx->rec_seq_size)) - tls_err_abort(sk); + tls_err_abort(sk, EBADMSG); tls_bigint_increment(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, ctx->iv_size); } @@ -204,9 +406,9 @@ static inline void tls_fill_prepend(struct tls_context *ctx, size_t plaintext_len, unsigned char record_type) { - size_t pkt_len, iv_size = ctx->iv_size; + size_t pkt_len, iv_size = ctx->tx.iv_size; - pkt_len = plaintext_len + iv_size + ctx->tag_size; + pkt_len = plaintext_len + iv_size + ctx->tx.tag_size; /* we cover nonce explicit here as well, so buf should be of * size KTLS_DTLS_HEADER_SIZE + KTLS_DTLS_NONCE_EXPLICIT_SIZE @@ -218,7 +420,22 @@ static inline void tls_fill_prepend(struct tls_context *ctx, buf[3] = pkt_len >> 8; buf[4] = pkt_len & 0xFF; memcpy(buf + TLS_NONCE_OFFSET, - ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv_size); + ctx->tx.iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv_size); +} + +static inline void tls_make_aad(char *buf, + size_t size, + char *record_sequence, + int record_sequence_size, + unsigned char record_type) +{ + memcpy(buf, record_sequence, record_sequence_size); + + buf[8] = record_type; + buf[9] = TLS_1_2_VERSION_MAJOR; + buf[10] = TLS_1_2_VERSION_MINOR; + buf[11] = size >> 8; + buf[12] = size & 0xFF; } static inline struct tls_context *tls_get_ctx(const struct sock *sk) @@ -228,19 +445,59 @@ static inline struct tls_context *tls_get_ctx(const struct sock *sk) return icsk->icsk_ulp_data; } -static inline struct tls_sw_context *tls_sw_ctx( +static inline struct tls_sw_context_rx *tls_sw_ctx_rx( const struct tls_context *tls_ctx) { - return (struct tls_sw_context *)tls_ctx->priv_ctx; + return (struct tls_sw_context_rx *)tls_ctx->priv_ctx_rx; } -static inline struct tls_offload_context *tls_offload_ctx( +static inline struct tls_sw_context_tx *tls_sw_ctx_tx( const struct tls_context *tls_ctx) { - return (struct tls_offload_context *)tls_ctx->priv_ctx; + return (struct tls_sw_context_tx *)tls_ctx->priv_ctx_tx; } +static inline struct tls_offload_context_tx * +tls_offload_ctx_tx(const struct tls_context *tls_ctx) +{ + return (struct tls_offload_context_tx *)tls_ctx->priv_ctx_tx; +} + +static inline struct tls_offload_context_rx * +tls_offload_ctx_rx(const struct tls_context *tls_ctx) +{ + return (struct tls_offload_context_rx *)tls_ctx->priv_ctx_rx; +} + +/* The TLS context is valid until sk_destruct is called */ +static inline void tls_offload_rx_resync_request(struct sock *sk, __be32 seq) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_offload_context_rx *rx_ctx = tls_offload_ctx_rx(tls_ctx); + + atomic64_set(&rx_ctx->resync_req, ((((uint64_t)seq) << 32) | 1)); +} + + int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg, unsigned char *record_type); +void tls_register_device(struct tls_device *device); +void tls_unregister_device(struct tls_device *device); +int tls_device_decrypted(struct sock *sk, struct sk_buff *skb); +int decrypt_skb(struct sock *sk, struct sk_buff *skb, + struct scatterlist *sgout); + +struct sk_buff *tls_validate_xmit_skb(struct sock *sk, + struct net_device *dev, + struct sk_buff *skb); + +int tls_sw_fallback_init(struct sock *sk, + struct tls_offload_context_tx *offload_ctx, + struct tls_crypto_info *crypto_info); + +int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx); + +void tls_device_offload_cleanup_rx(struct sock *sk); +void handle_device_resync(struct sock *sk, u32 seq, u64 rcd_sn); #endif /* _TLS_OFFLOAD_H */ diff --git a/include/net/xdp.h b/include/net/xdp.h index a5e755868a82..254595f30a70 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -1,7 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* include/net/xdp.h * * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc. - * Released under terms in GPL version 2. See COPYING. */ #ifndef __LINUX_NET_XDP_H__ #define __LINUX_NET_XDP_H__ @@ -33,10 +33,34 @@ * also mandatory during RX-ring setup. */ +enum xdp_mem_type { + MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */ + MEM_TYPE_PAGE_ORDER0, /* Orig XDP full page model */ + MEM_TYPE_PAGE_POOL, + MEM_TYPE_ZERO_COPY, + MEM_TYPE_MAX, +}; + +/* XDP flags for ndo_xdp_xmit */ +#define XDP_XMIT_FLUSH (1U << 0) /* doorbell signal consumer */ +#define XDP_XMIT_FLAGS_MASK XDP_XMIT_FLUSH + +struct xdp_mem_info { + u32 type; /* enum xdp_mem_type, but known size type */ + u32 id; +}; + +struct page_pool; + +struct zero_copy_allocator { + void (*free)(struct zero_copy_allocator *zca, unsigned long handle); +}; + struct xdp_rxq_info { struct net_device *dev; u32 queue_index; u32 reg_state; + struct xdp_mem_info mem; } ____cacheline_aligned; /* perf critical, avoid false-sharing */ struct xdp_buff { @@ -44,13 +68,88 @@ struct xdp_buff { void *data_end; void *data_meta; void *data_hard_start; + unsigned long handle; struct xdp_rxq_info *rxq; }; +struct xdp_frame { + void *data; + u16 len; + u16 headroom; + u16 metasize; + /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time, + * while mem info is valid on remote CPU. + */ + struct xdp_mem_info mem; + struct net_device *dev_rx; /* used by cpumap */ +}; + +/* Clear kernel pointers in xdp_frame */ +static inline void xdp_scrub_frame(struct xdp_frame *frame) +{ + frame->data = NULL; + frame->dev_rx = NULL; +} + +/* Convert xdp_buff to xdp_frame */ +static inline +struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp) +{ + struct xdp_frame *xdp_frame; + int metasize; + int headroom; + + /* TODO: implement clone, copy, use "native" MEM_TYPE */ + if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) + return NULL; + + /* Assure headroom is available for storing info */ + headroom = xdp->data - xdp->data_hard_start; + metasize = xdp->data - xdp->data_meta; + metasize = metasize > 0 ? metasize : 0; + if (unlikely((headroom - metasize) < sizeof(*xdp_frame))) + return NULL; + + /* Store info in top of packet */ + xdp_frame = xdp->data_hard_start; + + xdp_frame->data = xdp->data; + xdp_frame->len = xdp->data_end - xdp->data; + xdp_frame->headroom = headroom - sizeof(*xdp_frame); + xdp_frame->metasize = metasize; + + /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */ + xdp_frame->mem = xdp->rxq->mem; + + return xdp_frame; +} + +void xdp_return_frame(struct xdp_frame *xdpf); +void xdp_return_frame_rx_napi(struct xdp_frame *xdpf); +void xdp_return_buff(struct xdp_buff *xdp); + +/* When sending xdp_frame into the network stack, then there is no + * return point callback, which is needed to release e.g. DMA-mapping + * resources with page_pool. Thus, have explicit function to release + * frame resources. + */ +void __xdp_release_frame(void *data, struct xdp_mem_info *mem); +static inline void xdp_release_frame(struct xdp_frame *xdpf) +{ + struct xdp_mem_info *mem = &xdpf->mem; + + /* Curr only page_pool needs this */ + if (mem->type == MEM_TYPE_PAGE_POOL) + __xdp_release_frame(xdpf->data, mem); +} + int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq, struct net_device *dev, u32 queue_index); void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq); void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq); +bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq); +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq, + enum xdp_mem_type type, void *allocator); /* Drivers not supporting XDP metadata can use this helper, which * rejects any room expansion for metadata as a result. @@ -67,4 +166,17 @@ xdp_data_meta_unsupported(const struct xdp_buff *xdp) return unlikely(xdp->data_meta > xdp->data); } +struct xdp_attachment_info { + struct bpf_prog *prog; + u32 flags; +}; + +struct netdev_bpf; +int xdp_attachment_query(struct xdp_attachment_info *info, + struct netdev_bpf *bpf); +bool xdp_attachment_flags_ok(struct xdp_attachment_info *info, + struct netdev_bpf *bpf); +void xdp_attachment_setup(struct xdp_attachment_info *info, + struct netdev_bpf *bpf); + #endif /* __LINUX_NET_XDP_H__ */ diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h new file mode 100644 index 000000000000..5cbd10b3b931 --- /dev/null +++ b/include/net/xdp_sock.h @@ -0,0 +1,237 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* AF_XDP internal functions + * Copyright(c) 2018 Intel Corporation. + */ + +#ifndef _LINUX_XDP_SOCK_H +#define _LINUX_XDP_SOCK_H + +#include +#include +#include +#include +#include +#include + +struct net_device; +struct xsk_queue; + +struct xdp_umem_props { + u64 chunk_mask; + u64 size; +}; + +struct xdp_umem_page { + void *addr; + dma_addr_t dma; +}; + +struct xdp_umem_fq_reuse { + u32 nentries; + u32 length; + u64 handles[]; +}; + +struct xdp_umem { + struct xsk_queue *fq; + struct xsk_queue *cq; + struct xdp_umem_page *pages; + struct xdp_umem_props props; + u32 headroom; + u32 chunk_size_nohr; + struct user_struct *user; + unsigned long address; + refcount_t users; + struct work_struct work; + struct page **pgs; + u32 npgs; + struct net_device *dev; + struct xdp_umem_fq_reuse *fq_reuse; + u16 queue_id; + bool zc; + spinlock_t xsk_list_lock; + struct list_head xsk_list; +}; + +/* Nodes are linked in the struct xdp_sock map_list field, and used to + * track which maps a certain socket reside in. + */ +struct xsk_map; +struct xsk_map_node { + struct list_head node; + struct xsk_map *map; + struct xdp_sock **map_entry; +}; + +struct xdp_sock { + /* struct sock must be the first member of struct xdp_sock */ + struct sock sk; + struct xsk_queue *rx; + struct net_device *dev; + struct xdp_umem *umem; + struct list_head flush_node; + u16 queue_id; + bool zc; + enum { + XSK_READY = 0, + XSK_BOUND, + XSK_UNBOUND, + } state; + /* Protects multiple processes in the control path */ + struct mutex mutex; + struct xsk_queue *tx ____cacheline_aligned_in_smp; + struct list_head list; + /* Mutual exclusion of NAPI TX thread and sendmsg error paths + * in the SKB destructor callback. + */ + spinlock_t tx_completion_lock; + u64 rx_dropped; + struct list_head map_list; + /* Protects map_list */ + spinlock_t map_list_lock; +}; + +struct xdp_buff; +#ifdef CONFIG_XDP_SOCKETS +int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp); +int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp); +void xsk_flush(struct xdp_sock *xs); +bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs); +/* Used from netdev driver */ +u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr); +void xsk_umem_discard_addr(struct xdp_umem *umem); +void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries); +bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len); +void xsk_umem_consume_tx_done(struct xdp_umem *umem); +struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries); +struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem, + struct xdp_umem_fq_reuse *newq); +void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq); + +void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs, + struct xdp_sock **map_entry); +int xsk_map_inc(struct xsk_map *map); +void xsk_map_put(struct xsk_map *map); + +static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr) +{ + return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1)); +} + +static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr) +{ + return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1)); +} + +/* Reuse-queue aware version of FILL queue helpers */ +static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr) +{ + struct xdp_umem_fq_reuse *rq = umem->fq_reuse; + + if (!rq->length) + return xsk_umem_peek_addr(umem, addr); + + *addr = rq->handles[rq->length - 1]; + return addr; +} + +static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem) +{ + struct xdp_umem_fq_reuse *rq = umem->fq_reuse; + + if (!rq->length) + xsk_umem_discard_addr(umem); + else + rq->length--; +} + +static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr) +{ + struct xdp_umem_fq_reuse *rq = umem->fq_reuse; + + rq->handles[rq->length++] = addr; +} +#else +static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) +{ + return -ENOTSUPP; +} + +static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) +{ + return -ENOTSUPP; +} + +static inline void xsk_flush(struct xdp_sock *xs) +{ +} + +static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs) +{ + return false; +} + +static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr) +{ + return NULL; +} + +static inline void xsk_umem_discard_addr(struct xdp_umem *umem) +{ +} + +static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries) +{ +} + +static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, + u32 *len) +{ + return false; +} + +static inline void xsk_umem_consume_tx_done(struct xdp_umem *umem) +{ +} + +static inline struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries) +{ + return NULL; +} + +static inline struct xdp_umem_fq_reuse *xsk_reuseq_swap( + struct xdp_umem *umem, + struct xdp_umem_fq_reuse *newq) +{ + return NULL; +} +static inline void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq) +{ +} + +static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr) +{ + return NULL; +} + +static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr) +{ + return 0; +} + +static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr) +{ + return NULL; +} + +static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem) +{ +} + +static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr) +{ +} + +#endif /* CONFIG_XDP_SOCKETS */ + +#endif /* _LINUX_XDP_SOCK_H */ diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 5096f8774ee2..f1638792aa5c 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -979,7 +979,7 @@ static inline bool xfrm_sec_ctx_match(struct xfrm_sec_ctx *s1, struct xfrm_sec_c /* A struct encoding bundle of transformations to apply to some set of flow. * - * dst->child points to the next element of bundle. + * xdst->child points to the next element of bundle. * dst->xfrm points to an instanse of transformer. * * Due to unfortunate limitations of current routing cache, which we @@ -995,6 +995,8 @@ struct xfrm_dst { struct rt6_info rt6; } u; struct dst_entry *route; + struct dst_entry *child; + struct dst_entry *path; struct xfrm_policy *pols[XFRM_POLICY_TYPE_MAX]; int num_pols, num_xfrms; u32 xfrm_genid; @@ -1005,7 +1007,35 @@ struct xfrm_dst { u32 path_cookie; }; +static inline struct dst_entry *xfrm_dst_path(const struct dst_entry *dst) +{ #ifdef CONFIG_XFRM + if (dst->xfrm) { + const struct xfrm_dst *xdst = (const struct xfrm_dst *) dst; + + return xdst->path; + } +#endif + return (struct dst_entry *) dst; +} + +static inline struct dst_entry *xfrm_dst_child(const struct dst_entry *dst) +{ +#ifdef CONFIG_XFRM + if (dst->xfrm) { + struct xfrm_dst *xdst = (struct xfrm_dst *) dst; + return xdst->child; + } +#endif + return NULL; +} + +#ifdef CONFIG_XFRM +static inline void xfrm_dst_set_child(struct xfrm_dst *xdst, struct dst_entry *child) +{ + xdst->child = child; +} + static inline void xfrm_dst_destroy(struct xfrm_dst *xdst) { xfrm_pols_put(xdst->pols, xdst->num_pols); @@ -1048,7 +1078,6 @@ struct xfrm_offload { #define XFRM_GSO_SEGMENT 16 #define XFRM_GRO 32 #define XFRM_ESP_NO_TRAILER 64 -#define XFRM_DEV_RESUME 128 __u32 status; #define CRYPTO_SUCCESS 1 @@ -1072,6 +1101,15 @@ struct sec_path { struct xfrm_offload ovec[XFRM_MAX_OFFLOAD_DEPTH]; }; +static inline int secpath_exists(struct sk_buff *skb) +{ +#ifdef CONFIG_XFRM + return skb->sp != NULL; +#else + return 0; +#endif +} + static inline struct sec_path * secpath_get(struct sec_path *sp) { @@ -1884,27 +1922,21 @@ static inline struct xfrm_state *xfrm_input_state(struct sk_buff *skb) { return skb->sp->xvec[skb->sp->len - 1]; } -#endif static inline struct xfrm_offload *xfrm_offload(struct sk_buff *skb) { -#ifdef CONFIG_XFRM struct sec_path *sp = skb->sp; if (!sp || !sp->olen || sp->len != sp->olen) return NULL; return &sp->ovec[sp->olen - 1]; -#else - return NULL; -#endif } +#endif void __net_init xfrm_dev_init(void); #ifdef CONFIG_XFRM_OFFLOAD -void xfrm_dev_resume(struct sk_buff *skb); -void xfrm_dev_backlog(struct softnet_data *sd); -struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t features, bool *again); +int validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t features); int xfrm_dev_state_add(struct net *net, struct xfrm_state *x, struct xfrm_user_offload *xuo); bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x); @@ -1912,12 +1944,14 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x); static inline bool xfrm_dst_offload_ok(struct dst_entry *dst) { struct xfrm_state *x = dst->xfrm; + struct xfrm_dst *xdst; if (!x || !x->type_offload) return false; - if (x->xso.offload_handle && (x->xso.dev == dst->path->dev) && - !dst->child->xfrm) + xdst = (struct xfrm_dst *) dst; + if (x->xso.offload_handle && (x->xso.dev == xfrm_dst_path(dst)->dev) && + !xdst->child->xfrm) return true; return false; @@ -1943,17 +1977,9 @@ static inline void xfrm_dev_state_free(struct xfrm_state *x) } } #else -static inline void xfrm_dev_resume(struct sk_buff *skb) +static inline int validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t features) { -} - -static inline void xfrm_dev_backlog(struct softnet_data *sd) -{ -} - -static inline struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t features, bool *again) -{ - return skb; + return 0; } static inline int xfrm_dev_state_add(struct net *net, struct xfrm_state *x, struct xfrm_user_offload *xuo) diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h index 0a6e052e8da1..d6e556c0a085 100644 --- a/include/trace/bpf_probe.h +++ b/include/trace/bpf_probe.h @@ -84,6 +84,7 @@ __bpf_trace_tp_map_##call = { \ }; #define FIRST(x, ...) x + #undef DEFINE_EVENT_WRITABLE #define DEFINE_EVENT_WRITABLE(template, call, proto, args, size) \ static inline void bpf_test_buffer_##call(void) \ @@ -110,4 +111,5 @@ __DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), size) #undef DEFINE_EVENT_WRITABLE #undef __DEFINE_EVENT #undef FIRST + #endif /* CONFIG_BPF_EVENTS */ diff --git a/include/trace/events/bpf.h b/include/trace/events/bpf.h deleted file mode 100644 index cc749d7099fb..000000000000 --- a/include/trace/events/bpf.h +++ /dev/null @@ -1,352 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#undef TRACE_SYSTEM -#define TRACE_SYSTEM bpf - -#if !defined(_TRACE_BPF_H) || defined(TRACE_HEADER_MULTI_READ) -#define _TRACE_BPF_H - -#include -#include -#include -#include - -#define __PROG_TYPE_MAP(FN) \ - FN(SOCKET_FILTER) \ - FN(KPROBE) \ - FN(SCHED_CLS) \ - FN(SCHED_ACT) \ - FN(TRACEPOINT) \ - FN(XDP) \ - FN(PERF_EVENT) \ - FN(CGROUP_SKB) \ - FN(CGROUP_SOCK) \ - FN(LWT_IN) \ - FN(LWT_OUT) \ - FN(LWT_XMIT) - -#define __MAP_TYPE_MAP(FN) \ - FN(HASH) \ - FN(ARRAY) \ - FN(PROG_ARRAY) \ - FN(PERF_EVENT_ARRAY) \ - FN(PERCPU_HASH) \ - FN(PERCPU_ARRAY) \ - FN(STACK_TRACE) \ - FN(CGROUP_ARRAY) \ - FN(LRU_HASH) \ - FN(LRU_PERCPU_HASH) \ - FN(LPM_TRIE) - -#define __PROG_TYPE_TP_FN(x) \ - TRACE_DEFINE_ENUM(BPF_PROG_TYPE_##x); -#define __PROG_TYPE_SYM_FN(x) \ - { BPF_PROG_TYPE_##x, #x }, -#define __PROG_TYPE_SYM_TAB \ - __PROG_TYPE_MAP(__PROG_TYPE_SYM_FN) { -1, 0 } -__PROG_TYPE_MAP(__PROG_TYPE_TP_FN) - -#define __MAP_TYPE_TP_FN(x) \ - TRACE_DEFINE_ENUM(BPF_MAP_TYPE_##x); -#define __MAP_TYPE_SYM_FN(x) \ - { BPF_MAP_TYPE_##x, #x }, -#define __MAP_TYPE_SYM_TAB \ - __MAP_TYPE_MAP(__MAP_TYPE_SYM_FN) { -1, 0 } -__MAP_TYPE_MAP(__MAP_TYPE_TP_FN) - -DECLARE_EVENT_CLASS(bpf_prog_event, - - TP_PROTO(const struct bpf_prog *prg), - - TP_ARGS(prg), - - TP_STRUCT__entry( - __array(u8, prog_tag, 8) - __field(u32, type) - ), - - TP_fast_assign( - BUILD_BUG_ON(sizeof(__entry->prog_tag) != sizeof(prg->tag)); - memcpy(__entry->prog_tag, prg->tag, sizeof(prg->tag)); - __entry->type = prg->type; - ), - - TP_printk("prog=%s type=%s", - __print_hex_str(__entry->prog_tag, 8), - __print_symbolic(__entry->type, __PROG_TYPE_SYM_TAB)) -); - -DEFINE_EVENT(bpf_prog_event, bpf_prog_get_type, - - TP_PROTO(const struct bpf_prog *prg), - - TP_ARGS(prg) -); - -DEFINE_EVENT(bpf_prog_event, bpf_prog_put_rcu, - - TP_PROTO(const struct bpf_prog *prg), - - TP_ARGS(prg) -); - -TRACE_EVENT(bpf_prog_load, - - TP_PROTO(const struct bpf_prog *prg, int ufd), - - TP_ARGS(prg, ufd), - - TP_STRUCT__entry( - __array(u8, prog_tag, 8) - __field(u32, type) - __field(int, ufd) - ), - - TP_fast_assign( - BUILD_BUG_ON(sizeof(__entry->prog_tag) != sizeof(prg->tag)); - memcpy(__entry->prog_tag, prg->tag, sizeof(prg->tag)); - __entry->type = prg->type; - __entry->ufd = ufd; - ), - - TP_printk("prog=%s type=%s ufd=%d", - __print_hex_str(__entry->prog_tag, 8), - __print_symbolic(__entry->type, __PROG_TYPE_SYM_TAB), - __entry->ufd) -); - -TRACE_EVENT(bpf_map_create, - - TP_PROTO(const struct bpf_map *map, int ufd), - - TP_ARGS(map, ufd), - - TP_STRUCT__entry( - __field(u32, type) - __field(u32, size_key) - __field(u32, size_value) - __field(u32, max_entries) - __field(u32, flags) - __field(int, ufd) - ), - - TP_fast_assign( - __entry->type = map->map_type; - __entry->size_key = map->key_size; - __entry->size_value = map->value_size; - __entry->max_entries = map->max_entries; - __entry->flags = map->map_flags; - __entry->ufd = ufd; - ), - - TP_printk("map type=%s ufd=%d key=%u val=%u max=%u flags=%x", - __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB), - __entry->ufd, __entry->size_key, __entry->size_value, - __entry->max_entries, __entry->flags) -); - -DECLARE_EVENT_CLASS(bpf_obj_prog, - - TP_PROTO(const struct bpf_prog *prg, int ufd, - const struct filename *pname), - - TP_ARGS(prg, ufd, pname), - - TP_STRUCT__entry( - __array(u8, prog_tag, 8) - __field(int, ufd) - __string(path, pname->name) - ), - - TP_fast_assign( - BUILD_BUG_ON(sizeof(__entry->prog_tag) != sizeof(prg->tag)); - memcpy(__entry->prog_tag, prg->tag, sizeof(prg->tag)); - __assign_str(path, pname->name); - __entry->ufd = ufd; - ), - - TP_printk("prog=%s path=%s ufd=%d", - __print_hex_str(__entry->prog_tag, 8), - __get_str(path), __entry->ufd) -); - -DEFINE_EVENT(bpf_obj_prog, bpf_obj_pin_prog, - - TP_PROTO(const struct bpf_prog *prg, int ufd, - const struct filename *pname), - - TP_ARGS(prg, ufd, pname) -); - -DEFINE_EVENT(bpf_obj_prog, bpf_obj_get_prog, - - TP_PROTO(const struct bpf_prog *prg, int ufd, - const struct filename *pname), - - TP_ARGS(prg, ufd, pname) -); - -DECLARE_EVENT_CLASS(bpf_obj_map, - - TP_PROTO(const struct bpf_map *map, int ufd, - const struct filename *pname), - - TP_ARGS(map, ufd, pname), - - TP_STRUCT__entry( - __field(u32, type) - __field(int, ufd) - __string(path, pname->name) - ), - - TP_fast_assign( - __assign_str(path, pname->name); - __entry->type = map->map_type; - __entry->ufd = ufd; - ), - - TP_printk("map type=%s ufd=%d path=%s", - __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB), - __entry->ufd, __get_str(path)) -); - -DEFINE_EVENT(bpf_obj_map, bpf_obj_pin_map, - - TP_PROTO(const struct bpf_map *map, int ufd, - const struct filename *pname), - - TP_ARGS(map, ufd, pname) -); - -DEFINE_EVENT(bpf_obj_map, bpf_obj_get_map, - - TP_PROTO(const struct bpf_map *map, int ufd, - const struct filename *pname), - - TP_ARGS(map, ufd, pname) -); - -DECLARE_EVENT_CLASS(bpf_map_keyval, - - TP_PROTO(const struct bpf_map *map, int ufd, - const void *key, const void *val), - - TP_ARGS(map, ufd, key, val), - - TP_STRUCT__entry( - __field(u32, type) - __field(u32, key_len) - __dynamic_array(u8, key, map->key_size) - __field(bool, key_trunc) - __field(u32, val_len) - __dynamic_array(u8, val, map->value_size) - __field(bool, val_trunc) - __field(int, ufd) - ), - - TP_fast_assign( - memcpy(__get_dynamic_array(key), key, map->key_size); - memcpy(__get_dynamic_array(val), val, map->value_size); - __entry->type = map->map_type; - __entry->key_len = min(map->key_size, 16U); - __entry->key_trunc = map->key_size != __entry->key_len; - __entry->val_len = min(map->value_size, 16U); - __entry->val_trunc = map->value_size != __entry->val_len; - __entry->ufd = ufd; - ), - - TP_printk("map type=%s ufd=%d key=[%s%s] val=[%s%s]", - __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB), - __entry->ufd, - __print_hex(__get_dynamic_array(key), __entry->key_len), - __entry->key_trunc ? " ..." : "", - __print_hex(__get_dynamic_array(val), __entry->val_len), - __entry->val_trunc ? " ..." : "") -); - -DEFINE_EVENT(bpf_map_keyval, bpf_map_lookup_elem, - - TP_PROTO(const struct bpf_map *map, int ufd, - const void *key, const void *val), - - TP_ARGS(map, ufd, key, val) -); - -DEFINE_EVENT(bpf_map_keyval, bpf_map_update_elem, - - TP_PROTO(const struct bpf_map *map, int ufd, - const void *key, const void *val), - - TP_ARGS(map, ufd, key, val) -); - -TRACE_EVENT(bpf_map_delete_elem, - - TP_PROTO(const struct bpf_map *map, int ufd, - const void *key), - - TP_ARGS(map, ufd, key), - - TP_STRUCT__entry( - __field(u32, type) - __field(u32, key_len) - __dynamic_array(u8, key, map->key_size) - __field(bool, key_trunc) - __field(int, ufd) - ), - - TP_fast_assign( - memcpy(__get_dynamic_array(key), key, map->key_size); - __entry->type = map->map_type; - __entry->key_len = min(map->key_size, 16U); - __entry->key_trunc = map->key_size != __entry->key_len; - __entry->ufd = ufd; - ), - - TP_printk("map type=%s ufd=%d key=[%s%s]", - __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB), - __entry->ufd, - __print_hex(__get_dynamic_array(key), __entry->key_len), - __entry->key_trunc ? " ..." : "") -); - -TRACE_EVENT(bpf_map_next_key, - - TP_PROTO(const struct bpf_map *map, int ufd, - const void *key, const void *key_next), - - TP_ARGS(map, ufd, key, key_next), - - TP_STRUCT__entry( - __field(u32, type) - __field(u32, key_len) - __dynamic_array(u8, key, map->key_size) - __dynamic_array(u8, nxt, map->key_size) - __field(bool, key_trunc) - __field(bool, key_null) - __field(int, ufd) - ), - - TP_fast_assign( - if (key) - memcpy(__get_dynamic_array(key), key, map->key_size); - __entry->key_null = !key; - memcpy(__get_dynamic_array(nxt), key_next, map->key_size); - __entry->type = map->map_type; - __entry->key_len = min(map->key_size, 16U); - __entry->key_trunc = map->key_size != __entry->key_len; - __entry->ufd = ufd; - ), - - TP_printk("map type=%s ufd=%d key=[%s%s] next=[%s%s]", - __print_symbolic(__entry->type, __MAP_TYPE_SYM_TAB), - __entry->ufd, - __entry->key_null ? "NULL" : __print_hex(__get_dynamic_array(key), - __entry->key_len), - __entry->key_trunc && !__entry->key_null ? " ..." : "", - __print_hex(__get_dynamic_array(nxt), __entry->key_len), - __entry->key_trunc ? " ..." : "") -); - -#endif /* _TRACE_BPF_H */ - -#include diff --git a/include/trace/events/bpf_test_run.h b/include/trace/events/bpf_test_run.h new file mode 100644 index 000000000000..265447e3f71a --- /dev/null +++ b/include/trace/events/bpf_test_run.h @@ -0,0 +1,50 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM bpf_test_run + +#if !defined(_TRACE_BPF_TEST_RUN_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_BPF_TEST_RUN_H + +#include + +DECLARE_EVENT_CLASS(bpf_test_finish, + + TP_PROTO(int *err), + + TP_ARGS(err), + + TP_STRUCT__entry( + __field(int, err) + ), + + TP_fast_assign( + __entry->err = *err; + ), + + TP_printk("bpf_test_finish with err=%d", __entry->err) +); + +#ifdef DEFINE_EVENT_WRITABLE +#undef BPF_TEST_RUN_DEFINE_EVENT +#define BPF_TEST_RUN_DEFINE_EVENT(template, call, proto, args, size) \ + DEFINE_EVENT_WRITABLE(template, call, PARAMS(proto), \ + PARAMS(args), size) +#else +#undef BPF_TEST_RUN_DEFINE_EVENT +#define BPF_TEST_RUN_DEFINE_EVENT(template, call, proto, args, size) \ + DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args)) +#endif + +BPF_TEST_RUN_DEFINE_EVENT(bpf_test_finish, bpf_test_finish, + + TP_PROTO(int *err), + + TP_ARGS(err), + + sizeof(int) +); + +#endif + +/* This part must be outside protection */ +#include diff --git a/include/trace/events/fib6.h b/include/trace/events/fib6.h index d46e24702765..7e8d48a81b91 100644 --- a/include/trace/events/fib6.h +++ b/include/trace/events/fib6.h @@ -13,9 +13,9 @@ TRACE_EVENT(fib6_table_lookup, TP_PROTO(const struct net *net, const struct rt6_info *rt, - u32 tb_id, const struct flowi6 *flp), + struct fib6_table *table, const struct flowi6 *flp), - TP_ARGS(net, rt, tb_id, flp), + TP_ARGS(net, rt, table, flp), TP_STRUCT__entry( __field( u32, tb_id ) @@ -35,7 +35,7 @@ TRACE_EVENT(fib6_table_lookup, TP_fast_assign( struct in6_addr *in6; - __entry->tb_id = tb_id; + __entry->tb_id = table->tb6_id; __entry->oif = flp->flowi6_oif; __entry->iif = flp->flowi6_iif; __entry->tos = ip6_tclass(flp->flowlabel); diff --git a/include/trace/events/net.h b/include/trace/events/net.h index a9869b69e96d..cfac7d3a471d 100644 --- a/include/trace/events/net.h +++ b/include/trace/events/net.h @@ -59,7 +59,7 @@ TRACE_EVENT(net_dev_start_xmit, __entry->gso_size = skb_shinfo(skb)->gso_size; __entry->gso_segs = skb_shinfo(skb)->gso_segs; __entry->gso_type = skb_shinfo(skb)->gso_type; - __entry->utctime = ktime_get_tai_ns(); + __entry->utctime = ktime_get_clocktai_ns(); ), @@ -93,7 +93,7 @@ TRACE_EVENT(net_receive_skb_exit, TP_fast_assign( __entry->skbaddr = skb; - __entry->utctime = ktime_get_tai_ns(); + __entry->utctime = ktime_get_clocktai_ns(); ), @@ -122,7 +122,7 @@ TRACE_EVENT(net_dev_xmit, __entry->len = skb_len; __entry->rc = rc; __assign_str(name, dev->name); - __entry->utctime = ktime_get_tai_ns(); + __entry->utctime = ktime_get_clocktai_ns(); ), TP_printk("dev=%s skbaddr=%pK len=%u rc=%d UTC: %ld", @@ -147,7 +147,7 @@ DECLARE_EVENT_CLASS(net_dev_template, __entry->skbaddr = skb; __entry->len = skb->len; __assign_str(name, skb->dev->name); - __entry->utctime = ktime_get_tai_ns(); + __entry->utctime = ktime_get_clocktai_ns(); ), TP_printk("dev=%s skbaddr=%pK len=%u UTC: %ld", @@ -223,7 +223,7 @@ DECLARE_EVENT_CLASS(net_dev_rx_verbose_template, __entry->nr_frags = skb_shinfo(skb)->nr_frags; __entry->gso_size = skb_shinfo(skb)->gso_size; __entry->gso_type = skb_shinfo(skb)->gso_type; - __entry->utctime = ktime_get_tai_ns(); + __entry->utctime = ktime_get_clocktai_ns(); ), diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h index f7c73ae62b7a..5233814e506f 100644 --- a/include/trace/events/xdp.h +++ b/include/trace/events/xdp.h @@ -50,6 +50,35 @@ TRACE_EVENT(xdp_exception, __entry->ifindex) ); +TRACE_EVENT(xdp_bulk_tx, + + TP_PROTO(const struct net_device *dev, + int sent, int drops, int err), + + TP_ARGS(dev, sent, drops, err), + + TP_STRUCT__entry( + __field(int, ifindex) + __field(u32, act) + __field(int, drops) + __field(int, sent) + __field(int, err) + ), + + TP_fast_assign( + __entry->ifindex = dev->ifindex; + __entry->act = XDP_TX; + __entry->drops = drops; + __entry->sent = sent; + __entry->err = err; + ), + + TP_printk("ifindex=%d action=%s sent=%d drops=%d err=%d", + __entry->ifindex, + __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB), + __entry->sent, __entry->drops, __entry->err) +); + DECLARE_EVENT_CLASS(xdp_redirect_template, TP_PROTO(const struct net_device *dev, @@ -138,14 +167,137 @@ DEFINE_EVENT_PRINT(xdp_redirect_template, xdp_redirect_map_err, __entry->map_id, __entry->map_index) ); +#ifndef __DEVMAP_OBJ_TYPE +#define __DEVMAP_OBJ_TYPE +struct _bpf_dtab_netdev { + struct net_device *dev; +}; +#endif /* __DEVMAP_OBJ_TYPE */ + +#define devmap_ifindex(fwd, map) \ + ((map->map_type == BPF_MAP_TYPE_DEVMAP || \ + map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) ? \ + ((struct _bpf_dtab_netdev *)fwd)->dev->ifindex : 0) + #define _trace_xdp_redirect_map(dev, xdp, fwd, map, idx) \ - trace_xdp_redirect_map(dev, xdp, fwd ? fwd->ifindex : 0, \ + trace_xdp_redirect_map(dev, xdp, devmap_ifindex(fwd, map), \ 0, map, idx) #define _trace_xdp_redirect_map_err(dev, xdp, fwd, map, idx, err) \ - trace_xdp_redirect_map_err(dev, xdp, fwd ? fwd->ifindex : 0, \ + trace_xdp_redirect_map_err(dev, xdp, devmap_ifindex(fwd, map), \ err, map, idx) +TRACE_EVENT(xdp_cpumap_kthread, + + TP_PROTO(int map_id, unsigned int processed, unsigned int drops, + int sched), + + TP_ARGS(map_id, processed, drops, sched), + + TP_STRUCT__entry( + __field(int, map_id) + __field(u32, act) + __field(int, cpu) + __field(unsigned int, drops) + __field(unsigned int, processed) + __field(int, sched) + ), + + TP_fast_assign( + __entry->map_id = map_id; + __entry->act = XDP_REDIRECT; + __entry->cpu = smp_processor_id(); + __entry->drops = drops; + __entry->processed = processed; + __entry->sched = sched; + ), + + TP_printk("kthread" + " cpu=%d map_id=%d action=%s" + " processed=%u drops=%u" + " sched=%d", + __entry->cpu, __entry->map_id, + __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB), + __entry->processed, __entry->drops, + __entry->sched) +); + +TRACE_EVENT(xdp_cpumap_enqueue, + + TP_PROTO(int map_id, unsigned int processed, unsigned int drops, + int to_cpu), + + TP_ARGS(map_id, processed, drops, to_cpu), + + TP_STRUCT__entry( + __field(int, map_id) + __field(u32, act) + __field(int, cpu) + __field(unsigned int, drops) + __field(unsigned int, processed) + __field(int, to_cpu) + ), + + TP_fast_assign( + __entry->map_id = map_id; + __entry->act = XDP_REDIRECT; + __entry->cpu = smp_processor_id(); + __entry->drops = drops; + __entry->processed = processed; + __entry->to_cpu = to_cpu; + ), + + TP_printk("enqueue" + " cpu=%d map_id=%d action=%s" + " processed=%u drops=%u" + " to_cpu=%d", + __entry->cpu, __entry->map_id, + __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB), + __entry->processed, __entry->drops, + __entry->to_cpu) +); + +TRACE_EVENT(xdp_devmap_xmit, + + TP_PROTO(const struct bpf_map *map, u32 map_index, + int sent, int drops, + const struct net_device *from_dev, + const struct net_device *to_dev, int err), + + TP_ARGS(map, map_index, sent, drops, from_dev, to_dev, err), + + TP_STRUCT__entry( + __field(int, map_id) + __field(u32, act) + __field(u32, map_index) + __field(int, drops) + __field(int, sent) + __field(int, from_ifindex) + __field(int, to_ifindex) + __field(int, err) + ), + + TP_fast_assign( + __entry->map_id = map->id; + __entry->act = XDP_REDIRECT; + __entry->map_index = map_index; + __entry->drops = drops; + __entry->sent = sent; + __entry->from_ifindex = from_dev->ifindex; + __entry->to_ifindex = to_dev->ifindex; + __entry->err = err; + ), + + TP_printk("ndo_xdp_xmit" + " map_id=%d map_index=%d action=%s" + " sent=%d drops=%d" + " from_ifindex=%d to_ifindex=%d err=%d", + __entry->map_id, __entry->map_index, + __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB), + __entry->sent, __entry->drops, + __entry->from_ifindex, __entry->to_ifindex, __entry->err) +); + #endif /* _TRACE_XDP_H */ #include diff --git a/include/uapi/asm-generic/posix_types.h b/include/uapi/asm-generic/posix_types.h index 5e6ea22bd525..f0733a26ebfc 100644 --- a/include/uapi/asm-generic/posix_types.h +++ b/include/uapi/asm-generic/posix_types.h @@ -87,6 +87,7 @@ typedef struct { typedef __kernel_long_t __kernel_off_t; typedef long long __kernel_loff_t; typedef __kernel_long_t __kernel_time_t; +typedef long long __kernel_time64_t; typedef __kernel_long_t __kernel_clock_t; typedef int __kernel_timer_t; typedef int __kernel_clockid_t; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 95c8350cf4fe..d4fd6970a7ca 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -14,10 +14,11 @@ /* Extended instruction set based on top of classic BPF */ /* instruction classes */ +#define BPF_JMP32 0x06 /* jmp mode in word width */ #define BPF_ALU64 0x07 /* alu mode in double word width */ /* ld/ldx fields */ -#define BPF_DW 0x18 /* double word */ +#define BPF_DW 0x18 /* double word (64-bit) */ #define BPF_XADD 0xc0 /* exclusive add */ /* alu/jmp fields */ @@ -100,8 +101,12 @@ enum bpf_cmd { BPF_OBJ_GET_INFO_BY_FD, BPF_PROG_QUERY, BPF_RAW_TRACEPOINT_OPEN, - BPF_BTF_LOAD = 18, - BPF_BTF_GET_FD_BY_ID = 19, + BPF_BTF_LOAD, + BPF_BTF_GET_FD_BY_ID, + BPF_TASK_FD_QUERY, + BPF_MAP_LOOKUP_AND_DELETE_ELEM, + BPF_MAP_FREEZE, + BPF_BTF_GET_NEXT_ID, }; enum bpf_map_type { @@ -122,13 +127,25 @@ enum bpf_map_type { BPF_MAP_TYPE_DEVMAP, BPF_MAP_TYPE_SOCKMAP, BPF_MAP_TYPE_CPUMAP, - BPF_MAP_TYPE_SOCKHASH = 18, - BPF_MAP_TYPE_CGROUP_STORAGE = 19, - BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE = 21, - BPF_MAP_TYPE_SK_STORAGE = 24, - BPF_MAP_TYPE_DEVMAP_HASH = 25, + BPF_MAP_TYPE_XSKMAP, + BPF_MAP_TYPE_SOCKHASH, + BPF_MAP_TYPE_CGROUP_STORAGE, + BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, + BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, + BPF_MAP_TYPE_QUEUE, + BPF_MAP_TYPE_STACK, + BPF_MAP_TYPE_SK_STORAGE, + BPF_MAP_TYPE_DEVMAP_HASH, }; +/* Note that tracing related programs such as + * BPF_PROG_TYPE_{KPROBE,TRACEPOINT,PERF_EVENT,RAW_TRACEPOINT} + * are not subject to a stable API since kernel internal data + * structures can change from release to release and may + * therefore break existing tracing BPF programs. Tracing BPF + * programs correspond to /a/ specific kernel which is to be + * analyzed, and not /a/ specific kernel /and/ all future ones. + */ enum bpf_prog_type { BPF_PROG_TYPE_UNSPEC, BPF_PROG_TYPE_SOCKET_FILTER, @@ -146,13 +163,16 @@ enum bpf_prog_type { BPF_PROG_TYPE_SOCK_OPS, BPF_PROG_TYPE_SK_SKB, BPF_PROG_TYPE_CGROUP_DEVICE, - BPF_PROG_TYPE_SK_MSG = 16, - BPF_PROG_TYPE_RAW_TRACEPOINT = 17, - BPF_PROG_TYPE_CGROUP_SOCK_ADDR = 18, - BPF_PROG_TYPE_FLOW_DISSECTOR = 22, - BPF_PROG_TYPE_CGROUP_SYSCTL = 23, - BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE = 24, - BPF_PROG_TYPE_CGROUP_SOCKOPT = 25, + BPF_PROG_TYPE_SK_MSG, + BPF_PROG_TYPE_RAW_TRACEPOINT, + BPF_PROG_TYPE_CGROUP_SOCK_ADDR, + BPF_PROG_TYPE_LWT_SEG6LOCAL, + BPF_PROG_TYPE_LIRC_MODE2, + BPF_PROG_TYPE_SK_REUSEPORT, + BPF_PROG_TYPE_FLOW_DISSECTOR, + BPF_PROG_TYPE_CGROUP_SYSCTL, + BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, + BPF_PROG_TYPE_CGROUP_SOCKOPT, }; enum bpf_attach_type { @@ -163,21 +183,22 @@ enum bpf_attach_type { BPF_SK_SKB_STREAM_PARSER, BPF_SK_SKB_STREAM_VERDICT, BPF_CGROUP_DEVICE, - BPF_SK_MSG_VERDICT = 7, - BPF_CGROUP_INET4_BIND = 8, - BPF_CGROUP_INET6_BIND = 9, + BPF_SK_MSG_VERDICT, + BPF_CGROUP_INET4_BIND, + BPF_CGROUP_INET6_BIND, BPF_CGROUP_INET4_CONNECT, BPF_CGROUP_INET6_CONNECT, BPF_CGROUP_INET4_POST_BIND, BPF_CGROUP_INET6_POST_BIND, BPF_CGROUP_UDP4_SENDMSG, BPF_CGROUP_UDP6_SENDMSG, - BPF_FLOW_DISSECTOR = 17, - BPF_CGROUP_SYSCTL = 18, - BPF_CGROUP_UDP4_RECVMSG = 19, - BPF_CGROUP_UDP6_RECVMSG = 20, - BPF_CGROUP_GETSOCKOPT = 21, - BPF_CGROUP_SETSOCKOPT = 22, + BPF_LIRC_MODE2, + BPF_FLOW_DISSECTOR, + BPF_CGROUP_SYSCTL, + BPF_CGROUP_UDP4_RECVMSG, + BPF_CGROUP_UDP6_RECVMSG, + BPF_CGROUP_GETSOCKOPT, + BPF_CGROUP_SETSOCKOPT, __MAX_BPF_ATTACH_TYPE }; @@ -232,6 +253,41 @@ enum bpf_attach_type { */ #define BPF_F_STRICT_ALIGNMENT (1U << 0) +/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the + * verifier will allow any alignment whatsoever. On platforms + * with strict alignment requirements for loads ands stores (such + * as sparc and mips) the verifier validates that all loads and + * stores provably follow this requirement. This flag turns that + * checking and enforcement off. + * + * It is mostly used for testing when we want to validate the + * context and memory access aspects of the verifier, but because + * of an unaligned access the alignment check would trigger before + * the one we are interested in. + */ +#define BPF_F_ANY_ALIGNMENT (1U << 1) + +/* BPF_F_TEST_RND_HI32 is used in BPF_PROG_LOAD command for testing purpose. + * Verifier does sub-register def/use analysis and identifies instructions whose + * def only matters for low 32-bit, high 32-bit is never referenced later + * through implicit zero extension. Therefore verifier notifies JIT back-ends + * that it is safe to ignore clearing high 32-bit for these instructions. This + * saves some back-ends a lot of code-gen. However such optimization is not + * necessary on some arches, for example x86_64, arm64 etc, whose JIT back-ends + * hence hasn't used verifier's analysis result. But, we really want to have a + * way to be able to verify the correctness of the described optimization on + * x86_64 on which testsuites are frequently exercised. + * + * So, this flag is introduced. Once it is set, verifier will randomize high + * 32-bit for those instructions who has been identified as safe to ignore them. + * Then, if verifier is not doing correct analysis, such randomization will + * regress tests to expose bugs. + */ +#define BPF_F_TEST_RND_HI32 (1U << 2) + +/* The verifier internal test flag. Behavior is undefined */ +#define BPF_F_TEST_STATE_FREQ (1U << 3) + /* When BPF ldimm64's insn[0].src_reg != 0 then this can have * two extensions: * @@ -269,17 +325,46 @@ enum bpf_attach_type { /* Specify numa node during map creation */ #define BPF_F_NUMA_NODE (1U << 2) -/* flags for BPF_PROG_QUERY */ -#define BPF_F_QUERY_EFFECTIVE (1U << 0) - #define BPF_OBJ_NAME_LEN 16U -/* Flags for accessing BPF object */ +/* Flags for accessing BPF object from syscall side. */ #define BPF_F_RDONLY (1U << 3) #define BPF_F_WRONLY (1U << 4) +/* Flag for stack_map, store build_id+offset instead of pointer */ +#define BPF_F_STACK_BUILD_ID (1U << 5) + +/* Zero-initialize hash function seed. This should only be used for testing. */ +#define BPF_F_ZERO_SEED (1U << 6) + /* Flags for accessing BPF object from program side. */ #define BPF_F_RDONLY_PROG (1U << 7) +#define BPF_F_WRONLY_PROG (1U << 8) + +/* Clone map from listener for newly accepted socket */ +#define BPF_F_CLONE (1U << 9) + +/* flags for BPF_PROG_QUERY */ +#define BPF_F_QUERY_EFFECTIVE (1U << 0) + +enum bpf_stack_build_id_status { + /* user space need an empty entry to identify end of a trace */ + BPF_STACK_BUILD_ID_EMPTY = 0, + /* with valid build_id and offset */ + BPF_STACK_BUILD_ID_VALID = 1, + /* couldn't get build_id, fallback to ip */ + BPF_STACK_BUILD_ID_IP = 2, +}; + +#define BPF_BUILD_ID_SIZE 20 +struct bpf_stack_build_id { + __s32 status; + unsigned char build_id[BPF_BUILD_ID_SIZE]; + union { + __u64 offset; + __u64 ip; + }; +}; union bpf_attr { struct { /* anonymous struct used by BPF_MAP_CREATE command */ @@ -319,7 +404,7 @@ union bpf_attr { __u32 log_level; /* verbosity level of verifier */ __u32 log_size; /* size of user buffer */ __aligned_u64 log_buf; /* user supplied buffer */ - __u32 kern_version; /* checked when prog_type=kprobe */ + __u32 kern_version; /* not used */ __u32 prog_flags; char prog_name[BPF_OBJ_NAME_LEN]; __u32 prog_ifindex; /* ifindex of netdev to prep for */ @@ -353,12 +438,22 @@ union bpf_attr { struct { /* anonymous struct used by BPF_PROG_TEST_RUN command */ __u32 prog_fd; __u32 retval; - __u32 data_size_in; - __u32 data_size_out; + __u32 data_size_in; /* input: len of data_in */ + __u32 data_size_out; /* input/output: len of data_out + * returns ENOSPC if data_out + * is too small. + */ __aligned_u64 data_in; __aligned_u64 data_out; __u32 repeat; __u32 duration; + __u32 ctx_size_in; /* input: len of ctx_in */ + __u32 ctx_size_out; /* input/output: len of ctx_out + * returns ENOSPC if ctx_out + * is too small. + */ + __aligned_u64 ctx_in; + __aligned_u64 ctx_out; } test; struct { /* anonymous struct used by BPF_*_GET_*_ID */ @@ -399,341 +494,1517 @@ union bpf_attr { __u32 btf_log_size; __u32 btf_log_level; }; + + struct { + __u32 pid; /* input: pid */ + __u32 fd; /* input: fd */ + __u32 flags; /* input: flags */ + __u32 buf_len; /* input/output: buf len */ + __aligned_u64 buf; /* input/output: + * tp_name for tracepoint + * symbol for kprobe + * filename for uprobe + */ + __u32 prog_id; /* output: prod_id */ + __u32 fd_type; /* output: BPF_FD_TYPE_* */ + __u64 probe_offset; /* output: probe_offset */ + __u64 probe_addr; /* output: probe_addr */ + } task_fd_query; } __attribute__((aligned(8))); -/* BPF helper function descriptions: +/* The description below is an attempt at providing documentation to eBPF + * developers about the multiple available eBPF helper functions. It can be + * parsed and used to produce a manual page. The workflow is the following, + * and requires the rst2man utility: * - * void *bpf_map_lookup_elem(&map, &key) - * Return: Map value or NULL + * $ ./scripts/bpf_helpers_doc.py \ + * --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst + * $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7 + * $ man /tmp/bpf-helpers.7 * - * int bpf_map_update_elem(&map, &key, &value, flags) - * Return: 0 on success or negative error + * Note that in order to produce this external documentation, some RST + * formatting is used in the descriptions to get "bold" and "italics" in + * manual pages. Also note that the few trailing white spaces are + * intentional, removing them would break paragraphs for rst2man. * - * int bpf_map_delete_elem(&map, &key) - * Return: 0 on success or negative error + * Start of BPF helper function descriptions: * - * int bpf_probe_read(void *dst, int size, void *src) - * Return: 0 on success or negative error + * void *bpf_map_lookup_elem(struct bpf_map *map, const void *key) + * Description + * Perform a lookup in *map* for an entry associated to *key*. + * Return + * Map value associated to *key*, or **NULL** if no entry was + * found. + * + * int bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags) + * Description + * Add or update the value of the entry associated to *key* in + * *map* with *value*. *flags* is one of: + * + * **BPF_NOEXIST** + * The entry for *key* must not exist in the map. + * **BPF_EXIST** + * The entry for *key* must already exist in the map. + * **BPF_ANY** + * No condition on the existence of the entry for *key*. + * + * Flag value **BPF_NOEXIST** cannot be used for maps of types + * **BPF_MAP_TYPE_ARRAY** or **BPF_MAP_TYPE_PERCPU_ARRAY** (all + * elements always exist), the helper would return an error. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_map_delete_elem(struct bpf_map *map, const void *key) + * Description + * Delete entry with *key* from *map*. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags) + * Description + * Push an element *value* in *map*. *flags* is one of: + * + * **BPF_EXIST** + * If the queue/stack is full, the oldest element is removed to + * make room for this. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_map_pop_elem(struct bpf_map *map, void *value) + * Description + * Pop an element from *map*. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_map_peek_elem(struct bpf_map *map, void *value) + * Description + * Get an element from *map* without removing it. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_probe_read(void *dst, u32 size, const void *src) + * Description + * For tracing programs, safely attempt to read *size* bytes from + * address *src* and store the data in *dst*. + * Return + * 0 on success, or a negative error in case of failure. * * u64 bpf_ktime_get_ns(void) - * Return: current ktime + * Description + * Return the time elapsed since system boot, in nanoseconds. + * Does not include time the system was suspended. + * See: clock_gettime(CLOCK_MONOTONIC) + * Return + * Current *ktime*. * - * int bpf_trace_printk(const char *fmt, int fmt_size, ...) - * Return: length of buffer written or negative error + * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...) + * Description + * This helper is a "printk()-like" facility for debugging. It + * prints a message defined by format *fmt* (of size *fmt_size*) + * to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if + * available. It can take up to three additional **u64** + * arguments (as an eBPF helpers, the total number of arguments is + * limited to five). * - * u32 bpf_prandom_u32(void) - * Return: random value + * Each time the helper is called, it appends a line to the trace. + * The format of the trace is customizable, and the exact output + * one will get depends on the options set in + * *\/sys/kernel/debug/tracing/trace_options* (see also the + * *README* file under the same directory). However, it usually + * defaults to something like: * - * u32 bpf_raw_smp_processor_id(void) - * Return: SMP processor ID + * :: * - * int bpf_skb_store_bytes(skb, offset, from, len, flags) - * store bytes into packet - * @skb: pointer to skb - * @offset: offset within packet from skb->mac_header - * @from: pointer where to copy bytes from - * @len: number of bytes to store into packet - * @flags: bit 0 - if true, recompute skb->csum - * other bits - reserved - * Return: 0 on success or negative error + * telnet-470 [001] .N.. 419421.045894: 0x00000001: * - * int bpf_l3_csum_replace(skb, offset, from, to, flags) - * recompute IP checksum - * @skb: pointer to skb - * @offset: offset within packet where IP checksum is located - * @from: old value of header field - * @to: new value of header field - * @flags: bits 0-3 - size of header field - * other bits - reserved - * Return: 0 on success or negative error + * In the above: * - * int bpf_l4_csum_replace(skb, offset, from, to, flags) - * recompute TCP/UDP checksum - * @skb: pointer to skb - * @offset: offset within packet where TCP/UDP checksum is located - * @from: old value of header field - * @to: new value of header field - * @flags: bits 0-3 - size of header field - * bit 4 - is pseudo header - * other bits - reserved - * Return: 0 on success or negative error + * * ``telnet`` is the name of the current task. + * * ``470`` is the PID of the current task. + * * ``001`` is the CPU number on which the task is + * running. + * * In ``.N..``, each character refers to a set of + * options (whether irqs are enabled, scheduling + * options, whether hard/softirqs are running, level of + * preempt_disabled respectively). **N** means that + * **TIF_NEED_RESCHED** and **PREEMPT_NEED_RESCHED** + * are set. + * * ``419421.045894`` is a timestamp. + * * ``0x00000001`` is a fake value used by BPF for the + * instruction pointer register. + * * ```` is the message formatted with + * *fmt*. * - * int bpf_tail_call(ctx, prog_array_map, index) - * jump into another BPF program - * @ctx: context pointer passed to next program - * @prog_array_map: pointer to map which type is BPF_MAP_TYPE_PROG_ARRAY - * @index: 32-bit index inside array that selects specific program to run - * Return: 0 on success or negative error + * The conversion specifiers supported by *fmt* are similar, but + * more limited than for printk(). They are **%d**, **%i**, + * **%u**, **%x**, **%ld**, **%li**, **%lu**, **%lx**, **%lld**, + * **%lli**, **%llu**, **%llx**, **%p**, **%s**. No modifier (size + * of field, padding with zeroes, etc.) is available, and the + * helper will return **-EINVAL** (but print nothing) if it + * encounters an unknown specifier. * - * int bpf_clone_redirect(skb, ifindex, flags) - * redirect to another netdev - * @skb: pointer to skb - * @ifindex: ifindex of the net device - * @flags: bit 0 - if set, redirect to ingress instead of egress - * other bits - reserved - * Return: 0 on success or negative error + * Also, note that **bpf_trace_printk**\ () is slow, and should + * only be used for debugging purposes. For this reason, a notice + * bloc (spanning several lines) is printed to kernel logs and + * states that the helper should not be used "for production use" + * the first time this helper is used (or more precisely, when + * **trace_printk**\ () buffers are allocated). For passing values + * to user space, perf events should be preferred. + * Return + * The number of bytes written to the buffer, or a negative error + * in case of failure. + * + * u32 bpf_get_prandom_u32(void) + * Description + * Get a pseudo-random number. + * + * From a security point of view, this helper uses its own + * pseudo-random internal state, and cannot be used to infer the + * seed of other random functions in the kernel. However, it is + * essential to note that the generator used by the helper is not + * cryptographically secure. + * Return + * A random 32-bit unsigned value. + * + * u32 bpf_get_smp_processor_id(void) + * Description + * Get the SMP (symmetric multiprocessing) processor id. Note that + * all programs run with preemption disabled, which means that the + * SMP processor id is stable during all the execution of the + * program. + * Return + * The SMP id of the processor running the program. + * + * int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags) + * Description + * Store *len* bytes from address *from* into the packet + * associated to *skb*, at *offset*. *flags* are a combination of + * **BPF_F_RECOMPUTE_CSUM** (automatically recompute the + * checksum for the packet after storing the bytes) and + * **BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\ + * **->swhash** and *skb*\ **->l4hash** to 0). + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_l3_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 size) + * Description + * Recompute the layer 3 (e.g. IP) checksum for the packet + * associated to *skb*. Computation is incremental, so the helper + * must know the former value of the header field that was + * modified (*from*), the new value of this field (*to*), and the + * number of bytes (2 or 4) for this field, stored in *size*. + * Alternatively, it is possible to store the difference between + * the previous and the new values of the header field in *to*, by + * setting *from* and *size* to 0. For both methods, *offset* + * indicates the location of the IP checksum within the packet. + * + * This helper works in combination with **bpf_csum_diff**\ (), + * which does not update the checksum in-place, but offers more + * flexibility and can handle sizes larger than 2 or 4 for the + * checksum to update. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_l4_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 flags) + * Description + * Recompute the layer 4 (e.g. TCP, UDP or ICMP) checksum for the + * packet associated to *skb*. Computation is incremental, so the + * helper must know the former value of the header field that was + * modified (*from*), the new value of this field (*to*), and the + * number of bytes (2 or 4) for this field, stored on the lowest + * four bits of *flags*. Alternatively, it is possible to store + * the difference between the previous and the new values of the + * header field in *to*, by setting *from* and the four lowest + * bits of *flags* to 0. For both methods, *offset* indicates the + * location of the IP checksum within the packet. In addition to + * the size of the field, *flags* can be added (bitwise OR) actual + * flags. With **BPF_F_MARK_MANGLED_0**, a null checksum is left + * untouched (unless **BPF_F_MARK_ENFORCE** is added as well), and + * for updates resulting in a null checksum the value is set to + * **CSUM_MANGLED_0** instead. Flag **BPF_F_PSEUDO_HDR** indicates + * the checksum is to be computed against a pseudo-header. + * + * This helper works in combination with **bpf_csum_diff**\ (), + * which does not update the checksum in-place, but offers more + * flexibility and can handle sizes larger than 2 or 4 for the + * checksum to update. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index) + * Description + * This special helper is used to trigger a "tail call", or in + * other words, to jump into another eBPF program. The same stack + * frame is used (but values on stack and in registers for the + * caller are not accessible to the callee). This mechanism allows + * for program chaining, either for raising the maximum number of + * available eBPF instructions, or to execute given programs in + * conditional blocks. For security reasons, there is an upper + * limit to the number of successive tail calls that can be + * performed. + * + * Upon call of this helper, the program attempts to jump into a + * program referenced at index *index* in *prog_array_map*, a + * special map of type **BPF_MAP_TYPE_PROG_ARRAY**, and passes + * *ctx*, a pointer to the context. + * + * If the call succeeds, the kernel immediately runs the first + * instruction of the new program. This is not a function call, + * and it never returns to the previous program. If the call + * fails, then the helper has no effect, and the caller continues + * to run its subsequent instructions. A call can fail if the + * destination program for the jump does not exist (i.e. *index* + * is superior to the number of entries in *prog_array_map*), or + * if the maximum number of tail calls has been reached for this + * chain of programs. This limit is defined in the kernel by the + * macro **MAX_TAIL_CALL_CNT** (not accessible to user space), + * which is currently set to 32. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags) + * Description + * Clone and redirect the packet associated to *skb* to another + * net device of index *ifindex*. Both ingress and egress + * interfaces can be used for redirection. The **BPF_F_INGRESS** + * value in *flags* is used to make the distinction (ingress path + * is selected if the flag is present, egress path otherwise). + * This is the only flag supported for now. + * + * In comparison with **bpf_redirect**\ () helper, + * **bpf_clone_redirect**\ () has the associated cost of + * duplicating the packet buffer, but this can be executed out of + * the eBPF program. Conversely, **bpf_redirect**\ () is more + * efficient, but it is handled through an action code where the + * redirection happens only after the eBPF program has returned. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. Positive + * error indicates a potential drop or congestion in the target + * device. The particular positive error codes are not defined. * * u64 bpf_get_current_pid_tgid(void) - * Return: current->tgid << 32 | current->pid + * Return + * A 64-bit integer containing the current tgid and pid, and + * created as such: + * *current_task*\ **->tgid << 32 \|** + * *current_task*\ **->pid**. * * u64 bpf_get_current_uid_gid(void) - * Return: current_gid << 32 | current_uid + * Return + * A 64-bit integer containing the current GID and UID, and + * created as such: *current_gid* **<< 32 \|** *current_uid*. * - * int bpf_get_current_comm(char *buf, int size_of_buf) - * stores current->comm into buf - * Return: 0 on success or negative error + * int bpf_get_current_comm(char *buf, u32 size_of_buf) + * Description + * Copy the **comm** attribute of the current task into *buf* of + * *size_of_buf*. The **comm** attribute contains the name of + * the executable (excluding the path) for the current task. The + * *size_of_buf* must be strictly positive. On success, the + * helper makes sure that the *buf* is NUL-terminated. On failure, + * it is filled with zeroes. + * Return + * 0 on success, or a negative error in case of failure. * - * u32 bpf_get_cgroup_classid(skb) - * retrieve a proc's classid - * @skb: pointer to skb - * Return: classid if != 0 + * u32 bpf_get_cgroup_classid(struct sk_buff *skb) + * Description + * Retrieve the classid for the current task, i.e. for the net_cls + * cgroup to which *skb* belongs. * - * int bpf_skb_vlan_push(skb, vlan_proto, vlan_tci) - * Return: 0 on success or negative error + * This helper can be used on TC egress path, but not on ingress. * - * int bpf_skb_vlan_pop(skb) - * Return: 0 on success or negative error + * The net_cls cgroup provides an interface to tag network packets + * based on a user-provided identifier for all traffic coming from + * the tasks belonging to the related cgroup. See also the related + * kernel documentation, available from the Linux sources in file + * *Documentation/cgroup-v1/net_cls.txt*. * - * int bpf_skb_get_tunnel_key(skb, key, size, flags) - * int bpf_skb_set_tunnel_key(skb, key, size, flags) - * retrieve or populate tunnel metadata - * @skb: pointer to skb - * @key: pointer to 'struct bpf_tunnel_key' - * @size: size of 'struct bpf_tunnel_key' - * @flags: room for future extensions - * Return: 0 on success or negative error + * The Linux kernel has two versions for cgroups: there are + * cgroups v1 and cgroups v2. Both are available to users, who can + * use a mixture of them, but note that the net_cls cgroup is for + * cgroup v1 only. This makes it incompatible with BPF programs + * run on cgroups, which is a cgroup-v2-only feature (a socket can + * only hold data for one version of cgroups at a time). * - * u64 bpf_perf_event_read(map, flags) - * read perf event counter value - * @map: pointer to perf_event_array map - * @flags: index of event in the map or bitmask flags - * Return: value of perf event counter read or error code + * This helper is only available is the kernel was compiled with + * the **CONFIG_CGROUP_NET_CLASSID** configuration option set to + * "**y**" or to "**m**". + * Return + * The classid, or 0 for the default unconfigured classid. * - * int bpf_redirect(ifindex, flags) - * redirect to another netdev - * @ifindex: ifindex of the net device - * @flags: - * cls_bpf: - * bit 0 - if set, redirect to ingress instead of egress - * other bits - reserved - * xdp_bpf: - * all bits - reserved - * Return: cls_bpf: TC_ACT_REDIRECT on success or TC_ACT_SHOT on error - * xdp_bfp: XDP_REDIRECT on success or XDP_ABORT on error - * int bpf_redirect_map(map, key, flags) - * redirect to endpoint in map - * @map: pointer to dev map - * @key: index in map to lookup - * @flags: -- - * Return: XDP_REDIRECT on success or XDP_ABORT on error + * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 vlan_tci) + * Description + * Push a *vlan_tci* (VLAN tag control information) of protocol + * *vlan_proto* to the packet associated to *skb*, then update + * the checksum. Note that if *vlan_proto* is different from + * **ETH_P_8021Q** and **ETH_P_8021AD**, it is considered to + * be **ETH_P_8021Q**. * - * u32 bpf_get_route_realm(skb) - * retrieve a dst's tclassid - * @skb: pointer to skb - * Return: realm if != 0 + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. * - * int bpf_perf_event_output(ctx, map, flags, data, size) - * output perf raw sample - * @ctx: struct pt_regs* - * @map: pointer to perf_event_array map - * @flags: index of event in the map or bitmask flags - * @data: data on stack to be output as raw data - * @size: size of data - * Return: 0 on success or negative error + * int bpf_skb_vlan_pop(struct sk_buff *skb) + * Description + * Pop a VLAN header from the packet associated to *skb*. * - * int bpf_get_stackid(ctx, map, flags) - * walk user or kernel stack and return id - * @ctx: struct pt_regs* - * @map: pointer to stack_trace map - * @flags: bits 0-7 - numer of stack frames to skip - * bit 8 - collect user stack instead of kernel - * bit 9 - compare stacks by hash only - * bit 10 - if two different stacks hash into the same stackid - * discard old - * other bits - reserved - * Return: >= 0 stackid on success or negative error + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. * - * s64 bpf_csum_diff(from, from_size, to, to_size, seed) - * calculate csum diff - * @from: raw from buffer - * @from_size: length of from buffer - * @to: raw to buffer - * @to_size: length of to buffer - * @seed: optional seed - * Return: csum result or negative error code + * int bpf_skb_get_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags) + * Description + * Get tunnel metadata. This helper takes a pointer *key* to an + * empty **struct bpf_tunnel_key** of **size**, that will be + * filled with tunnel metadata for the packet associated to *skb*. + * The *flags* can be set to **BPF_F_TUNINFO_IPV6**, which + * indicates that the tunnel is based on IPv6 protocol instead of + * IPv4. * - * int bpf_skb_get_tunnel_opt(skb, opt, size) - * retrieve tunnel options metadata - * @skb: pointer to skb - * @opt: pointer to raw tunnel option data - * @size: size of @opt - * Return: option size + * The **struct bpf_tunnel_key** is an object that generalizes the + * principal parameters used by various tunneling protocols into a + * single struct. This way, it can be used to easily make a + * decision based on the contents of the encapsulation header, + * "summarized" in this struct. In particular, it holds the IP + * address of the remote end (IPv4 or IPv6, depending on the case) + * in *key*\ **->remote_ipv4** or *key*\ **->remote_ipv6**. Also, + * this struct exposes the *key*\ **->tunnel_id**, which is + * generally mapped to a VNI (Virtual Network Identifier), making + * it programmable together with the **bpf_skb_set_tunnel_key**\ + * () helper. * - * int bpf_skb_set_tunnel_opt(skb, opt, size) - * populate tunnel options metadata - * @skb: pointer to skb - * @opt: pointer to raw tunnel option data - * @size: size of @opt - * Return: 0 on success or negative error + * Let's imagine that the following code is part of a program + * attached to the TC ingress interface, on one end of a GRE + * tunnel, and is supposed to filter out all messages coming from + * remote ends with IPv4 address other than 10.0.0.1: * - * int bpf_skb_change_proto(skb, proto, flags) - * Change protocol of the skb. Currently supported is v4 -> v6, - * v6 -> v4 transitions. The helper will also resize the skb. eBPF - * program is expected to fill the new headers via skb_store_bytes - * and lX_csum_replace. - * @skb: pointer to skb - * @proto: new skb->protocol type - * @flags: reserved - * Return: 0 on success or negative error + * :: * - * int bpf_skb_change_type(skb, type) - * Change packet type of skb. - * @skb: pointer to skb - * @type: new skb->pkt_type type - * Return: 0 on success or negative error + * int ret; + * struct bpf_tunnel_key key = {}; + * + * ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0); + * if (ret < 0) + * return TC_ACT_SHOT; // drop packet + * + * if (key.remote_ipv4 != 0x0a000001) + * return TC_ACT_SHOT; // drop packet + * + * return TC_ACT_OK; // accept packet * - * int bpf_skb_under_cgroup(skb, map, index) - * Check cgroup2 membership of skb - * @skb: pointer to skb - * @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type - * @index: index of the cgroup in the bpf_map - * Return: - * == 0 skb failed the cgroup2 descendant test - * == 1 skb succeeded the cgroup2 descendant test - * < 0 error + * This interface can also be used with all encapsulation devices + * that can operate in "collect metadata" mode: instead of having + * one network device per specific configuration, the "collect + * metadata" mode only requires a single device where the + * configuration can be extracted from this helper. * - * u32 bpf_get_hash_recalc(skb) - * Retrieve and possibly recalculate skb->hash. - * @skb: pointer to skb - * Return: hash + * This can be used together with various tunnels such as VXLan, + * Geneve, GRE or IP in IP (IPIP). + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_skb_set_tunnel_key(struct sk_buff *skb, struct bpf_tunnel_key *key, u32 size, u64 flags) + * Description + * Populate tunnel metadata for packet associated to *skb.* The + * tunnel metadata is set to the contents of *key*, of *size*. The + * *flags* can be set to a combination of the following values: + * + * **BPF_F_TUNINFO_IPV6** + * Indicate that the tunnel is based on IPv6 protocol + * instead of IPv4. + * **BPF_F_ZERO_CSUM_TX** + * For IPv4 packets, add a flag to tunnel metadata + * indicating that checksum computation should be skipped + * and checksum set to zeroes. + * **BPF_F_DONT_FRAGMENT** + * Add a flag to tunnel metadata indicating that the + * packet should not be fragmented. + * **BPF_F_SEQ_NUMBER** + * Add a flag to tunnel metadata indicating that a + * sequence number should be added to tunnel header before + * sending the packet. This flag was added for GRE + * encapsulation, but might be used with other protocols + * as well in the future. + * + * Here is a typical usage on the transmit path: + * + * :: + * + * struct bpf_tunnel_key key; + * populate key ... + * bpf_skb_set_tunnel_key(skb, &key, sizeof(key), 0); + * bpf_clone_redirect(skb, vxlan_dev_ifindex, 0); + * + * See also the description of the **bpf_skb_get_tunnel_key**\ () + * helper for additional information. + * Return + * 0 on success, or a negative error in case of failure. + * + * u64 bpf_perf_event_read(struct bpf_map *map, u64 flags) + * Description + * Read the value of a perf event counter. This helper relies on a + * *map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of + * the perf event counter is selected when *map* is updated with + * perf event file descriptors. The *map* is an array whose size + * is the number of available CPUs, and each cell contains a value + * relative to one CPU. The value to retrieve is indicated by + * *flags*, that contains the index of the CPU to look up, masked + * with **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to + * **BPF_F_CURRENT_CPU** to indicate that the value for the + * current CPU should be retrieved. + * + * Note that before Linux 4.13, only hardware perf event can be + * retrieved. + * + * Also, be aware that the newer helper + * **bpf_perf_event_read_value**\ () is recommended over + * **bpf_perf_event_read**\ () in general. The latter has some ABI + * quirks where error and counter value are used as a return code + * (which is wrong to do since ranges may overlap). This issue is + * fixed with **bpf_perf_event_read_value**\ (), which at the same + * time provides more features over the **bpf_perf_event_read**\ + * () interface. Please refer to the description of + * **bpf_perf_event_read_value**\ () for details. + * Return + * The value of the perf event counter read from the map, or a + * negative error code in case of failure. + * + * int bpf_redirect(u32 ifindex, u64 flags) + * Description + * Redirect the packet to another net device of index *ifindex*. + * This helper is somewhat similar to **bpf_clone_redirect**\ + * (), except that the packet is not cloned, which provides + * increased performance. + * + * Except for XDP, both ingress and egress interfaces can be used + * for redirection. The **BPF_F_INGRESS** value in *flags* is used + * to make the distinction (ingress path is selected if the flag + * is present, egress path otherwise). Currently, XDP only + * supports redirection to the egress interface, and accepts no + * flag at all. + * + * The same effect can be attained with the more generic + * **bpf_redirect_map**\ (), which requires specific maps to be + * used but offers better performance. + * Return + * For XDP, the helper returns **XDP_REDIRECT** on success or + * **XDP_ABORTED** on error. For other program types, the values + * are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on + * error. + * + * u32 bpf_get_route_realm(struct sk_buff *skb) + * Description + * Retrieve the realm or the route, that is to say the + * **tclassid** field of the destination for the *skb*. The + * indentifier retrieved is a user-provided tag, similar to the + * one used with the net_cls cgroup (see description for + * **bpf_get_cgroup_classid**\ () helper), but here this tag is + * held by a route (a destination entry), not by a task. + * + * Retrieving this identifier works with the clsact TC egress hook + * (see also **tc-bpf(8)**), or alternatively on conventional + * classful egress qdiscs, but not on TC ingress path. In case of + * clsact TC egress hook, this has the advantage that, internally, + * the destination entry has not been dropped yet in the transmit + * path. Therefore, the destination entry does not need to be + * artificially held via **netif_keep_dst**\ () for a classful + * qdisc until the *skb* is freed. + * + * This helper is available only if the kernel was compiled with + * **CONFIG_IP_ROUTE_CLASSID** configuration option. + * Return + * The realm of the route for the packet associated to *skb*, or 0 + * if none was found. + * + * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 flags, void *data, u64 size) + * Description + * Write raw *data* blob into a special BPF perf event held by + * *map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf + * event must have the following attributes: **PERF_SAMPLE_RAW** + * as **sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and + * **PERF_COUNT_SW_BPF_OUTPUT** as **config**. + * + * The *flags* are used to indicate the index in *map* for which + * the value must be put, masked with **BPF_F_INDEX_MASK**. + * Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU** + * to indicate that the index of the current CPU core should be + * used. + * + * The value to write, of *size*, is passed through eBPF stack and + * pointed by *data*. + * + * The context of the program *ctx* needs also be passed to the + * helper. + * + * On user space, a program willing to read the values needs to + * call **perf_event_open**\ () on the perf event (either for + * one or for all CPUs) and to store the file descriptor into the + * *map*. This must be done before the eBPF program can send data + * into it. An example is available in file + * *samples/bpf/trace_output_user.c* in the Linux kernel source + * tree (the eBPF program counterpart is in + * *samples/bpf/trace_output_kern.c*). + * + * **bpf_perf_event_output**\ () achieves better performance + * than **bpf_trace_printk**\ () for sharing data with user + * space, and is much better suitable for streaming data from eBPF + * programs. + * + * Note that this helper is not restricted to tracing use cases + * and can be used with programs attached to TC or XDP as well, + * where it allows for passing data to user space listeners. Data + * can be: + * + * * Only custom structs, + * * Only the packet payload, or + * * A combination of both. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len) + * Description + * This helper was provided as an easy way to load data from a + * packet. It can be used to load *len* bytes from *offset* from + * the packet associated to *skb*, into the buffer pointed by + * *to*. + * + * Since Linux 4.7, usage of this helper has mostly been replaced + * by "direct packet access", enabling packet data to be + * manipulated with *skb*\ **->data** and *skb*\ **->data_end** + * pointing respectively to the first byte of packet data and to + * the byte after the last byte of packet data. However, it + * remains useful if one wishes to read large quantities of data + * at once from a packet into the eBPF stack. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags) + * Description + * Walk a user or a kernel stack and return its id. To achieve + * this, the helper needs *ctx*, which is a pointer to the context + * on which the tracing program is executed, and a pointer to a + * *map* of type **BPF_MAP_TYPE_STACK_TRACE**. + * + * The last argument, *flags*, holds the number of stack frames to + * skip (from 0 to 255), masked with + * **BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set + * a combination of the following flags: + * + * **BPF_F_USER_STACK** + * Collect a user space stack instead of a kernel stack. + * **BPF_F_FAST_STACK_CMP** + * Compare stacks by hash only. + * **BPF_F_REUSE_STACKID** + * If two different stacks hash into the same *stackid*, + * discard the old one. + * + * The stack id retrieved is a 32 bit long integer handle which + * can be further combined with other data (including other stack + * ids) and used as a key into maps. This can be useful for + * generating a variety of graphs (such as flame graphs or off-cpu + * graphs). + * + * For walking a stack, this helper is an improvement over + * **bpf_probe_read**\ (), which can be used with unrolled loops + * but is not efficient and consumes a lot of eBPF instructions. + * Instead, **bpf_get_stackid**\ () can collect up to + * **PERF_MAX_STACK_DEPTH** both kernel and user frames. Note that + * this limit can be controlled with the **sysctl** program, and + * that it should be manually increased in order to profile long + * user stacks (such as stacks for Java programs). To do so, use: + * + * :: + * + * # sysctl kernel.perf_event_max_stack= + * Return + * The positive or null stack id on success, or a negative error + * in case of failure. + * + * s64 bpf_csum_diff(__be32 *from, u32 from_size, __be32 *to, u32 to_size, __wsum seed) + * Description + * Compute a checksum difference, from the raw buffer pointed by + * *from*, of length *from_size* (that must be a multiple of 4), + * towards the raw buffer pointed by *to*, of size *to_size* + * (same remark). An optional *seed* can be added to the value + * (this can be cascaded, the seed may come from a previous call + * to the helper). + * + * This is flexible enough to be used in several ways: + * + * * With *from_size* == 0, *to_size* > 0 and *seed* set to + * checksum, it can be used when pushing new data. + * * With *from_size* > 0, *to_size* == 0 and *seed* set to + * checksum, it can be used when removing data from a packet. + * * With *from_size* > 0, *to_size* > 0 and *seed* set to 0, it + * can be used to compute a diff. Note that *from_size* and + * *to_size* do not need to be equal. + * + * This helper can be used in combination with + * **bpf_l3_csum_replace**\ () and **bpf_l4_csum_replace**\ (), to + * which one can feed in the difference computed with + * **bpf_csum_diff**\ (). + * Return + * The checksum result, or a negative error code in case of + * failure. + * + * int bpf_skb_get_tunnel_opt(struct sk_buff *skb, u8 *opt, u32 size) + * Description + * Retrieve tunnel options metadata for the packet associated to + * *skb*, and store the raw tunnel option data to the buffer *opt* + * of *size*. + * + * This helper can be used with encapsulation devices that can + * operate in "collect metadata" mode (please refer to the related + * note in the description of **bpf_skb_get_tunnel_key**\ () for + * more details). A particular example where this can be used is + * in combination with the Geneve encapsulation protocol, where it + * allows for pushing (with **bpf_skb_get_tunnel_opt**\ () helper) + * and retrieving arbitrary TLVs (Type-Length-Value headers) from + * the eBPF program. This allows for full customization of these + * headers. + * Return + * The size of the option data retrieved. + * + * int bpf_skb_set_tunnel_opt(struct sk_buff *skb, u8 *opt, u32 size) + * Description + * Set tunnel options metadata for the packet associated to *skb* + * to the option data contained in the raw buffer *opt* of *size*. + * + * See also the description of the **bpf_skb_get_tunnel_opt**\ () + * helper for additional information. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_skb_change_proto(struct sk_buff *skb, __be16 proto, u64 flags) + * Description + * Change the protocol of the *skb* to *proto*. Currently + * supported are transition from IPv4 to IPv6, and from IPv6 to + * IPv4. The helper takes care of the groundwork for the + * transition, including resizing the socket buffer. The eBPF + * program is expected to fill the new headers, if any, via + * **skb_store_bytes**\ () and to recompute the checksums with + * **bpf_l3_csum_replace**\ () and **bpf_l4_csum_replace**\ + * (). The main case for this helper is to perform NAT64 + * operations out of an eBPF program. + * + * Internally, the GSO type is marked as dodgy so that headers are + * checked and segments are recalculated by the GSO/GRO engine. + * The size for GSO target is adapted as well. + * + * All values for *flags* are reserved for future usage, and must + * be left at zero. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_skb_change_type(struct sk_buff *skb, u32 type) + * Description + * Change the packet type for the packet associated to *skb*. This + * comes down to setting *skb*\ **->pkt_type** to *type*, except + * the eBPF program does not have a write access to *skb*\ + * **->pkt_type** beside this helper. Using a helper here allows + * for graceful handling of errors. + * + * The major use case is to change incoming *skb*s to + * **PACKET_HOST** in a programmatic way instead of having to + * recirculate via **redirect**\ (..., **BPF_F_INGRESS**), for + * example. + * + * Note that *type* only allows certain values. At this time, they + * are: + * + * **PACKET_HOST** + * Packet is for us. + * **PACKET_BROADCAST** + * Send packet to all. + * **PACKET_MULTICAST** + * Send packet to group. + * **PACKET_OTHERHOST** + * Send packet to someone else. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_skb_under_cgroup(struct sk_buff *skb, struct bpf_map *map, u32 index) + * Description + * Check whether *skb* is a descendant of the cgroup2 held by + * *map* of type **BPF_MAP_TYPE_CGROUP_ARRAY**, at *index*. + * Return + * The return value depends on the result of the test, and can be: + * + * * 0, if the *skb* failed the cgroup2 descendant test. + * * 1, if the *skb* succeeded the cgroup2 descendant test. + * * A negative error code, if an error occurred. + * + * u32 bpf_get_hash_recalc(struct sk_buff *skb) + * Description + * Retrieve the hash of the packet, *skb*\ **->hash**. If it is + * not set, in particular if the hash was cleared due to mangling, + * recompute this hash. Later accesses to the hash can be done + * directly with *skb*\ **->hash**. + * + * Calling **bpf_set_hash_invalid**\ (), changing a packet + * prototype with **bpf_skb_change_proto**\ (), or calling + * **bpf_skb_store_bytes**\ () with the + * **BPF_F_INVALIDATE_HASH** are actions susceptible to clear + * the hash and to trigger a new computation for the next call to + * **bpf_get_hash_recalc**\ (). + * Return + * The 32-bit hash. * * u64 bpf_get_current_task(void) - * Returns current task_struct - * Return: current + * Return + * A pointer to the current task struct. * - * int bpf_probe_write_user(void *dst, void *src, int len) - * safely attempt to write to a location - * @dst: destination address in userspace - * @src: source address on stack - * @len: number of bytes to copy - * Return: 0 on success or negative error + * int bpf_probe_write_user(void *dst, const void *src, u32 len) + * Description + * Attempt in a safe way to write *len* bytes from the buffer + * *src* to *dst* in memory. It only works for threads that are in + * user context, and *dst* must be a valid user space address. * - * int bpf_current_task_under_cgroup(map, index) - * Check cgroup2 membership of current task - * @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type - * @index: index of the cgroup in the bpf_map - * Return: - * == 0 current failed the cgroup2 descendant test - * == 1 current succeeded the cgroup2 descendant test - * < 0 error + * This helper should not be used to implement any kind of + * security mechanism because of TOC-TOU attacks, but rather to + * debug, divert, and manipulate execution of semi-cooperative + * processes. * - * int bpf_skb_change_tail(skb, len, flags) - * The helper will resize the skb to the given new size, to be used f.e. - * with control messages. - * @skb: pointer to skb - * @len: new skb length - * @flags: reserved - * Return: 0 on success or negative error + * Keep in mind that this feature is meant for experiments, and it + * has a risk of crashing the system and running programs. + * Therefore, when an eBPF program using this helper is attached, + * a warning including PID and process name is printed to kernel + * logs. + * Return + * 0 on success, or a negative error in case of failure. * - * int bpf_skb_pull_data(skb, len) - * The helper will pull in non-linear data in case the skb is non-linear - * and not all of len are part of the linear section. Only needed for - * read/write with direct packet access. - * @skb: pointer to skb - * @len: len to make read/writeable - * Return: 0 on success or negative error + * int bpf_current_task_under_cgroup(struct bpf_map *map, u32 index) + * Description + * Check whether the probe is being run is the context of a given + * subset of the cgroup2 hierarchy. The cgroup2 to test is held by + * *map* of type **BPF_MAP_TYPE_CGROUP_ARRAY**, at *index*. + * Return + * The return value depends on the result of the test, and can be: * - * s64 bpf_csum_update(skb, csum) - * Adds csum into skb->csum in case of CHECKSUM_COMPLETE. - * @skb: pointer to skb - * @csum: csum to add - * Return: csum on success or negative error + * * 1, if current task belongs to the cgroup2. + * * 0, if current task does not belong to the cgroup2. + * * A negative error code, if an error occurred. * - * void bpf_set_hash_invalid(skb) - * Invalidate current skb->hash. - * @skb: pointer to skb + * int bpf_skb_change_tail(struct sk_buff *skb, u32 len, u64 flags) + * Description + * Resize (trim or grow) the packet associated to *skb* to the + * new *len*. The *flags* are reserved for future usage, and must + * be left at zero. * - * int bpf_get_numa_node_id() - * Return: Id of current NUMA node. + * The basic idea is that the helper performs the needed work to + * change the size of the packet, then the eBPF program rewrites + * the rest via helpers like **bpf_skb_store_bytes**\ (), + * **bpf_l3_csum_replace**\ (), **bpf_l3_csum_replace**\ () + * and others. This helper is a slow path utility intended for + * replies with control messages. And because it is targeted for + * slow path, the helper itself can afford to be slow: it + * implicitly linearizes, unclones and drops offloads from the + * *skb*. * - * int bpf_skb_change_head() - * Grows headroom of skb and adjusts MAC header offset accordingly. - * Will extends/reallocae as required automatically. - * May change skb data pointer and will thus invalidate any check - * performed for direct packet access. - * @skb: pointer to skb - * @len: length of header to be pushed in front - * @flags: Flags (unused for now) - * Return: 0 on success or negative error + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. * - * int bpf_xdp_adjust_head(xdp_md, delta) - * Adjust the xdp_md.data by delta - * @xdp_md: pointer to xdp_md - * @delta: An positive/negative integer to be added to xdp_md.data - * Return: 0 on success or negative on error + * int bpf_skb_pull_data(struct sk_buff *skb, u32 len) + * Description + * Pull in non-linear data in case the *skb* is non-linear and not + * all of *len* are part of the linear section. Make *len* bytes + * from *skb* readable and writable. If a zero value is passed for + * *len*, then the whole length of the *skb* is pulled. + * + * This helper is only needed for reading and writing with direct + * packet access. + * + * For direct packet access, testing that offsets to access + * are within packet boundaries (test on *skb*\ **->data_end**) is + * susceptible to fail if offsets are invalid, or if the requested + * data is in non-linear parts of the *skb*. On failure the + * program can just bail out, or in the case of a non-linear + * buffer, use a helper to make the data available. The + * **bpf_skb_load_bytes**\ () helper is a first solution to access + * the data. Another one consists in using **bpf_skb_pull_data** + * to pull in once the non-linear parts, then retesting and + * eventually access the data. + * + * At the same time, this also makes sure the *skb* is uncloned, + * which is a necessary condition for direct write. As this needs + * to be an invariant for the write part only, the verifier + * detects writes and adds a prologue that is calling + * **bpf_skb_pull_data()** to effectively unclone the *skb* from + * the very beginning in case it is indeed cloned. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * s64 bpf_csum_update(struct sk_buff *skb, __wsum csum) + * Description + * Add the checksum *csum* into *skb*\ **->csum** in case the + * driver has supplied a checksum for the entire packet into that + * field. Return an error otherwise. This helper is intended to be + * used in combination with **bpf_csum_diff**\ (), in particular + * when the checksum needs to be updated after data has been + * written into the packet through direct packet access. + * Return + * The checksum on success, or a negative error code in case of + * failure. + * + * void bpf_set_hash_invalid(struct sk_buff *skb) + * Description + * Invalidate the current *skb*\ **->hash**. It can be used after + * mangling on headers through direct packet access, in order to + * indicate that the hash is outdated and to trigger a + * recalculation the next time the kernel tries to access this + * hash or when the **bpf_get_hash_recalc**\ () helper is called. + * + * int bpf_get_numa_node_id(void) + * Description + * Return the id of the current NUMA node. The primary use case + * for this helper is the selection of sockets for the local NUMA + * node, when the program is attached to sockets using the + * **SO_ATTACH_REUSEPORT_EBPF** option (see also **socket(7)**), + * but the helper is also available to other eBPF program types, + * similarly to **bpf_get_smp_processor_id**\ (). + * Return + * The id of current NUMA node. + * + * int bpf_skb_change_head(struct sk_buff *skb, u32 len, u64 flags) + * Description + * Grows headroom of packet associated to *skb* and adjusts the + * offset of the MAC header accordingly, adding *len* bytes of + * space. It automatically extends and reallocates memory as + * required. + * + * This helper can be used on a layer 3 *skb* to push a MAC header + * for redirection into a layer 2 device. + * + * All values for *flags* are reserved for future usage, and must + * be left at zero. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_xdp_adjust_head(struct xdp_buff *xdp_md, int delta) + * Description + * Adjust (move) *xdp_md*\ **->data** by *delta* bytes. Note that + * it is possible to use a negative value for *delta*. This helper + * can be used to prepare the packet for pushing or popping + * headers. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. * * int bpf_probe_read_str(void *dst, int size, const void *unsafe_ptr) - * Copy a NUL terminated string from unsafe address. In case the string - * length is smaller than size, the target is not padded with further NUL - * bytes. In case the string length is larger than size, just count-1 - * bytes are copied and the last byte is set to NUL. - * @dst: destination address - * @size: maximum number of bytes to copy, including the trailing NUL - * @unsafe_ptr: unsafe address - * Return: - * > 0 length of the string including the trailing NUL on success - * < 0 error + * Description + * Copy a NUL terminated string from an unsafe address + * *unsafe_ptr* to *dst*. The *size* should include the + * terminating NUL byte. In case the string length is smaller than + * *size*, the target is not padded with further NUL bytes. If the + * string length is larger than *size*, just *size*-1 bytes are + * copied and the last byte is set to NUL. * - * u64 bpf_get_socket_cookie(skb) - * Get the cookie for the socket stored inside sk_buff. - * @skb: pointer to skb - * Return: 8 Bytes non-decreasing number on success or 0 if the socket - * field is missing inside sk_buff + * On success, the length of the copied string is returned. This + * makes this helper useful in tracing programs for reading + * strings, and more importantly to get its length at runtime. See + * the following snippet: * - * u32 bpf_get_socket_uid(skb) - * Get the owner uid of the socket stored inside sk_buff. - * @skb: pointer to skb - * Return: uid of the socket owner on success or overflowuid if failed. + * :: * - * u32 bpf_set_hash(skb, hash) - * Set full skb->hash. - * @skb: pointer to skb - * @hash: hash to set + * SEC("kprobe/sys_open") + * void bpf_sys_open(struct pt_regs *ctx) + * { + * char buf[PATHLEN]; // PATHLEN is defined to 256 + * int res = bpf_probe_read_str(buf, sizeof(buf), + * ctx->di); * - * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen) - * Calls setsockopt. Not all opts are available, only those with - * integer optvals plus TCP_CONGESTION. - * Supported levels: SOL_SOCKET and IPROTO_TCP - * @bpf_socket: pointer to bpf_socket - * @level: SOL_SOCKET or IPROTO_TCP - * @optname: option name - * @optval: pointer to option value - * @optlen: length of optval in byes - * Return: 0 or negative error + * // Consume buf, for example push it to + * // userspace via bpf_perf_event_output(); we + * // can use res (the string length) as event + * // size, after checking its boundaries. + * } * - * int bpf_skb_adjust_room(skb, len_diff, mode, flags) - * Grow or shrink room in sk_buff. - * @skb: pointer to skb - * @len_diff: (signed) amount of room to grow/shrink - * @mode: operation mode (enum bpf_adj_room_mode) - * @flags: reserved for future use - * Return: 0 on success or negative error code + * In comparison, using **bpf_probe_read()** helper here instead + * to read the string would require to estimate the length at + * compile time, and would often result in copying more memory + * than necessary. * - * int bpf_sk_redirect_map(map, key, flags) - * Redirect skb to a sock in map using key as a lookup key for the - * sock in map. - * @map: pointer to sockmap - * @key: key to lookup sock in map - * @flags: reserved for future use - * Return: SK_PASS + * Another useful use case is when parsing individual process + * arguments or individual environment variables navigating + * *current*\ **->mm->arg_start** and *current*\ + * **->mm->env_start**: using this helper and the return value, + * one can quickly iterate at the right offset of the memory area. + * Return + * On success, the strictly positive length of the string, + * including the trailing NUL character. On error, a negative + * value. * - * int bpf_sock_map_update(skops, map, key, flags) - * @skops: pointer to bpf_sock_ops - * @map: pointer to sockmap to update - * @key: key to insert/update sock in map - * @flags: same flags as map update elem + * u64 bpf_get_socket_cookie(struct sk_buff *skb) + * Description + * If the **struct sk_buff** pointed by *skb* has a known socket, + * retrieve the cookie (generated by the kernel) of this socket. + * If no cookie has been set yet, generate a new cookie. Once + * generated, the socket cookie remains stable for the life of the + * socket. This helper can be useful for monitoring per socket + * networking traffic statistics as it provides a unique socket + * identifier per namespace. + * Return + * A 8-byte long non-decreasing number on success, or 0 if the + * socket field is missing inside *skb*. * - * int skb_load_bytes_relative(const struct sk_buff *skb, u32 offset, void *to, u32 len, u32 start_header) + * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx) + * Description + * Equivalent to bpf_get_socket_cookie() helper that accepts + * *skb*, but gets socket from **struct bpf_sock_addr** contex. + * Return + * A 8-byte long non-decreasing number. + * + * u64 bpf_get_socket_cookie(struct bpf_sock_ops *ctx) + * Description + * Equivalent to bpf_get_socket_cookie() helper that accepts + * *skb*, but gets socket from **struct bpf_sock_ops** contex. + * Return + * A 8-byte long non-decreasing number. + * + * u32 bpf_get_socket_uid(struct sk_buff *skb) + * Return + * The owner UID of the socket associated to *skb*. If the socket + * is **NULL**, or if it is not a full socket (i.e. if it is a + * time-wait or a request socket instead), **overflowuid** value + * is returned (note that **overflowuid** might also be the actual + * UID value for the socket). + * + * u32 bpf_set_hash(struct sk_buff *skb, u32 hash) + * Description + * Set the full hash for *skb* (set the field *skb*\ **->hash**) + * to value *hash*. + * Return + * 0 + * + * int bpf_setsockopt(struct bpf_sock_ops *bpf_socket, int level, int optname, char *optval, int optlen) + * Description + * Emulate a call to **setsockopt()** on the socket associated to + * *bpf_socket*, which must be a full socket. The *level* at + * which the option resides and the name *optname* of the option + * must be specified, see **setsockopt(2)** for more information. + * The option value of length *optlen* is pointed by *optval*. + * + * This helper actually implements a subset of **setsockopt()**. + * It supports the following *level*\ s: + * + * * **SOL_SOCKET**, which supports the following *optname*\ s: + * **SO_RCVBUF**, **SO_SNDBUF**, **SO_MAX_PACING_RATE**, + * **SO_PRIORITY**, **SO_RCVLOWAT**, **SO_MARK**. + * * **IPPROTO_TCP**, which supports the following *optname*\ s: + * **TCP_CONGESTION**, **TCP_BPF_IW**, + * **TCP_BPF_SNDCWND_CLAMP**. + * * **IPPROTO_IP**, which supports *optname* **IP_TOS**. + * * **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_skb_adjust_room(struct sk_buff *skb, u32 len_diff, u32 mode, u64 flags) + * Description + * Grow or shrink the room for data in the packet associated to + * *skb* by *len_diff*, and according to the selected *mode*. + * + * There are two supported modes at this time: + * + * * **BPF_ADJ_ROOM_MAC**: Adjust room at the mac layer + * (room space is added or removed below the layer 2 header). + * + * * **BPF_ADJ_ROOM_NET**: Adjust room at the network layer + * (room space is added or removed below the layer 3 header). + * + * The following flags are supported at this time: + * + * * **BPF_F_ADJ_ROOM_FIXED_GSO**: Do not adjust gso_size. + * Adjusting mss in this way is not allowed for datagrams. + * + * * **BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 **: + * * **BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 **: + * Any new space is reserved to hold a tunnel header. + * Configure skb offsets and other fields accordingly. + * + * * **BPF_F_ADJ_ROOM_ENCAP_L4_GRE **: + * * **BPF_F_ADJ_ROOM_ENCAP_L4_UDP **: + * Use with ENCAP_L3 flags to further specify the tunnel type. + * + * * **BPF_F_ADJ_ROOM_ENCAP_L2(len) **: + * Use with ENCAP_L3/L4 flags to further specify the tunnel + * type; **len** is the length of the inner MAC header. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags) + * Description + * Redirect the packet to the endpoint referenced by *map* at + * index *key*. Depending on its type, this *map* can contain + * references to net devices (for forwarding packets through other + * ports), or to CPUs (for redirecting XDP frames to another CPU; + * but this is only implemented for native XDP (with driver + * support) as of this writing). + * + * The lower two bits of *flags* are used as the return code if + * the map lookup fails. This is so that the return value can be + * one of the XDP program return codes up to XDP_TX, as chosen by + * the caller. Any higher bits in the *flags* argument must be + * unset. + * + * When used to redirect packets to net devices, this helper + * provides a high performance increase over **bpf_redirect**\ (). + * This is due to various implementation details of the underlying + * mechanisms, one of which is the fact that **bpf_redirect_map**\ + * () tries to send packet as a "bulk" to the device. + * Return + * **XDP_REDIRECT** on success, or **XDP_ABORTED** on error. + * + * int bpf_sk_redirect_map(struct bpf_map *map, u32 key, u64 flags) + * Description + * Redirect the packet to the socket referenced by *map* (of type + * **BPF_MAP_TYPE_SOCKMAP**) at index *key*. Both ingress and + * egress interfaces can be used for redirection. The + * **BPF_F_INGRESS** value in *flags* is used to make the + * distinction (ingress path is selected if the flag is present, + * egress path otherwise). This is the only flag supported for now. + * Return + * **SK_PASS** on success, or **SK_DROP** on error. + * + * int bpf_sock_map_update(struct bpf_sock_ops *skops, struct bpf_map *map, void *key, u64 flags) + * Description + * Add an entry to, or update a *map* referencing sockets. The + * *skops* is used as a new value for the entry associated to + * *key*. *flags* is one of: + * + * **BPF_NOEXIST** + * The entry for *key* must not exist in the map. + * **BPF_EXIST** + * The entry for *key* must already exist in the map. + * **BPF_ANY** + * No condition on the existence of the entry for *key*. + * + * If the *map* has eBPF programs (parser and verdict), those will + * be inherited by the socket being added. If the socket is + * already attached to eBPF programs, this results in an error. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_xdp_adjust_meta(struct xdp_buff *xdp_md, int delta) + * Description + * Adjust the address pointed by *xdp_md*\ **->data_meta** by + * *delta* (which can be positive or negative). Note that this + * operation modifies the address stored in *xdp_md*\ **->data**, + * so the latter must be loaded only after the helper has been + * called. + * + * The use of *xdp_md*\ **->data_meta** is optional and programs + * are not required to use it. The rationale is that when the + * packet is processed with XDP (e.g. as DoS filter), it is + * possible to push further meta data along with it before passing + * to the stack, and to give the guarantee that an ingress eBPF + * program attached as a TC classifier on the same device can pick + * this up for further post-processing. Since TC works with socket + * buffers, it remains possible to set from XDP the **mark** or + * **priority** pointers, or other pointers for the socket buffer. + * Having this scratch space generic and programmable allows for + * more flexibility as the user is free to store whatever meta + * data they need. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_perf_event_read_value(struct bpf_map *map, u64 flags, struct bpf_perf_event_value *buf, u32 buf_size) + * Description + * Read the value of a perf event counter, and store it into *buf* + * of size *buf_size*. This helper relies on a *map* of type + * **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. The nature of the perf event + * counter is selected when *map* is updated with perf event file + * descriptors. The *map* is an array whose size is the number of + * available CPUs, and each cell contains a value relative to one + * CPU. The value to retrieve is indicated by *flags*, that + * contains the index of the CPU to look up, masked with + * **BPF_F_INDEX_MASK**. Alternatively, *flags* can be set to + * **BPF_F_CURRENT_CPU** to indicate that the value for the + * current CPU should be retrieved. + * + * This helper behaves in a way close to + * **bpf_perf_event_read**\ () helper, save that instead of + * just returning the value observed, it fills the *buf* + * structure. This allows for additional data to be retrieved: in + * particular, the enabled and running times (in *buf*\ + * **->enabled** and *buf*\ **->running**, respectively) are + * copied. In general, **bpf_perf_event_read_value**\ () is + * recommended over **bpf_perf_event_read**\ (), which has some + * ABI issues and provides fewer functionalities. + * + * These values are interesting, because hardware PMU (Performance + * Monitoring Unit) counters are limited resources. When there are + * more PMU based perf events opened than available counters, + * kernel will multiplex these events so each event gets certain + * percentage (but not all) of the PMU time. In case that + * multiplexing happens, the number of samples or counter value + * will not reflect the case compared to when no multiplexing + * occurs. This makes comparison between different runs difficult. + * Typically, the counter value should be normalized before + * comparing to other experiments. The usual normalization is done + * as follows. + * + * :: + * + * normalized_counter = counter * t_enabled / t_running + * + * Where t_enabled is the time enabled for event and t_running is + * the time running for event since last normalization. The + * enabled and running times are accumulated since the perf event + * open. To achieve scaling factor between two invocations of an + * eBPF program, users can can use CPU id as the key (which is + * typical for perf array usage model) to remember the previous + * value and do the calculation inside the eBPF program. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_perf_prog_read_value(struct bpf_perf_event_data *ctx, struct bpf_perf_event_value *buf, u32 buf_size) + * Description + * For en eBPF program attached to a perf event, retrieve the + * value of the event counter associated to *ctx* and store it in + * the structure pointed by *buf* and of size *buf_size*. Enabled + * and running times are also stored in the structure (see + * description of helper **bpf_perf_event_read_value**\ () for + * more details). + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_getsockopt(struct bpf_sock_ops *bpf_socket, int level, int optname, char *optval, int optlen) + * Description + * Emulate a call to **getsockopt()** on the socket associated to + * *bpf_socket*, which must be a full socket. The *level* at + * which the option resides and the name *optname* of the option + * must be specified, see **getsockopt(2)** for more information. + * The retrieved value is stored in the structure pointed by + * *opval* and of length *optlen*. + * + * This helper actually implements a subset of **getsockopt()**. + * It supports the following *level*\ s: + * + * * **IPPROTO_TCP**, which supports *optname* + * **TCP_CONGESTION**. + * * **IPPROTO_IP**, which supports *optname* **IP_TOS**. + * * **IPPROTO_IPV6**, which supports *optname* **IPV6_TCLASS**. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_override_return(struct pt_reg *regs, u64 rc) + * Description + * Used for error injection, this helper uses kprobes to override + * the return value of the probed function, and to set it to *rc*. + * The first argument is the context *regs* on which the kprobe + * works. + * + * This helper works by setting setting the PC (program counter) + * to an override function which is run in place of the original + * probed function. This means the probed function is not run at + * all. The replacement function just returns with the required + * value. + * + * This helper has security implications, and thus is subject to + * restrictions. It is only available if the kernel was compiled + * with the **CONFIG_BPF_KPROBE_OVERRIDE** configuration + * option, and in this case it only works on functions tagged with + * **ALLOW_ERROR_INJECTION** in the kernel code. + * + * Also, the helper is only available for the architectures having + * the CONFIG_FUNCTION_ERROR_INJECTION option. As of this writing, + * x86 architecture is the only one to support this feature. + * Return + * 0 + * + * int bpf_sock_ops_cb_flags_set(struct bpf_sock_ops *bpf_sock, int argval) + * Description + * Attempt to set the value of the **bpf_sock_ops_cb_flags** field + * for the full TCP socket associated to *bpf_sock_ops* to + * *argval*. + * + * The primary use of this field is to determine if there should + * be calls to eBPF programs of type + * **BPF_PROG_TYPE_SOCK_OPS** at various points in the TCP + * code. A program of the same type can change its value, per + * connection and as necessary, when the connection is + * established. This field is directly accessible for reading, but + * this helper must be used for updates in order to return an + * error if an eBPF program tries to set a callback that is not + * supported in the current kernel. + * + * *argval* is a flag array which can combine these flags: + * + * * **BPF_SOCK_OPS_RTO_CB_FLAG** (retransmission time out) + * * **BPF_SOCK_OPS_RETRANS_CB_FLAG** (retransmission) + * * **BPF_SOCK_OPS_STATE_CB_FLAG** (TCP state change) + * * **BPF_SOCK_OPS_RTT_CB_FLAG** (every RTT) + * + * Therefore, this function can be used to clear a callback flag by + * setting the appropriate bit to zero. e.g. to disable the RTO + * callback: + * + * **bpf_sock_ops_cb_flags_set(bpf_sock,** + * **bpf_sock->bpf_sock_ops_cb_flags & ~BPF_SOCK_OPS_RTO_CB_FLAG)** + * + * Here are some examples of where one could call such eBPF + * program: + * + * * When RTO fires. + * * When a packet is retransmitted. + * * When the connection terminates. + * * When a packet is sent. + * * When a packet is received. + * Return + * Code **-EINVAL** if the socket is not a full TCP socket; + * otherwise, a positive number containing the bits that could not + * be set is returned (which comes down to 0 if all bits were set + * as required). + * + * int bpf_msg_redirect_map(struct sk_msg_buff *msg, struct bpf_map *map, u32 key, u64 flags) + * Description + * This helper is used in programs implementing policies at the + * socket level. If the message *msg* is allowed to pass (i.e. if + * the verdict eBPF program returns **SK_PASS**), redirect it to + * the socket referenced by *map* (of type + * **BPF_MAP_TYPE_SOCKMAP**) at index *key*. Both ingress and + * egress interfaces can be used for redirection. The + * **BPF_F_INGRESS** value in *flags* is used to make the + * distinction (ingress path is selected if the flag is present, + * egress path otherwise). This is the only flag supported for now. + * Return + * **SK_PASS** on success, or **SK_DROP** on error. + * + * int bpf_msg_apply_bytes(struct sk_msg_buff *msg, u32 bytes) + * Description + * For socket policies, apply the verdict of the eBPF program to + * the next *bytes* (number of bytes) of message *msg*. + * + * For example, this helper can be used in the following cases: + * + * * A single **sendmsg**\ () or **sendfile**\ () system call + * contains multiple logical messages that the eBPF program is + * supposed to read and for which it should apply a verdict. + * * An eBPF program only cares to read the first *bytes* of a + * *msg*. If the message has a large payload, then setting up + * and calling the eBPF program repeatedly for all bytes, even + * though the verdict is already known, would create unnecessary + * overhead. + * + * When called from within an eBPF program, the helper sets a + * counter internal to the BPF infrastructure, that is used to + * apply the last verdict to the next *bytes*. If *bytes* is + * smaller than the current data being processed from a + * **sendmsg**\ () or **sendfile**\ () system call, the first + * *bytes* will be sent and the eBPF program will be re-run with + * the pointer for start of data pointing to byte number *bytes* + * **+ 1**. If *bytes* is larger than the current data being + * processed, then the eBPF verdict will be applied to multiple + * **sendmsg**\ () or **sendfile**\ () calls until *bytes* are + * consumed. + * + * Note that if a socket closes with the internal counter holding + * a non-zero value, this is not a problem because data is not + * being buffered for *bytes* and is sent as it is received. + * Return + * 0 + * + * int bpf_msg_cork_bytes(struct sk_msg_buff *msg, u32 bytes) + * Description + * For socket policies, prevent the execution of the verdict eBPF + * program for message *msg* until *bytes* (byte number) have been + * accumulated. + * + * This can be used when one needs a specific number of bytes + * before a verdict can be assigned, even if the data spans + * multiple **sendmsg**\ () or **sendfile**\ () calls. The extreme + * case would be a user calling **sendmsg**\ () repeatedly with + * 1-byte long message segments. Obviously, this is bad for + * performance, but it is still valid. If the eBPF program needs + * *bytes* bytes to validate a header, this helper can be used to + * prevent the eBPF program to be called again until *bytes* have + * been accumulated. + * Return + * 0 + * + * int bpf_msg_pull_data(struct sk_msg_buff *msg, u32 start, u32 end, u64 flags) + * Description + * For socket policies, pull in non-linear data from user space + * for *msg* and set pointers *msg*\ **->data** and *msg*\ + * **->data_end** to *start* and *end* bytes offsets into *msg*, + * respectively. + * + * If a program of type **BPF_PROG_TYPE_SK_MSG** is run on a + * *msg* it can only parse data that the (**data**, **data_end**) + * pointers have already consumed. For **sendmsg**\ () hooks this + * is likely the first scatterlist element. But for calls relying + * on the **sendpage** handler (e.g. **sendfile**\ ()) this will + * be the range (**0**, **0**) because the data is shared with + * user space and by default the objective is to avoid allowing + * user space to modify data while (or after) eBPF verdict is + * being decided. This helper can be used to pull in data and to + * set the start and end pointer to given values. Data will be + * copied if necessary (i.e. if data was not linear and if start + * and end pointers do not point to the same chunk). + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * + * All values for *flags* are reserved for future usage, and must + * be left at zero. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_bind(struct bpf_sock_addr *ctx, struct sockaddr *addr, int addr_len) + * Description + * Bind the socket associated to *ctx* to the address pointed by + * *addr*, of length *addr_len*. This allows for making outgoing + * connection from the desired IP address, which can be useful for + * example when all processes inside a cgroup should use one + * single IP address on a host that has multiple IP configured. + * + * This helper works for IPv4 and IPv6, TCP and UDP sockets. The + * domain (*addr*\ **->sa_family**) must be **AF_INET** (or + * **AF_INET6**). Looking for a free port to bind to can be + * expensive, therefore binding to port is not permitted by the + * helper: *addr*\ **->sin_port** (or **sin6_port**, respectively) + * must be set to zero. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_xdp_adjust_tail(struct xdp_buff *xdp_md, int delta) + * Description + * Adjust (move) *xdp_md*\ **->data_end** by *delta* bytes. It is + * only possible to shrink the packet as of this writing, + * therefore *delta* must be a negative integer. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_skb_get_xfrm_state(struct sk_buff *skb, u32 index, struct bpf_xfrm_state *xfrm_state, u32 size, u64 flags) + * Description + * Retrieve the XFRM state (IP transform framework, see also + * **ip-xfrm(8)**) at *index* in XFRM "security path" for *skb*. + * + * The retrieved value is stored in the **struct bpf_xfrm_state** + * pointed by *xfrm_state* and of length *size*. + * + * All values for *flags* are reserved for future usage, and must + * be left at zero. + * + * This helper is available only if the kernel was compiled with + * **CONFIG_XFRM** configuration option. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_get_stack(struct pt_regs *regs, void *buf, u32 size, u64 flags) + * Description + * Return a user or a kernel stack in bpf program provided buffer. + * To achieve this, the helper needs *ctx*, which is a pointer + * to the context on which the tracing program is executed. + * To store the stacktrace, the bpf program provides *buf* with + * a nonnegative *size*. + * + * The last argument, *flags*, holds the number of stack frames to + * skip (from 0 to 255), masked with + * **BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set + * the following flags: + * + * **BPF_F_USER_STACK** + * Collect a user space stack instead of a kernel stack. + * **BPF_F_USER_BUILD_ID** + * Collect buildid+offset instead of ips for user stack, + * only valid if **BPF_F_USER_STACK** is also specified. + * + * **bpf_get_stack**\ () can collect up to + * **PERF_MAX_STACK_DEPTH** both kernel and user frames, subject + * to sufficient large buffer size. Note that + * this limit can be controlled with the **sysctl** program, and + * that it should be manually increased in order to profile long + * user stacks (such as stacks for Java programs). To do so, use: + * + * :: + * + * # sysctl kernel.perf_event_max_stack= + * Return + * A non-negative value equal to or less than *size* on success, + * or a negative error in case of failure. + * + * int bpf_skb_load_bytes_relative(const struct sk_buff *skb, u32 offset, void *to, u32 len, u32 start_header) * Description * This helper is similar to **bpf_skb_load_bytes**\ () in that * it provides an easy way to load *len* bytes from *offset* @@ -752,174 +2023,39 @@ union bpf_attr { * in socket filters where *skb*\ **->data** does not always point * to the start of the mac header and where "direct packet access" * is not available. - * * Return * 0 on success, or a negative error in case of failure. * - * int bpf_bind(ctx, addr, addr_len) - * Bind socket to address. Only binding to IP is supported, no port can be - * set in addr. - * @ctx: pointer to context of type bpf_sock_addr - * @addr: pointer to struct sockaddr to bind socket to - * @addr_len: length of sockaddr structure - * Return: 0 on success or negative error code - * - * void* get_local_storage(void *map, u64 flags) + * int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags) * Description - * Get the pointer to the local storage area. - * The type and the size of the local storage is defined - * by the *map* argument. - * The *flags* meaning is specific for each map type, - * and has to be 0 for cgroup local storage. + * Do FIB lookup in kernel tables using parameters in *params*. + * If lookup is successful and result shows packet is to be + * forwarded, the neighbor tables are searched for the nexthop. + * If successful (ie., FIB lookup shows forwarding and nexthop + * is resolved), the nexthop address is returned in ipv4_dst + * or ipv6_dst based on family, smac is set to mac address of + * egress device, dmac is set to nexthop mac address, rt_metric + * is set to metric from route (IPv4/IPv6 only), and ifindex + * is set to the device index of the nexthop from the FIB lookup. * - * Depending on the bpf program type, a local storage area - * can be shared between multiple instances of the bpf program, - * running simultaneously. + * *plen* argument is the size of the passed in struct. + * *flags* argument can be a combination of one or more of the + * following values: * - * A user should care about the synchronization by himself. - * For example, by using the BPF_STX_XADD instruction to alter - * the shared data. - * Return - * Pointer to the local storage area. + * **BPF_FIB_LOOKUP_DIRECT** + * Do a direct table lookup vs full lookup using FIB + * rules. + * **BPF_FIB_LOOKUP_OUTPUT** + * Perform lookup from an egress perspective (default is + * ingress). * - * u64 bpf_get_current_cgroup_id(void) - * Return - * A 64-bit integer containing the current cgroup id based - * on the cgroup within which the current task is running. - * - * int bpf_xdp_adjust_meta(xdp_md, delta) - * Adjust the xdp_md.data_meta by delta - * @xdp_md: pointer to xdp_md - * @delta: An positive/negative integer to be added to xdp_md.data_meta - * Return: 0 on success or negative on error - * - * struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk) - * Description - * This helper gets a **struct bpf_sock** pointer such - * that all the fields in bpf_sock can be accessed. - * Return - * A **struct bpf_sock** pointer on success, or NULL in - * case of failure. - * - * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u32 netns, u64 flags) - * Description - * Look for TCP socket matching *tuple*, optionally in a child - * network namespace *netns*. The return value must be checked, - * and if non-NULL, released via **bpf_sk_release**\ (). - * - * The *ctx* should point to the context of the program, such as - * the skb or socket (depending on the hook in use). This is used - * to determine the base network namespace for the lookup. - * - * *tuple_size* must be one of: - * - * **sizeof**\ (*tuple*\ **->ipv4**) - * Look for an IPv4 socket. - * **sizeof**\ (*tuple*\ **->ipv6**) - * Look for an IPv6 socket. - * - * If the *netns* is zero, then the socket lookup table in the - * netns associated with the *ctx* will be used. For the TC hooks, - * this in the netns of the device in the skb. For socket hooks, - * this in the netns of the socket. If *netns* is non-zero, then - * it specifies the ID of the netns relative to the netns - * associated with the *ctx*. - * - * All values for *flags* are reserved for future usage, and must - * be left at zero. - * - * This helper is available only if the kernel was compiled with - * **CONFIG_NET** configuration option. - * Return - * Pointer to *struct bpf_sock*, or NULL in case of failure. - * For sockets with reuseport option, *struct bpf_sock* - * return is from reuse->socks[] using hash of the packet. - * - * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u32 netns, u64 flags) - * Description - * Look for UDP socket matching *tuple*, optionally in a child - * network namespace *netns*. The return value must be checked, - * and if non-NULL, released via **bpf_sk_release**\ (). - * - * The *ctx* should point to the context of the program, such as - * the skb or socket (depending on the hook in use). This is used - * to determine the base network namespace for the lookup. - * - * *tuple_size* must be one of: - * - * **sizeof**\ (*tuple*\ **->ipv4**) - * Look for an IPv4 socket. - * **sizeof**\ (*tuple*\ **->ipv6**) - * Look for an IPv6 socket. - * - * If the *netns* is zero, then the socket lookup table in the - * netns associated with the *ctx* will be used. For the TC hooks, - * this in the netns of the device in the skb. For socket hooks, - * this in the netns of the socket. If *netns* is non-zero, then - * it specifies the ID of the netns relative to the netns - * associated with the *ctx*. - * - * All values for *flags* are reserved for future usage, and must - * be left at zero. - * - * This helper is available only if the kernel was compiled with - * **CONFIG_NET** configuration option. - * Return - * Pointer to *struct bpf_sock*, or NULL in case of failure. - * For sockets with reuseport option, *struct bpf_sock* - * return is from reuse->socks[] using hash of the packet. - * - * int bpf_sk_release(struct bpf_sock *sk) - * Description - * Release the reference held by *sock*. *sock* must be a non-NULL - * pointer that was returned from bpf_sk_lookup_xxx\ (). - * Return - * 0 on success, or a negative error in case of failure. - * - * struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk) - * Description - * This helper gets a **struct bpf_tcp_sock** pointer from a - * **struct bpf_sock** pointer. - * - * Return - * A **struct bpf_tcp_sock** pointer on success, or NULL in - * case of failure. - * - * void *bpf_sk_storage_get(struct bpf_map *map, struct bpf_sock *sk, void *value, u64 flags) - * Description - * Get a bpf-local-storage from a sk. - * - * Logically, it could be thought of getting the value from - * a *map* with *sk* as the **key**. From this - * perspective, the usage is not much different from - * **bpf_map_lookup_elem(map, &sk)** except this - * helper enforces the key must be a **bpf_fullsock()** - * and the map must be a BPF_MAP_TYPE_SK_STORAGE also. - * - * Underneath, the value is stored locally at *sk* instead of - * the map. The *map* is used as the bpf-local-storage **type**. - * The bpf-local-storage **type** (i.e. the *map*) is searched - * against all bpf-local-storages residing at sk. - * - * An optional *flags* (BPF_SK_STORAGE_GET_F_CREATE) can be - * used such that a new bpf-local-storage will be - * created if one does not exist. *value* can be used - * together with BPF_SK_STORAGE_GET_F_CREATE to specify - * the initial value of a bpf-local-storage. If *value* is - * NULL, the new bpf-local-storage will be zero initialized. - * Return - * A bpf-local-storage pointer is returned on success. - * - * **NULL** if not found or there was an error in adding - * a new bpf-local-storage. - * - * int bpf_sk_storage_delete(struct bpf_map *map, struct bpf_sock *sk) - * Description - * Delete a bpf-local-storage from a sk. - * Return - * 0 on success. - * - * **-ENOENT** if the bpf-local-storage cannot be found. + * *ctx* is either **struct xdp_md** for XDP programs or + * **struct sk_buff** tc cls_act programs. + * Return + * * < 0 if any input argument is invalid + * * 0 on success (packet is forwarded, nexthop neighbor exists) + * * > 0 one of **BPF_FIB_LKUP_RET_** codes explaining why the + * packet is not forwarded or needs assist from full stack * * int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags) * Description @@ -967,6 +2103,573 @@ union bpf_attr { * egress otherwise). This is the only flag supported for now. * Return * **SK_PASS** on success, or **SK_DROP** on error. + * + * int bpf_lwt_push_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len) + * Description + * Encapsulate the packet associated to *skb* within a Layer 3 + * protocol header. This header is provided in the buffer at + * address *hdr*, with *len* its size in bytes. *type* indicates + * the protocol of the header and can be one of: + * + * **BPF_LWT_ENCAP_SEG6** + * IPv6 encapsulation with Segment Routing Header + * (**struct ipv6_sr_hdr**). *hdr* only contains the SRH, + * the IPv6 header is computed by the kernel. + * **BPF_LWT_ENCAP_SEG6_INLINE** + * Only works if *skb* contains an IPv6 packet. Insert a + * Segment Routing Header (**struct ipv6_sr_hdr**) inside + * the IPv6 header. + * **BPF_LWT_ENCAP_IP** + * IP encapsulation (GRE/GUE/IPIP/etc). The outer header + * must be IPv4 or IPv6, followed by zero or more + * additional headers, up to LWT_BPF_MAX_HEADROOM total + * bytes in all prepended headers. Please note that + * if skb_is_gso(skb) is true, no more than two headers + * can be prepended, and the inner header, if present, + * should be either GRE or UDP/GUE. + * + * BPF_LWT_ENCAP_SEG6*** types can be called by bpf programs of + * type BPF_PROG_TYPE_LWT_IN; BPF_LWT_ENCAP_IP type can be called + * by bpf programs of types BPF_PROG_TYPE_LWT_IN and + * BPF_PROG_TYPE_LWT_XMIT. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_lwt_seg6_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len) + * Description + * Store *len* bytes from address *from* into the packet + * associated to *skb*, at *offset*. Only the flags, tag and TLVs + * inside the outermost IPv6 Segment Routing Header can be + * modified through this helper. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_lwt_seg6_adjust_srh(struct sk_buff *skb, u32 offset, s32 delta) + * Description + * Adjust the size allocated to TLVs in the outermost IPv6 + * Segment Routing Header contained in the packet associated to + * *skb*, at position *offset* by *delta* bytes. Only offsets + * after the segments are accepted. *delta* can be as well + * positive (growing) as negative (shrinking). + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_lwt_seg6_action(struct sk_buff *skb, u32 action, void *param, u32 param_len) + * Description + * Apply an IPv6 Segment Routing action of type *action* to the + * packet associated to *skb*. Each action takes a parameter + * contained at address *param*, and of length *param_len* bytes. + * *action* can be one of: + * + * **SEG6_LOCAL_ACTION_END_X** + * End.X action: Endpoint with Layer-3 cross-connect. + * Type of *param*: **struct in6_addr**. + * **SEG6_LOCAL_ACTION_END_T** + * End.T action: Endpoint with specific IPv6 table lookup. + * Type of *param*: **int**. + * **SEG6_LOCAL_ACTION_END_B6** + * End.B6 action: Endpoint bound to an SRv6 policy. + * Type of param: **struct ipv6_sr_hdr**. + * **SEG6_LOCAL_ACTION_END_B6_ENCAP** + * End.B6.Encap action: Endpoint bound to an SRv6 + * encapsulation policy. + * Type of param: **struct ipv6_sr_hdr**. + * + * A call to this helper is susceptible to change the underlaying + * packet buffer. Therefore, at load time, all checks on pointers + * previously done by the verifier are invalidated and must be + * performed again, if the helper is used in combination with + * direct packet access. + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_rc_keydown(void *ctx, u32 protocol, u64 scancode, u32 toggle) + * Description + * This helper is used in programs implementing IR decoding, to + * report a successfully decoded key press with *scancode*, + * *toggle* value in the given *protocol*. The scancode will be + * translated to a keycode using the rc keymap, and reported as + * an input key down event. After a period a key up event is + * generated. This period can be extended by calling either + * **bpf_rc_keydown** () again with the same values, or calling + * **bpf_rc_repeat** (). + * + * Some protocols include a toggle bit, in case the button was + * released and pressed again between consecutive scancodes. + * + * The *ctx* should point to the lirc sample as passed into + * the program. + * + * The *protocol* is the decoded protocol number (see + * **enum rc_proto** for some predefined values). + * + * This helper is only available is the kernel was compiled with + * the **CONFIG_BPF_LIRC_MODE2** configuration option set to + * "**y**". + * Return + * 0 + * + * int bpf_rc_repeat(void *ctx) + * Description + * This helper is used in programs implementing IR decoding, to + * report a successfully decoded repeat key message. This delays + * the generation of a key up event for previously generated + * key down event. + * + * Some IR protocols like NEC have a special IR message for + * repeating last button, for when a button is held down. + * + * The *ctx* should point to the lirc sample as passed into + * the program. + * + * This helper is only available is the kernel was compiled with + * the **CONFIG_BPF_LIRC_MODE2** configuration option set to + * "**y**". + * Return + * 0 + * + * uint64_t bpf_skb_cgroup_id(struct sk_buff *skb) + * Description + * Return the cgroup v2 id of the socket associated with the *skb*. + * This is roughly similar to the **bpf_get_cgroup_classid**\ () + * helper for cgroup v1 by providing a tag resp. identifier that + * can be matched on or used for map lookups e.g. to implement + * policy. The cgroup v2 id of a given path in the hierarchy is + * exposed in user space through the f_handle API in order to get + * to the same 64-bit id. + * + * This helper can be used on TC egress path, but not on ingress, + * and is available only if the kernel was compiled with the + * **CONFIG_SOCK_CGROUP_DATA** configuration option. + * Return + * The id is returned or 0 in case the id could not be retrieved. + * + * u64 bpf_skb_ancestor_cgroup_id(struct sk_buff *skb, int ancestor_level) + * Description + * Return id of cgroup v2 that is ancestor of cgroup associated + * with the *skb* at the *ancestor_level*. The root cgroup is at + * *ancestor_level* zero and each step down the hierarchy + * increments the level. If *ancestor_level* == level of cgroup + * associated with *skb*, then return value will be same as that + * of **bpf_skb_cgroup_id**\ (). + * + * The helper is useful to implement policies based on cgroups + * that are upper in hierarchy than immediate cgroup associated + * with *skb*. + * + * The format of returned id and helper limitations are same as in + * **bpf_skb_cgroup_id**\ (). + * Return + * The id is returned or 0 in case the id could not be retrieved. + * + * u64 bpf_get_current_cgroup_id(void) + * Return + * A 64-bit integer containing the current cgroup id based + * on the cgroup within which the current task is running. + * + * void* get_local_storage(void *map, u64 flags) + * Description + * Get the pointer to the local storage area. + * The type and the size of the local storage is defined + * by the *map* argument. + * The *flags* meaning is specific for each map type, + * and has to be 0 for cgroup local storage. + * + * Depending on the bpf program type, a local storage area + * can be shared between multiple instances of the bpf program, + * running simultaneously. + * + * A user should care about the synchronization by himself. + * For example, by using the BPF_STX_XADD instruction to alter + * the shared data. + * Return + * Pointer to the local storage area. + * + * int bpf_sk_select_reuseport(struct sk_reuseport_md *reuse, struct bpf_map *map, void *key, u64 flags) + * Description + * Select a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY map + * It checks the selected sk is matching the incoming + * request in the skb. + * Return + * 0 on success, or a negative error in case of failure. + * + * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 flags) + * Description + * Look for TCP socket matching *tuple*, optionally in a child + * network namespace *netns*. The return value must be checked, + * and if non-NULL, released via **bpf_sk_release**\ (). + * + * The *ctx* should point to the context of the program, such as + * the skb or socket (depending on the hook in use). This is used + * to determine the base network namespace for the lookup. + * + * *tuple_size* must be one of: + * + * **sizeof**\ (*tuple*\ **->ipv4**) + * Look for an IPv4 socket. + * **sizeof**\ (*tuple*\ **->ipv6**) + * Look for an IPv6 socket. + * + * If the *netns* is a negative signed 32-bit integer, then the + * socket lookup table in the netns associated with the *ctx* will + * will be used. For the TC hooks, this is the netns of the device + * in the skb. For socket hooks, this is the netns of the socket. + * If *netns* is any other signed 32-bit value greater than or + * equal to zero then it specifies the ID of the netns relative to + * the netns associated with the *ctx*. *netns* values beyond the + * range of 32-bit integers are reserved for future use. + * + * All values for *flags* are reserved for future usage, and must + * be left at zero. + * + * This helper is available only if the kernel was compiled with + * **CONFIG_NET** configuration option. + * Return + * Pointer to *struct bpf_sock*, or NULL in case of failure. + * For sockets with reuseport option, *struct bpf_sock* + * return is from reuse->socks[] using hash of the packet. + * + * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 flags) + * Description + * Look for UDP socket matching *tuple*, optionally in a child + * network namespace *netns*. The return value must be checked, + * and if non-NULL, released via **bpf_sk_release**\ (). + * + * The *ctx* should point to the context of the program, such as + * the skb or socket (depending on the hook in use). This is used + * to determine the base network namespace for the lookup. + * + * *tuple_size* must be one of: + * + * **sizeof**\ (*tuple*\ **->ipv4**) + * Look for an IPv4 socket. + * **sizeof**\ (*tuple*\ **->ipv6**) + * Look for an IPv6 socket. + * + * If the *netns* is a negative signed 32-bit integer, then the + * socket lookup table in the netns associated with the *ctx* will + * will be used. For the TC hooks, this is the netns of the device + * in the skb. For socket hooks, this is the netns of the socket. + * If *netns* is any other signed 32-bit value greater than or + * equal to zero then it specifies the ID of the netns relative to + * the netns associated with the *ctx*. *netns* values beyond the + * range of 32-bit integers are reserved for future use. + * + * All values for *flags* are reserved for future usage, and must + * be left at zero. + * + * This helper is available only if the kernel was compiled with + * **CONFIG_NET** configuration option. + * Return + * Pointer to *struct bpf_sock*, or NULL in case of failure. + * For sockets with reuseport option, *struct bpf_sock* + * return is from reuse->socks[] using hash of the packet. + * + * int bpf_sk_release(struct bpf_sock *sk) + * Description + * Release the reference held by *sock*. *sock* must be a non-NULL + * pointer that was returned from bpf_sk_lookup_xxx\ (). + * Return + * 0 on success, or a negative error in case of failure. + * + * u64 bpf_ktime_get_boot_ns(void) + * Description + * Return the time elapsed since system boot, in nanoseconds. + * Does include the time the system was suspended. + * See: clock_gettime(CLOCK_BOOTTIME) + * Return + * Current *ktime*. + * + * struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk) + * Description + * This helper gets a **struct bpf_sock** pointer such + * that all the fields in bpf_sock can be accessed. + * Return + * A **struct bpf_sock** pointer on success, or NULL in + * case of failure. + * + * struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk) + * Description + * This helper gets a **struct bpf_tcp_sock** pointer from a + * **struct bpf_sock** pointer. + * + * Return + * A **struct bpf_tcp_sock** pointer on success, or NULL in + * case of failure. + * + * int bpf_skb_ecn_set_ce(struct sk_buf *skb) + * Description + * Sets ECN of IP header to ce (congestion encountered) if + * current value is ect (ECN capable). Works with IPv6 and IPv4. + * Return + * 1 if set, 0 if not set. + * + * struct bpf_sock *bpf_get_listener_sock(struct bpf_sock *sk) + * Description + * Return a **struct bpf_sock** pointer in TCP_LISTEN state. + * bpf_sk_release() is unnecessary and not allowed. + * Return + * A **struct bpf_sock** pointer on success, or NULL in + * case of failure. + * + * struct bpf_sock *bpf_skc_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, u32 tuple_size, u64 netns, u64 flags) + * Description + * Look for TCP socket matching *tuple*, optionally in a child + * network namespace *netns*. The return value must be checked, + * and if non-**NULL**, released via **bpf_sk_release**\ (). + * + * This function is identical to bpf_sk_lookup_tcp, except that it + * also returns timewait or request sockets. Use bpf_sk_fullsock + * or bpf_tcp_socket to access the full structure. + * + * This helper is available only if the kernel was compiled with + * **CONFIG_NET** configuration option. + * Return + * Pointer to **struct bpf_sock**, or **NULL** in case of failure. + * For sockets with reuseport option, the **struct bpf_sock** + * result is from **reuse->socks**\ [] using the hash of the tuple. + * + * int bpf_sysctl_get_name(struct bpf_sysctl *ctx, char *buf, size_t buf_len, u64 flags) + * Description + * Get name of sysctl in /proc/sys/ and copy it into provided by + * program buffer *buf* of size *buf_len*. + * + * The buffer is always NUL terminated, unless it's zero-sized. + * + * If *flags* is zero, full name (e.g. "net/ipv4/tcp_mem") is + * copied. Use **BPF_F_SYSCTL_BASE_NAME** flag to copy base name + * only (e.g. "tcp_mem"). + * Return + * Number of character copied (not including the trailing NUL). + * + * **-E2BIG** if the buffer wasn't big enough (*buf* will contain + * truncated name in this case). + * + * int bpf_sysctl_get_current_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len) + * Description + * Get current value of sysctl as it is presented in /proc/sys + * (incl. newline, etc), and copy it as a string into provided + * by program buffer *buf* of size *buf_len*. + * + * The whole value is copied, no matter what file position user + * space issued e.g. sys_read at. + * + * The buffer is always NUL terminated, unless it's zero-sized. + * Return + * Number of character copied (not including the trailing NUL). + * + * **-E2BIG** if the buffer wasn't big enough (*buf* will contain + * truncated name in this case). + * + * **-EINVAL** if current value was unavailable, e.g. because + * sysctl is uninitialized and read returns -EIO for it. + * + * int bpf_sysctl_get_new_value(struct bpf_sysctl *ctx, char *buf, size_t buf_len) + * Description + * Get new value being written by user space to sysctl (before + * the actual write happens) and copy it as a string into + * provided by program buffer *buf* of size *buf_len*. + * + * User space may write new value at file position > 0. + * + * The buffer is always NUL terminated, unless it's zero-sized. + * Return + * Number of character copied (not including the trailing NUL). + * + * **-E2BIG** if the buffer wasn't big enough (*buf* will contain + * truncated name in this case). + * + * **-EINVAL** if sysctl is being read. + * + * int bpf_sysctl_set_new_value(struct bpf_sysctl *ctx, const char *buf, size_t buf_len) + * Description + * Override new value being written by user space to sysctl with + * value provided by program in buffer *buf* of size *buf_len*. + * + * *buf* should contain a string in same form as provided by user + * space on sysctl write. + * + * User space may write new value at file position > 0. To override + * the whole sysctl value file position should be set to zero. + * Return + * 0 on success. + * + * **-E2BIG** if the *buf_len* is too big. + * + * **-EINVAL** if sysctl is being read. + * + * int bpf_strtol(const char *buf, size_t buf_len, u64 flags, long *res) + * Description + * Convert the initial part of the string from buffer *buf* of + * size *buf_len* to a long integer according to the given base + * and save the result in *res*. + * + * The string may begin with an arbitrary amount of white space + * (as determined by isspace(3)) followed by a single optional '-' + * sign. + * + * Five least significant bits of *flags* encode base, other bits + * are currently unused. + * + * Base must be either 8, 10, 16 or 0 to detect it automatically + * similar to user space strtol(3). + * Return + * Number of characters consumed on success. Must be positive but + * no more than buf_len. + * + * **-EINVAL** if no valid digits were found or unsupported base + * was provided. + * + * **-ERANGE** if resulting value was out of range. + * + * int bpf_strtoul(const char *buf, size_t buf_len, u64 flags, unsigned long *res) + * Description + * Convert the initial part of the string from buffer *buf* of + * size *buf_len* to an unsigned long integer according to the + * given base and save the result in *res*. + * + * The string may begin with an arbitrary amount of white space + * (as determined by isspace(3)). + * + * Five least significant bits of *flags* encode base, other bits + * are currently unused. + * + * Base must be either 8, 10, 16 or 0 to detect it automatically + * similar to user space strtoul(3). + * Return + * Number of characters consumed on success. Must be positive but + * no more than buf_len. + * + * **-EINVAL** if no valid digits were found or unsupported base + * was provided. + * + * **-ERANGE** if resulting value was out of range. + * + * void *bpf_sk_storage_get(struct bpf_map *map, struct bpf_sock *sk, void *value, u64 flags) + * Description + * Get a bpf-local-storage from a sk. + * + * Logically, it could be thought of getting the value from + * a *map* with *sk* as the **key**. From this + * perspective, the usage is not much different from + * **bpf_map_lookup_elem(map, &sk)** except this + * helper enforces the key must be a **bpf_fullsock()** + * and the map must be a BPF_MAP_TYPE_SK_STORAGE also. + * + * Underneath, the value is stored locally at *sk* instead of + * the map. The *map* is used as the bpf-local-storage **type**. + * The bpf-local-storage **type** (i.e. the *map*) is searched + * against all bpf-local-storages residing at sk. + * + * An optional *flags* (BPF_SK_STORAGE_GET_F_CREATE) can be + * used such that a new bpf-local-storage will be + * created if one does not exist. *value* can be used + * together with BPF_SK_STORAGE_GET_F_CREATE to specify + * the initial value of a bpf-local-storage. If *value* is + * NULL, the new bpf-local-storage will be zero initialized. + * Return + * A bpf-local-storage pointer is returned on success. + * + * **NULL** if not found or there was an error in adding + * a new bpf-local-storage. + * + * int bpf_sk_storage_delete(struct bpf_map *map, struct bpf_sock *sk) + * Description + * Delete a bpf-local-storage from a sk. + * Return + * 0 on success. + * + * **-ENOENT** if the bpf-local-storage cannot be found. + * + * int bpf_msg_push_data(struct sk_buff *skb, u32 start, u32 len, u64 flags) + * Description + * For socket policies, insert *len* bytes into msg at offset + * *start*. + * + * If a program of type **BPF_PROG_TYPE_SK_MSG** is run on a + * *msg* it may want to insert metadata or options into the msg. + * This can later be read and used by any of the lower layer BPF + * hooks. + * + * This helper may fail if under memory pressure (a malloc + * fails) in these cases BPF programs will get an appropriate + * error and BPF programs will need to handle them. + * + * Return + * 0 on success, or a negative error in case of failure. + * + * int bpf_msg_pop_data(struct sk_msg_buff *msg, u32 start, u32 pop, u64 flags) + * Description + * Will remove *pop* bytes from a *msg* starting at byte *start*. + * This may result in **ENOMEM** errors under certain situations if + * an allocation and copy are required due to a full ring buffer. + * However, the helper will try to avoid doing the allocation + * if possible. Other errors can occur if input parameters are + * invalid either due to *start* byte not being valid part of msg + * payload and/or *pop* value being to large. + * + * Return + * 0 on success, or a negative erro in case of failure. + * + * int bpf_tcp_check_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len) + * Description + * Check whether iph and th contain a valid SYN cookie ACK for + * the listening socket in sk. + * + * iph points to the start of the IPv4 or IPv6 header, while + * iph_len contains sizeof(struct iphdr) or sizeof(struct ip6hdr). + * + * th points to the start of the TCP header, while th_len contains + * sizeof(struct tcphdr). + * + * Return + * 0 if iph and th are a valid SYN cookie ACK, or a negative error + * otherwise. + * + * s64 bpf_tcp_gen_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len) + * Description + * Try to issue a SYN cookie for the packet with corresponding + * IP/TCP headers, *iph* and *th*, on the listening socket in *sk*. + * + * *iph* points to the start of the IPv4 or IPv6 header, while + * *iph_len* contains **sizeof**\ (**struct iphdr**) or + * **sizeof**\ (**struct ip6hdr**). + * + * *th* points to the start of the TCP header, while *th_len* + * contains the length of the TCP header. + * + * Return + * On success, lower 32 bits hold the generated SYN cookie in + * followed by 16 bits which hold the MSS value for that cookie, + * and the top 16 bits are unused. + * + * On failure, the returned value is one of the following: + * + * **-EINVAL** SYN cookie cannot be issued due to error + * + * **-ENOENT** SYN cookie should not be issued (no SYN flood) + * + * **-EOPNOTSUPP** kernel configuration does not enable SYN cookies + * + * **-EPROTONOSUPPORT** IP packet version is not 4 or 6 */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -1128,28 +2831,55 @@ enum bpf_func_id { /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */ #define BPF_F_TUNINFO_IPV6 (1ULL << 0) -/* BPF_FUNC_get_stackid flags. */ +/* flags for both BPF_FUNC_get_stackid and BPF_FUNC_get_stack. */ #define BPF_F_SKIP_FIELD_MASK 0xffULL #define BPF_F_USER_STACK (1ULL << 8) +/* flags used by BPF_FUNC_get_stackid only. */ #define BPF_F_FAST_STACK_CMP (1ULL << 9) #define BPF_F_REUSE_STACKID (1ULL << 10) +/* flags used by BPF_FUNC_get_stack only. */ +#define BPF_F_USER_BUILD_ID (1ULL << 11) /* BPF_FUNC_skb_set_tunnel_key flags. */ #define BPF_F_ZERO_CSUM_TX (1ULL << 1) #define BPF_F_DONT_FRAGMENT (1ULL << 2) +#define BPF_F_SEQ_NUMBER (1ULL << 3) -/* BPF_FUNC_perf_event_output and BPF_FUNC_perf_event_read flags. */ +/* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and + * BPF_FUNC_perf_event_read_value flags. + */ #define BPF_F_INDEX_MASK 0xffffffffULL #define BPF_F_CURRENT_CPU BPF_F_INDEX_MASK /* BPF_FUNC_perf_event_output for sk_buff input context. */ #define BPF_F_CTXLEN_MASK (0xfffffULL << 32) +/* Current network namespace */ +#define BPF_F_CURRENT_NETNS (-1L) + +/* BPF_FUNC_skb_adjust_room flags. */ +#define BPF_F_ADJ_ROOM_FIXED_GSO (1ULL << 0) + +#define BPF_ADJ_ROOM_ENCAP_L2_MASK 0xff +#define BPF_ADJ_ROOM_ENCAP_L2_SHIFT 56 + +#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 (1ULL << 1) +#define BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 (1ULL << 2) +#define BPF_F_ADJ_ROOM_ENCAP_L4_GRE (1ULL << 3) +#define BPF_F_ADJ_ROOM_ENCAP_L4_UDP (1ULL << 4) +#define BPF_F_ADJ_ROOM_ENCAP_L2(len) (((__u64)len & \ + BPF_ADJ_ROOM_ENCAP_L2_MASK) \ + << BPF_ADJ_ROOM_ENCAP_L2_SHIFT) + +/* BPF_FUNC_sysctl_get_name flags. */ +#define BPF_F_SYSCTL_BASE_NAME (1ULL << 0) + /* BPF_FUNC_sk_storage_get flags */ #define BPF_SK_STORAGE_GET_F_CREATE (1ULL << 0) /* Mode for BPF_FUNC_skb_adjust_room helper. */ enum bpf_adj_room_mode { BPF_ADJ_ROOM_NET, + BPF_ADJ_ROOM_MAC, }; /* Mode for BPF_FUNC_skb_load_bytes_relative helper. */ @@ -1158,6 +2888,13 @@ enum bpf_hdr_start_off { BPF_HDR_START_NET, }; +/* Encapsulation type for BPF_FUNC_lwt_push_encap helper. */ +enum bpf_lwt_encap_mode { + BPF_LWT_ENCAP_SEG6, + BPF_LWT_ENCAP_SEG6_INLINE, + BPF_LWT_ENCAP_IP, +}; + #define __bpf_md_ptr(type, name) \ union { \ type name; \ @@ -1195,12 +2932,15 @@ struct __sk_buff { __u32 local_ip6[4]; /* Stored in network byte order */ __u32 remote_port; /* Stored in network byte order */ __u32 local_port; /* stored in host byte order */ - - __bpf_md_ptr(struct bpf_flow_keys *, flow_keys); /* ... here. */ + __u32 data_meta; - __bpf_md_ptr(struct bpf_sock *, sk); + __bpf_md_ptr(struct bpf_flow_keys *, flow_keys); __u64 tstamp; + __u32 wire_len; + __u32 gso_segs; + __bpf_md_ptr(struct bpf_sock *, sk); + __u32 gso_size; }; struct bpf_tunnel_key { @@ -1211,10 +2951,24 @@ struct bpf_tunnel_key { }; __u8 tunnel_tos; __u8 tunnel_ttl; - __u16 tunnel_ext; + __u16 tunnel_ext; /* Padding, future use. */ __u32 tunnel_label; }; +/* user accessible mirror of in-kernel xfrm_state. + * new fields can only be added to the end of this structure + */ +struct bpf_xfrm_state { + __u32 reqid; + __u32 spi; /* Stored in network byte order */ + __u16 family; + __u16 ext; /* Padding, future use. */ + union { + __u32 remote_ipv4; /* Stored in network byte order */ + __u32 remote_ipv6[4]; /* Stored in network byte order */ + }; +}; + /* Generic BPF return codes which all BPF program types may support. * The values are binary compatible with their TC_ACT_* counter-part to * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT @@ -1228,7 +2982,15 @@ enum bpf_ret_code { BPF_DROP = 2, /* 3-6 reserved */ BPF_REDIRECT = 7, - /* >127 are reserved for prog type specific return codes */ + /* >127 are reserved for prog type specific return codes. + * + * BPF_LWT_REROUTE: used by BPF_PROG_TYPE_LWT_IN and + * BPF_PROG_TYPE_LWT_XMIT to indicate that skb had been + * changed and should be routed based on its new L3 header. + * (This is an L3 redirect, as opposed to L2 redirect + * represented by BPF_REDIRECT above). + */ + BPF_LWT_REROUTE = 128, }; struct bpf_sock { @@ -1238,15 +3000,15 @@ struct bpf_sock { __u32 protocol; __u32 mark; __u32 priority; - __u32 src_ip4; /* Allows 1,2,4-byte read. - * Stored in network byte order. - */ - __u32 src_ip6[4]; /* Allows 1,2,4-byte read. - * Stored in network byte order. - */ - __u32 src_port; /* Allows 4-byte read. - * Stored in host byte order - */ + /* IP address also allows 1 and 2 bytes access */ + __u32 src_ip4; + __u32 src_ip6[4]; + __u32 src_port; /* host byte order */ + __be16 dst_port; /* network byte order */ + __u16 :16; /* zero padding */ + __u32 dst_ip4; + __u32 dst_ip6[4]; + __u32 state; }; struct bpf_tcp_sock { @@ -1286,6 +3048,12 @@ struct bpf_tcp_sock { * sum(delta(snd_una)), or how many bytes * were acked. */ + __u32 dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups + * total number of DSACK blocks received + */ + __u32 delivered; /* Total data packets delivered incl. rexmits */ + __u32 delivered_ce; /* Like the above but only ECE marked packets */ + __u32 icsk_retransmits; /* Number of unrecovered [RTO] timeouts */ }; struct bpf_sock_tuple { @@ -1305,6 +3073,10 @@ struct bpf_sock_tuple { }; }; +struct bpf_xdp_sock { + __u32 queue_id; +}; + #define XDP_PACKET_HEADROOM 256 /* User return codes for XDP prog type. @@ -1327,6 +3099,9 @@ struct xdp_md { __u32 data; __u32 data_end; __u32 data_meta; + /* Below access go through struct xdp_rxq_info */ + __u32 ingress_ifindex; /* rxq->dev->ifindex */ + __u32 rx_queue_index; /* rxq->queue_index */ }; enum sk_action { @@ -1340,6 +3115,40 @@ enum sk_action { struct sk_msg_md { __bpf_md_ptr(void *, data); __bpf_md_ptr(void *, data_end); + + __u32 family; + __u32 remote_ip4; /* Stored in network byte order */ + __u32 local_ip4; /* Stored in network byte order */ + __u32 remote_ip6[4]; /* Stored in network byte order */ + __u32 local_ip6[4]; /* Stored in network byte order */ + __u32 remote_port; /* Stored in network byte order */ + __u32 local_port; /* stored in host byte order */ + __u32 size; /* Total size of sk_msg */ +}; + +struct sk_reuseport_md { + /* + * Start of directly accessible data. It begins from + * the tcp/udp header. + */ + __bpf_md_ptr(void *, data); + /* End of directly accessible data */ + __bpf_md_ptr(void *, data_end); + /* + * Total length of packet (starting from the tcp/udp header). + * Note that the directly accessible bytes (data_end - data) + * could be less than this "len". Those bytes could be + * indirectly read by a helper "bpf_skb_load_bytes()". + */ + __u32 len; + /* + * Eth protocol in the mac header (network byte order). e.g. + * ETH_P_IP(0x0800) and ETH_P_IPV6(0x86DD) + */ + __u32 eth_protocol; + __u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */ + __u32 bind_inany; /* Is sock bound to an INANY address? */ + __u32 hash; /* A hash of the packet 4 tuples */ }; #define BPF_TAG_SIZE 8 @@ -1359,6 +3168,7 @@ struct bpf_prog_info { char name[BPF_OBJ_NAME_LEN]; __u32 ifindex; __u32 gpl_compatible:1; + __u32 :31; /* alignment pad */ __u64 netns_dev; __u64 netns_ino; __u32 nr_jited_ksyms; @@ -1368,13 +3178,17 @@ struct bpf_prog_info { __u32 btf_id; __u32 func_info_rec_size; __aligned_u64 func_info; - __u32 func_info_cnt; - __u32 line_info_cnt; + __u32 nr_func_info; + __u32 nr_line_info; __aligned_u64 line_info; __aligned_u64 jited_line_info; - __u32 jited_line_info_cnt; + __u32 nr_jited_line_info; __u32 line_info_rec_size; __u32 jited_line_info_rec_size; + __u32 nr_prog_tags; + __aligned_u64 prog_tags; + __u64 run_time_ns; + __u64 run_cnt; } __attribute__((aligned(8))); struct bpf_map_info { @@ -1386,6 +3200,7 @@ struct bpf_map_info { __u32 map_flags; char name[BPF_OBJ_NAME_LEN]; __u32 ifindex; + __u32 :32; __u64 netns_dev; __u64 netns_ino; __u32 btf_id; @@ -1393,6 +3208,12 @@ struct bpf_map_info { __u32 btf_value_type_id; } __attribute__((aligned(8))); +struct bpf_btf_info { + __aligned_u64 btf; + __u32 btf_size; + __u32 id; +} __attribute__((aligned(8))); + /* User bpf_sock_addr struct to access socket fields and sockaddr struct passed * by user and intended to be used by socket (e.g. to bind to, depends on * attach attach type). @@ -1402,7 +3223,7 @@ struct bpf_sock_addr { __u32 user_ip4; /* Allows 1,2,4-byte read and 4-byte write. * Stored in network byte order. */ - __u32 user_ip6[4]; /* Allows 1,2,4-byte read an 4-byte write. + __u32 user_ip6[4]; /* Allows 1,2,4,8-byte read and 4,8-byte write. * Stored in network byte order. */ __u32 user_port; /* Allows 4-byte read and write. @@ -1411,12 +3232,13 @@ struct bpf_sock_addr { __u32 family; /* Allows 4-byte read, but no write */ __u32 type; /* Allows 4-byte read, but no write */ __u32 protocol; /* Allows 4-byte read, but no write */ - __u32 msg_src_ip4; /* Allows 1,2,4-byte read an 4-byte write. + __u32 msg_src_ip4; /* Allows 1,2,4-byte read and 4-byte write. * Stored in network byte order. */ - __u32 msg_src_ip6[4]; /* Allows 1,2,4-byte read an 4-byte write. + __u32 msg_src_ip6[4]; /* Allows 1,2,4,8-byte read and 4,8-byte write. * Stored in network byte order. */ + __bpf_md_ptr(struct bpf_sock *, sk); }; /* User bpf_sock_ops struct to access socket values and specify request ops @@ -1428,8 +3250,9 @@ struct bpf_sock_addr { struct bpf_sock_ops { __u32 op; union { - __u32 reply; - __u32 replylong[4]; + __u32 args[4]; /* Optionally passed to bpf program */ + __u32 reply; /* Returned by bpf program */ + __u32 replylong[4]; /* Optionally returned by bpf prog */ }; __u32 family; __u32 remote_ip4; /* Stored in network byte order */ @@ -1438,8 +3261,47 @@ struct bpf_sock_ops { __u32 local_ip6[4]; /* Stored in network byte order */ __u32 remote_port; /* Stored in network byte order */ __u32 local_port; /* stored in host byte order */ + __u32 is_fullsock; /* Some TCP fields are only valid if + * there is a full socket. If not, the + * fields read as zero. + */ + __u32 snd_cwnd; + __u32 srtt_us; /* Averaged RTT << 3 in usecs */ + __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */ + __u32 state; + __u32 rtt_min; + __u32 snd_ssthresh; + __u32 rcv_nxt; + __u32 snd_nxt; + __u32 snd_una; + __u32 mss_cache; + __u32 ecn_flags; + __u32 rate_delivered; + __u32 rate_interval_us; + __u32 packets_out; + __u32 retrans_out; + __u32 total_retrans; + __u32 segs_in; + __u32 data_segs_in; + __u32 segs_out; + __u32 data_segs_out; + __u32 lost_out; + __u32 sacked_out; + __u32 sk_txhash; + __u64 bytes_received; + __u64 bytes_acked; + __bpf_md_ptr(struct bpf_sock *, sk); }; +/* Definitions for bpf_sock_ops_cb_flags */ +#define BPF_SOCK_OPS_RTO_CB_FLAG (1<<0) +#define BPF_SOCK_OPS_RETRANS_CB_FLAG (1<<1) +#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2) +#define BPF_SOCK_OPS_RTT_CB_FLAG (1<<3) +#define BPF_SOCK_OPS_ALL_CB_FLAGS 0xF /* Mask of all currently + * supported cb flags + */ + /* List of known BPF sock_ops operators. * New entries can only be added at the end */ @@ -1466,20 +3328,76 @@ enum { BPF_SOCK_OPS_NEEDS_ECN, /* If connection's congestion control * needs ECN */ + BPF_SOCK_OPS_BASE_RTT, /* Get base RTT. The correct value is + * based on the path and may be + * dependent on the congestion control + * algorithm. In general it indicates + * a congestion threshold. RTTs above + * this indicate congestion + */ + BPF_SOCK_OPS_RTO_CB, /* Called when an RTO has triggered. + * Arg1: value of icsk_retransmits + * Arg2: value of icsk_rto + * Arg3: whether RTO has expired + */ + BPF_SOCK_OPS_RETRANS_CB, /* Called when skb is retransmitted. + * Arg1: sequence number of 1st byte + * Arg2: # segments + * Arg3: return value of + * tcp_transmit_skb (0 => success) + */ + BPF_SOCK_OPS_STATE_CB, /* Called when TCP changes state. + * Arg1: old_state + * Arg2: new_state + */ + BPF_SOCK_OPS_TCP_LISTEN_CB, /* Called on listen(2), right after + * socket transition to LISTEN state. + */ + BPF_SOCK_OPS_RTT_CB, /* Called on every RTT. + */ +}; + +/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect + * changes between the TCP and BPF versions. Ideally this should never happen. + * If it does, we need to add code to convert them before calling + * the BPF sock_ops function. + */ +enum { + BPF_TCP_ESTABLISHED = 1, + BPF_TCP_SYN_SENT, + BPF_TCP_SYN_RECV, + BPF_TCP_FIN_WAIT1, + BPF_TCP_FIN_WAIT2, + BPF_TCP_TIME_WAIT, + BPF_TCP_CLOSE, + BPF_TCP_CLOSE_WAIT, + BPF_TCP_LAST_ACK, + BPF_TCP_LISTEN, + BPF_TCP_CLOSING, /* Now a valid state */ + BPF_TCP_NEW_SYN_RECV, + + BPF_TCP_MAX_STATES /* Leave at the end! */ }; #define TCP_BPF_IW 1001 /* Set TCP initial congestion window */ #define TCP_BPF_SNDCWND_CLAMP 1002 /* Set sndcwnd_clamp */ -#define BPF_DEVCG_ACC_MKNOD (1ULL << 0) -#define BPF_DEVCG_ACC_READ (1ULL << 1) -#define BPF_DEVCG_ACC_WRITE (1ULL << 2) +struct bpf_perf_event_value { + __u64 counter; + __u64 enabled; + __u64 running; +}; -#define BPF_DEVCG_DEV_BLOCK (1ULL << 0) -#define BPF_DEVCG_DEV_CHAR (1ULL << 1) +#define BPF_DEVCG_ACC_MKNOD (1ULL << 0) +#define BPF_DEVCG_ACC_READ (1ULL << 1) +#define BPF_DEVCG_ACC_WRITE (1ULL << 2) + +#define BPF_DEVCG_DEV_BLOCK (1ULL << 0) +#define BPF_DEVCG_DEV_CHAR (1ULL << 1) struct bpf_cgroup_dev_ctx { - __u32 access_type; /* (access << 16) | type */ + /* access_type encoded as (BPF_DEVCG_ACC_* << 16) | BPF_DEVCG_DEV_* */ + __u32 access_type; __u32 major; __u32 minor; }; @@ -1488,6 +3406,86 @@ struct bpf_raw_tracepoint_args { __u64 args[0]; }; +/* DIRECT: Skip the FIB rules and go to FIB table associated with device + * OUTPUT: Do lookup from egress perspective; default is ingress + */ +#define BPF_FIB_LOOKUP_DIRECT (1U << 0) +#define BPF_FIB_LOOKUP_OUTPUT (1U << 1) + +enum { + BPF_FIB_LKUP_RET_SUCCESS, /* lookup successful */ + BPF_FIB_LKUP_RET_BLACKHOLE, /* dest is blackholed; can be dropped */ + BPF_FIB_LKUP_RET_UNREACHABLE, /* dest is unreachable; can be dropped */ + BPF_FIB_LKUP_RET_PROHIBIT, /* dest not allowed; can be dropped */ + BPF_FIB_LKUP_RET_NOT_FWDED, /* packet is not forwarded */ + BPF_FIB_LKUP_RET_FWD_DISABLED, /* fwding is not enabled on ingress */ + BPF_FIB_LKUP_RET_UNSUPP_LWT, /* fwd requires encapsulation */ + BPF_FIB_LKUP_RET_NO_NEIGH, /* no neighbor entry for nh */ + BPF_FIB_LKUP_RET_FRAG_NEEDED, /* fragmentation required to fwd */ +}; + +struct bpf_fib_lookup { + /* input: network family for lookup (AF_INET, AF_INET6) + * output: network family of egress nexthop + */ + __u8 family; + + /* set if lookup is to consider L4 data - e.g., FIB rules */ + __u8 l4_protocol; + __be16 sport; + __be16 dport; + + /* total length of packet from network header - used for MTU check */ + __u16 tot_len; + + /* input: L3 device index for lookup + * output: device index from FIB lookup + */ + __u32 ifindex; + + union { + /* inputs to lookup */ + __u8 tos; /* AF_INET */ + __be32 flowinfo; /* AF_INET6, flow_label + priority */ + + /* output: metric of fib result (IPv4/IPv6 only) */ + __u32 rt_metric; + }; + + union { + __be32 ipv4_src; + __u32 ipv6_src[4]; /* in6_addr; network order */ + }; + + /* input to bpf_fib_lookup, ipv{4,6}_dst is destination address in + * network header. output: bpf_fib_lookup sets to gateway address + * if FIB lookup returns gateway route + */ + union { + __be32 ipv4_dst; + __u32 ipv6_dst[4]; /* in6_addr; network order */ + }; + + /* output */ + __be16 h_vlan_proto; + __be16 h_vlan_TCI; + __u8 smac[6]; /* ETH_ALEN */ + __u8 dmac[6]; /* ETH_ALEN */ +}; + +enum bpf_task_fd_type { + BPF_FD_TYPE_RAW_TRACEPOINT, /* tp name */ + BPF_FD_TYPE_TRACEPOINT, /* tp name */ + BPF_FD_TYPE_KPROBE, /* (symbol + offset) or addr */ + BPF_FD_TYPE_KRETPROBE, /* (symbol + offset) or addr */ + BPF_FD_TYPE_UPROBE, /* filename + offset */ + BPF_FD_TYPE_URETPROBE, /* filename + offset */ +}; + +#define BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG (1U << 0) +#define BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL (1U << 1) +#define BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP (1U << 2) + struct bpf_flow_keys { __u16 nhoff; __u16 thoff; @@ -1509,21 +3507,18 @@ struct bpf_flow_keys { __u32 ipv6_dst[4]; /* in6_addr; network order */ }; }; -}; - -struct bpf_sysctl { - __u32 write; /* Sysctl is being read (= 0) or written (= 1). - * Allows 1,2,4-byte read, but no write. - */ + __u32 flags; + __be32 flow_label; }; struct bpf_func_info { - __u32 insn_offset; + __u32 insn_off; __u32 type_id; }; #define BPF_LINE_INFO_LINE_NUM(line_col) ((line_col) >> 10) #define BPF_LINE_INFO_LINE_COL(line_col) ((line_col) & 0x3ff) + struct bpf_line_info { __u32 insn_off; __u32 file_name_off; @@ -1535,10 +3530,20 @@ struct bpf_spin_lock { __u32 val; }; +struct bpf_sysctl { + __u32 write; /* Sysctl is being read (= 0) or written (= 1). + * Allows 1,2,4-byte read, but no write. + */ + __u32 file_pos; /* Sysctl file position to read from, write to. + * Allows 1,2,4-byte read an 4-byte write. + */ +}; + struct bpf_sockopt { __bpf_md_ptr(struct bpf_sock *, sk); __bpf_md_ptr(void *, optval); __bpf_md_ptr(void *, optval_end); + __s32 level; __s32 optname; __s32 optlen; diff --git a/include/uapi/linux/bpf_common.h b/include/uapi/linux/bpf_common.h index 18be90725ab0..ee97668bdadb 100644 --- a/include/uapi/linux/bpf_common.h +++ b/include/uapi/linux/bpf_common.h @@ -15,9 +15,10 @@ /* ld/ldx fields */ #define BPF_SIZE(code) ((code) & 0x18) -#define BPF_W 0x00 -#define BPF_H 0x08 -#define BPF_B 0x10 +#define BPF_W 0x00 /* 32-bit */ +#define BPF_H 0x08 /* 16-bit */ +#define BPF_B 0x10 /* 8-bit */ +/* eBPF BPF_DW 0x18 64-bit */ #define BPF_MODE(code) ((code) & 0xe0) #define BPF_IMM 0x00 #define BPF_ABS 0x20 diff --git a/include/uapi/linux/bpfilter.h b/include/uapi/linux/bpfilter.h new file mode 100644 index 000000000000..2ec3cc99ea4c --- /dev/null +++ b/include/uapi/linux/bpfilter.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _UAPI_LINUX_BPFILTER_H +#define _UAPI_LINUX_BPFILTER_H + +#include + +enum { + BPFILTER_IPT_SO_SET_REPLACE = 64, + BPFILTER_IPT_SO_SET_ADD_COUNTERS = 65, + BPFILTER_IPT_SET_MAX, +}; + +enum { + BPFILTER_IPT_SO_GET_INFO = 64, + BPFILTER_IPT_SO_GET_ENTRIES = 65, + BPFILTER_IPT_SO_GET_REVISION_MATCH = 66, + BPFILTER_IPT_SO_GET_REVISION_TARGET = 67, + BPFILTER_IPT_GET_MAX, +}; + +#endif /* _UAPI_LINUX_BPFILTER_H */ diff --git a/include/uapi/linux/btf.h b/include/uapi/linux/btf.h index 11b0cd6bbc42..9310652ca4f9 100644 --- a/include/uapi/linux/btf.h +++ b/include/uapi/linux/btf.h @@ -22,9 +22,9 @@ struct btf_header { }; /* Max # of type identifier */ -#define BTF_MAX_TYPE 0x000fffff +#define BTF_MAX_TYPE 0x0000ffff /* Max offset into the string section */ -#define BTF_MAX_NAME_OFFSET 0x00ffffff +#define BTF_MAX_NAME_OFFSET 0x0000ffff /* Max # of struct/union/enum members or func args */ #define BTF_MAX_VLEN 0xffff @@ -39,11 +39,11 @@ struct btf_type { * struct, union and fwd */ __u32 info; - /* "size" is used by INT, ENUM, STRUCT and UNION. + /* "size" is used by INT, ENUM, STRUCT, UNION and DATASEC. * "size" tells the size of the type it is describing. * * "type" is used by PTR, TYPEDEF, VOLATILE, CONST, RESTRICT, - * FUNC and FUNC_PROTO. + * FUNC, FUNC_PROTO and VAR. * "type" is a type_id referring to another type. */ union { @@ -70,8 +70,10 @@ struct btf_type { #define BTF_KIND_RESTRICT 11 /* Restrict */ #define BTF_KIND_FUNC 12 /* Function */ #define BTF_KIND_FUNC_PROTO 13 /* Function Proto */ -#define BTF_KIND_MAX 13 -#define NR_BTF_KINDS 14 +#define BTF_KIND_VAR 14 /* Variable */ +#define BTF_KIND_DATASEC 15 /* Section */ +#define BTF_KIND_MAX BTF_KIND_DATASEC +#define NR_BTF_KINDS (BTF_KIND_MAX + 1) /* For some specific BTF_KIND, "struct btf_type" is immediately * followed by extra data. @@ -137,4 +139,27 @@ struct btf_param { __u32 name_off; __u32 type; }; + +enum { + BTF_VAR_STATIC = 0, + BTF_VAR_GLOBAL_ALLOCATED, +}; + +/* BTF_KIND_VAR is followed by a single "struct btf_var" to describe + * additional information related to the variable such as its linkage. + */ +struct btf_var { + __u32 linkage; +}; + +/* BTF_KIND_DATASEC is followed by multiple "struct btf_var_secinfo" + * to describe all BTF_KIND_VAR types it contains along with it's + * in-section offset as well as size. + */ +struct btf_var_secinfo { + __u32 type; + __u32 offset; + __u32 size; +}; + #endif /* _UAPI__LINUX_BTF_H__ */ diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h index 2b642bf9b5a0..232df14e1287 100644 --- a/include/uapi/linux/fib_rules.h +++ b/include/uapi/linux/fib_rules.h @@ -23,7 +23,7 @@ struct fib_rule_hdr { __u8 tos; __u8 table; - __u8 res1; /* reserved */ + __u8 res1; /* reserved */ __u8 res2; /* reserved */ __u8 action; @@ -35,6 +35,11 @@ struct fib_rule_uid_range { __u32 end; }; +struct fib_rule_port_range { + __u16 start; + __u16 end; +}; + enum { FRA_UNSPEC, FRA_DST, /* destination address */ @@ -58,6 +63,10 @@ enum { FRA_PAD, FRA_L3MDEV, /* iif or oif is l3mdev goto its table */ FRA_UID_RANGE, /* UID range */ + FRA_PROTOCOL, /* Originator of the rule */ + FRA_IP_PROTO, /* ip proto */ + FRA_SPORT_RANGE, /* sport */ + FRA_DPORT_RANGE, /* dport */ __FRA_MAX }; diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 6603c3a867c8..49a04cad14b2 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -915,6 +915,7 @@ enum { XDP_ATTACHED_DRV, XDP_ATTACHED_SKB, XDP_ATTACHED_HW, + XDP_ATTACHED_MULTI, }; enum { @@ -923,6 +924,9 @@ enum { IFLA_XDP_ATTACHED, IFLA_XDP_FLAGS, IFLA_XDP_PROG_ID, + IFLA_XDP_DRV_PROG_ID, + IFLA_XDP_SKB_PROG_ID, + IFLA_XDP_HW_PROG_ID, __IFLA_XDP_MAX, }; diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h new file mode 100644 index 000000000000..c3b61f4481f3 --- /dev/null +++ b/include/uapi/linux/if_xdp.h @@ -0,0 +1,78 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * if_xdp: XDP socket user-space interface + * Copyright(c) 2018 Intel Corporation. + * + * Author(s): Björn Töpel + * Magnus Karlsson + */ + +#ifndef _LINUX_IF_XDP_H +#define _LINUX_IF_XDP_H + +#include + +/* Options for the sxdp_flags field */ +#define XDP_SHARED_UMEM (1 << 0) +#define XDP_COPY (1 << 1) /* Force copy-mode */ +#define XDP_ZEROCOPY (1 << 2) /* Force zero-copy mode */ + +struct sockaddr_xdp { + __u16 sxdp_family; + __u32 sxdp_ifindex; + __u32 sxdp_queue_id; + __u32 sxdp_shared_umem_fd; + __u16 sxdp_flags; +}; + +struct xdp_ring_offset { + __u64 producer; + __u64 consumer; + __u64 desc; +}; + +struct xdp_mmap_offsets { + struct xdp_ring_offset rx; + struct xdp_ring_offset tx; + struct xdp_ring_offset fr; /* Fill */ + struct xdp_ring_offset cr; /* Completion */ +}; + +/* XDP socket options */ +#define XDP_MMAP_OFFSETS 1 +#define XDP_RX_RING 2 +#define XDP_TX_RING 3 +#define XDP_UMEM_REG 4 +#define XDP_UMEM_FILL_RING 5 +#define XDP_UMEM_COMPLETION_RING 6 +#define XDP_STATISTICS 7 + +struct xdp_umem_reg { + __u64 addr; /* Start of packet data area */ + __u64 len; /* Length of packet data area */ + __u32 chunk_size; + __u32 headroom; +}; + +struct xdp_statistics { + __u64 rx_dropped; /* Dropped for reasons other than invalid desc */ + __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ + __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ +}; + +/* Pgoff for mmaping the rings */ +#define XDP_PGOFF_RX_RING 0 +#define XDP_PGOFF_TX_RING 0x80000000 +#define XDP_UMEM_PGOFF_FILL_RING 0x100000000ULL +#define XDP_UMEM_PGOFF_COMPLETION_RING 0x180000000ULL + +/* Rx/Tx descriptor */ +struct xdp_desc { + __u64 addr; + __u32 len; + __u32 options; +}; + +/* UMEM descriptor is __u64 */ + +#endif /* _LINUX_IF_XDP_H */ diff --git a/include/uapi/linux/ipv6_route.h b/include/uapi/linux/ipv6_route.h index a96eb17ad6fc..593800a18799 100644 --- a/include/uapi/linux/ipv6_route.h +++ b/include/uapi/linux/ipv6_route.h @@ -29,7 +29,7 @@ #define RTF_ROUTEINFO 0x00800000 /* route information - RA */ -#define RTF_CACHE 0x01000000 /* cache entry */ +#define RTF_CACHE 0x01000000 /* read-only: can not be set by user */ #define RTF_FLOW 0x02000000 /* flow significant route */ #define RTF_POLICY 0x04000000 /* policy route */ diff --git a/include/uapi/linux/lirc.h b/include/uapi/linux/lirc.h index c3aef4316fbf..7db6063fa6a2 100644 --- a/include/uapi/linux/lirc.h +++ b/include/uapi/linux/lirc.h @@ -47,12 +47,14 @@ #define LIRC_MODE_RAW 0x00000001 #define LIRC_MODE_PULSE 0x00000002 #define LIRC_MODE_MODE2 0x00000004 +#define LIRC_MODE_SCANCODE 0x00000008 #define LIRC_MODE_LIRCCODE 0x00000010 #define LIRC_CAN_SEND_RAW LIRC_MODE2SEND(LIRC_MODE_RAW) #define LIRC_CAN_SEND_PULSE LIRC_MODE2SEND(LIRC_MODE_PULSE) #define LIRC_CAN_SEND_MODE2 LIRC_MODE2SEND(LIRC_MODE_MODE2) +#define LIRC_CAN_SEND_SCANCODE LIRC_MODE2SEND(LIRC_MODE_SCANCODE) #define LIRC_CAN_SEND_LIRCCODE LIRC_MODE2SEND(LIRC_MODE_LIRCCODE) #define LIRC_CAN_SEND_MASK 0x0000003f @@ -64,6 +66,7 @@ #define LIRC_CAN_REC_RAW LIRC_MODE2REC(LIRC_MODE_RAW) #define LIRC_CAN_REC_PULSE LIRC_MODE2REC(LIRC_MODE_PULSE) #define LIRC_CAN_REC_MODE2 LIRC_MODE2REC(LIRC_MODE_MODE2) +#define LIRC_CAN_REC_SCANCODE LIRC_MODE2REC(LIRC_MODE_SCANCODE) #define LIRC_CAN_REC_LIRCCODE LIRC_MODE2REC(LIRC_MODE_LIRCCODE) #define LIRC_CAN_REC_MASK LIRC_MODE2REC(LIRC_CAN_SEND_MASK) @@ -131,4 +134,91 @@ #define LIRC_SET_WIDEBAND_RECEIVER _IOW('i', 0x00000023, __u32) +/* + * Return the recording timeout, which is either set by + * the ioctl LIRC_SET_REC_TIMEOUT or by the kernel after setting the protocols. + */ +#define LIRC_GET_REC_TIMEOUT _IOR('i', 0x00000024, __u32) + +/* + * struct lirc_scancode - decoded scancode with protocol for use with + * LIRC_MODE_SCANCODE + * + * @timestamp: Timestamp in nanoseconds using CLOCK_MONOTONIC when IR + * was decoded. + * @flags: should be 0 for transmit. When receiving scancodes, + * LIRC_SCANCODE_FLAG_TOGGLE or LIRC_SCANCODE_FLAG_REPEAT can be set + * depending on the protocol + * @rc_proto: see enum rc_proto + * @keycode: the translated keycode. Set to 0 for transmit. + * @scancode: the scancode received or to be sent + */ +struct lirc_scancode { + __u64 timestamp; + __u16 flags; + __u16 rc_proto; + __u32 keycode; + __u64 scancode; +}; + +/* Set if the toggle bit of rc-5 or rc-6 is enabled */ +#define LIRC_SCANCODE_FLAG_TOGGLE 1 +/* Set if this is a nec or sanyo repeat */ +#define LIRC_SCANCODE_FLAG_REPEAT 2 + +/** + * enum rc_proto - the Remote Controller protocol + * + * @RC_PROTO_UNKNOWN: Protocol not known + * @RC_PROTO_OTHER: Protocol known but proprietary + * @RC_PROTO_RC5: Philips RC5 protocol + * @RC_PROTO_RC5X_20: Philips RC5x 20 bit protocol + * @RC_PROTO_RC5_SZ: StreamZap variant of RC5 + * @RC_PROTO_JVC: JVC protocol + * @RC_PROTO_SONY12: Sony 12 bit protocol + * @RC_PROTO_SONY15: Sony 15 bit protocol + * @RC_PROTO_SONY20: Sony 20 bit protocol + * @RC_PROTO_NEC: NEC protocol + * @RC_PROTO_NECX: Extended NEC protocol + * @RC_PROTO_NEC32: NEC 32 bit protocol + * @RC_PROTO_SANYO: Sanyo protocol + * @RC_PROTO_MCIR2_KBD: RC6-ish MCE keyboard + * @RC_PROTO_MCIR2_MSE: RC6-ish MCE mouse + * @RC_PROTO_RC6_0: Philips RC6-0-16 protocol + * @RC_PROTO_RC6_6A_20: Philips RC6-6A-20 protocol + * @RC_PROTO_RC6_6A_24: Philips RC6-6A-24 protocol + * @RC_PROTO_RC6_6A_32: Philips RC6-6A-32 protocol + * @RC_PROTO_RC6_MCE: MCE (Philips RC6-6A-32 subtype) protocol + * @RC_PROTO_SHARP: Sharp protocol + * @RC_PROTO_XMP: XMP protocol + * @RC_PROTO_CEC: CEC protocol + * @RC_PROTO_IMON: iMon Pad protocol + */ +enum rc_proto { + RC_PROTO_UNKNOWN = 0, + RC_PROTO_OTHER = 1, + RC_PROTO_RC5 = 2, + RC_PROTO_RC5X_20 = 3, + RC_PROTO_RC5_SZ = 4, + RC_PROTO_JVC = 5, + RC_PROTO_SONY12 = 6, + RC_PROTO_SONY15 = 7, + RC_PROTO_SONY20 = 8, + RC_PROTO_NEC = 9, + RC_PROTO_NECX = 10, + RC_PROTO_NEC32 = 11, + RC_PROTO_SANYO = 12, + RC_PROTO_MCIR2_KBD = 13, + RC_PROTO_MCIR2_MSE = 14, + RC_PROTO_RC6_0 = 15, + RC_PROTO_RC6_6A_20 = 16, + RC_PROTO_RC6_6A_24 = 17, + RC_PROTO_RC6_6A_32 = 18, + RC_PROTO_RC6_MCE = 19, + RC_PROTO_SHARP = 20, + RC_PROTO_XMP = 21, + RC_PROTO_CEC = 22, + RC_PROTO_IMON = 23, +}; + #endif diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h index 36d0b161e066..9cb40ab76aa8 100644 --- a/include/uapi/linux/openvswitch.h +++ b/include/uapi/linux/openvswitch.h @@ -360,6 +360,7 @@ enum ovs_tunnel_key_attr { OVS_TUNNEL_KEY_ATTR_IPV6_SRC, /* struct in6_addr src IPv6 address. */ OVS_TUNNEL_KEY_ATTR_IPV6_DST, /* struct in6_addr dst IPv6 address. */ OVS_TUNNEL_KEY_ATTR_PAD, + OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS, /* struct erspan_metadata */ __OVS_TUNNEL_KEY_ATTR_MAX }; diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 64a797500d4e..6c87b0a42e64 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -370,7 +370,9 @@ struct perf_event_attr { context_switch : 1, /* context switch data */ write_backward : 1, /* Write ring buffer from end to beginning */ namespaces : 1, /* include namespaces data */ - __reserved_1 : 35; + ksymbol : 1, /* include ksymbol events */ + bpf_event : 1, /* include bpf events */ + __reserved_1 : 33; union { __u32 wakeup_events; /* wakeup every n events */ @@ -380,10 +382,14 @@ struct perf_event_attr { __u32 bp_type; union { __u64 bp_addr; + __u64 kprobe_func; /* for perf_kprobe */ + __u64 uprobe_path; /* for perf_uprobe */ __u64 config1; /* extension of config */ }; union { __u64 bp_len; + __u64 kprobe_addr; /* when kprobe_func == NULL */ + __u64 probe_offset; /* for perf_[k,u]probe */ __u64 config2; /* extension of config1 */ }; __u64 branch_sample_type; /* enum perf_branch_sample_type */ @@ -425,21 +431,20 @@ struct perf_event_attr { */ struct perf_event_query_bpf { /* - * The below ids array length - */ - __u32 ids_len; + * The below ids array length + */ + __u32 ids_len; /* - * Set by the kernel to indicate the number of - * available programs - */ - __u32 prog_cnt; + * Set by the kernel to indicate the number of + * available programs + */ + __u32 prog_cnt; /* - * User provided buffer to store program ids - */ - __u32 ids[0]; + * User provided buffer to store program ids + */ + __u32 ids[0]; }; - #define perf_flags(attr) (*(&(attr)->read_format + 1)) /* @@ -455,7 +460,7 @@ struct perf_event_query_bpf { #define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *) #define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32) #define PERF_EVENT_IOC_PAUSE_OUTPUT _IOW('$', 9, __u32) -#define PERF_EVENT_IOC_QUERY_BPF _IOWR('$', 10, struct perf_event_query_bpf *) +#define PERF_EVENT_IOC_QUERY_BPF _IOWR('$', 10, struct perf_event_query_bpf *) enum perf_event_ioc_flags { PERF_IOC_FLAG_GROUP = 1U << 0, @@ -941,9 +946,58 @@ enum perf_event_type { */ PERF_RECORD_NAMESPACES = 16, + /* + * Record ksymbol register/unregister events: + * + * struct { + * struct perf_event_header header; + * u64 addr; + * u32 len; + * u16 ksym_type; + * u16 flags; + * char name[]; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_KSYMBOL = 17, + + /* + * Record bpf events: + * enum perf_bpf_event_type { + * PERF_BPF_EVENT_UNKNOWN = 0, + * PERF_BPF_EVENT_PROG_LOAD = 1, + * PERF_BPF_EVENT_PROG_UNLOAD = 2, + * }; + * + * struct { + * struct perf_event_header header; + * u16 type; + * u16 flags; + * u32 id; + * u8 tag[BPF_TAG_SIZE]; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_BPF_EVENT = 18, + PERF_RECORD_MAX, /* non-ABI */ }; +enum perf_record_ksymbol_type { + PERF_RECORD_KSYMBOL_TYPE_UNKNOWN = 0, + PERF_RECORD_KSYMBOL_TYPE_BPF = 1, + PERF_RECORD_KSYMBOL_TYPE_MAX /* non-ABI */ +}; + +#define PERF_RECORD_KSYMBOL_FLAGS_UNREGISTER (1 << 0) + +enum perf_bpf_event_type { + PERF_BPF_EVENT_UNKNOWN = 0, + PERF_BPF_EVENT_PROG_LOAD = 1, + PERF_BPF_EVENT_PROG_UNLOAD = 2, + PERF_BPF_EVENT_MAX, /* non-ABI */ +}; + #define PERF_MAX_STACK_DEPTH 127 #define PERF_MAX_CONTEXTS_PER_STACK 8 diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h index ef2d8c3e76c1..edc138bdc56d 100644 --- a/include/uapi/linux/seg6_local.h +++ b/include/uapi/linux/seg6_local.h @@ -25,6 +25,7 @@ enum { SEG6_LOCAL_NH6, SEG6_LOCAL_IIF, SEG6_LOCAL_OIF, + SEG6_LOCAL_BPF, __SEG6_LOCAL_MAX, }; #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1) @@ -59,10 +60,21 @@ enum { SEG6_LOCAL_ACTION_END_AS = 13, /* forward to SR-unaware VNF with masquerading */ SEG6_LOCAL_ACTION_END_AM = 14, + /* custom BPF action */ + SEG6_LOCAL_ACTION_END_BPF = 15, __SEG6_LOCAL_ACTION_MAX, }; #define SEG6_LOCAL_ACTION_MAX (__SEG6_LOCAL_ACTION_MAX - 1) +enum { + SEG6_LOCAL_BPF_PROG_UNSPEC, + SEG6_LOCAL_BPF_PROG, + SEG6_LOCAL_BPF_PROG_NAME, + __SEG6_LOCAL_BPF_PROG_MAX, +}; + +#define SEG6_LOCAL_BPF_PROG_MAX (__SEG6_LOCAL_BPF_PROG_MAX - 1) + #endif diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h index bf31965355c6..9cc679ab52b4 100644 --- a/include/uapi/linux/snmp.h +++ b/include/uapi/linux/snmp.h @@ -278,6 +278,8 @@ enum LINUX_MIB_TCPKEEPALIVE, /* TCPKeepAlive */ LINUX_MIB_TCPMTUPFAIL, /* TCPMTUPFail */ LINUX_MIB_TCPMTUPSUCCESS, /* TCPMTUPSuccess */ + LINUX_MIB_TCPDELIVERED, /* TCPDelivered */ + LINUX_MIB_TCPDELIVEREDCE, /* TCPDeliveredCE */ LINUX_MIB_TCPWQUEUETOOBIG, /* TCPWqueueTooBig */ __LINUX_MIB_MAX }; diff --git a/include/uapi/linux/tc_act/tc_skbedit.h b/include/uapi/linux/tc_act/tc_skbedit.h index fbcfe27a4e6c..6de6071ebed6 100644 --- a/include/uapi/linux/tc_act/tc_skbedit.h +++ b/include/uapi/linux/tc_act/tc_skbedit.h @@ -30,6 +30,7 @@ #define SKBEDIT_F_MARK 0x4 #define SKBEDIT_F_PTYPE 0x8 #define SKBEDIT_F_MASK 0x10 +#define SKBEDIT_F_INHERITDSFIELD 0x20 struct tc_skbedit { tc_gen; @@ -45,6 +46,7 @@ enum { TCA_SKBEDIT_PAD, TCA_SKBEDIT_PTYPE, TCA_SKBEDIT_MASK, + TCA_SKBEDIT_FLAGS, __TCA_SKBEDIT_MAX }; #define TCA_SKBEDIT_MAX (__TCA_SKBEDIT_MAX - 1) diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 6a64beeecfad..d6a35b4e0151 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -222,6 +222,13 @@ struct tcp_info { __u64 tcpi_busy_time; /* Time (usec) busy sending data */ __u64 tcpi_rwnd_limited; /* Time (usec) limited by receive window */ __u64 tcpi_sndbuf_limited; /* Time (usec) limited by send buffer */ + + __u32 tcpi_delivered; + __u32 tcpi_delivered_ce; + + __u64 tcpi_bytes_sent; /* RFC4898 tcpEStatsPerfHCDataOctetsOut */ + __u64 tcpi_bytes_retrans; /* RFC4898 tcpEStatsPerfOctetsRetrans */ + __u32 tcpi_dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups */ }; /* netlink attributes types for SCM_TIMESTAMPING_OPT_STATS */ @@ -239,7 +246,14 @@ enum { TCP_NLA_MIN_RTT, /* minimum RTT */ TCP_NLA_RECUR_RETRANS, /* Recurring retransmits for the current pkt */ TCP_NLA_DELIVERY_RATE_APP_LMT, /* delivery rate application limited ? */ - + TCP_NLA_SNDQ_SIZE, /* Data (bytes) pending in send queue */ + TCP_NLA_CA_STATE, /* ca_state of socket */ + TCP_NLA_SND_SSTHRESH, /* Slow start size threshold */ + TCP_NLA_DELIVERED, /* Data pkts delivered incl. out-of-order */ + TCP_NLA_DELIVERED_CE, /* Like above but only ones w/ CE marks */ + TCP_NLA_BYTES_SENT, /* Data bytes sent including retransmission */ + TCP_NLA_BYTES_RETRANS, /* Data bytes retransmitted */ + TCP_NLA_DSACK_DUPS, /* DSACK blocks received */ }; /* for TCP_MD5SIG socket option */ diff --git a/include/uapi/linux/time.h b/include/uapi/linux/time.h index 53f8dd84beb5..fcf936656493 100644 --- a/include/uapi/linux/time.h +++ b/include/uapi/linux/time.h @@ -42,6 +42,25 @@ struct itimerval { struct timeval it_value; /* current value */ }; +#ifndef __kernel_timespec +struct __kernel_timespec { + __kernel_time64_t tv_sec; /* seconds */ + long long tv_nsec; /* nanoseconds */ +}; +#endif + +/* + * legacy timeval structure, only embedded in structures that + * traditionally used 'timeval' to pass time intervals (not absolute + * times). Do not add new users. If user space fails to compile + * here, this is probably because it is not y2038 safe and needs to + * be changed to use another interface. + */ +struct __kernel_old_timeval { + __kernel_long_t tv_sec; + __kernel_long_t tv_usec; +}; + /* * The IDs of the various system clocks (for POSIX.1b interval timers): */ diff --git a/include/uapi/linux/tls.h b/include/uapi/linux/tls.h index 293b2cdad88d..c6633e97eca4 100644 --- a/include/uapi/linux/tls.h +++ b/include/uapi/linux/tls.h @@ -38,6 +38,7 @@ /* TLS socket options */ #define TLS_TX 1 /* Set transmit parameters */ +#define TLS_RX 2 /* Set receive parameters */ /* Supported versions */ #define TLS_VERSION_MINOR(ver) ((ver) & 0xFF) @@ -59,6 +60,7 @@ #define TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE 8 #define TLS_SET_RECORD_TYPE 1 +#define TLS_GET_RECORD_TYPE 2 struct tls_crypto_info { __u16 version; diff --git a/init/Kconfig b/init/Kconfig index f9efd266f4be..82d5912872c6 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1501,6 +1501,7 @@ config EVENTFD config BPF_SYSCALL bool "Enable bpf() system call" select BPF + select IRQ_WORK default n help Enable the bpf() system call that allows to manipulate eBPF diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index c06336e079e4..b0d78bc0b197 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -1,18 +1,31 @@ # SPDX-License-Identifier: GPL-2.0 obj-y := core.o -CFLAGS_core.o += $(call cc-disable-warning, override-init) +ifneq ($(CONFIG_BPF_JIT_ALWAYS_ON),y) +# ___bpf_prog_run() needs GCSE disabled on x86; see 3193c0836f203 for details +cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse +endif +CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy) obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o -obj-$(CONFIG_BPF_SYSCALL) += local_storage.o -obj-$(CONFIG_BPF_SYSCALL) += btf.o +obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o obj-$(CONFIG_BPF_SYSCALL) += disasm.o +obj-$(CONFIG_BPF_SYSCALL) += btf.o ifeq ($(CONFIG_NET),y) obj-$(CONFIG_BPF_SYSCALL) += devmap.o obj-$(CONFIG_BPF_SYSCALL) += cpumap.o +ifeq ($(CONFIG_XDP_SOCKETS),y) +obj-$(CONFIG_BPF_SYSCALL) += xskmap.o +endif obj-$(CONFIG_BPF_SYSCALL) += offload.o endif ifeq ($(CONFIG_PERF_EVENTS),y) obj-$(CONFIG_BPF_SYSCALL) += stackmap.o endif obj-$(CONFIG_CGROUP_BPF) += cgroup.o +ifeq ($(CONFIG_INET),y) +obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o +endif +ifeq ($(CONFIG_SYSFS),y) +obj-$(CONFIG_DEBUG_INFO_BTF) += sysfs_btf.o +endif diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index b294a5510529..fa46e7a9bf3b 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com * Copyright (c) 2016,2017 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include @@ -22,7 +14,7 @@ #include "map_in_map.h" #define ARRAY_CREATE_FLAG_MASK \ - (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) + (BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK) static void bpf_array_free_percpu(struct bpf_array *array) { @@ -54,6 +46,31 @@ static int bpf_array_alloc_percpu(struct bpf_array *array) } /* Called from syscall */ +int array_map_alloc_check(union bpf_attr *attr) +{ + bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY; + int numa_node = bpf_map_attr_numa_node(attr); + + /* check sanity of attributes */ + if (attr->max_entries == 0 || attr->key_size != 4 || + attr->value_size == 0 || + attr->map_flags & ~ARRAY_CREATE_FLAG_MASK || + !bpf_map_flags_access_ok(attr->map_flags) || + (percpu && numa_node != NUMA_NO_NODE)) + return -EINVAL; + + if (attr->value_size > KMALLOC_MAX_SIZE) + /* if value_size is bigger, the user space won't be able to + * access the elements. + */ + return -E2BIG; + /* percpu map value size is bound by PCPU_MIN_UNIT_SIZE */ + if (percpu && round_up(attr->value_size, 8) > PCPU_MIN_UNIT_SIZE) + return -E2BIG; + + return 0; +} + static struct bpf_map *array_map_alloc(union bpf_attr *attr) { bool percpu = attr->map_type == BPF_MAP_TYPE_PERCPU_ARRAY; @@ -61,21 +78,9 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) u32 elem_size, index_mask, max_entries; bool unpriv = !capable(CAP_SYS_ADMIN); u64 cost, array_size, mask64; + struct bpf_map_memory mem; struct bpf_array *array; - /* check sanity of attributes */ - if (attr->max_entries == 0 || attr->key_size != 4 || - attr->value_size == 0 || - attr->map_flags & ~ARRAY_CREATE_FLAG_MASK || - (percpu && numa_node != NUMA_NO_NODE)) - return ERR_PTR(-EINVAL); - - if (attr->value_size > KMALLOC_MAX_SIZE) - /* if value_size is bigger, the user space won't be able to - * access the elements. - */ - return ERR_PTR(-E2BIG); - elem_size = round_up(attr->value_size, 8); max_entries = attr->max_entries; @@ -107,37 +112,29 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr) /* make sure there is no u32 overflow later in round_up() */ cost = array_size; - if (cost >= U32_MAX - PAGE_SIZE) - return ERR_PTR(-ENOMEM); - if (percpu) { + if (percpu) cost += (u64)attr->max_entries * elem_size * num_possible_cpus(); - if (cost >= U32_MAX - PAGE_SIZE) - return ERR_PTR(-ENOMEM); - } - cost = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; - ret = bpf_map_precharge_memlock(cost); + ret = bpf_map_charge_init(&mem, cost); if (ret < 0) return ERR_PTR(ret); /* allocate all map elements and zero-initialize them */ array = bpf_map_area_alloc(array_size, numa_node); - if (!array) + if (!array) { + bpf_map_charge_finish(&mem); return ERR_PTR(-ENOMEM); + } array->index_mask = index_mask; array->map.unpriv_array = unpriv; /* copy mandatory map attributes */ - array->map.map_type = attr->map_type; - array->map.key_size = attr->key_size; - array->map.value_size = attr->value_size; - array->map.max_entries = attr->max_entries; - array->map.map_flags = attr->map_flags; - array->map.numa_node = numa_node; - array->map.pages = cost; + bpf_map_init_from_attr(&array->map, attr); + bpf_map_charge_move(&array->map.memory, &mem); array->elem_size = elem_size; if (percpu && bpf_array_alloc_percpu(array)) { + bpf_map_charge_finish(&array->map.memory); bpf_map_area_free(array); return ERR_PTR(-ENOMEM); } @@ -164,14 +161,13 @@ static int array_map_direct_value_addr(const struct bpf_map *map, u64 *imm, if (map->max_entries != 1) return -ENOTSUPP; - if (off >= map->value_size) return -EINVAL; *imm = (unsigned long)array->value; - return 0; } + static int array_map_direct_value_meta(const struct bpf_map *map, u64 imm, u32 *off) { @@ -181,12 +177,10 @@ static int array_map_direct_value_meta(const struct bpf_map *map, u64 imm, if (map->max_entries != 1) return -ENOTSUPP; - if (imm < base || imm >= base + range) return -ENOENT; *off = imm - base; - return 0; } @@ -312,7 +306,6 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value, else copy_map_value(map, val, value); } - return 0; } @@ -382,15 +375,43 @@ static void array_map_seq_show_elem(struct bpf_map *map, void *key, struct seq_file *m) { void *value; + rcu_read_lock(); + value = array_map_lookup_elem(map, key); if (!value) { rcu_read_unlock(); return; } - seq_printf(m, "%u: ", *(u32 *)key); + + if (map->btf_key_type_id) + seq_printf(m, "%u: ", *(u32 *)key); btf_type_seq_show(map->btf, map->btf_value_type_id, value, m); seq_puts(m, "\n"); + + rcu_read_unlock(); +} + +static void percpu_array_map_seq_show_elem(struct bpf_map *map, void *key, + struct seq_file *m) +{ + struct bpf_array *array = container_of(map, struct bpf_array, map); + u32 index = *(u32 *)key; + void __percpu *pptr; + int cpu; + + rcu_read_lock(); + + seq_printf(m, "%u: {\n", *(u32 *)key); + pptr = array->pptrs[index & array->index_mask]; + for_each_possible_cpu(cpu) { + seq_printf(m, "\tcpu%d: ", cpu); + btf_type_seq_show(map->btf, map->btf_value_type_id, + per_cpu_ptr(pptr, cpu), m); + seq_puts(m, "\n"); + } + seq_puts(m, "}\n"); + rcu_read_unlock(); } @@ -401,19 +422,33 @@ static int array_map_check_btf(const struct bpf_map *map, { u32 int_data; + /* One exception for keyless BTF: .bss/.data/.rodata map */ + if (btf_type_is_void(key_type)) { + if (map->map_type != BPF_MAP_TYPE_ARRAY || + map->max_entries != 1) + return -EINVAL; + + if (BTF_INFO_KIND(value_type->info) != BTF_KIND_DATASEC) + return -EINVAL; + + return 0; + } + if (BTF_INFO_KIND(key_type->info) != BTF_KIND_INT) return -EINVAL; + int_data = *(u32 *)(key_type + 1); /* bpf array can only take a u32 key. This check makes sure * that the btf matches the attr used during map_create. */ - if (BTF_INT_BITS(int_data) != 32 || BTF_INT_OFFSET(int_data)) return -EINVAL; + return 0; } const struct bpf_map_ops array_map_ops = { + .map_alloc_check = array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = array_map_free, .map_get_next_key = array_map_get_next_key, @@ -428,21 +463,26 @@ const struct bpf_map_ops array_map_ops = { }; const struct bpf_map_ops percpu_array_map_ops = { + .map_alloc_check = array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = array_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = percpu_array_map_lookup_elem, .map_update_elem = array_map_update_elem, .map_delete_elem = array_map_delete_elem, + .map_seq_show_elem = percpu_array_map_seq_show_elem, .map_check_btf = array_map_check_btf, }; -static struct bpf_map *fd_array_map_alloc(union bpf_attr *attr) +static int fd_array_map_alloc_check(union bpf_attr *attr) { /* only file descriptors can be stored in this type of map */ if (attr->value_size != sizeof(u32)) - return ERR_PTR(-EINVAL); - return array_map_alloc(attr); + return -EINVAL; + /* Program read-only/write-only not supported for special maps yet. */ + if (attr->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG)) + return -EINVAL; + return array_map_alloc_check(attr); } static void fd_array_map_free(struct bpf_map *map) @@ -461,7 +501,7 @@ static void fd_array_map_free(struct bpf_map *map) static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key) { - return NULL; + return ERR_PTR(-EOPNOTSUPP); } /* only called from syscall */ @@ -505,7 +545,7 @@ int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file, old_ptr = xchg(array->ptrs + index, new_ptr); if (old_ptr) - map->ops->map_fd_put_ptr(map, old_ptr, true); + map->ops->map_fd_put_ptr(old_ptr); return 0; } @@ -521,7 +561,7 @@ static int fd_array_map_delete_elem(struct bpf_map *map, void *key) old_ptr = xchg(array->ptrs + index, NULL); if (old_ptr) { - map->ops->map_fd_put_ptr(map, old_ptr, true); + map->ops->map_fd_put_ptr(old_ptr); return 0; } else { return -ENOENT; @@ -545,9 +585,8 @@ static void *prog_fd_array_get_ptr(struct bpf_map *map, return prog; } -static void prog_fd_array_put_ptr(struct bpf_map *map, void *ptr, bool need_defer) +static void prog_fd_array_put_ptr(void *ptr) { - /* bpf_prog is freed after one RCU or tasks trace grace period */ bpf_prog_put(ptr); } @@ -566,8 +605,32 @@ static void bpf_fd_array_map_clear(struct bpf_map *map) fd_array_map_delete_elem(map, &i); } +static void prog_array_map_seq_show_elem(struct bpf_map *map, void *key, + struct seq_file *m) +{ + void **elem, *ptr; + u32 prog_id; + + rcu_read_lock(); + + elem = array_map_lookup_elem(map, key); + if (elem) { + ptr = READ_ONCE(*elem); + if (ptr) { + seq_printf(m, "%u: ", *(u32 *)key); + prog_id = prog_fd_array_sys_lookup_elem(ptr); + btf_type_seq_show(map->btf, map->btf_value_type_id, + &prog_id, m); + seq_puts(m, "\n"); + } + } + + rcu_read_unlock(); +} + const struct bpf_map_ops prog_array_map_ops = { - .map_alloc = fd_array_map_alloc, + .map_alloc_check = fd_array_map_alloc_check, + .map_alloc = array_map_alloc, .map_free = fd_array_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, @@ -576,7 +639,7 @@ const struct bpf_map_ops prog_array_map_ops = { .map_fd_put_ptr = prog_fd_array_put_ptr, .map_fd_sys_lookup_elem = prog_fd_array_sys_lookup_elem, .map_release_uref = bpf_fd_array_map_clear, - .map_check_btf = map_check_no_btf, + .map_seq_show_elem = prog_array_map_seq_show_elem, }; static struct bpf_event_entry *bpf_event_entry_gen(struct file *perf_file, @@ -622,7 +685,7 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map *map, ee = ERR_PTR(-EOPNOTSUPP); event = perf_file->private_data; - if (perf_event_read_local(event, &value) == -EOPNOTSUPP) + if (perf_event_read_local(event, &value, NULL, NULL) == -EOPNOTSUPP) goto err_out; ee = bpf_event_entry_gen(perf_file, map_file); @@ -634,9 +697,8 @@ err_out: return ee; } -static void perf_event_fd_array_put_ptr(struct bpf_map *map, void *ptr, bool need_defer) +static void perf_event_fd_array_put_ptr(void *ptr) { - /* bpf_perf_event is freed after one RCU grace period */ bpf_event_entry_free_rcu(ptr); } @@ -657,7 +719,8 @@ static void perf_event_fd_array_release(struct bpf_map *map, } const struct bpf_map_ops perf_event_array_map_ops = { - .map_alloc = fd_array_map_alloc, + .map_alloc_check = fd_array_map_alloc_check, + .map_alloc = array_map_alloc, .map_free = fd_array_map_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, @@ -676,7 +739,7 @@ static void *cgroup_fd_array_get_ptr(struct bpf_map *map, return cgroup_get_from_fd(fd); } -static void cgroup_fd_array_put_ptr(struct bpf_map *map, void *ptr, bool need_defer) +static void cgroup_fd_array_put_ptr(void *ptr) { /* cgroup_put free cgrp after a rcu grace period */ cgroup_put(ptr); @@ -689,7 +752,8 @@ static void cgroup_fd_array_free(struct bpf_map *map) } const struct bpf_map_ops cgroup_array_map_ops = { - .map_alloc = fd_array_map_alloc, + .map_alloc_check = fd_array_map_alloc_check, + .map_alloc = array_map_alloc, .map_free = cgroup_fd_array_free, .map_get_next_key = array_map_get_next_key, .map_lookup_elem = fd_array_map_lookup_elem, @@ -708,7 +772,7 @@ static struct bpf_map *array_of_map_alloc(union bpf_attr *attr) if (IS_ERR(inner_map_meta)) return inner_map_meta; - map = fd_array_map_alloc(attr); + map = array_map_alloc(attr); if (IS_ERR(map)) { bpf_map_meta_free(inner_map_meta); return map; @@ -771,6 +835,7 @@ static u32 array_of_map_gen_lookup(struct bpf_map *map, } const struct bpf_map_ops array_of_maps_map_ops = { + .map_alloc_check = fd_array_map_alloc_check, .map_alloc = array_of_map_alloc, .map_free = array_of_map_free, .map_get_next_key = array_map_get_next_key, diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c index 39a0e768adc3..3dabdd137d10 100644 --- a/kernel/bpf/bpf_lru_list.c +++ b/kernel/bpf/bpf_lru_list.c @@ -1,8 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2016 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. */ #include #include diff --git a/kernel/bpf/bpf_lru_list.h b/kernel/bpf/bpf_lru_list.h index 08da78b59f0b..41f8fea530c8 100644 --- a/kernel/bpf/bpf_lru_list.h +++ b/kernel/bpf/bpf_lru_list.h @@ -1,8 +1,5 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* Copyright (c) 2016 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. */ #ifndef __BPF_LRU_LIST_H_ #define __BPF_LRU_LIST_H_ diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index cb8b05bdaeb0..5189bc5ebd89 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -157,7 +157,7 @@ * */ -#define BITS_PER_U64 (sizeof(u64) * BITS_PER_BYTE) +#define BITS_PER_U128 (sizeof(u64) * BITS_PER_BYTE * 2) #define BITS_PER_BYTE_MASK (BITS_PER_BYTE - 1) #define BITS_PER_BYTE_MASKED(bits) ((bits) & BITS_PER_BYTE_MASK) #define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3) @@ -185,8 +185,18 @@ i < btf_type_vlen(struct_type); \ i++, member++) -static DEFINE_IDR(btf_idr); -static DEFINE_SPINLOCK(btf_idr_lock); +#define for_each_vsi(i, struct_type, member) \ + for (i = 0, member = btf_type_var_secinfo(struct_type); \ + i < btf_type_vlen(struct_type); \ + i++, member++) + +#define for_each_vsi_from(i, from, struct_type, member) \ + for (i = from, member = btf_type_var_secinfo(struct_type) + from; \ + i < btf_type_vlen(struct_type); \ + i++, member++) + +DEFINE_IDR(btf_idr); +DEFINE_SPINLOCK(btf_idr_lock); struct btf { void *data; @@ -262,6 +272,8 @@ static const char * const btf_kind_str[NR_BTF_KINDS] = { [BTF_KIND_RESTRICT] = "RESTRICT", [BTF_KIND_FUNC] = "FUNC", [BTF_KIND_FUNC_PROTO] = "FUNC_PROTO", + [BTF_KIND_VAR] = "VAR", + [BTF_KIND_DATASEC] = "DATASEC", }; struct btf_kind_operations { @@ -314,11 +326,11 @@ static bool btf_type_is_modifier(const struct btf_type *t) return false; } -static bool btf_type_is_void(const struct btf_type *t) +bool btf_type_is_void(const struct btf_type *t) { return t == &btf_void; - } + static bool btf_type_is_fwd(const struct btf_type *t) { return BTF_INFO_KIND(t->info) == BTF_KIND_FWD; @@ -375,13 +387,36 @@ static bool btf_type_is_int(const struct btf_type *t) return BTF_INFO_KIND(t->info) == BTF_KIND_INT; } +static bool btf_type_is_var(const struct btf_type *t) +{ + return BTF_INFO_KIND(t->info) == BTF_KIND_VAR; +} + +static bool btf_type_is_datasec(const struct btf_type *t) +{ + return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC; +} + +/* Types that act only as a source, not sink or intermediate + * type when resolving. + */ +static bool btf_type_is_resolve_source_only(const struct btf_type *t) +{ + return btf_type_is_var(t) || + btf_type_is_datasec(t); +} + /* What types need to be resolved? * * btf_type_is_modifier() is an obvious one. * * btf_type_is_struct() because its member refers to * another type (through member->type). - + * + * btf_type_is_var() because the variable refers to + * another type. btf_type_is_datasec() holds multiple + * btf_type_is_var() types that need resolving. + * * btf_type_is_array() because its element (array->type) * refers to another type. Array can be thought of a * special case of struct while array just has the same @@ -390,9 +425,11 @@ static bool btf_type_is_int(const struct btf_type *t) static bool btf_type_needs_resolve(const struct btf_type *t) { return btf_type_is_modifier(t) || - btf_type_is_ptr(t) || - btf_type_is_struct(t) || - btf_type_is_array(t); + btf_type_is_ptr(t) || + btf_type_is_struct(t) || + btf_type_is_array(t) || + btf_type_is_var(t) || + btf_type_is_datasec(t); } /* t->size can be used */ @@ -403,6 +440,7 @@ static bool btf_type_has_size(const struct btf_type *t) case BTF_KIND_STRUCT: case BTF_KIND_UNION: case BTF_KIND_ENUM: + case BTF_KIND_DATASEC: return true; } @@ -467,38 +505,70 @@ static const struct btf_enum *btf_type_enum(const struct btf_type *t) return (const struct btf_enum *)(t + 1); } +static const struct btf_var *btf_type_var(const struct btf_type *t) +{ + return (const struct btf_var *)(t + 1); +} + +static const struct btf_var_secinfo *btf_type_var_secinfo(const struct btf_type *t) +{ + return (const struct btf_var_secinfo *)(t + 1); +} + static const struct btf_kind_operations *btf_type_ops(const struct btf_type *t) { return kind_ops[BTF_INFO_KIND(t->info)]; } -bool btf_name_offset_valid(const struct btf *btf, u32 offset) +static bool btf_name_offset_valid(const struct btf *btf, u32 offset) { return BTF_STR_OFFSET_VALID(offset) && offset < btf->hdr.str_len; } +static bool __btf_name_char_ok(char c, bool first, bool dot_ok) +{ + if ((first ? !isalpha(c) : + !isalnum(c)) && + c != '_' && + ((c == '.' && !dot_ok) || + c != '.')) + return false; + return true; +} + +static bool __btf_name_valid(const struct btf *btf, u32 offset, bool dot_ok) +{ + /* offset must be valid */ + const char *src = &btf->strings[offset]; + const char *src_limit; + + if (!__btf_name_char_ok(*src, true, dot_ok)) + return false; + + /* set a limit on identifier length */ + src_limit = src + KSYM_NAME_LEN; + src++; + while (*src && src < src_limit) { + if (!__btf_name_char_ok(*src, false, dot_ok)) + return false; + src++; + } + + return !*src; +} + /* Only C-style identifier is permitted. This can be relaxed if * necessary. */ static bool btf_name_valid_identifier(const struct btf *btf, u32 offset) { - /* offset must be valid */ - const char *src = &btf->strings[offset]; - const char *src_limit; + return __btf_name_valid(btf, offset, false); +} - if (!isalpha(*src) && *src != '_') - return false; - - /* set a limit on identifier length */ - src_limit = src + KSYM_NAME_LEN; - src++; - while (*src && src < src_limit) { - if (!isalnum(*src) && *src != '_') - return false; - src++; - } - return !*src; +static bool btf_name_valid_section(const struct btf *btf, u32 offset) +{ + return __btf_name_valid(btf, offset, true); } static const char *__btf_name_by_offset(const struct btf *btf, u32 offset) @@ -529,7 +599,7 @@ const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id) /* * Regular int is not a bit field and it must be either - * u8/u16/u32/u64. + * u8/u16/u32/u64 or __int128. */ static bool btf_type_int_is_regular(const struct btf_type *t) { @@ -542,7 +612,8 @@ static bool btf_type_int_is_regular(const struct btf_type *t) if (BITS_PER_BYTE_MASKED(nr_bits) || BTF_INT_OFFSET(int_data) || (nr_bytes != sizeof(u8) && nr_bytes != sizeof(u16) && - nr_bytes != sizeof(u32) && nr_bytes != sizeof(u64))) { + nr_bytes != sizeof(u32) && nr_bytes != sizeof(u64) && + nr_bytes != (2 * sizeof(u64)))) { return false; } @@ -550,7 +621,7 @@ static bool btf_type_int_is_regular(const struct btf_type *t) } /* -* Check that given struct member is a regular int with expected + * Check that given struct member is a regular int with expected * offset and size. */ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, @@ -571,6 +642,7 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, if (btf_type_kflag(s)) { u32 bitfield_size = BTF_MEMBER_BITFIELD_SIZE(m->offset); u32 bit_offset = BTF_MEMBER_BIT_OFFSET(m->offset); + /* if kflag set, int should be a regular int and * bit offset should be at byte boundary. */ @@ -578,6 +650,7 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s, BITS_ROUNDUP_BYTES(bit_offset) == expected_offset && BITS_ROUNDUP_BYTES(nr_bits) == expected_size; } + if (BTF_INT_OFFSET(int_data) || BITS_PER_BYTE_MASKED(m->offset) || BITS_ROUNDUP_BYTES(m->offset) != expected_offset || @@ -693,6 +766,32 @@ static void btf_verifier_log_member(struct btf_verifier_env *env, __btf_verifier_log(log, "\n"); } +__printf(4, 5) +static void btf_verifier_log_vsi(struct btf_verifier_env *env, + const struct btf_type *datasec_type, + const struct btf_var_secinfo *vsi, + const char *fmt, ...) +{ + struct bpf_verifier_log *log = &env->log; + va_list args; + + if (!bpf_verifier_log_needed(log)) + return; + if (env->phase != CHECK_META) + btf_verifier_log_type(env, datasec_type, NULL); + + __btf_verifier_log(log, "\t type_id=%u offset=%u size=%u", + vsi->type, vsi->offset, vsi->size); + if (fmt && *fmt) { + __btf_verifier_log(log, " "); + va_start(args, fmt); + bpf_verifier_vlog(log, fmt, args); + va_end(args); + } + + __btf_verifier_log(log, "\n"); +} + static void btf_verifier_log_hdr(struct btf_verifier_env *env, u32 btf_data_size) { @@ -738,7 +837,7 @@ static int btf_add_type(struct btf_verifier_env *env, struct btf_type *t) new_size = min_t(u32, BTF_MAX_TYPE, btf->types_size + expand_by); - new_types = kvzalloc(new_size * sizeof(*new_types), + new_types = kvcalloc(new_size, sizeof(*new_types), GFP_KERNEL | __GFP_NOWARN); if (!new_types) return -ENOMEM; @@ -762,6 +861,7 @@ static int btf_add_type(struct btf_verifier_env *env, struct btf_type *t) static int btf_alloc_id(struct btf *btf) { int id; + idr_preload(GFP_KERNEL); spin_lock_bh(&btf_idr_lock); id = idr_alloc_cyclic(&btf_idr, btf, 1, INT_MAX, GFP_ATOMIC); @@ -769,13 +869,17 @@ static int btf_alloc_id(struct btf *btf) btf->id = id; spin_unlock_bh(&btf_idr_lock); idr_preload_end(); + if (WARN_ON_ONCE(!id)) return -ENOSPC; + return id > 0 ? 0 : id; } + static void btf_free_id(struct btf *btf) { unsigned long flags; + /* * In map-in-map, calling map_delete_elem() on outer * map will call bpf_map_put on the inner map. @@ -802,6 +906,7 @@ static void btf_free(struct btf *btf) static void btf_free_rcu(struct rcu_head *rcu) { struct btf *btf = container_of(rcu, struct btf, rcu); + btf_free(btf); } @@ -822,17 +927,17 @@ static int env_resolve_init(struct btf_verifier_env *env) u8 *visit_states = NULL; /* +1 for btf_void */ - resolved_sizes = kvzalloc((nr_types + 1) * sizeof(*resolved_sizes), + resolved_sizes = kvcalloc(nr_types + 1, sizeof(*resolved_sizes), GFP_KERNEL | __GFP_NOWARN); if (!resolved_sizes) goto nomem; - resolved_ids = kvzalloc((nr_types + 1) * sizeof(*resolved_ids), + resolved_ids = kvcalloc(nr_types + 1, sizeof(*resolved_ids), GFP_KERNEL | __GFP_NOWARN); if (!resolved_ids) goto nomem; - visit_states = kvzalloc((nr_types + 1) * sizeof(*visit_states), + visit_states = kvcalloc(nr_types + 1, sizeof(*visit_states), GFP_KERNEL | __GFP_NOWARN); if (!visit_states) goto nomem; @@ -964,14 +1069,22 @@ const struct btf_type *btf_type_id_size(const struct btf *btf, } else if (btf_type_is_ptr(size_type)) { size = sizeof(void *); } else { - if (WARN_ON_ONCE(!btf_type_is_modifier(size_type))) + if (WARN_ON_ONCE(!btf_type_is_modifier(size_type) && + !btf_type_is_var(size_type))) return NULL; - size = btf->resolved_sizes[size_type_id]; size_type_id = btf->resolved_ids[size_type_id]; size_type = btf_type_by_id(btf, size_type_id); if (btf_type_nosize_or_null(size_type)) return NULL; + else if (btf_type_has_size(size_type)) + size = size_type->size; + else if (btf_type_is_array(size_type)) + size = btf->resolved_sizes[size_type_id]; + else if (btf_type_is_ptr(size_type)) + size = sizeof(void *); + else + return NULL; } *type_id = size_type_id; @@ -1014,6 +1127,7 @@ static int btf_generic_check_kflag_member(struct btf_verifier_env *env, "Invalid member bitfield_size"); return -EINVAL; } + /* bitfield size is 0, so member->offset represents bit offset only. * It is safe to call non kflag check_member variants. */ @@ -1058,9 +1172,9 @@ static int btf_int_check_member(struct btf_verifier_env *env, nr_copy_bits = BTF_INT_BITS(int_data) + BITS_PER_BYTE_MASKED(struct_bits_off); - if (nr_copy_bits > BITS_PER_U64) { + if (nr_copy_bits > BITS_PER_U128) { btf_verifier_log_member(env, struct_type, member, - "nr_copy_bits exceeds 64"); + "nr_copy_bits exceeds 128"); return -EINVAL; } @@ -1083,12 +1197,14 @@ static int btf_int_check_kflag_member(struct btf_verifier_env *env, u32 int_data = btf_type_int(member_type); u32 struct_size = struct_type->size; u32 nr_copy_bits; + /* a regular int type is required for the kflag int member */ if (!btf_type_int_is_regular(member_type)) { btf_verifier_log_member(env, struct_type, member, "Invalid member base type"); return -EINVAL; } + /* check sanity of bitfield size */ nr_bits = BTF_MEMBER_BITFIELD_SIZE(member->offset); struct_bits_off = BTF_MEMBER_BIT_OFFSET(member->offset); @@ -1102,25 +1218,29 @@ static int btf_int_check_kflag_member(struct btf_verifier_env *env, "Invalid member offset"); return -EINVAL; } + nr_bits = nr_int_data_bits; } else if (nr_bits > nr_int_data_bits) { btf_verifier_log_member(env, struct_type, member, "Invalid member bitfield_size"); return -EINVAL; } + bytes_offset = BITS_ROUNDDOWN_BYTES(struct_bits_off); nr_copy_bits = nr_bits + BITS_PER_BYTE_MASKED(struct_bits_off); - if (nr_copy_bits > BITS_PER_U64) { + if (nr_copy_bits > BITS_PER_U128) { btf_verifier_log_member(env, struct_type, member, - "nr_copy_bits exceeds 64"); + "nr_copy_bits exceeds 128"); return -EINVAL; } + if (struct_size < bytes_offset || struct_size - bytes_offset < BITS_ROUNDUP_BYTES(nr_copy_bits)) { btf_verifier_log_member(env, struct_type, member, "Member exceeds struct_size"); return -EINVAL; } + return 0; } @@ -1157,9 +1277,9 @@ static s32 btf_int_check_meta(struct btf_verifier_env *env, nr_bits = BTF_INT_BITS(int_data) + BTF_INT_OFFSET(int_data); - if (nr_bits > BITS_PER_U64) { + if (nr_bits > BITS_PER_U128) { btf_verifier_log_type(env, t, "nr_bits exceeds %zu", - BITS_PER_U64); + BITS_PER_U128); return -EINVAL; } @@ -1200,35 +1320,96 @@ static void btf_int_log(struct btf_verifier_env *env, btf_int_encoding_str(BTF_INT_ENCODING(int_data))); } +static void btf_int128_print(struct seq_file *m, void *data) +{ + /* data points to a __int128 number. + * Suppose + * int128_num = *(__int128 *)data; + * The below formulas shows what upper_num and lower_num represents: + * upper_num = int128_num >> 64; + * lower_num = int128_num & 0xffffffffFFFFFFFFULL; + */ + u64 upper_num, lower_num; + +#ifdef __BIG_ENDIAN_BITFIELD + upper_num = *(u64 *)data; + lower_num = *(u64 *)(data + 8); +#else + upper_num = *(u64 *)(data + 8); + lower_num = *(u64 *)data; +#endif + if (upper_num == 0) + seq_printf(m, "0x%llx", lower_num); + else + seq_printf(m, "0x%llx%016llx", upper_num, lower_num); +} + +static void btf_int128_shift(u64 *print_num, u16 left_shift_bits, + u16 right_shift_bits) +{ + u64 upper_num, lower_num; + +#ifdef __BIG_ENDIAN_BITFIELD + upper_num = print_num[0]; + lower_num = print_num[1]; +#else + upper_num = print_num[1]; + lower_num = print_num[0]; +#endif + + /* shake out un-needed bits by shift/or operations */ + if (left_shift_bits >= 64) { + upper_num = lower_num << (left_shift_bits - 64); + lower_num = 0; + } else { + upper_num = (upper_num << left_shift_bits) | + (lower_num >> (64 - left_shift_bits)); + lower_num = lower_num << left_shift_bits; + } + + if (right_shift_bits >= 64) { + lower_num = upper_num >> (right_shift_bits - 64); + upper_num = 0; + } else { + lower_num = (lower_num >> right_shift_bits) | + (upper_num << (64 - right_shift_bits)); + upper_num = upper_num >> right_shift_bits; + } + +#ifdef __BIG_ENDIAN_BITFIELD + print_num[0] = upper_num; + print_num[1] = lower_num; +#else + print_num[0] = lower_num; + print_num[1] = upper_num; +#endif +} + static void btf_bitfield_seq_show(void *data, u8 bits_offset, u8 nr_bits, struct seq_file *m) { u16 left_shift_bits, right_shift_bits; u8 nr_copy_bytes; u8 nr_copy_bits; - u64 print_num; + u64 print_num[2] = {}; - data += BITS_ROUNDDOWN_BYTES(bits_offset); - bits_offset = BITS_PER_BYTE_MASKED(bits_offset); nr_copy_bits = nr_bits + bits_offset; nr_copy_bytes = BITS_ROUNDUP_BYTES(nr_copy_bits); - print_num = 0; - memcpy(&print_num, data, nr_copy_bytes); + memcpy(print_num, data, nr_copy_bytes); #ifdef __BIG_ENDIAN_BITFIELD left_shift_bits = bits_offset; #else - left_shift_bits = BITS_PER_U64 - nr_copy_bits; + left_shift_bits = BITS_PER_U128 - nr_copy_bits; #endif - right_shift_bits = BITS_PER_U64 - nr_bits; + right_shift_bits = BITS_PER_U128 - nr_bits; - print_num <<= left_shift_bits; - print_num >>= right_shift_bits; - - seq_printf(m, "0x%llx", print_num); + btf_int128_shift(print_num, left_shift_bits, right_shift_bits); + btf_int128_print(m, print_num); } + static void btf_int_bits_seq_show(const struct btf *btf, const struct btf_type *t, void *data, u8 bits_offset, @@ -1240,10 +1421,12 @@ static void btf_int_bits_seq_show(const struct btf *btf, /* * bits_offset is at most 7. - * BTF_INT_OFFSET() cannot exceed 64 bits. + * BTF_INT_OFFSET() cannot exceed 128 bits. */ total_bits_offset = bits_offset + BTF_INT_OFFSET(int_data); - btf_bitfield_seq_show(data, total_bits_offset, nr_bits, m); + data += BITS_ROUNDDOWN_BYTES(total_bits_offset); + bits_offset = BITS_PER_BYTE_MASKED(total_bits_offset); + btf_bitfield_seq_show(data, bits_offset, nr_bits, m); } static void btf_int_seq_show(const struct btf *btf, const struct btf_type *t, @@ -1262,6 +1445,9 @@ static void btf_int_seq_show(const struct btf *btf, const struct btf_type *t, } switch (nr_bits) { + case 128: + btf_int128_print(m, data); + break; case 64: if (sign) seq_printf(m, "%lld", *(s64 *)data); @@ -1334,14 +1520,17 @@ static int btf_modifier_check_kflag_member(struct btf_verifier_env *env, u32 resolved_type_id = member->type; struct btf_member resolved_member; struct btf *btf = env->btf; + resolved_type = btf_type_id_size(btf, &resolved_type_id, NULL); if (!resolved_type) { btf_verifier_log_member(env, struct_type, member, "Invalid member"); return -EINVAL; } + resolved_member = *member; resolved_member.type = resolved_type_id; + return btf_type_ops(resolved_type)->check_kflag_member(env, struct_type, &resolved_member, resolved_type); @@ -1392,6 +1581,22 @@ static int btf_ref_type_check_meta(struct btf_verifier_env *env, return -EINVAL; } + /* typedef type must have a valid name, and other ref types, + * volatile, const, restrict, should have a null name. + */ + if (BTF_INFO_KIND(t->info) == BTF_KIND_TYPEDEF) { + if (!t->name_off || + !btf_name_valid_identifier(env->btf, t->name_off)) { + btf_verifier_log_type(env, t, "Invalid name"); + return -EINVAL; + } + } else { + if (t->name_off) { + btf_verifier_log_type(env, t, "Invalid name"); + return -EINVAL; + } + } + btf_verifier_log_type(env, t, NULL); return 0; @@ -1404,10 +1609,9 @@ static int btf_modifier_resolve(struct btf_verifier_env *env, const struct btf_type *next_type; u32 next_type_id = t->type; struct btf *btf = env->btf; - u32 next_type_size = 0; next_type = btf_type_by_id(btf, next_type_id); - if (!next_type) { + if (!next_type || btf_type_is_resolve_source_only(next_type)) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } @@ -1422,19 +1626,66 @@ static int btf_modifier_resolve(struct btf_verifier_env *env, * save us a few type-following when we use it later (e.g. in * pretty print). */ - if (!btf_type_id_size(btf, &next_type_id, &next_type_size)) { + if (!btf_type_id_size(btf, &next_type_id, NULL)) { if (env_type_is_resolved(env, next_type_id)) next_type = btf_type_id_resolve(btf, &next_type_id); /* "typedef void new_void", "const void"...etc */ if (!btf_type_is_void(next_type) && - !btf_type_is_fwd(next_type)) { + !btf_type_is_fwd(next_type) && + !btf_type_is_func_proto(next_type)) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } } - env_stack_pop_resolved(env, next_type_id, next_type_size); + env_stack_pop_resolved(env, next_type_id, 0); + + return 0; +} + +static int btf_var_resolve(struct btf_verifier_env *env, + const struct resolve_vertex *v) +{ + const struct btf_type *next_type; + const struct btf_type *t = v->t; + u32 next_type_id = t->type; + struct btf *btf = env->btf; + + next_type = btf_type_by_id(btf, next_type_id); + if (!next_type || btf_type_is_resolve_source_only(next_type)) { + btf_verifier_log_type(env, v->t, "Invalid type_id"); + return -EINVAL; + } + + if (!env_type_is_resolve_sink(env, next_type) && + !env_type_is_resolved(env, next_type_id)) + return env_stack_push(env, next_type, next_type_id); + + if (btf_type_is_modifier(next_type)) { + const struct btf_type *resolved_type; + u32 resolved_type_id; + + resolved_type_id = next_type_id; + resolved_type = btf_type_id_resolve(btf, &resolved_type_id); + + if (btf_type_is_ptr(resolved_type) && + !env_type_is_resolve_sink(env, resolved_type) && + !env_type_is_resolved(env, resolved_type_id)) + return env_stack_push(env, resolved_type, + resolved_type_id); + } + + /* We must resolve to something concrete at this point, no + * forward types or similar that would resolve to size of + * zero is allowed. + */ + if (!btf_type_id_size(btf, &next_type_id, NULL)) { + btf_verifier_log_type(env, v->t, "Invalid type_id"); + return -EINVAL; + } + + env_stack_pop_resolved(env, next_type_id, 0); return 0; } @@ -1448,7 +1699,7 @@ static int btf_ptr_resolve(struct btf_verifier_env *env, struct btf *btf = env->btf; next_type = btf_type_by_id(btf, next_type_id); - if (!next_type) { + if (!next_type || btf_type_is_resolve_source_only(next_type)) { btf_verifier_log_type(env, v->t, "Invalid type_id"); return -EINVAL; } @@ -1506,6 +1757,15 @@ static void btf_modifier_seq_show(const struct btf *btf, btf_type_ops(t)->seq_show(btf, t, type_id, data, bits_offset, m); } +static void btf_var_seq_show(const struct btf *btf, const struct btf_type *t, + u32 type_id, void *data, u8 bits_offset, + struct seq_file *m) +{ + t = btf_type_id_resolve(btf, &type_id); + + btf_type_ops(t)->seq_show(btf, t, type_id, data, bits_offset, m); +} + static void btf_ptr_seq_show(const struct btf *btf, const struct btf_type *t, u32 type_id, void *data, u8 bits_offset, struct seq_file *m) @@ -1552,17 +1812,30 @@ static s32 btf_fwd_check_meta(struct btf_verifier_env *env, return -EINVAL; } + /* fwd type must have a valid name */ + if (!t->name_off || + !btf_name_valid_identifier(env->btf, t->name_off)) { + btf_verifier_log_type(env, t, "Invalid name"); + return -EINVAL; + } + btf_verifier_log_type(env, t, NULL); return 0; } +static void btf_fwd_type_log(struct btf_verifier_env *env, + const struct btf_type *t) +{ + btf_verifier_log(env, "%s", btf_type_kflag(t) ? "union" : "struct"); +} + static struct btf_kind_operations fwd_ops = { .check_meta = btf_fwd_check_meta, .resolve = btf_df_resolve, .check_member = btf_df_check_member, .check_kflag_member = btf_df_check_kflag_member, - .log_details = btf_ref_type_log, + .log_details = btf_fwd_type_log, .seq_show = btf_df_seq_show, }; @@ -1609,6 +1882,12 @@ static s32 btf_array_check_meta(struct btf_verifier_env *env, return -EINVAL; } + /* array type should not have a name */ + if (t->name_off) { + btf_verifier_log_type(env, t, "Invalid name"); + return -EINVAL; + } + if (btf_type_vlen(t)) { btf_verifier_log_type(env, t, "vlen != 0"); return -EINVAL; @@ -1654,7 +1933,8 @@ static int btf_array_resolve(struct btf_verifier_env *env, /* Check array->index_type */ index_type_id = array->index_type; index_type = btf_type_by_id(btf, index_type_id); - if (btf_type_nosize_or_null(index_type)) { + if (btf_type_nosize_or_null(index_type) || + btf_type_is_resolve_source_only(index_type)) { btf_verifier_log_type(env, v->t, "Invalid index"); return -EINVAL; } @@ -1673,7 +1953,8 @@ static int btf_array_resolve(struct btf_verifier_env *env, /* Check array->type */ elem_type_id = array->type; elem_type = btf_type_by_id(btf, elem_type_id); - if (btf_type_nosize_or_null(elem_type)) { + if (btf_type_nosize_or_null(elem_type) || + btf_type_is_resolve_source_only(elem_type)) { btf_verifier_log_type(env, v->t, "Invalid elem"); return -EINVAL; @@ -1792,6 +2073,13 @@ static s32 btf_struct_check_meta(struct btf_verifier_env *env, return -EINVAL; } + /* struct type either no name or a valid one */ + if (t->name_off && + !btf_name_valid_identifier(env->btf, t->name_off)) { + btf_verifier_log_type(env, t, "Invalid name"); + return -EINVAL; + } + btf_verifier_log_type(env, t, NULL); last_offset = 0; @@ -1803,6 +2091,12 @@ static s32 btf_struct_check_meta(struct btf_verifier_env *env, return -EINVAL; } + /* struct member either no name or a valid one */ + if (member->name_off && + !btf_name_valid_identifier(btf, member->name_off)) { + btf_verifier_log_member(env, t, member, "Invalid name"); + return -EINVAL; + } /* A member cannot be in type void */ if (!member->type || !BTF_TYPE_ID_VALID(member->type)) { btf_verifier_log_member(env, t, member, @@ -1829,7 +2123,7 @@ static s32 btf_struct_check_meta(struct btf_verifier_env *env, if (BITS_ROUNDUP_BYTES(offset) > struct_size) { btf_verifier_log_member(env, t, member, - "Memmber bits_offset exceeds its struct size"); + "Member bits_offset exceeds its struct size"); return -EINVAL; } @@ -1881,7 +2175,8 @@ static int btf_struct_resolve(struct btf_verifier_env *env, const struct btf_type *member_type = btf_type_by_id(env->btf, member_type_id); - if (btf_type_nosize_or_null(member_type)) { + if (btf_type_nosize_or_null(member_type) || + btf_type_is_resolve_source_only(member_type)) { btf_verifier_log_member(env, v->t, member, "Invalid member"); return -EINVAL; @@ -1924,8 +2219,10 @@ int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t) { const struct btf_member *member; u32 i, off = -ENOENT; + if (!__btf_type_is_struct(t)) return -EINVAL; + for_each_member(i, t, member) { const struct btf_type *member_type = btf_type_by_id(btf, member->type); @@ -1973,12 +2270,12 @@ static void btf_struct_seq_show(const struct btf *btf, const struct btf_type *t, member_offset = btf_member_bit_offset(t, member); bitfield_size = btf_member_bitfield_size(t, member); + bytes_offset = BITS_ROUNDDOWN_BYTES(member_offset); + bits8_offset = BITS_PER_BYTE_MASKED(member_offset); if (bitfield_size) { - btf_bitfield_seq_show(data, member_offset, + btf_bitfield_seq_show(data + bytes_offset, bits8_offset, bitfield_size, m); } else { - bytes_offset = BITS_ROUNDDOWN_BYTES(member_offset); - bits8_offset = BITS_PER_BYTE_MASKED(member_offset); ops = btf_type_ops(member_type); ops->seq_show(btf, member_type, member->type, data + bytes_offset, bits8_offset, m); @@ -2028,20 +2325,23 @@ static int btf_enum_check_kflag_member(struct btf_verifier_env *env, { u32 struct_bits_off, nr_bits, bytes_end, struct_size; u32 int_bitsize = sizeof(int) * BITS_PER_BYTE; + struct_bits_off = BTF_MEMBER_BIT_OFFSET(member->offset); nr_bits = BTF_MEMBER_BITFIELD_SIZE(member->offset); if (!nr_bits) { if (BITS_PER_BYTE_MASKED(struct_bits_off)) { btf_verifier_log_member(env, struct_type, member, "Member is not byte aligned"); - return -EINVAL; + return -EINVAL; } + nr_bits = int_bitsize; } else if (nr_bits > int_bitsize) { btf_verifier_log_member(env, struct_type, member, "Invalid member bitfield_size"); return -EINVAL; } + struct_size = struct_type->size; bytes_end = BITS_ROUNDUP_BYTES(struct_bits_off + nr_bits); if (struct_size < bytes_end) { @@ -2049,6 +2349,7 @@ static int btf_enum_check_kflag_member(struct btf_verifier_env *env, "Member exceeds struct_size"); return -EINVAL; } + return 0; } @@ -2076,9 +2377,15 @@ static s32 btf_enum_check_meta(struct btf_verifier_env *env, return -EINVAL; } - if (t->size != sizeof(int)) { - btf_verifier_log_type(env, t, "Expected size:%zu", - sizeof(int)); + if (t->size > 8 || !is_power_of_2(t->size)) { + btf_verifier_log_type(env, t, "Unexpected size"); + return -EINVAL; + } + + /* enum type either no name or a valid one */ + if (t->name_off && + !btf_name_valid_identifier(env->btf, t->name_off)) { + btf_verifier_log_type(env, t, "Invalid name"); return -EINVAL; } @@ -2091,6 +2398,14 @@ static s32 btf_enum_check_meta(struct btf_verifier_env *env, return -EINVAL; } + /* enum member must have a valid name */ + if (!enums[i].name_off || + !btf_name_valid_identifier(btf, enums[i].name_off)) { + btf_verifier_log_type(env, t, "Invalid name"); + return -EINVAL; + } + + btf_verifier_log(env, "\t%s val=%d\n", __btf_name_by_offset(btf, enums[i].name_off), enums[i].val); @@ -2116,7 +2431,8 @@ static void btf_enum_seq_show(const struct btf *btf, const struct btf_type *t, for (i = 0; i < nr_enums; i++) { if (v == enums[i].val) { seq_printf(m, "%s", - __btf_name_by_offset(btf, enums[i].name_off)); + __btf_name_by_offset(btf, + enums[i].name_off)); return; } } @@ -2157,6 +2473,7 @@ static s32 btf_func_proto_check_meta(struct btf_verifier_env *env, } btf_verifier_log_type(env, t, NULL); + return meta_needed; } @@ -2165,8 +2482,8 @@ static void btf_func_proto_log(struct btf_verifier_env *env, { const struct btf_param *args = (const struct btf_param *)(t + 1); u16 nr_args = btf_type_vlen(t), i; - btf_verifier_log(env, "return=%u args=(", t->type); + btf_verifier_log(env, "return=%u args=(", t->type); if (!nr_args) { btf_verifier_log(env, "void"); goto done; @@ -2180,20 +2497,23 @@ static void btf_func_proto_log(struct btf_verifier_env *env, btf_verifier_log(env, "%u %s", args[0].type, __btf_name_by_offset(env->btf, - args[0].name_off)); + args[0].name_off)); for (i = 1; i < nr_args - 1; i++) btf_verifier_log(env, ", %u %s", args[i].type, __btf_name_by_offset(env->btf, - args[i].name_off)); + args[i].name_off)); + if (nr_args > 1) { const struct btf_param *last_arg = &args[nr_args - 1]; + if (last_arg->type) btf_verifier_log(env, ", %u %s", last_arg->type, __btf_name_by_offset(env->btf, - last_arg->name_off)); + last_arg->name_off)); else btf_verifier_log(env, ", vararg"); } + done: btf_verifier_log(env, ")"); } @@ -2237,6 +2557,7 @@ static s32 btf_func_check_meta(struct btf_verifier_env *env, } btf_verifier_log_type(env, t, NULL); + return 0; } @@ -2249,6 +2570,223 @@ static struct btf_kind_operations func_ops = { .seq_show = btf_df_seq_show, }; +static s32 btf_var_check_meta(struct btf_verifier_env *env, + const struct btf_type *t, + u32 meta_left) +{ + const struct btf_var *var; + u32 meta_needed = sizeof(*var); + + if (meta_left < meta_needed) { + btf_verifier_log_basic(env, t, + "meta_left:%u meta_needed:%u", + meta_left, meta_needed); + return -EINVAL; + } + + if (btf_type_vlen(t)) { + btf_verifier_log_type(env, t, "vlen != 0"); + return -EINVAL; + } + + if (btf_type_kflag(t)) { + btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); + return -EINVAL; + } + + if (!t->name_off || + !__btf_name_valid(env->btf, t->name_off, true)) { + btf_verifier_log_type(env, t, "Invalid name"); + return -EINVAL; + } + + /* A var cannot be in type void */ + if (!t->type || !BTF_TYPE_ID_VALID(t->type)) { + btf_verifier_log_type(env, t, "Invalid type_id"); + return -EINVAL; + } + + var = btf_type_var(t); + if (var->linkage != BTF_VAR_STATIC && + var->linkage != BTF_VAR_GLOBAL_ALLOCATED) { + btf_verifier_log_type(env, t, "Linkage not supported"); + return -EINVAL; + } + + btf_verifier_log_type(env, t, NULL); + + return meta_needed; +} + +static void btf_var_log(struct btf_verifier_env *env, const struct btf_type *t) +{ + const struct btf_var *var = btf_type_var(t); + + btf_verifier_log(env, "type_id=%u linkage=%u", t->type, var->linkage); +} + +static const struct btf_kind_operations var_ops = { + .check_meta = btf_var_check_meta, + .resolve = btf_var_resolve, + .check_member = btf_df_check_member, + .check_kflag_member = btf_df_check_kflag_member, + .log_details = btf_var_log, + .seq_show = btf_var_seq_show, +}; + +static s32 btf_datasec_check_meta(struct btf_verifier_env *env, + const struct btf_type *t, + u32 meta_left) +{ + const struct btf_var_secinfo *vsi; + u64 last_vsi_end_off = 0, sum = 0; + u32 i, meta_needed; + + meta_needed = btf_type_vlen(t) * sizeof(*vsi); + if (meta_left < meta_needed) { + btf_verifier_log_basic(env, t, + "meta_left:%u meta_needed:%u", + meta_left, meta_needed); + return -EINVAL; + } + + if (!btf_type_vlen(t)) { + btf_verifier_log_type(env, t, "vlen == 0"); + return -EINVAL; + } + + if (!t->size) { + btf_verifier_log_type(env, t, "size == 0"); + return -EINVAL; + } + + if (btf_type_kflag(t)) { + btf_verifier_log_type(env, t, "Invalid btf_info kind_flag"); + return -EINVAL; + } + + if (!t->name_off || + !btf_name_valid_section(env->btf, t->name_off)) { + btf_verifier_log_type(env, t, "Invalid name"); + return -EINVAL; + } + + btf_verifier_log_type(env, t, NULL); + + for_each_vsi(i, t, vsi) { + /* A var cannot be in type void */ + if (!vsi->type || !BTF_TYPE_ID_VALID(vsi->type)) { + btf_verifier_log_vsi(env, t, vsi, + "Invalid type_id"); + return -EINVAL; + } + + if (vsi->offset < last_vsi_end_off || vsi->offset >= t->size) { + btf_verifier_log_vsi(env, t, vsi, + "Invalid offset"); + return -EINVAL; + } + + if (!vsi->size || vsi->size > t->size) { + btf_verifier_log_vsi(env, t, vsi, + "Invalid size"); + return -EINVAL; + } + + last_vsi_end_off = vsi->offset + vsi->size; + if (last_vsi_end_off > t->size) { + btf_verifier_log_vsi(env, t, vsi, + "Invalid offset+size"); + return -EINVAL; + } + + btf_verifier_log_vsi(env, t, vsi, NULL); + sum += vsi->size; + } + + if (t->size < sum) { + btf_verifier_log_type(env, t, "Invalid btf_info size"); + return -EINVAL; + } + + return meta_needed; +} + +static int btf_datasec_resolve(struct btf_verifier_env *env, + const struct resolve_vertex *v) +{ + const struct btf_var_secinfo *vsi; + struct btf *btf = env->btf; + u16 i; + + env->resolve_mode = RESOLVE_TBD; + for_each_vsi_from(i, v->next_member, v->t, vsi) { + u32 var_type_id = vsi->type, type_id, type_size = 0; + const struct btf_type *var_type = btf_type_by_id(env->btf, + var_type_id); + if (!var_type || !btf_type_is_var(var_type)) { + btf_verifier_log_vsi(env, v->t, vsi, + "Not a VAR kind member"); + return -EINVAL; + } + + if (!env_type_is_resolve_sink(env, var_type) && + !env_type_is_resolved(env, var_type_id)) { + env_stack_set_next_member(env, i + 1); + return env_stack_push(env, var_type, var_type_id); + } + + type_id = var_type->type; + if (!btf_type_id_size(btf, &type_id, &type_size)) { + btf_verifier_log_vsi(env, v->t, vsi, "Invalid type"); + return -EINVAL; + } + + if (vsi->size < type_size) { + btf_verifier_log_vsi(env, v->t, vsi, "Invalid size"); + return -EINVAL; + } + } + + env_stack_pop_resolved(env, 0, 0); + return 0; +} + +static void btf_datasec_log(struct btf_verifier_env *env, + const struct btf_type *t) +{ + btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t)); +} + +static void btf_datasec_seq_show(const struct btf *btf, + const struct btf_type *t, u32 type_id, + void *data, u8 bits_offset, + struct seq_file *m) +{ + const struct btf_var_secinfo *vsi; + const struct btf_type *var; + u32 i; + + seq_printf(m, "section (\"%s\") = {", __btf_name_by_offset(btf, t->name_off)); + for_each_vsi(i, t, vsi) { + var = btf_type_by_id(btf, vsi->type); + if (i) + seq_puts(m, ","); + btf_type_ops(var)->seq_show(btf, var, vsi->type, + data + vsi->offset, bits_offset, m); + } + seq_puts(m, "}"); +} + +static const struct btf_kind_operations datasec_ops = { + .check_meta = btf_datasec_check_meta, + .resolve = btf_datasec_resolve, + .check_member = btf_df_check_member, + .check_kflag_member = btf_df_check_kflag_member, + .log_details = btf_datasec_log, + .seq_show = btf_datasec_seq_show, +}; + static int btf_func_proto_check(struct btf_verifier_env *env, const struct btf_type *t) { @@ -2257,6 +2795,7 @@ static int btf_func_proto_check(struct btf_verifier_env *env, const struct btf *btf; u16 nr_args, i; int err; + btf = env->btf; args = (const struct btf_param *)(t + 1); nr_args = btf_type_vlen(t); @@ -2264,17 +2803,20 @@ static int btf_func_proto_check(struct btf_verifier_env *env, /* Check func return type which could be "void" (t->type == 0) */ if (t->type) { u32 ret_type_id = t->type; + ret_type = btf_type_by_id(btf, ret_type_id); if (!ret_type) { btf_verifier_log_type(env, t, "Invalid return type"); return -EINVAL; } + if (btf_type_needs_resolve(ret_type) && !env_type_is_resolved(env, ret_type_id)) { err = btf_resolve(env, ret_type, ret_type_id); if (err) return err; } + /* Ensure the return type is a type that has a size */ if (!btf_type_id_size(btf, &ret_type_id, NULL)) { btf_verifier_log_type(env, t, "Invalid return type"); @@ -2299,6 +2841,7 @@ static int btf_func_proto_check(struct btf_verifier_env *env, for (i = 0; i < nr_args; i++) { const struct btf_type *arg_type; u32 arg_type_id; + arg_type_id = args[i].type; arg_type = btf_type_by_id(btf, arg_type_id); if (!arg_type) { @@ -2307,6 +2850,11 @@ static int btf_func_proto_check(struct btf_verifier_env *env, break; } + if (btf_type_is_resolve_source_only(arg_type)) { + btf_verifier_log_type(env, t, "Invalid arg#%u", i + 1); + return -EINVAL; + } + if (args[i].name_off && (!btf_name_offset_valid(btf, args[i].name_off) || !btf_name_valid_identifier(btf, args[i].name_off))) { @@ -2329,8 +2877,10 @@ static int btf_func_proto_check(struct btf_verifier_env *env, break; } } + return err; } + static int btf_func_check(struct btf_verifier_env *env, const struct btf_type *t) { @@ -2338,6 +2888,7 @@ static int btf_func_check(struct btf_verifier_env *env, const struct btf_param *args; const struct btf *btf; u16 nr_args, i; + btf = env->btf; proto_type = btf_type_by_id(btf, t->type); @@ -2354,6 +2905,7 @@ static int btf_func_check(struct btf_verifier_env *env, return -EINVAL; } } + return 0; } @@ -2371,6 +2923,8 @@ static const struct btf_kind_operations * const kind_ops[NR_BTF_KINDS] = { [BTF_KIND_RESTRICT] = &modifier_ops, [BTF_KIND_FUNC] = &func_ops, [BTF_KIND_FUNC_PROTO] = &func_proto_ops, + [BTF_KIND_VAR] = &var_ops, + [BTF_KIND_DATASEC] = &datasec_ops, }; static s32 btf_check_meta(struct btf_verifier_env *env, @@ -2451,13 +3005,17 @@ static bool btf_resolve_valid(struct btf_verifier_env *env, if (!env_type_is_resolved(env, type_id)) return false; - if (btf_type_is_struct(t)) + if (btf_type_is_struct(t) || btf_type_is_datasec(t)) return !btf->resolved_ids[type_id] && - !btf->resolved_sizes[type_id]; + !btf->resolved_sizes[type_id]; - if (btf_type_is_modifier(t) || btf_type_is_ptr(t)) { + if (btf_type_is_modifier(t) || btf_type_is_ptr(t) || + btf_type_is_var(t)) { t = btf_type_id_resolve(btf, &type_id); - return t && !btf_type_is_modifier(t); + return t && + !btf_type_is_modifier(t) && + !btf_type_is_var(t) && + !btf_type_is_datasec(t); } if (btf_type_is_array(t)) { @@ -2484,7 +3042,6 @@ static int btf_resolve(struct btf_verifier_env *env, env->resolve_mode = RESOLVE_TBD; env_stack_push(env, t, type_id); - while (!err && (v = env_stack_peak(env))) { env->log_type_id = v->type_id; err = btf_type_ops(v->t)->resolve(env, v); @@ -2703,9 +3260,6 @@ static int btf_parse_hdr(struct btf_verifier_env *env) hdr = &btf->hdr; - if (hdr->hdr_len != hdr_len) - return -EINVAL; - btf_verifier_log_hdr(env, btf_data_size); if (hdr->magic != BTF_MAGIC) { @@ -2827,6 +3381,15 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj, btf_type_ops(t)->seq_show(btf, t, type_id, obj, 0, m); } +#ifdef CONFIG_PROC_FS +static void bpf_btf_show_fdinfo(struct seq_file *m, struct file *filp) +{ + const struct btf *btf = filp->private_data; + + seq_printf(m, "btf_id:\t%u\n", btf->id); +} +#endif + static int btf_release(struct inode *inode, struct file *filp) { btf_put(filp->private_data); @@ -2834,6 +3397,9 @@ static int btf_release(struct inode *inode, struct file *filp) } const struct file_operations btf_fops = { +#ifdef CONFIG_PROC_FS + .show_fdinfo = bpf_btf_show_fdinfo, +#endif .release = btf_release, }; @@ -2859,11 +3425,13 @@ int btf_new_fd(const union bpf_attr *attr) btf_free(btf); return ret; } + /* * The BTF ID is published to the userspace. * All BTF free must go through call_rcu() from * now on (i.e. free by calling btf_put()). */ + ret = __btf_new_fd(btf); if (ret < 0) btf_put(btf); @@ -2897,12 +3465,29 @@ int btf_get_info_by_fd(const struct btf *btf, const union bpf_attr *attr, union bpf_attr __user *uattr) { - void __user *udata = u64_to_user_ptr(attr->info.info); - u32 copy_len = min_t(u32, btf->data_size, - attr->info.info_len); + struct bpf_btf_info __user *uinfo; + struct bpf_btf_info info; + u32 info_copy, btf_copy; + void __user *ubtf; + u32 uinfo_len; - if (copy_to_user(udata, btf->data, copy_len) || - put_user(btf->data_size, &uattr->info.info_len)) + uinfo = u64_to_user_ptr(attr->info.info); + uinfo_len = attr->info.info_len; + + info_copy = min_t(u32, uinfo_len, sizeof(info)); + memset(&info, 0, sizeof(info)); + if (copy_from_user(&info, uinfo, info_copy)) + return -EFAULT; + + info.id = btf->id; + ubtf = u64_to_user_ptr(info.btf); + btf_copy = min_t(u32, btf->data_size, info.btf_size); + if (copy_to_user(ubtf, btf->data, btf_copy)) + return -EFAULT; + info.btf_size = btf->data_size; + + if (copy_to_user(uinfo, &info, info_copy) || + put_user(info_copy, &uattr->info.info_len)) return -EFAULT; return 0; @@ -2912,18 +3497,23 @@ int btf_get_fd_by_id(u32 id) { struct btf *btf; int fd; + rcu_read_lock(); btf = idr_find(&btf_idr, id); if (!btf || !refcount_inc_not_zero(&btf->refcnt)) btf = ERR_PTR(-ENOENT); rcu_read_unlock(); + if (IS_ERR(btf)) return PTR_ERR(btf); + fd = __btf_new_fd(btf); if (fd < 0) btf_put(btf); + return fd; } + u32 btf_id(const struct btf *btf) { return btf->id; diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 6e084cfa03b2..08c1246c758e 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -1,11 +1,8 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Functions to manage eBPF programs attached to cgroups * * Copyright (c) 2016 Daniel Mack - * - * This file is subject to the terms and conditions of version 2 of the GNU - * General Public License. See the file COPYING in the main directory of the - * Linux distribution for more details. */ #include @@ -14,23 +11,38 @@ #include #include #include +#include #include #include #include #include +#include "../cgroup/cgroup-internal.h" + DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key); EXPORT_SYMBOL(cgroup_bpf_enabled_key); -/** - * cgroup_bpf_put() - put references of all bpf programs - * @cgrp: the cgroup to modify - */ -void cgroup_bpf_put(struct cgroup *cgrp) +void cgroup_bpf_offline(struct cgroup *cgrp) { + cgroup_get(cgrp); + percpu_ref_kill(&cgrp->bpf.refcnt); +} + +/** + * cgroup_bpf_release() - put references of all bpf programs and + * release all cgroup bpf data + * @work: work structure embedded into the cgroup to modify + */ +static void cgroup_bpf_release(struct work_struct *work) +{ + struct cgroup *p, *cgrp = container_of(work, struct cgroup, + bpf.release_work); enum bpf_cgroup_storage_type stype; + struct bpf_prog_array *old_array; unsigned int type; + mutex_lock(&cgroup_mutex); + for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) { struct list_head *progs = &cgrp->bpf.progs[type]; struct bpf_prog_list *pl, *tmp; @@ -45,8 +57,32 @@ void cgroup_bpf_put(struct cgroup *cgrp) kfree(pl); static_branch_dec(&cgroup_bpf_enabled_key); } - bpf_prog_array_free(cgrp->bpf.effective[type]); + old_array = rcu_dereference_protected( + cgrp->bpf.effective[type], + lockdep_is_held(&cgroup_mutex)); + bpf_prog_array_free(old_array); } + + mutex_unlock(&cgroup_mutex); + + for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p)) + cgroup_bpf_put(p); + + percpu_ref_exit(&cgrp->bpf.refcnt); + cgroup_put(cgrp); +} + +/** + * cgroup_bpf_release_fn() - callback used to schedule releasing + * of bpf cgroup data + * @ref: percpu ref counter structure + */ +static void cgroup_bpf_release_fn(struct percpu_ref *ref) +{ + struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt); + + INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release); + queue_work(system_wq, &cgrp->bpf.release_work); } /* count number of elements in the list. @@ -101,7 +137,7 @@ static bool hierarchy_allows_attach(struct cgroup *cgrp, */ static int compute_effective_progs(struct cgroup *cgrp, enum bpf_attach_type type, - struct bpf_prog_array __rcu **array) + struct bpf_prog_array **array) { enum bpf_cgroup_storage_type stype; struct bpf_prog_array *progs; @@ -127,10 +163,10 @@ static int compute_effective_progs(struct cgroup *cgrp, if (cnt > 0 && !(p->bpf.flags[type] & BPF_F_ALLOW_MULTI)) continue; - list_for_each_entry(pl, - &p->bpf.progs[type], node) { + list_for_each_entry(pl, &p->bpf.progs[type], node) { if (!pl->prog) continue; + progs->items[cnt].prog = pl->prog; for_each_cgroup_storage_type(stype) progs->items[cnt].cgroup_storage[stype] = @@ -139,17 +175,16 @@ static int compute_effective_progs(struct cgroup *cgrp, } } while ((p = cgroup_parent(p))); - rcu_assign_pointer(*array, progs); + *array = progs; return 0; } static void activate_effective_progs(struct cgroup *cgrp, enum bpf_attach_type type, - struct bpf_prog_array __rcu *array) + struct bpf_prog_array *old_array) { - struct bpf_prog_array __rcu *old_array; - - old_array = xchg(&cgrp->bpf.effective[type], array); + rcu_swap_protected(cgrp->bpf.effective[type], old_array, + lockdep_is_held(&cgroup_mutex)); /* free prog array after grace period, since __cgroup_bpf_run_*() * might be still walking the array */ @@ -166,8 +201,17 @@ int cgroup_bpf_inherit(struct cgroup *cgrp) * that array below is variable length */ #define NR ARRAY_SIZE(cgrp->bpf.effective) - struct bpf_prog_array __rcu *arrays[NR] = {}; - int i; + struct bpf_prog_array *arrays[NR] = {}; + struct cgroup *p; + int ret, i; + + ret = percpu_ref_init(&cgrp->bpf.refcnt, cgroup_bpf_release_fn, 0, + GFP_KERNEL); + if (ret) + return ret; + + for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p)) + cgroup_bpf_get(p); for (i = 0; i < NR; i++) INIT_LIST_HEAD(&cgrp->bpf.progs[i]); @@ -183,9 +227,65 @@ int cgroup_bpf_inherit(struct cgroup *cgrp) cleanup: for (i = 0; i < NR; i++) bpf_prog_array_free(arrays[i]); + + for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p)) + cgroup_bpf_put(p); + + percpu_ref_exit(&cgrp->bpf.refcnt); + return -ENOMEM; } +static int update_effective_progs(struct cgroup *cgrp, + enum bpf_attach_type type) +{ + struct cgroup_subsys_state *css; + int err; + + /* allocate and recompute effective prog arrays */ + css_for_each_descendant_pre(css, &cgrp->self) { + struct cgroup *desc = container_of(css, struct cgroup, self); + + if (percpu_ref_is_zero(&desc->bpf.refcnt)) + continue; + + err = compute_effective_progs(desc, type, &desc->bpf.inactive); + if (err) + goto cleanup; + } + + /* all allocations were successful. Activate all prog arrays */ + css_for_each_descendant_pre(css, &cgrp->self) { + struct cgroup *desc = container_of(css, struct cgroup, self); + + if (percpu_ref_is_zero(&desc->bpf.refcnt)) { + if (unlikely(desc->bpf.inactive)) { + bpf_prog_array_free(desc->bpf.inactive); + desc->bpf.inactive = NULL; + } + continue; + } + + activate_effective_progs(desc, type, desc->bpf.inactive); + desc->bpf.inactive = NULL; + } + + return 0; + +cleanup: + /* oom while computing effective. Free all computed effective arrays + * since they were not activated + */ + css_for_each_descendant_pre(css, &cgrp->self) { + struct cgroup *desc = container_of(css, struct cgroup, self); + + bpf_prog_array_free(desc->bpf.inactive); + desc->bpf.inactive = NULL; + } + + return err; +} + #define BPF_CGROUP_MAX_PROGS 64 /** @@ -194,6 +294,7 @@ cleanup: * @cgrp: The cgroup which descendants to traverse * @prog: A program to attach * @type: Type of attach operation + * @flags: Option flags * * Must be called with cgroup_mutex held. */ @@ -202,13 +303,11 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, { struct list_head *progs = &cgrp->bpf.progs[type]; struct bpf_prog *old_prog = NULL; - struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE], - *old_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {NULL}; + struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {}; + struct bpf_cgroup_storage *old_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {}; enum bpf_cgroup_storage_type stype; - struct cgroup_subsys_state *css; struct bpf_prog_list *pl; bool pl_was_allocated; - u32 old_flags; int err; if ((flags & BPF_F_ALLOW_OVERRIDE) && (flags & BPF_F_ALLOW_MULTI)) @@ -254,6 +353,7 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, bpf_cgroup_storage_free(storage[stype]); return -ENOMEM; } + pl_was_allocated = true; pl->prog = prog; for_each_cgroup_storage_type(stype) @@ -283,25 +383,11 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, pl->storage[stype] = storage[stype]; } - old_flags = cgrp->bpf.flags[type]; cgrp->bpf.flags[type] = flags; - /* allocate and recompute effective prog arrays */ - css_for_each_descendant_pre(css, &cgrp->self) { - struct cgroup *desc = container_of(css, struct cgroup, self); - - err = compute_effective_progs(desc, type, &desc->bpf.inactive); - if (err) - goto cleanup; - } - - /* all allocations were successful. Activate all prog arrays */ - css_for_each_descendant_pre(css, &cgrp->self) { - struct cgroup *desc = container_of(css, struct cgroup, self); - - activate_effective_progs(desc, type, desc->bpf.inactive); - desc->bpf.inactive = NULL; - } + err = update_effective_progs(cgrp, type); + if (err) + goto cleanup; static_branch_inc(&cgroup_bpf_enabled_key); for_each_cgroup_storage_type(stype) { @@ -318,28 +404,18 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, return 0; cleanup: - /* oom while computing effective. Free all computed effective arrays - * since they were not activated - */ - css_for_each_descendant_pre(css, &cgrp->self) { - struct cgroup *desc = container_of(css, struct cgroup, self); - - bpf_prog_array_free(desc->bpf.inactive); - desc->bpf.inactive = NULL; - } - - /* and cleanup the prog list */ - pl->prog = old_prog; + /* and cleanup the prog list */ + pl->prog = old_prog; for_each_cgroup_storage_type(stype) { bpf_cgroup_storage_free(pl->storage[stype]); pl->storage[stype] = old_storage[stype]; bpf_cgroup_storage_link(old_storage[stype], cgrp, type); } - if (pl_was_allocated) { - list_del(&pl->node); - kfree(pl); - } - return err; + if (pl_was_allocated) { + list_del(&pl->node); + kfree(pl); + } + return err; } /** @@ -352,13 +428,12 @@ cleanup: * Must be called with cgroup_mutex held. */ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, - enum bpf_attach_type type, u32 unused_flags) + enum bpf_attach_type type) { struct list_head *progs = &cgrp->bpf.progs[type]; enum bpf_cgroup_storage_type stype; u32 flags = cgrp->bpf.flags[type]; struct bpf_prog *old_prog = NULL; - struct cgroup_subsys_state *css; struct bpf_prog_list *pl; int err; @@ -397,22 +472,9 @@ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, pl->prog = NULL; } - /* allocate and recompute effective prog arrays */ - css_for_each_descendant_pre(css, &cgrp->self) { - struct cgroup *desc = container_of(css, struct cgroup, self); - - err = compute_effective_progs(desc, type, &desc->bpf.inactive); - if (err) - goto cleanup; - } - - /* all allocations were successful. Activate all prog arrays */ - css_for_each_descendant_pre(css, &cgrp->self) { - struct cgroup *desc = container_of(css, struct cgroup, self); - - activate_effective_progs(desc, type, desc->bpf.inactive); - desc->bpf.inactive = NULL; - } + err = update_effective_progs(cgrp, type); + if (err) + goto cleanup; /* now can actually delete it from this cgroup list */ list_del(&pl->node); @@ -430,16 +492,6 @@ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, return 0; cleanup: - /* oom while computing effective. Free all computed effective arrays - * since they were not activated - */ - css_for_each_descendant_pre(css, &cgrp->self) { - struct cgroup *desc = container_of(css, struct cgroup, self); - - bpf_prog_array_free(desc->bpf.inactive); - desc->bpf.inactive = NULL; - } - /* and restore back old_prog */ pl->prog = old_prog; return err; @@ -453,10 +505,14 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, enum bpf_attach_type type = attr->query.attach_type; struct list_head *progs = &cgrp->bpf.progs[type]; u32 flags = cgrp->bpf.flags[type]; + struct bpf_prog_array *effective; int cnt, ret = 0, i; + effective = rcu_dereference_protected(cgrp->bpf.effective[type], + lockdep_is_held(&cgroup_mutex)); + if (attr->query.query_flags & BPF_F_QUERY_EFFECTIVE) - cnt = bpf_prog_array_length(cgrp->bpf.effective[type]); + cnt = bpf_prog_array_length(effective); else cnt = prog_list_length(progs); @@ -473,8 +529,7 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, } if (attr->query.query_flags & BPF_F_QUERY_EFFECTIVE) { - return bpf_prog_array_copy_to_user(cgrp->bpf.effective[type], - prog_ids, cnt); + return bpf_prog_array_copy_to_user(effective, prog_ids, cnt); } else { struct bpf_prog_list *pl; u32 id; @@ -491,6 +546,60 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, return ret; } +int cgroup_bpf_prog_attach(const union bpf_attr *attr, + enum bpf_prog_type ptype, struct bpf_prog *prog) +{ + struct cgroup *cgrp; + int ret; + + cgrp = cgroup_get_from_fd(attr->target_fd); + if (IS_ERR(cgrp)) + return PTR_ERR(cgrp); + + ret = cgroup_bpf_attach(cgrp, prog, attr->attach_type, + attr->attach_flags); + cgroup_put(cgrp); + return ret; +} + +int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype) +{ + struct bpf_prog *prog; + struct cgroup *cgrp; + int ret; + + cgrp = cgroup_get_from_fd(attr->target_fd); + if (IS_ERR(cgrp)) + return PTR_ERR(cgrp); + + prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype); + if (IS_ERR(prog)) + prog = NULL; + + ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type, 0); + if (prog) + bpf_prog_put(prog); + + cgroup_put(cgrp); + return ret; +} + +int cgroup_bpf_prog_query(const union bpf_attr *attr, + union bpf_attr __user *uattr) +{ + struct cgroup *cgrp; + int ret; + + cgrp = cgroup_get_from_fd(attr->query.target_fd); + if (IS_ERR(cgrp)) + return PTR_ERR(cgrp); + + ret = cgroup_bpf_query(cgrp, attr, uattr); + + cgroup_put(cgrp); + return ret; +} + /** * __cgroup_bpf_run_filter_skb() - Run a program for packet filtering * @sk: The socket sending or receiving traffic @@ -503,8 +612,16 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, * The program type passed in via @type must be suitable for network * filtering. No further check is performed to assert that. * - * This function will return %-EPERM if any if an attached program was found - * and if it returned != 1 during execution. In all other cases, 0 is returned. + * For egress packets, this function can return: + * NET_XMIT_SUCCESS (0) - continue with packet output + * NET_XMIT_DROP (1) - drop packet and notify TCP to call cwr + * NET_XMIT_CN (2) - continue with packet output and notify TCP + * to call cwr + * -EPERM - drop packet + * + * For ingress packets, this function will return -EPERM if any + * attached program was found and if it returned != 1 during execution. + * Otherwise 0 is returned. */ int __cgroup_bpf_run_filter_skb(struct sock *sk, struct sk_buff *skb, @@ -512,6 +629,7 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk, { unsigned int offset = skb->data - skb_network_header(skb); struct sock *save_sk; + void *saved_data_end; struct cgroup *cgrp; int ret; @@ -525,11 +643,23 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk, save_sk = skb->sk; skb->sk = sk; __skb_push(skb, offset); - ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb, - bpf_prog_run_save_cb); + + /* compute pointers for the bpf prog */ + bpf_compute_and_save_data_end(skb, &saved_data_end); + + if (type == BPF_CGROUP_INET_EGRESS) { + ret = BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY( + cgrp->bpf.effective[type], skb, __bpf_prog_run_save_cb); + } else { + ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb, + __bpf_prog_run_save_cb); + ret = (ret == 1 ? 0 : -EPERM); + } + bpf_restore_data_end(skb, saved_data_end); __skb_pull(skb, offset); skb->sk = save_sk; - return ret == 1 ? 0 : -EPERM; + + return ret; } EXPORT_SYMBOL(__cgroup_bpf_run_filter_skb); @@ -662,6 +792,12 @@ cgroup_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_map_update_elem_proto; case BPF_FUNC_map_delete_elem: return &bpf_map_delete_elem_proto; + case BPF_FUNC_map_push_elem: + return &bpf_map_push_elem_proto; + case BPF_FUNC_map_pop_elem: + return &bpf_map_pop_elem_proto; + case BPF_FUNC_map_peek_elem: + return &bpf_map_peek_elem_proto; case BPF_FUNC_get_current_uid_gid: return &bpf_get_current_uid_gid_proto; case BPF_FUNC_get_local_storage: @@ -671,6 +807,7 @@ cgroup_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_trace_printk: if (capable(CAP_SYS_ADMIN)) return bpf_get_trace_printk_proto(); + /* fall through */ default: return NULL; } @@ -682,10 +819,13 @@ cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return cgroup_base_func_proto(func_id, prog); } -static bool cgroup_dev_is_valid_access(int off, int size, enum bpf_access_type type, +static bool cgroup_dev_is_valid_access(int off, int size, + enum bpf_access_type type, const struct bpf_prog *prog, struct bpf_insn_access_aux *info) { + const int size_default = sizeof(__u32); + if (type == BPF_WRITE) return false; @@ -694,8 +834,17 @@ static bool cgroup_dev_is_valid_access(int off, int size, enum bpf_access_type t /* The verifier guarantees that size > 0. */ if (off % size != 0) return false; - if (size != sizeof(__u32)) - return false; + + switch (off) { + case bpf_ctx_range(struct bpf_cgroup_dev_ctx, access_type): + bpf_ctx_record_field_size(info, size_default); + if (!bpf_ctx_narrow_access_ok(off, size, size_default)) + return false; + break; + default: + if (size != size_default) + return false; + } return true; } @@ -708,13 +857,22 @@ const struct bpf_verifier_ops cg_dev_verifier_ops = { .is_valid_access = cgroup_dev_is_valid_access, }; - /** * __cgroup_bpf_run_filter_sysctl - Run a program on sysctl * * @head: sysctl table header * @table: sysctl table * @write: sysctl is being read (= 0) or written (= 1) + * @buf: pointer to buffer passed by user space + * @pcount: value-result argument: value is size of buffer pointed to by @buf, + * result is size of @new_buf if program set new value, initial value + * otherwise + * @ppos: value-result argument: value is position at which read from or write + * to sysctl is happening, result is new position if program overrode it, + * initial value otherwise + * @new_buf: pointer to pointer to new buffer that will be allocated if program + * overrides new value provided by user space on sysctl write + * NOTE: it's caller responsibility to free *new_buf if it was set * @type: type of program to be executed * * Program is run when sysctl is being accessed, either read or written, and @@ -725,26 +883,73 @@ const struct bpf_verifier_ops cg_dev_verifier_ops = { */ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head, struct ctl_table *table, int write, + void __user *buf, size_t *pcount, + loff_t *ppos, void **new_buf, enum bpf_attach_type type) { struct bpf_sysctl_kern ctx = { .head = head, .table = table, .write = write, + .ppos = ppos, + .cur_val = NULL, + .cur_len = PAGE_SIZE, + .new_val = NULL, + .new_len = 0, + .new_updated = 0, }; struct cgroup *cgrp; int ret; + ctx.cur_val = kmalloc_track_caller(ctx.cur_len, GFP_KERNEL); + if (ctx.cur_val) { + mm_segment_t old_fs; + loff_t pos = 0; + + old_fs = get_fs(); + set_fs(KERNEL_DS); + if (table->proc_handler(table, 0, (void __user *)ctx.cur_val, + &ctx.cur_len, &pos)) { + /* Let BPF program decide how to proceed. */ + ctx.cur_len = 0; + } + set_fs(old_fs); + } else { + /* Let BPF program decide how to proceed. */ + ctx.cur_len = 0; + } + + if (write && buf && *pcount) { + /* BPF program should be able to override new value with a + * buffer bigger than provided by user. + */ + ctx.new_val = kmalloc_track_caller(PAGE_SIZE, GFP_KERNEL); + ctx.new_len = min_t(size_t, PAGE_SIZE, *pcount); + if (!ctx.new_val || + copy_from_user(ctx.new_val, buf, ctx.new_len)) + /* Let BPF program decide how to proceed. */ + ctx.new_len = 0; + } + rcu_read_lock(); cgrp = task_dfl_cgroup(current); ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN); rcu_read_unlock(); + kfree(ctx.cur_val); + + if (ret == 1 && ctx.new_updated) { + *new_buf = ctx.new_val; + *pcount = ctx.new_len; + } else { + kfree(ctx.new_val); + } + return ret == 1 ? 0 : -EPERM; } - EXPORT_SYMBOL(__cgroup_bpf_run_filter_sysctl); +#ifdef CONFIG_NET static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, enum bpf_attach_type attach_type) { @@ -755,6 +960,7 @@ static bool __cgroup_bpf_prog_array_is_empty(struct cgroup *cgrp, prog_array = rcu_dereference(cgrp->bpf.effective[attach_type]); empty = bpf_prog_array_is_empty(prog_array); rcu_read_unlock(); + return empty; } @@ -775,6 +981,7 @@ static int sockopt_alloc_buf(struct bpf_sockopt_kern *ctx, int max_optlen) return -ENOMEM; ctx->optval_end = ctx->optval + max_optlen; + return max_optlen; } @@ -824,10 +1031,12 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_SETSOCKOPT], &ctx, BPF_PROG_RUN); release_sock(sk); + if (!ret) { ret = -EPERM; goto out; } + if (ctx.optlen == -1) { /* optlen set to -1, bypass kernel */ ret = 1; @@ -837,6 +1046,7 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, } else { /* optlen within bounds, run kernel handler */ ret = 0; + /* export any potential modifications */ *level = ctx.level; *optname = ctx.optname; @@ -847,15 +1057,17 @@ int __cgroup_bpf_run_filter_setsockopt(struct sock *sk, int *level, if (ctx.optlen != 0) { *optlen = ctx.optlen; *kernel_optval = ctx.optval; + /* export and don't free sockopt buf */ + return 0; } } + out: - if (ret) - sockopt_free_buf(&ctx); + sockopt_free_buf(&ctx); return ret; } - EXPORT_SYMBOL(__cgroup_bpf_run_filter_setsockopt); + int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, int optname, char __user *optval, int __user *optlen, int max_optlen, @@ -891,11 +1103,17 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, * one that kernel returned as well to let * BPF programs inspect the value. */ + if (get_user(ctx.optlen, optlen)) { ret = -EFAULT; goto out; } + if (ctx.optlen < 0) { + ret = -EFAULT; + goto out; + } + if (copy_from_user(ctx.optval, optval, min(ctx.optlen, max_optlen)) != 0) { ret = -EFAULT; @@ -907,12 +1125,13 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[BPF_CGROUP_GETSOCKOPT], &ctx, BPF_PROG_RUN); release_sock(sk); + if (!ret) { ret = -EPERM; goto out; } - if (ctx.optlen > max_optlen) { + if (optval && (ctx.optlen > max_optlen || ctx.optlen < 0)) { ret = -EFAULT; goto out; } @@ -926,24 +1145,192 @@ int __cgroup_bpf_run_filter_getsockopt(struct sock *sk, int level, } if (ctx.optlen != 0) { - if (copy_to_user(optval, ctx.optval, ctx.optlen) || - put_user(ctx.optlen, optlen)) { + if (optval && copy_to_user(optval, ctx.optval, ctx.optlen)) { + ret = -EFAULT; + goto out; + } + if (put_user(ctx.optlen, optlen)) { ret = -EFAULT; goto out; } } ret = ctx.retval; + out: sockopt_free_buf(&ctx); return ret; } EXPORT_SYMBOL(__cgroup_bpf_run_filter_getsockopt); +#endif + +static ssize_t sysctl_cpy_dir(const struct ctl_dir *dir, char **bufp, + size_t *lenp) +{ + ssize_t tmp_ret = 0, ret; + + if (dir->header.parent) { + tmp_ret = sysctl_cpy_dir(dir->header.parent, bufp, lenp); + if (tmp_ret < 0) + return tmp_ret; + } + + ret = strscpy(*bufp, dir->header.ctl_table[0].procname, *lenp); + if (ret < 0) + return ret; + *bufp += ret; + *lenp -= ret; + ret += tmp_ret; + + /* Avoid leading slash. */ + if (!ret) + return ret; + + tmp_ret = strscpy(*bufp, "/", *lenp); + if (tmp_ret < 0) + return tmp_ret; + *bufp += tmp_ret; + *lenp -= tmp_ret; + + return ret + tmp_ret; +} + +BPF_CALL_4(bpf_sysctl_get_name, struct bpf_sysctl_kern *, ctx, char *, buf, + size_t, buf_len, u64, flags) +{ + ssize_t tmp_ret = 0, ret; + + if (!buf) + return -EINVAL; + + if (!(flags & BPF_F_SYSCTL_BASE_NAME)) { + if (!ctx->head) + return -EINVAL; + tmp_ret = sysctl_cpy_dir(ctx->head->parent, &buf, &buf_len); + if (tmp_ret < 0) + return tmp_ret; + } + + ret = strscpy(buf, ctx->table->procname, buf_len); + + return ret < 0 ? ret : tmp_ret + ret; +} + +static const struct bpf_func_proto bpf_sysctl_get_name_proto = { + .func = bpf_sysctl_get_name, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, +}; + +static int copy_sysctl_value(char *dst, size_t dst_len, char *src, + size_t src_len) +{ + if (!dst) + return -EINVAL; + + if (!dst_len) + return -E2BIG; + + if (!src || !src_len) { + memset(dst, 0, dst_len); + return -EINVAL; + } + + memcpy(dst, src, min(dst_len, src_len)); + + if (dst_len > src_len) { + memset(dst + src_len, '\0', dst_len - src_len); + return src_len; + } + + dst[dst_len - 1] = '\0'; + + return -E2BIG; +} + +BPF_CALL_3(bpf_sysctl_get_current_value, struct bpf_sysctl_kern *, ctx, + char *, buf, size_t, buf_len) +{ + return copy_sysctl_value(buf, buf_len, ctx->cur_val, ctx->cur_len); +} + +static const struct bpf_func_proto bpf_sysctl_get_current_value_proto = { + .func = bpf_sysctl_get_current_value, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_UNINIT_MEM, + .arg3_type = ARG_CONST_SIZE, +}; + +BPF_CALL_3(bpf_sysctl_get_new_value, struct bpf_sysctl_kern *, ctx, char *, buf, + size_t, buf_len) +{ + if (!ctx->write) { + if (buf && buf_len) + memset(buf, '\0', buf_len); + return -EINVAL; + } + return copy_sysctl_value(buf, buf_len, ctx->new_val, ctx->new_len); +} + +static const struct bpf_func_proto bpf_sysctl_get_new_value_proto = { + .func = bpf_sysctl_get_new_value, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_UNINIT_MEM, + .arg3_type = ARG_CONST_SIZE, +}; + +BPF_CALL_3(bpf_sysctl_set_new_value, struct bpf_sysctl_kern *, ctx, + const char *, buf, size_t, buf_len) +{ + if (!ctx->write || !ctx->new_val || !ctx->new_len || !buf || !buf_len) + return -EINVAL; + + if (buf_len > PAGE_SIZE - 1) + return -E2BIG; + + memcpy(ctx->new_val, buf, buf_len); + ctx->new_len = buf_len; + ctx->new_updated = 1; + + return 0; +} + +static const struct bpf_func_proto bpf_sysctl_set_new_value_proto = { + .func = bpf_sysctl_set_new_value, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, +}; static const struct bpf_func_proto * sysctl_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { - return cgroup_dev_func_proto(func_id, prog); + switch (func_id) { + case BPF_FUNC_strtol: + return &bpf_strtol_proto; + case BPF_FUNC_strtoul: + return &bpf_strtoul_proto; + case BPF_FUNC_sysctl_get_name: + return &bpf_sysctl_get_name_proto; + case BPF_FUNC_sysctl_get_current_value: + return &bpf_sysctl_get_current_value_proto; + case BPF_FUNC_sysctl_get_new_value: + return &bpf_sysctl_get_new_value_proto; + case BPF_FUNC_sysctl_set_new_value: + return &bpf_sysctl_set_new_value_proto; + default: + return cgroup_base_func_proto(func_id, prog); + } } static bool sysctl_is_valid_access(int off, int size, enum bpf_access_type type, @@ -952,14 +1339,22 @@ static bool sysctl_is_valid_access(int off, int size, enum bpf_access_type type, { const int size_default = sizeof(__u32); - if (off < 0 || off + size > sizeof(struct bpf_sysctl) || - off % size || type != BPF_READ) + if (off < 0 || off + size > sizeof(struct bpf_sysctl) || off % size) return false; switch (off) { - case offsetof(struct bpf_sysctl, write): + case bpf_ctx_range(struct bpf_sysctl, write): + if (type != BPF_READ) + return false; bpf_ctx_record_field_size(info, size_default); return bpf_ctx_narrow_access_ok(off, size, size_default); + case bpf_ctx_range(struct bpf_sysctl, file_pos): + if (type == BPF_READ) { + bpf_ctx_record_field_size(info, size_default); + return bpf_ctx_narrow_access_ok(off, size, size_default); + } else { + return size == size_default; + } default: return false; } @@ -971,6 +1366,7 @@ static u32 sysctl_convert_ctx_access(enum bpf_access_type type, struct bpf_prog *prog, u32 *target_size) { struct bpf_insn *insn = insn_buf; + u32 read_size; switch (si->off) { case offsetof(struct bpf_sysctl, write): @@ -981,6 +1377,46 @@ static u32 sysctl_convert_ctx_access(enum bpf_access_type type, write), target_size)); break; + case offsetof(struct bpf_sysctl, file_pos): + /* ppos is a pointer so it should be accessed via indirect + * loads and stores. Also for stores additional temporary + * register is used since neither src_reg nor dst_reg can be + * overridden. + */ + if (type == BPF_WRITE) { + int treg = BPF_REG_9; + + if (si->src_reg == treg || si->dst_reg == treg) + --treg; + if (si->src_reg == treg || si->dst_reg == treg) + --treg; + *insn++ = BPF_STX_MEM( + BPF_DW, si->dst_reg, treg, + offsetof(struct bpf_sysctl_kern, tmp_reg)); + *insn++ = BPF_LDX_MEM( + BPF_FIELD_SIZEOF(struct bpf_sysctl_kern, ppos), + treg, si->dst_reg, + offsetof(struct bpf_sysctl_kern, ppos)); + *insn++ = BPF_STX_MEM( + BPF_SIZEOF(u32), treg, si->src_reg, + bpf_ctx_narrow_access_offset( + 0, sizeof(u32), sizeof(loff_t))); + *insn++ = BPF_LDX_MEM( + BPF_DW, treg, si->dst_reg, + offsetof(struct bpf_sysctl_kern, tmp_reg)); + } else { + *insn++ = BPF_LDX_MEM( + BPF_FIELD_SIZEOF(struct bpf_sysctl_kern, ppos), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sysctl_kern, ppos)); + read_size = bpf_size_to_bytes(BPF_SIZE(si->code)); + *insn++ = BPF_LDX_MEM( + BPF_SIZE(si->code), si->dst_reg, si->dst_reg, + bpf_ctx_narrow_access_offset( + 0, read_size, sizeof(loff_t))); + } + *target_size = sizeof(u32); + break; } return insn - insn_buf; @@ -999,10 +1435,12 @@ static const struct bpf_func_proto * cg_sockopt_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { switch (func_id) { +#ifdef CONFIG_NET case BPF_FUNC_sk_storage_get: return &bpf_sk_storage_get_proto; case BPF_FUNC_sk_storage_delete: return &bpf_sk_storage_delete_proto; +#endif #ifdef CONFIG_INET case BPF_FUNC_tcp_sock: return &bpf_tcp_sock_proto; @@ -1018,10 +1456,13 @@ static bool cg_sockopt_is_valid_access(int off, int size, struct bpf_insn_access_aux *info) { const int size_default = sizeof(__u32); + if (off < 0 || off >= sizeof(struct bpf_sockopt)) return false; + if (off % size != 0) return false; + if (type == BPF_WRITE) { switch (off) { case offsetof(struct bpf_sockopt, retval): @@ -1042,6 +1483,7 @@ static bool cg_sockopt_is_valid_access(int off, int size, return false; } } + switch (off) { case offsetof(struct bpf_sockopt, sk): if (size != sizeof(__u64)) @@ -1067,7 +1509,6 @@ static bool cg_sockopt_is_valid_access(int off, int size, return false; break; } - return true; } @@ -1075,6 +1516,7 @@ static bool cg_sockopt_is_valid_access(int off, int size, T(BPF_FIELD_SIZEOF(struct bpf_sockopt_kern, F), \ si->dst_reg, si->src_reg, \ offsetof(struct bpf_sockopt_kern, F)) + static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type, const struct bpf_insn *si, struct bpf_insn *insn_buf, @@ -1082,6 +1524,7 @@ static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type, u32 *target_size) { struct bpf_insn *insn = insn_buf; + switch (si->off) { case offsetof(struct bpf_sockopt, sk): *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, sk); @@ -1117,6 +1560,7 @@ static u32 cg_sockopt_convert_ctx_access(enum bpf_access_type type, *insn++ = CG_SOCKOPT_ACCESS_FIELD(BPF_LDX_MEM, optval_end); break; } + return insn - insn_buf; } diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index b312af3dcd35..c244122355e1 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-or-later /* * Linux Socket Filter - Kernel level socket filtering * @@ -12,11 +13,6 @@ * Alexei Starovoitov * Daniel Borkmann * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * * Andi Kleen - Fix a few bad bugs and races. * Kris Katterjohn - Added many additional checks in bpf_check_classic() */ @@ -33,10 +29,14 @@ #include #include #include +#include +#include + #ifdef CONFIG_RKP_MODULE_SUPPORT #include #endif +#include #include /* Registers */ @@ -82,7 +82,7 @@ void *bpf_internal_load_pointer_neg_helper(const struct sk_buff *skb, int k, uns return NULL; } -struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) +struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flags) { gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags; struct bpf_prog_aux *aux; @@ -108,17 +108,45 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) return fp; } + +struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | gfp_extra_flags; + struct bpf_prog *prog; + int cpu; + + prog = bpf_prog_alloc_no_stats(size, gfp_extra_flags); + if (!prog) + return NULL; + + prog->aux->stats = alloc_percpu_gfp(struct bpf_prog_stats, gfp_flags); + if (!prog->aux->stats) { + kfree(prog->aux); + vfree(prog); + return NULL; + } + + for_each_possible_cpu(cpu) { + struct bpf_prog_stats *pstats; + + pstats = per_cpu_ptr(prog->aux->stats, cpu); + u64_stats_init(&pstats->syncp); + } + return prog; +} EXPORT_SYMBOL_GPL(bpf_prog_alloc); int bpf_prog_alloc_jited_linfo(struct bpf_prog *prog) { if (!prog->aux->nr_linfo || !prog->jit_requested) return 0; + prog->aux->jited_linfo = kcalloc(prog->aux->nr_linfo, sizeof(*prog->aux->jited_linfo), GFP_KERNEL | __GFP_NOWARN); if (!prog->aux->jited_linfo) return -ENOMEM; + return 0; } @@ -164,16 +192,21 @@ void bpf_prog_fill_jited_linfo(struct bpf_prog *prog, u32 linfo_idx, insn_start, insn_end, nr_linfo, i; const struct bpf_line_info *linfo; void **jited_linfo; + if (!prog->aux->jited_linfo) /* Userspace did not provide linfo */ return; + linfo_idx = prog->aux->linfo_idx; linfo = &prog->aux->linfo[linfo_idx]; insn_start = linfo[0].insn_off; insn_end = insn_start + prog->len; + jited_linfo = &prog->aux->jited_linfo[linfo_idx]; jited_linfo[0] = prog->bpf_func; + nr_linfo = prog->aux->nr_linfo - linfo_idx; + for (i = 1; i < nr_linfo && linfo[i].insn_off < insn_end; i++) /* The verifier ensures that linfo[i].insn_off is * strictly increasing @@ -228,7 +261,10 @@ struct bpf_prog *bpf_prog_realloc(struct bpf_prog *fp_old, unsigned int size, void __bpf_prog_free(struct bpf_prog *fp) { - kfree(fp->aux); + if (fp->aux) { + free_percpu(fp->aux->stats); + kfree(fp->aux); + } vfree(fp); } @@ -305,51 +341,75 @@ int bpf_prog_calc_tag(struct bpf_prog *fp) return 0; } -static int bpf_adj_branches(struct bpf_prog *prog, u32 pos, u32 delta, - const bool probe_pass) +static int bpf_adj_delta_to_imm(struct bpf_insn *insn, u32 pos, s32 end_old, + s32 end_new, s32 curr, const bool probe_pass) { - u32 i, insn_cnt = prog->len + (probe_pass ? delta : 0); + const s64 imm_min = S32_MIN, imm_max = S32_MAX; + s32 delta = end_new - end_old; + s64 imm = insn->imm; + + if (curr < pos && curr + imm + 1 >= end_old) + imm += delta; + else if (curr >= end_new && curr + imm + 1 < end_new) + imm -= delta; + if (imm < imm_min || imm > imm_max) + return -ERANGE; + if (!probe_pass) + insn->imm = imm; + return 0; +} + +static int bpf_adj_delta_to_off(struct bpf_insn *insn, u32 pos, s32 end_old, + s32 end_new, s32 curr, const bool probe_pass) +{ + const s32 off_min = S16_MIN, off_max = S16_MAX; + s32 delta = end_new - end_old; + s32 off = insn->off; + + if (curr < pos && curr + off + 1 >= end_old) + off += delta; + else if (curr >= end_new && curr + off + 1 < end_new) + off -= delta; + if (off < off_min || off > off_max) + return -ERANGE; + if (!probe_pass) + insn->off = off; + return 0; +} + +static int bpf_adj_branches(struct bpf_prog *prog, u32 pos, s32 end_old, + s32 end_new, const bool probe_pass) +{ + u32 i, insn_cnt = prog->len + (probe_pass ? end_new - end_old : 0); struct bpf_insn *insn = prog->insnsi; int ret = 0; - bool pseudo_call; - u8 code; - int off; for (i = 0; i < insn_cnt; i++, insn++) { + u8 code; + /* In the probing pass we still operate on the original, * unpatched image in order to check overflows before we * do any other adjustments. Therefore skip the patchlet. */ if (probe_pass && i == pos) { - i += delta + 1; - insn++; + i = end_new; + insn = prog->insnsi + end_old; } - code = insn->code; - if (BPF_CLASS(code) != BPF_JMP) + if ((BPF_CLASS(code) != BPF_JMP && + BPF_CLASS(code) != BPF_JMP32) || + BPF_OP(code) == BPF_EXIT) continue; - if (BPF_OP(code) == BPF_EXIT) - continue; - if (BPF_OP(code) == BPF_CALL) { - if (insn->src_reg == BPF_PSEUDO_CALL) - pseudo_call = true; - else - continue; - } else { - pseudo_call = false; - } - off = pseudo_call ? insn->imm : insn->off; - /* Adjust offset of jmps if we cross patch boundaries. */ - if (i < pos && i + off + 1 > pos) - off += delta; - else if (i > pos + delta && i + off + 1 <= pos + delta) - off -= delta; - if (pseudo_call) - insn->imm = off; - else - insn->off = off; - + if (BPF_OP(code) == BPF_CALL) { + if (insn->src_reg != BPF_PSEUDO_CALL) + continue; + ret = bpf_adj_delta_to_imm(insn, pos, end_old, + end_new, i, probe_pass); + } else { + ret = bpf_adj_delta_to_off(insn, pos, end_old, + end_new, i, probe_pass); + } if (ret) break; } @@ -361,13 +421,17 @@ static void bpf_adj_linfo(struct bpf_prog *prog, u32 off, u32 delta) { struct bpf_line_info *linfo; u32 i, nr_linfo; + nr_linfo = prog->aux->nr_linfo; if (!nr_linfo || !delta) return; + linfo = prog->aux->linfo; + for (i = 0; i < nr_linfo; i++) if (off < linfo[i].insn_off) break; + /* Push all off < linfo[i].insn_off by delta */ for (; i < nr_linfo; i++) linfo[i].insn_off += delta; @@ -379,6 +443,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, u32 insn_adj_cnt, insn_rest, insn_delta = len - 1; const u32 cnt_max = S16_MAX; struct bpf_prog *prog_adj; + int err; /* Since our patchlet doesn't expand the image, we're done. */ if (insn_delta == 0) { @@ -394,8 +459,8 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, * we afterwards may not fail anymore. */ if (insn_adj_cnt > cnt_max && - bpf_adj_branches(prog, off, insn_delta, true)) - return NULL; + (err = bpf_adj_branches(prog, off, off + 1, off + len, true))) + return ERR_PTR(err); /* Several new instructions need to be inserted. Make room * for them. Likely, there's no need for a new allocation as @@ -404,7 +469,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, prog_adj = bpf_prog_realloc(prog, bpf_prog_size(insn_adj_cnt), GFP_USER); if (!prog_adj) - return NULL; + return ERR_PTR(-ENOMEM); prog_adj->len = insn_adj_cnt; @@ -426,13 +491,43 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off, * the ship has sailed to reverse to the original state. An * overflow cannot happen at this point. */ - BUG_ON(bpf_adj_branches(prog_adj, off, insn_delta, false)); + BUG_ON(bpf_adj_branches(prog_adj, off, off + 1, off + len, false)); bpf_adj_linfo(prog_adj, off, insn_delta); return prog_adj; } +int bpf_remove_insns(struct bpf_prog *prog, u32 off, u32 cnt) +{ + int err; + + /* Branch offsets can't overflow when program is shrinking, no need + * to call bpf_adj_branches(..., true) here + */ + memmove(prog->insnsi + off, prog->insnsi + off + cnt, + sizeof(struct bpf_insn) * (prog->len - off - cnt)); + prog->len -= cnt; + + err = bpf_adj_branches(prog, off, off + cnt, off, false); + WARN_ON_ONCE(err); + return err; +} + +static void bpf_prog_kallsyms_del_subprogs(struct bpf_prog *fp) +{ + int i; + + for (i = 0; i < fp->aux->func_cnt; i++) + bpf_prog_kallsyms_del(fp->aux->func[i]); +} + +void bpf_prog_kallsyms_del_all(struct bpf_prog *fp) +{ + bpf_prog_kallsyms_del_subprogs(fp); + bpf_prog_kallsyms_del(fp); +} + #ifdef CONFIG_BPF_JIT /* All BPF JIT sysctl knobs here. */ int bpf_jit_enable __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON); @@ -455,7 +550,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog, *symbol_end = addr + hdr->pages * PAGE_SIZE; } -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) +void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) { const char *end = sym + KSYM_NAME_LEN; const struct btf_type *type; @@ -476,7 +571,7 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) sym = bin2hex(sym, prog->tag, sizeof(prog->tag)); /* prog->aux->name will be ignored if full btf name is available */ - if (prog->aux->btf) { + if (prog->aux->func_info_cnt) { type = btf_type_by_id(prog->aux->btf, prog->aux->func_info[prog->aux->func_idx].type_id); func_name = btf_name_by_offset(prog->aux->btf, type->name_off); @@ -633,7 +728,6 @@ bool is_bpf_text_address(unsigned long addr) int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, char *sym) { - unsigned long symbol_start, symbol_end; struct bpf_prog_aux *aux; unsigned int it = 0; int ret = -ERANGE; @@ -646,10 +740,9 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, if (it++ != symnum) continue; - bpf_get_prog_addr_region(aux->prog, &symbol_start, &symbol_end); bpf_get_prog_name(aux->prog, sym); - *value = symbol_start; + *value = (unsigned long)aux->prog->bpf_func; *type = BPF_SYM_ELF_TYPE; ret = 0; @@ -718,7 +811,7 @@ bool __weak arch_bpf_jit_check_func(const struct bpf_prog *prog) { return true; } -EXPORT_SYMBOL(arch_bpf_jit_check_func); +EXPORT_SYMBOL_GPL(arch_bpf_jit_check_func); #endif struct bpf_binary_header * @@ -749,7 +842,6 @@ bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr, bpf_jit_set_header_magic(hdr); hdr->pages = pages; - hole = min_t(unsigned int, size - (proglen + sizeof(*hdr)), PAGE_SIZE - sizeof(*hdr)); start = (get_random_int() % hole) & ~(alignment - 1); @@ -780,7 +872,6 @@ void __weak bpf_jit_free(struct bpf_prog *fp) if (fp->jited) { struct bpf_binary_header *hdr = bpf_jit_binary_hdr(fp); - bpf_jit_binary_unlock_ro(hdr); bpf_jit_binary_free(hdr); WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(fp)); @@ -789,9 +880,44 @@ void __weak bpf_jit_free(struct bpf_prog *fp) bpf_prog_unlock_free(fp); } +int bpf_jit_get_func_addr(const struct bpf_prog *prog, + const struct bpf_insn *insn, bool extra_pass, + u64 *func_addr, bool *func_addr_fixed) +{ + s16 off = insn->off; + s32 imm = insn->imm; + u8 *addr; + + *func_addr_fixed = insn->src_reg != BPF_PSEUDO_CALL; + if (!*func_addr_fixed) { + /* Place-holder address till the last pass has collected + * all addresses for JITed subprograms in which case we + * can pick them up from prog->aux. + */ + if (!extra_pass) + addr = NULL; + else if (prog->aux->func && + off >= 0 && off < prog->aux->func_cnt) + addr = (u8 *)prog->aux->func[off]->bpf_func; + else + return -EINVAL; + } else { + /* Address of a BPF helper call. Since part of the core + * kernel, it's always at a fixed location. __bpf_call_base + * and the helper with imm relative to it are both in core + * kernel. + */ + addr = (u8 *)__bpf_call_base + imm; + } + + *func_addr = (unsigned long)addr; + return 0; +} + static int bpf_jit_blind_insn(const struct bpf_insn *from, const struct bpf_insn *aux, - struct bpf_insn *to_buff) + struct bpf_insn *to_buff, + bool emit_zext) { struct bpf_insn *to = to_buff; u32 imm_rnd = get_random_int(); @@ -810,6 +936,9 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from, * below. * * Constant blinding is only used by JITs, not in the interpreter. + * The interpreter uses AX in some occasions as a local temporary + * register e.g. in DIV or MOD instructions. + * * In restricted circumstances, the verifier can also use the AX * register for rewrites as long as they do not interfere with * the above cases! @@ -873,21 +1002,25 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from, *to++ = BPF_JMP_REG(from->code, from->dst_reg, BPF_REG_AX, off); break; - case BPF_LD | BPF_ABS | BPF_W: - case BPF_LD | BPF_ABS | BPF_H: - case BPF_LD | BPF_ABS | BPF_B: - *to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm); - *to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd); - *to++ = BPF_LD_IND(from->code, BPF_REG_AX, 0); - break; - - case BPF_LD | BPF_IND | BPF_W: - case BPF_LD | BPF_IND | BPF_H: - case BPF_LD | BPF_IND | BPF_B: - *to++ = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm); - *to++ = BPF_ALU64_IMM(BPF_XOR, BPF_REG_AX, imm_rnd); - *to++ = BPF_ALU32_REG(BPF_ADD, BPF_REG_AX, from->src_reg); - *to++ = BPF_LD_IND(from->code, BPF_REG_AX, 0); + case BPF_JMP32 | BPF_JEQ | BPF_K: + case BPF_JMP32 | BPF_JNE | BPF_K: + case BPF_JMP32 | BPF_JGT | BPF_K: + case BPF_JMP32 | BPF_JLT | BPF_K: + case BPF_JMP32 | BPF_JGE | BPF_K: + case BPF_JMP32 | BPF_JLE | BPF_K: + case BPF_JMP32 | BPF_JSGT | BPF_K: + case BPF_JMP32 | BPF_JSLT | BPF_K: + case BPF_JMP32 | BPF_JSGE | BPF_K: + case BPF_JMP32 | BPF_JSLE | BPF_K: + case BPF_JMP32 | BPF_JSET | BPF_K: + /* Accommodate for extra offset in case of a backjump. */ + off = from->off; + if (off < 0) + off -= 2; + *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm); + *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd); + *to++ = BPF_JMP32_REG(from->code, from->dst_reg, BPF_REG_AX, + off); break; case BPF_LD | BPF_IMM | BPF_DW: @@ -899,6 +1032,8 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from, case 0: /* Part 2 of BPF_LD | BPF_IMM | BPF_DW. */ *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ aux[0].imm); *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd); + if (emit_zext) + *to++ = BPF_ZEXT_REG(BPF_REG_AX); *to++ = BPF_ALU64_REG(BPF_OR, aux[0].dst_reg, BPF_REG_AX); break; @@ -982,18 +1117,19 @@ struct bpf_prog *bpf_jit_blind_constants(struct bpf_prog *prog) insn[1].code == 0) memcpy(aux, insn, sizeof(aux)); - rewritten = bpf_jit_blind_insn(insn, aux, insn_buff); + rewritten = bpf_jit_blind_insn(insn, aux, insn_buff, + clone->aux->verifier_zext); if (!rewritten) continue; tmp = bpf_patch_insn_single(clone, i, insn_buff, rewritten); - if (!tmp) { + if (IS_ERR(tmp)) { /* Patching may have repointed aux->prog during * realloc from the original one, so we need to * fix it up here on error. */ bpf_jit_prog_release_other(prog, clone); - return ERR_PTR(-ENOMEM); + return tmp; } clone = tmp; @@ -1022,129 +1158,189 @@ noinline u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) } EXPORT_SYMBOL_GPL(__bpf_call_base); +/* All UAPI available opcodes. */ +#define BPF_INSN_MAP(INSN_2, INSN_3) \ + /* 32 bit ALU operations. */ \ + /* Register based. */ \ + INSN_3(ALU, ADD, X), \ + INSN_3(ALU, SUB, X), \ + INSN_3(ALU, AND, X), \ + INSN_3(ALU, OR, X), \ + INSN_3(ALU, LSH, X), \ + INSN_3(ALU, RSH, X), \ + INSN_3(ALU, XOR, X), \ + INSN_3(ALU, MUL, X), \ + INSN_3(ALU, MOV, X), \ + INSN_3(ALU, ARSH, X), \ + INSN_3(ALU, DIV, X), \ + INSN_3(ALU, MOD, X), \ + INSN_2(ALU, NEG), \ + INSN_3(ALU, END, TO_BE), \ + INSN_3(ALU, END, TO_LE), \ + /* Immediate based. */ \ + INSN_3(ALU, ADD, K), \ + INSN_3(ALU, SUB, K), \ + INSN_3(ALU, AND, K), \ + INSN_3(ALU, OR, K), \ + INSN_3(ALU, LSH, K), \ + INSN_3(ALU, RSH, K), \ + INSN_3(ALU, XOR, K), \ + INSN_3(ALU, MUL, K), \ + INSN_3(ALU, MOV, K), \ + INSN_3(ALU, ARSH, K), \ + INSN_3(ALU, DIV, K), \ + INSN_3(ALU, MOD, K), \ + /* 64 bit ALU operations. */ \ + /* Register based. */ \ + INSN_3(ALU64, ADD, X), \ + INSN_3(ALU64, SUB, X), \ + INSN_3(ALU64, AND, X), \ + INSN_3(ALU64, OR, X), \ + INSN_3(ALU64, LSH, X), \ + INSN_3(ALU64, RSH, X), \ + INSN_3(ALU64, XOR, X), \ + INSN_3(ALU64, MUL, X), \ + INSN_3(ALU64, MOV, X), \ + INSN_3(ALU64, ARSH, X), \ + INSN_3(ALU64, DIV, X), \ + INSN_3(ALU64, MOD, X), \ + INSN_2(ALU64, NEG), \ + /* Immediate based. */ \ + INSN_3(ALU64, ADD, K), \ + INSN_3(ALU64, SUB, K), \ + INSN_3(ALU64, AND, K), \ + INSN_3(ALU64, OR, K), \ + INSN_3(ALU64, LSH, K), \ + INSN_3(ALU64, RSH, K), \ + INSN_3(ALU64, XOR, K), \ + INSN_3(ALU64, MUL, K), \ + INSN_3(ALU64, MOV, K), \ + INSN_3(ALU64, ARSH, K), \ + INSN_3(ALU64, DIV, K), \ + INSN_3(ALU64, MOD, K), \ + /* Call instruction. */ \ + INSN_2(JMP, CALL), \ + /* Exit instruction. */ \ + INSN_2(JMP, EXIT), \ + /* 32-bit Jump instructions. */ \ + /* Register based. */ \ + INSN_3(JMP32, JEQ, X), \ + INSN_3(JMP32, JNE, X), \ + INSN_3(JMP32, JGT, X), \ + INSN_3(JMP32, JLT, X), \ + INSN_3(JMP32, JGE, X), \ + INSN_3(JMP32, JLE, X), \ + INSN_3(JMP32, JSGT, X), \ + INSN_3(JMP32, JSLT, X), \ + INSN_3(JMP32, JSGE, X), \ + INSN_3(JMP32, JSLE, X), \ + INSN_3(JMP32, JSET, X), \ + /* Immediate based. */ \ + INSN_3(JMP32, JEQ, K), \ + INSN_3(JMP32, JNE, K), \ + INSN_3(JMP32, JGT, K), \ + INSN_3(JMP32, JLT, K), \ + INSN_3(JMP32, JGE, K), \ + INSN_3(JMP32, JLE, K), \ + INSN_3(JMP32, JSGT, K), \ + INSN_3(JMP32, JSLT, K), \ + INSN_3(JMP32, JSGE, K), \ + INSN_3(JMP32, JSLE, K), \ + INSN_3(JMP32, JSET, K), \ + /* Jump instructions. */ \ + /* Register based. */ \ + INSN_3(JMP, JEQ, X), \ + INSN_3(JMP, JNE, X), \ + INSN_3(JMP, JGT, X), \ + INSN_3(JMP, JLT, X), \ + INSN_3(JMP, JGE, X), \ + INSN_3(JMP, JLE, X), \ + INSN_3(JMP, JSGT, X), \ + INSN_3(JMP, JSLT, X), \ + INSN_3(JMP, JSGE, X), \ + INSN_3(JMP, JSLE, X), \ + INSN_3(JMP, JSET, X), \ + /* Immediate based. */ \ + INSN_3(JMP, JEQ, K), \ + INSN_3(JMP, JNE, K), \ + INSN_3(JMP, JGT, K), \ + INSN_3(JMP, JLT, K), \ + INSN_3(JMP, JGE, K), \ + INSN_3(JMP, JLE, K), \ + INSN_3(JMP, JSGT, K), \ + INSN_3(JMP, JSLT, K), \ + INSN_3(JMP, JSGE, K), \ + INSN_3(JMP, JSLE, K), \ + INSN_3(JMP, JSET, K), \ + INSN_2(JMP, JA), \ + /* Store instructions. */ \ + /* Register based. */ \ + INSN_3(STX, MEM, B), \ + INSN_3(STX, MEM, H), \ + INSN_3(STX, MEM, W), \ + INSN_3(STX, MEM, DW), \ + INSN_3(STX, XADD, W), \ + INSN_3(STX, XADD, DW), \ + /* Immediate based. */ \ + INSN_3(ST, MEM, B), \ + INSN_3(ST, MEM, H), \ + INSN_3(ST, MEM, W), \ + INSN_3(ST, MEM, DW), \ + /* Load instructions. */ \ + /* Register based. */ \ + INSN_3(LDX, MEM, B), \ + INSN_3(LDX, MEM, H), \ + INSN_3(LDX, MEM, W), \ + INSN_3(LDX, MEM, DW), \ + /* Immediate based. */ \ + INSN_3(LD, IMM, DW) + +bool bpf_opcode_in_insntable(u8 code) +{ +#define BPF_INSN_2_TBL(x, y) [BPF_##x | BPF_##y] = true +#define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true + static const bool public_insntable[256] = { + [0 ... 255] = false, + /* Now overwrite non-defaults ... */ + BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL), + /* UAPI exposed, but rewritten opcodes. cBPF carry-over. */ + [BPF_LD | BPF_ABS | BPF_B] = true, + [BPF_LD | BPF_ABS | BPF_H] = true, + [BPF_LD | BPF_ABS | BPF_W] = true, + [BPF_LD | BPF_IND | BPF_B] = true, + [BPF_LD | BPF_IND | BPF_H] = true, + [BPF_LD | BPF_IND | BPF_W] = true, + }; +#undef BPF_INSN_3_TBL +#undef BPF_INSN_2_TBL + return public_insntable[code]; +} + #ifndef CONFIG_BPF_JIT_ALWAYS_ON /** * __bpf_prog_run - run eBPF program on a given context - * @ctx: is the data we are operating on + * @regs: is the array of MAX_BPF_EXT_REG eBPF pseudo-registers * @insn: is the array of eBPF instructions + * @stack: is the eBPF storage stack * * Decode and execute eBPF instructions. */ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack) { - u64 tmp; - static const void *jumptable[256] = { +#define BPF_INSN_2_LBL(x, y) [BPF_##x | BPF_##y] = &&x##_##y +#define BPF_INSN_3_LBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = &&x##_##y##_##z + static const void * const jumptable[256] __annotate_jump_table = { [0 ... 255] = &&default_label, /* Now overwrite non-defaults ... */ - /* 32 bit ALU operations */ - [BPF_ALU | BPF_ADD | BPF_X] = &&ALU_ADD_X, - [BPF_ALU | BPF_ADD | BPF_K] = &&ALU_ADD_K, - [BPF_ALU | BPF_SUB | BPF_X] = &&ALU_SUB_X, - [BPF_ALU | BPF_SUB | BPF_K] = &&ALU_SUB_K, - [BPF_ALU | BPF_AND | BPF_X] = &&ALU_AND_X, - [BPF_ALU | BPF_AND | BPF_K] = &&ALU_AND_K, - [BPF_ALU | BPF_OR | BPF_X] = &&ALU_OR_X, - [BPF_ALU | BPF_OR | BPF_K] = &&ALU_OR_K, - [BPF_ALU | BPF_LSH | BPF_X] = &&ALU_LSH_X, - [BPF_ALU | BPF_LSH | BPF_K] = &&ALU_LSH_K, - [BPF_ALU | BPF_RSH | BPF_X] = &&ALU_RSH_X, - [BPF_ALU | BPF_RSH | BPF_K] = &&ALU_RSH_K, - [BPF_ALU | BPF_XOR | BPF_X] = &&ALU_XOR_X, - [BPF_ALU | BPF_XOR | BPF_K] = &&ALU_XOR_K, - [BPF_ALU | BPF_MUL | BPF_X] = &&ALU_MUL_X, - [BPF_ALU | BPF_MUL | BPF_K] = &&ALU_MUL_K, - [BPF_ALU | BPF_MOV | BPF_X] = &&ALU_MOV_X, - [BPF_ALU | BPF_MOV | BPF_K] = &&ALU_MOV_K, - [BPF_ALU | BPF_DIV | BPF_X] = &&ALU_DIV_X, - [BPF_ALU | BPF_DIV | BPF_K] = &&ALU_DIV_K, - [BPF_ALU | BPF_MOD | BPF_X] = &&ALU_MOD_X, - [BPF_ALU | BPF_MOD | BPF_K] = &&ALU_MOD_K, - [BPF_ALU | BPF_NEG] = &&ALU_NEG, - [BPF_ALU | BPF_END | BPF_TO_BE] = &&ALU_END_TO_BE, - [BPF_ALU | BPF_END | BPF_TO_LE] = &&ALU_END_TO_LE, - /* 64 bit ALU operations */ - [BPF_ALU64 | BPF_ADD | BPF_X] = &&ALU64_ADD_X, - [BPF_ALU64 | BPF_ADD | BPF_K] = &&ALU64_ADD_K, - [BPF_ALU64 | BPF_SUB | BPF_X] = &&ALU64_SUB_X, - [BPF_ALU64 | BPF_SUB | BPF_K] = &&ALU64_SUB_K, - [BPF_ALU64 | BPF_AND | BPF_X] = &&ALU64_AND_X, - [BPF_ALU64 | BPF_AND | BPF_K] = &&ALU64_AND_K, - [BPF_ALU64 | BPF_OR | BPF_X] = &&ALU64_OR_X, - [BPF_ALU64 | BPF_OR | BPF_K] = &&ALU64_OR_K, - [BPF_ALU64 | BPF_LSH | BPF_X] = &&ALU64_LSH_X, - [BPF_ALU64 | BPF_LSH | BPF_K] = &&ALU64_LSH_K, - [BPF_ALU64 | BPF_RSH | BPF_X] = &&ALU64_RSH_X, - [BPF_ALU64 | BPF_RSH | BPF_K] = &&ALU64_RSH_K, - [BPF_ALU64 | BPF_XOR | BPF_X] = &&ALU64_XOR_X, - [BPF_ALU64 | BPF_XOR | BPF_K] = &&ALU64_XOR_K, - [BPF_ALU64 | BPF_MUL | BPF_X] = &&ALU64_MUL_X, - [BPF_ALU64 | BPF_MUL | BPF_K] = &&ALU64_MUL_K, - [BPF_ALU64 | BPF_MOV | BPF_X] = &&ALU64_MOV_X, - [BPF_ALU64 | BPF_MOV | BPF_K] = &&ALU64_MOV_K, - [BPF_ALU64 | BPF_ARSH | BPF_X] = &&ALU64_ARSH_X, - [BPF_ALU64 | BPF_ARSH | BPF_K] = &&ALU64_ARSH_K, - [BPF_ALU64 | BPF_DIV | BPF_X] = &&ALU64_DIV_X, - [BPF_ALU64 | BPF_DIV | BPF_K] = &&ALU64_DIV_K, - [BPF_ALU64 | BPF_MOD | BPF_X] = &&ALU64_MOD_X, - [BPF_ALU64 | BPF_MOD | BPF_K] = &&ALU64_MOD_K, - [BPF_ALU64 | BPF_NEG] = &&ALU64_NEG, - /* Call instruction */ - [BPF_JMP | BPF_CALL] = &&JMP_CALL, + BPF_INSN_MAP(BPF_INSN_2_LBL, BPF_INSN_3_LBL), + /* Non-UAPI available opcodes. */ [BPF_JMP | BPF_CALL_ARGS] = &&JMP_CALL_ARGS, [BPF_JMP | BPF_TAIL_CALL] = &&JMP_TAIL_CALL, - /* Jumps */ - [BPF_JMP | BPF_JA] = &&JMP_JA, - [BPF_JMP | BPF_JEQ | BPF_X] = &&JMP_JEQ_X, - [BPF_JMP | BPF_JEQ | BPF_K] = &&JMP_JEQ_K, - [BPF_JMP | BPF_JNE | BPF_X] = &&JMP_JNE_X, - [BPF_JMP | BPF_JNE | BPF_K] = &&JMP_JNE_K, - [BPF_JMP | BPF_JGT | BPF_X] = &&JMP_JGT_X, - [BPF_JMP | BPF_JGT | BPF_K] = &&JMP_JGT_K, - [BPF_JMP | BPF_JLT | BPF_X] = &&JMP_JLT_X, - [BPF_JMP | BPF_JLT | BPF_K] = &&JMP_JLT_K, - [BPF_JMP | BPF_JGE | BPF_X] = &&JMP_JGE_X, - [BPF_JMP | BPF_JGE | BPF_K] = &&JMP_JGE_K, - [BPF_JMP | BPF_JLE | BPF_X] = &&JMP_JLE_X, - [BPF_JMP | BPF_JLE | BPF_K] = &&JMP_JLE_K, - [BPF_JMP | BPF_JSGT | BPF_X] = &&JMP_JSGT_X, - [BPF_JMP | BPF_JSGT | BPF_K] = &&JMP_JSGT_K, - [BPF_JMP | BPF_JSLT | BPF_X] = &&JMP_JSLT_X, - [BPF_JMP | BPF_JSLT | BPF_K] = &&JMP_JSLT_K, - [BPF_JMP | BPF_JSGE | BPF_X] = &&JMP_JSGE_X, - [BPF_JMP | BPF_JSGE | BPF_K] = &&JMP_JSGE_K, - [BPF_JMP | BPF_JSLE | BPF_X] = &&JMP_JSLE_X, - [BPF_JMP | BPF_JSLE | BPF_K] = &&JMP_JSLE_K, - [BPF_JMP | BPF_JSET | BPF_X] = &&JMP_JSET_X, - [BPF_JMP | BPF_JSET | BPF_K] = &&JMP_JSET_K, - /* Program return */ - [BPF_JMP | BPF_EXIT] = &&JMP_EXIT, - /* Store instructions */ - [BPF_STX | BPF_MEM | BPF_B] = &&STX_MEM_B, - [BPF_STX | BPF_MEM | BPF_H] = &&STX_MEM_H, - [BPF_STX | BPF_MEM | BPF_W] = &&STX_MEM_W, - [BPF_STX | BPF_MEM | BPF_DW] = &&STX_MEM_DW, - [BPF_STX | BPF_XADD | BPF_W] = &&STX_XADD_W, - [BPF_STX | BPF_XADD | BPF_DW] = &&STX_XADD_DW, - [BPF_ST | BPF_MEM | BPF_B] = &&ST_MEM_B, - [BPF_ST | BPF_MEM | BPF_H] = &&ST_MEM_H, - [BPF_ST | BPF_MEM | BPF_W] = &&ST_MEM_W, - [BPF_ST | BPF_MEM | BPF_DW] = &&ST_MEM_DW, - /* Load instructions */ - [BPF_LDX | BPF_MEM | BPF_B] = &&LDX_MEM_B, - [BPF_LDX | BPF_MEM | BPF_H] = &&LDX_MEM_H, - [BPF_LDX | BPF_MEM | BPF_W] = &&LDX_MEM_W, - [BPF_LDX | BPF_MEM | BPF_DW] = &&LDX_MEM_DW, - [BPF_LD | BPF_ABS | BPF_W] = &&LD_ABS_W, - [BPF_LD | BPF_ABS | BPF_H] = &&LD_ABS_H, - [BPF_LD | BPF_ABS | BPF_B] = &&LD_ABS_B, - [BPF_LD | BPF_IND | BPF_W] = &&LD_IND_W, - [BPF_LD | BPF_IND | BPF_H] = &&LD_IND_H, - [BPF_LD | BPF_IND | BPF_B] = &&LD_IND_B, - [BPF_LD | BPF_IMM | BPF_DW] = &&LD_IMM_DW, + [BPF_ST | BPF_NOSPEC] = &&ST_NOSPEC, }; +#undef BPF_INSN_3_LBL +#undef BPF_INSN_2_LBL u32 tail_call_cnt = 0; - void *ptr; - int off; #define CONT ({ insn++; goto select_insn; }) #define CONT_JMP ({ insn++; goto select_insn; }) @@ -1152,29 +1348,54 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack) select_insn: goto *jumptable[insn->code]; - /* ALU */ -#define ALU(OPCODE, OP) \ - ALU64_##OPCODE##_X: \ - DST = DST OP SRC; \ - CONT; \ - ALU_##OPCODE##_X: \ - DST = (u32) DST OP (u32) SRC; \ - CONT; \ - ALU64_##OPCODE##_K: \ - DST = DST OP IMM; \ - CONT; \ - ALU_##OPCODE##_K: \ - DST = (u32) DST OP (u32) IMM; \ + /* Explicitly mask the register-based shift amounts with 63 or 31 + * to avoid undefined behavior. Normally this won't affect the + * generated code, for example, in case of native 64 bit archs such + * as x86-64 or arm64, the compiler is optimizing the AND away for + * the interpreter. In case of JITs, each of the JIT backends compiles + * the BPF shift operations to machine instructions which produce + * implementation-defined results in such a case; the resulting + * contents of the register may be arbitrary, but program behaviour + * as a whole remains defined. In other words, in case of JIT backends, + * the AND must /not/ be added to the emitted LSH/RSH/ARSH translation. + */ + /* ALU (shifts) */ +#define SHT(OPCODE, OP) \ + ALU64_##OPCODE##_X: \ + DST = DST OP (SRC & 63); \ + CONT; \ + ALU_##OPCODE##_X: \ + DST = (u32) DST OP ((u32) SRC & 31); \ + CONT; \ + ALU64_##OPCODE##_K: \ + DST = DST OP IMM; \ + CONT; \ + ALU_##OPCODE##_K: \ + DST = (u32) DST OP (u32) IMM; \ + CONT; + /* ALU (rest) */ +#define ALU(OPCODE, OP) \ + ALU64_##OPCODE##_X: \ + DST = DST OP SRC; \ + CONT; \ + ALU_##OPCODE##_X: \ + DST = (u32) DST OP (u32) SRC; \ + CONT; \ + ALU64_##OPCODE##_K: \ + DST = DST OP IMM; \ + CONT; \ + ALU_##OPCODE##_K: \ + DST = (u32) DST OP (u32) IMM; \ CONT; - ALU(ADD, +) ALU(SUB, -) ALU(AND, &) ALU(OR, |) - ALU(LSH, <<) - ALU(RSH, >>) ALU(XOR, ^) ALU(MUL, *) + SHT(LSH, <<) + SHT(RSH, >>) +#undef SHT #undef ALU ALU_NEG: DST = (u32) -DST; @@ -1198,43 +1419,49 @@ select_insn: DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32; insn++; CONT; + ALU_ARSH_X: + DST = (u64) (u32) (((s32) DST) >> (SRC & 31)); + CONT; + ALU_ARSH_K: + DST = (u64) (u32) (((s32) DST) >> IMM); + CONT; ALU64_ARSH_X: - (*(s64 *) &DST) >>= SRC; + (*(s64 *) &DST) >>= (SRC & 63); CONT; ALU64_ARSH_K: (*(s64 *) &DST) >>= IMM; CONT; ALU64_MOD_X: - div64_u64_rem(DST, SRC, &tmp); - DST = tmp; + div64_u64_rem(DST, SRC, &AX); + DST = AX; CONT; ALU_MOD_X: - tmp = (u32) DST; - DST = do_div(tmp, (u32) SRC); + AX = (u32) DST; + DST = do_div(AX, (u32) SRC); CONT; ALU64_MOD_K: - div64_u64_rem(DST, IMM, &tmp); - DST = tmp; + div64_u64_rem(DST, IMM, &AX); + DST = AX; CONT; ALU_MOD_K: - tmp = (u32) DST; - DST = do_div(tmp, (u32) IMM); + AX = (u32) DST; + DST = do_div(AX, (u32) IMM); CONT; ALU64_DIV_X: DST = div64_u64(DST, SRC); CONT; ALU_DIV_X: - tmp = (u32) DST; - do_div(tmp, (u32) SRC); - DST = (u32) tmp; + AX = (u32) DST; + do_div(AX, (u32) SRC); + DST = (u32) AX; CONT; ALU64_DIV_K: DST = div64_u64(DST, IMM); CONT; ALU_DIV_K: - tmp = (u32) DST; - do_div(tmp, (u32) IMM); - DST = (u32) tmp; + AX = (u32) DST; + do_div(AX, (u32) IMM); + DST = (u32) AX; CONT; ALU_END_TO_BE: switch (IMM) { @@ -1307,146 +1534,62 @@ select_insn: out: CONT; } - /* JMP */ JMP_JA: insn += insn->off; CONT; - JMP_JEQ_X: - if (DST == SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JEQ_K: - if (DST == IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JNE_X: - if (DST != SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JNE_K: - if (DST != IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JGT_X: - if (DST > SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JGT_K: - if (DST > IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JLT_X: - if (DST < SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JLT_K: - if (DST < IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JGE_X: - if (DST >= SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JGE_K: - if (DST >= IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JLE_X: - if (DST <= SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JLE_K: - if (DST <= IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSGT_X: - if (((s64) DST) > ((s64) SRC)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSGT_K: - if (((s64) DST) > ((s64) IMM)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSLT_X: - if (((s64) DST) < ((s64) SRC)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSLT_K: - if (((s64) DST) < ((s64) IMM)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSGE_X: - if (((s64) DST) >= ((s64) SRC)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSGE_K: - if (((s64) DST) >= ((s64) IMM)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSLE_X: - if (((s64) DST) <= ((s64) SRC)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSLE_K: - if (((s64) DST) <= ((s64) IMM)) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSET_X: - if (DST & SRC) { - insn += insn->off; - CONT_JMP; - } - CONT; - JMP_JSET_K: - if (DST & IMM) { - insn += insn->off; - CONT_JMP; - } - CONT; JMP_EXIT: return BPF_R0; - - /* STX and ST and LDX*/ + /* JMP */ +#define COND_JMP(SIGN, OPCODE, CMP_OP) \ + JMP_##OPCODE##_X: \ + if ((SIGN##64) DST CMP_OP (SIGN##64) SRC) { \ + insn += insn->off; \ + CONT_JMP; \ + } \ + CONT; \ + JMP32_##OPCODE##_X: \ + if ((SIGN##32) DST CMP_OP (SIGN##32) SRC) { \ + insn += insn->off; \ + CONT_JMP; \ + } \ + CONT; \ + JMP_##OPCODE##_K: \ + if ((SIGN##64) DST CMP_OP (SIGN##64) IMM) { \ + insn += insn->off; \ + CONT_JMP; \ + } \ + CONT; \ + JMP32_##OPCODE##_K: \ + if ((SIGN##32) DST CMP_OP (SIGN##32) IMM) { \ + insn += insn->off; \ + CONT_JMP; \ + } \ + CONT; + COND_JMP(u, JEQ, ==) + COND_JMP(u, JNE, !=) + COND_JMP(u, JGT, >) + COND_JMP(u, JLT, <) + COND_JMP(u, JGE, >=) + COND_JMP(u, JLE, <=) + COND_JMP(u, JSET, &) + COND_JMP(s, JSGT, >) + COND_JMP(s, JSLT, <) + COND_JMP(s, JSGE, >=) + COND_JMP(s, JSLE, <=) +#undef COND_JMP + /* ST, STX and LDX*/ + ST_NOSPEC: + /* Speculation barrier for mitigating Speculative Store Bypass. + * In case of arm64, we rely on the firmware mitigation as + * controlled via the ssbd kernel parameter. Whenever the + * mitigation is enabled, it works for all of the kernel code + * with no need to provide any additional instructions here. + * In case of x86, we use 'lfence' insn for mitigation. We + * reuse preexisting logic from Spectre v1 mitigation that + * happens to produce the required code on x86 for v4 as well. + */ + barrier_nospec(); + CONT; #define LDST(SIZEOP, SIZE) \ STX_MEM_##SIZEOP: \ *(SIZE *)(unsigned long) (DST + insn->off) = SRC; \ @@ -1471,74 +1614,18 @@ out: atomic64_add((u64) SRC, (atomic64_t *)(unsigned long) (DST + insn->off)); CONT; - LD_ABS_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + imm32)) */ - off = IMM; -load_word: - /* BPF_LD + BPD_ABS and BPF_LD + BPF_IND insns are only - * appearing in the programs where ctx == skb - * (see may_access_skb() in the verifier). All programs - * keep 'ctx' in regs[BPF_REG_CTX] == BPF_R6, - * bpf_convert_filter() saves it in BPF_R6, internal BPF - * verifier will check that BPF_R6 == ctx. - * - * BPF_ABS and BPF_IND are wrappers of function calls, - * so they scratch BPF_R1-BPF_R5 registers, preserve - * BPF_R6-BPF_R9, and store return value into BPF_R0. - * - * Implicit input: - * ctx == skb == BPF_R6 == CTX - * - * Explicit input: - * SRC == any register - * IMM == 32-bit immediate - * - * Output: - * BPF_R0 - 8/16/32-bit skb data converted to cpu endianness - */ - - ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 4, &tmp); - if (likely(ptr != NULL)) { - BPF_R0 = get_unaligned_be32(ptr); - CONT; - } - - return 0; - LD_ABS_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + imm32)) */ - off = IMM; -load_half: - ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 2, &tmp); - if (likely(ptr != NULL)) { - BPF_R0 = get_unaligned_be16(ptr); - CONT; - } - - return 0; - LD_ABS_B: /* BPF_R0 = *(u8 *) (skb->data + imm32) */ - off = IMM; -load_byte: - ptr = bpf_load_pointer((struct sk_buff *) (unsigned long) CTX, off, 1, &tmp); - if (likely(ptr != NULL)) { - BPF_R0 = *(u8 *)ptr; - CONT; - } - - return 0; - LD_IND_W: /* BPF_R0 = ntohl(*(u32 *) (skb->data + src_reg + imm32)) */ - off = IMM + SRC; - goto load_word; - LD_IND_H: /* BPF_R0 = ntohs(*(u16 *) (skb->data + src_reg + imm32)) */ - off = IMM + SRC; - goto load_half; - LD_IND_B: /* BPF_R0 = *(u8 *) (skb->data + src_reg + imm32) */ - off = IMM + SRC; - goto load_byte; default_label: - /* If we ever reach this, we have a bug somewhere. */ - WARN_RATELIMIT(1, "unknown opcode %02x\n", insn->code); + /* If we ever reach this, we have a bug somewhere. Die hard here + * instead of just returning 0; we could be somewhere in a subprog, + * so execution could continue otherwise which we do /not/ want. + * + * Note, verifier whitelists all opcodes in bpf_opcode_in_insntable(). + */ + pr_warn("BPF interpreter: unknown opcode %02x\n", insn->code); + BUG_ON(1); return 0; } -STACK_FRAME_NON_STANDARD(___bpf_prog_run); /* jump table */ #define PROG_NAME(stack_size) __bpf_prog_run##stack_size #define DEFINE_BPF_PROG_RUN(stack_size) \ @@ -1558,7 +1645,7 @@ static u64 PROG_NAME_ARGS(stack_size)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5, \ const struct bpf_insn *insn) \ { \ u64 stack[stack_size / sizeof(u64)]; \ - u64 regs[MAX_BPF_REG]; \ + u64 regs[MAX_BPF_EXT_REG]; \ \ FP = (u64) (unsigned long) &stack[ARRAY_SIZE(stack)]; \ BPF_R1 = r1; \ @@ -1591,7 +1678,7 @@ static unsigned int (*interpreters[])(const void *ctx, EVAL6(PROG_NAME_LIST, 32, 64, 96, 128, 160, 192) EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384) EVAL4(PROG_NAME_LIST, 416, 448, 480, 512) - +}; #undef PROG_NAME_LIST #define PROG_NAME_LIST(stack_size) PROG_NAME_ARGS(stack_size), static u64 (*interpreters_args[])(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5, @@ -1601,6 +1688,7 @@ EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384) EVAL4(PROG_NAME_LIST, 416, 448, 480, 512) }; #undef PROG_NAME_LIST + void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth) { stack_depth = max_t(u32, stack_depth, 1); @@ -1610,8 +1698,6 @@ void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth) insn->code = BPF_JMP | BPF_CALL_ARGS; } -}; - #else static unsigned int __bpf_prog_ret0_warn(const void *ctx, const struct bpf_insn *insn) @@ -1627,6 +1713,9 @@ static unsigned int __bpf_prog_ret0_warn(const void *ctx, bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp) { + if (fp->kprobe_override) + return false; + if (!array->owner_prog_type) { /* There's no owner yet where we could check for * compatibility. @@ -1665,6 +1754,7 @@ static void bpf_prog_select_func(struct bpf_prog *fp) { #ifndef CONFIG_BPF_JIT_ALWAYS_ON u32 stack_depth = max_t(u32, fp->aux->stack_depth, 1); + fp->bpf_func = interpreters[(round_up(stack_depth, 32) / 32) - 1]; #else fp->bpf_func = __bpf_prog_ret0_warn; @@ -1768,22 +1858,42 @@ struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags) return &empty_prog_array.hdr; } -void bpf_prog_array_free(struct bpf_prog_array __rcu *progs) +void bpf_prog_array_free(struct bpf_prog_array *progs) { - if (!progs || - progs == (struct bpf_prog_array __rcu *)&empty_prog_array.hdr) + if (!progs || progs == &empty_prog_array.hdr) return; kfree_rcu(progs, rcu); } -static bool bpf_prog_array_copy_core(struct bpf_prog_array __rcu *array, +int bpf_prog_array_length(struct bpf_prog_array *array) +{ + struct bpf_prog_array_item *item; + u32 cnt = 0; + + for (item = array->items; item->prog; item++) + if (item->prog != &dummy_bpf_prog.prog) + cnt++; + return cnt; +} + +bool bpf_prog_array_is_empty(struct bpf_prog_array *array) +{ + struct bpf_prog_array_item *item; + + for (item = array->items; item->prog; item++) + if (item->prog != &dummy_bpf_prog.prog) + return false; + return true; +} + +static bool bpf_prog_array_copy_core(struct bpf_prog_array *array, u32 *prog_ids, u32 request_cnt) { struct bpf_prog_array_item *item; int i = 0; - item = rcu_dereference(array)->items; - for (; item->prog; item++) { + + for (item = array->items; item->prog; item++) { if (item->prog == &dummy_bpf_prog.prog) continue; prog_ids[i] = item->prog->aux->id; @@ -1792,42 +1902,49 @@ static bool bpf_prog_array_copy_core(struct bpf_prog_array __rcu *array, break; } } + return !!(item->prog); } -int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array, - u32 *prog_ids, u32 request_cnt, - u32 *prog_cnt) +int bpf_prog_array_copy_to_user(struct bpf_prog_array *array, + __u32 __user *prog_ids, u32 cnt) { - u32 cnt = 0; + unsigned long err = 0; + bool nospc; + u32 *ids; - if (array) - cnt = bpf_prog_array_length(array); - - *prog_cnt = cnt; - - /* return early if user requested only program count or nothing to copy */ - if (!request_cnt || !cnt) - return 0; - - /* this function is called under trace/bpf_trace.c: bpf_event_mutex */ - return bpf_prog_array_copy_core(array, prog_ids, request_cnt) ? -ENOSPC - : 0; + /* users of this function are doing: + * cnt = bpf_prog_array_length(); + * if (cnt > 0) + * bpf_prog_array_copy_to_user(..., cnt); + * so below kcalloc doesn't need extra cnt > 0 check. + */ + ids = kcalloc(cnt, sizeof(u32), GFP_USER | __GFP_NOWARN); + if (!ids) + return -ENOMEM; + nospc = bpf_prog_array_copy_core(array, ids, cnt); + err = copy_to_user(prog_ids, ids, cnt * sizeof(u32)); + kfree(ids); + if (err) + return -EFAULT; + if (nospc) + return -ENOSPC; + return 0; } -void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *array, +void bpf_prog_array_delete_safe(struct bpf_prog_array *array, struct bpf_prog *old_prog) { - struct bpf_prog_array_item *item = array->items; + struct bpf_prog_array_item *item; - for (; item->prog; item++) + for (item = array->items; item->prog; item++) if (item->prog == old_prog) { WRITE_ONCE(item->prog, &dummy_bpf_prog.prog); break; } } -int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array, +int bpf_prog_array_copy(struct bpf_prog_array *old_array, struct bpf_prog *exclude_prog, struct bpf_prog *include_prog, struct bpf_prog_array **new_array) @@ -1855,6 +1972,9 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array, } } + if (exclude_prog && !found_exclude) + return -ENOENT; + /* How many progs (not NULL) will be in the new array? */ new_prog_cnt = carry_prog_cnt; if (include_prog) @@ -1874,13 +1994,12 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array, /* Fill in the new prog array */ if (carry_prog_cnt) { existing = old_array->items; - for (; existing->prog; existing++) { + for (; existing->prog; existing++) if (existing->prog != exclude_prog && existing->prog != &dummy_bpf_prog.prog) { array->items[new_prog_idx++].prog = existing->prog; } - } } if (include_prog) array->items[new_prog_idx++].prog = include_prog; @@ -1889,58 +2008,24 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array, return 0; } -int bpf_prog_array_length(struct bpf_prog_array __rcu *array) +int bpf_prog_array_copy_info(struct bpf_prog_array *array, + u32 *prog_ids, u32 request_cnt, + u32 *prog_cnt) { - struct bpf_prog_array_item *item; u32 cnt = 0; - rcu_read_lock(); - item = rcu_dereference(array)->items; - for (; item->prog; item++) - if (item->prog != &dummy_bpf_prog.prog) - cnt++; - rcu_read_unlock(); - return cnt; -} + if (array) + cnt = bpf_prog_array_length(array); -bool bpf_prog_array_is_empty(struct bpf_prog_array *array) -{ - struct bpf_prog_array_item *item; - for (item = array->items; item->prog; item++) - if (item->prog != &dummy_bpf_prog.prog) - return false; - return true; -} + *prog_cnt = cnt; -int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *array, - __u32 __user *prog_ids, u32 cnt) -{ - unsigned long err = 0; - bool nospc; - u32 *ids; - /* users of this function are doing: - * cnt = bpf_prog_array_length(); - * if (cnt > 0) - * bpf_prog_array_copy_to_user(..., cnt); - * so below kcalloc doesn't need extra cnt > 0 check, but - * bpf_prog_array_length() releases rcu lock and - * prog array could have been swapped with empty or larger array, - * so always copy 'cnt' prog_ids to the user. - * In a rare race the user will see zero prog_ids - */ - ids = kcalloc(cnt, sizeof(u32), GFP_USER); - if (!ids) - return -ENOMEM; - rcu_read_lock(); - nospc = bpf_prog_array_copy_core(array, ids, cnt); - rcu_read_unlock(); - err = copy_to_user(prog_ids, ids, cnt * sizeof(u32)); - kfree(ids); - if (err) - return -EFAULT; - if (nospc) - return -ENOSPC; - return 0; + /* return early if user requested only program count or nothing to copy */ + if (!request_cnt || !cnt) + return 0; + + /* this function is called under trace/bpf_trace.c: bpf_event_mutex */ + return bpf_prog_array_copy_core(array, prog_ids, request_cnt) ? -ENOSPC + : 0; } static void bpf_prog_free_deferred(struct work_struct *work) @@ -1951,8 +2036,10 @@ static void bpf_prog_free_deferred(struct work_struct *work) aux = container_of(work, struct bpf_prog_aux, work); if (bpf_prog_is_dev_bound(aux)) bpf_prog_offload_destroy(aux->prog); - - +#ifdef CONFIG_PERF_EVENTS + if (aux->prog->has_callchain_buf) + put_callchain_buffers(); +#endif for (i = 0; i < aux->func_cnt; i++) bpf_jit_free(aux->func[i]); if (aux->func_cnt) { @@ -2003,6 +2090,9 @@ BPF_CALL_0(bpf_user_rnd_u32) const struct bpf_func_proto bpf_map_lookup_elem_proto __weak; const struct bpf_func_proto bpf_map_update_elem_proto __weak; const struct bpf_func_proto bpf_map_delete_elem_proto __weak; +const struct bpf_func_proto bpf_map_push_elem_proto __weak; +const struct bpf_func_proto bpf_map_pop_elem_proto __weak; +const struct bpf_func_proto bpf_map_peek_elem_proto __weak; const struct bpf_func_proto bpf_spin_lock_proto __weak; const struct bpf_func_proto bpf_spin_unlock_proto __weak; @@ -2029,6 +2119,7 @@ bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, { return -ENOTSUPP; } +EXPORT_SYMBOL_GPL(bpf_event_output); /* Always built-in helper functions. */ const struct bpf_func_proto bpf_tail_call_proto = { @@ -2061,6 +2152,15 @@ bool __weak bpf_helper_changes_pkt_data(void *func) return false; } +/* Return TRUE if the JIT backend wants verifier to enable sub-register usage + * analysis code and wants explicit zero extension inserted by verifier. + * Otherwise, return FALSE. + */ +bool __weak bpf_jit_needs_zext(void) +{ + return false; +} + /* To execute LD_ABS/LD_IND instructions __bpf_prog_run() may call * skb_copy_bits(), so provide a weak definition of it for NET-less config. */ @@ -2070,11 +2170,12 @@ int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to, return -EFAULT; } +DEFINE_STATIC_KEY_FALSE(bpf_stats_enabled_key); +EXPORT_SYMBOL(bpf_stats_enabled_key); + /* All definitions of tracepoints related to BPF. */ #define CREATE_TRACE_POINTS #include EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_exception); - -EXPORT_TRACEPOINT_SYMBOL_GPL(bpf_prog_get_type); -EXPORT_TRACEPOINT_SYMBOL_GPL(bpf_prog_put_rcu); +EXPORT_TRACEPOINT_SYMBOL_GPL(xdp_bulk_tx); diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 0b2dcf0990e0..19be747f4e5a 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -1,7 +1,7 @@ +// SPDX-License-Identifier: GPL-2.0-only /* bpf/cpumap.c * * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc. - * Released under terms in GPL version 2. See COPYING. */ /* The 'cpumap' is primarily used as a backend map for XDP BPF helper @@ -19,32 +19,46 @@ #include #include #include +#include #include #include #include #include +#include + +#include /* netif_receive_skb_core */ +#include /* eth_type_trans */ /* General idea: XDP packets getting XDP redirected to another CPU, * will maximum be stored/queued for one driver ->poll() call. It is - * guaranteed that setting flush bit and flush operation happen on + * guaranteed that queueing the frame and the flush operation happen on * same CPU. Thus, cpu_map_flush operation can deduct via this_cpu_ptr() * which queue in bpf_cpu_map_entry contains packets. */ #define CPU_MAP_BULK_SIZE 8 /* 8 == one cacheline on 64-bit archs */ +struct bpf_cpu_map_entry; +struct bpf_cpu_map; + struct xdp_bulk_queue { void *q[CPU_MAP_BULK_SIZE]; + struct list_head flush_node; + struct bpf_cpu_map_entry *obj; unsigned int count; }; /* Struct for every remote "destination" CPU in map */ struct bpf_cpu_map_entry { + u32 cpu; /* kthread CPU and map index */ + int map_id; /* Back reference to map */ u32 qsize; /* Queue size placeholder for map lookup */ /* XDP can run multiple RX-ring queues, need __percpu enqueue store */ struct xdp_bulk_queue __percpu *bulkq; + struct bpf_cpu_map *cmap; + /* Queue with potential multi-producers, and single-consumer kthread */ struct ptr_ring *queue; struct task_struct *kthread; @@ -58,23 +72,17 @@ struct bpf_cpu_map { struct bpf_map map; /* Below members specific for map type */ struct bpf_cpu_map_entry **cpu_map; - unsigned long __percpu *flush_needed; + struct list_head __percpu *flush_list; }; -static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu, - struct xdp_bulk_queue *bq); - -static u64 cpu_map_bitmap_size(const union bpf_attr *attr) -{ - return BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long); -} +static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx); static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) { struct bpf_cpu_map *cmap; int err = -ENOMEM; + int ret, cpu; u64 cost; - int ret; if (!capable(CAP_SYS_ADMIN)) return ERR_PTR(-EPERM); @@ -98,23 +106,21 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) /* make sure page count doesn't overflow */ cost = (u64) cmap->map.max_entries * sizeof(struct bpf_cpu_map_entry *); - cost += cpu_map_bitmap_size(attr) * num_possible_cpus(); - if (cost >= U32_MAX - PAGE_SIZE) - goto free_cmap; - cmap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; + cost += sizeof(struct list_head) * num_possible_cpus(); /* Notice returns -EPERM on if map size is larger than memlock limit */ - ret = bpf_map_precharge_memlock(cmap->map.pages); + ret = bpf_map_charge_init(&cmap->map.memory, cost); if (ret) { err = ret; goto free_cmap; } - /* A per cpu bitfield with a bit per possible CPU in map */ - cmap->flush_needed = __alloc_percpu(cpu_map_bitmap_size(attr), - __alignof__(unsigned long)); - if (!cmap->flush_needed) - goto free_cmap; + cmap->flush_list = alloc_percpu(struct list_head); + if (!cmap->flush_list) + goto free_charge; + + for_each_possible_cpu(cpu) + INIT_LIST_HEAD(per_cpu_ptr(cmap->flush_list, cpu)); /* Alloc array for possible remote "destination" CPUs */ cmap->cpu_map = bpf_map_area_alloc(cmap->map.max_entries * @@ -125,33 +131,14 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) return &cmap->map; free_percpu: - free_percpu(cmap->flush_needed); + free_percpu(cmap->flush_list); +free_charge: + bpf_map_charge_finish(&cmap->map.memory); free_cmap: kfree(cmap); return ERR_PTR(err); } -void __cpu_map_queue_destructor(void *ptr) -{ - /* The tear-down procedure should have made sure that queue is - * empty. See __cpu_map_entry_replace() and work-queue - * invoked cpu_map_kthread_stop(). Catch any broken behaviour - * gracefully and warn once. - */ - if (WARN_ON_ONCE(ptr)) - page_frag_free(ptr); -} - -static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) -{ - if (atomic_dec_and_test(&rcpu->refcnt)) { - /* The queue should be empty at this point */ - ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor); - kfree(rcpu->queue); - kfree(rcpu); - } -} - static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) { atomic_inc(&rcpu->refcnt); @@ -173,9 +160,96 @@ static void cpu_map_kthread_stop(struct work_struct *work) kthread_stop(rcpu->kthread); } +static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu, + struct xdp_frame *xdpf, + struct sk_buff *skb) +{ + unsigned int hard_start_headroom; + unsigned int frame_size; + void *pkt_data_start; + + /* Part of headroom was reserved to xdpf */ + hard_start_headroom = sizeof(struct xdp_frame) + xdpf->headroom; + + /* build_skb need to place skb_shared_info after SKB end, and + * also want to know the memory "truesize". Thus, need to + * know the memory frame size backing xdp_buff. + * + * XDP was designed to have PAGE_SIZE frames, but this + * assumption is not longer true with ixgbe and i40e. It + * would be preferred to set frame_size to 2048 or 4096 + * depending on the driver. + * frame_size = 2048; + * frame_len = frame_size - sizeof(*xdp_frame); + * + * Instead, with info avail, skb_shared_info in placed after + * packet len. This, unfortunately fakes the truesize. + * Another disadvantage of this approach, the skb_shared_info + * is not at a fixed memory location, with mixed length + * packets, which is bad for cache-line hotness. + */ + frame_size = SKB_DATA_ALIGN(xdpf->len + hard_start_headroom) + + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + + pkt_data_start = xdpf->data - hard_start_headroom; + skb = build_skb_around(skb, pkt_data_start, frame_size); + if (unlikely(!skb)) + return NULL; + + skb_reserve(skb, hard_start_headroom); + __skb_put(skb, xdpf->len); + if (xdpf->metasize) + skb_metadata_set(skb, xdpf->metasize); + + /* Essential SKB info: protocol and skb->dev */ + skb->protocol = eth_type_trans(skb, xdpf->dev_rx); + + /* Optional SKB info, currently missing: + * - HW checksum info (skb->ip_summed) + * - HW RX hash (skb_set_hash) + * - RX ring dev queue index (skb_record_rx_queue) + */ + + /* Until page_pool get SKB return path, release DMA here */ + xdp_release_frame(xdpf); + + /* Allow SKB to reuse area used by xdp_frame */ + xdp_scrub_frame(xdpf); + + return skb; +} + +static void __cpu_map_ring_cleanup(struct ptr_ring *ring) +{ + /* The tear-down procedure should have made sure that queue is + * empty. See __cpu_map_entry_replace() and work-queue + * invoked cpu_map_kthread_stop(). Catch any broken behaviour + * gracefully and warn once. + */ + struct xdp_frame *xdpf; + + while ((xdpf = ptr_ring_consume(ring))) + if (WARN_ON_ONCE(xdpf)) + xdp_return_frame(xdpf); +} + +static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu) +{ + if (atomic_dec_and_test(&rcpu->refcnt)) { + /* The queue should be empty at this point */ + __cpu_map_ring_cleanup(rcpu->queue); + ptr_ring_cleanup(rcpu->queue, NULL); + kfree(rcpu->queue); + kfree(rcpu); + } +} + +#define CPUMAP_BATCH 8 + static int cpu_map_kthread_run(void *data) { struct bpf_cpu_map_entry *rcpu = data; + unsigned long last_qs = jiffies; set_current_state(TASK_INTERRUPTIBLE); @@ -185,15 +259,74 @@ static int cpu_map_kthread_run(void *data) * kthread_stop signal until queue is empty. */ while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) { - struct xdp_pkt *xdp_pkt; + unsigned int drops = 0, sched = 0; + void *frames[CPUMAP_BATCH]; + void *skbs[CPUMAP_BATCH]; + gfp_t gfp = __GFP_ZERO | GFP_ATOMIC; + int i, n, m; - schedule(); - /* Do work */ - while ((xdp_pkt = ptr_ring_consume(rcpu->queue))) { - /* For now just "refcnt-free" */ - page_frag_free(xdp_pkt); + /* Release CPU reschedule checks */ + if (__ptr_ring_empty(rcpu->queue)) { + set_current_state(TASK_INTERRUPTIBLE); + /* Recheck to avoid lost wake-up */ + if (__ptr_ring_empty(rcpu->queue)) { + schedule(); + sched = 1; + last_qs = jiffies; + } else { + __set_current_state(TASK_RUNNING); + } + } else { + rcu_softirq_qs_periodic(last_qs); + sched = cond_resched(); } - __set_current_state(TASK_INTERRUPTIBLE); + + /* + * The bpf_cpu_map_entry is single consumer, with this + * kthread CPU pinned. Lockless access to ptr_ring + * consume side valid as no-resize allowed of queue. + */ + n = ptr_ring_consume_batched(rcpu->queue, frames, CPUMAP_BATCH); + + for (i = 0; i < n; i++) { + void *f = frames[i]; + struct page *page = virt_to_page(f); + + /* Bring struct page memory area to curr CPU. Read by + * build_skb_around via page_is_pfmemalloc(), and when + * freed written by page_frag_free call. + */ + prefetchw(page); + } + + m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, n, skbs); + if (unlikely(m == 0)) { + for (i = 0; i < n; i++) + skbs[i] = NULL; /* effect: xdp_return_frame */ + drops = n; + } + + local_bh_disable(); + for (i = 0; i < n; i++) { + struct xdp_frame *xdpf = frames[i]; + struct sk_buff *skb = skbs[i]; + int ret; + + skb = cpu_map_build_skb(rcpu, xdpf, skb); + if (!skb) { + xdp_return_frame(xdpf); + continue; + } + + /* Inject into network stack */ + ret = netif_receive_skb_core(skb); + if (ret == NET_RX_DROP) + drops++; + } + /* Feedback loop via tracepoint */ + trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched); + + local_bh_enable(); /* resched point, may call do_softirq() */ } __set_current_state(TASK_RUNNING); @@ -201,11 +334,13 @@ static int cpu_map_kthread_run(void *data) return 0; } -struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, int map_id) +static struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, + int map_id) { - gfp_t gfp = GFP_ATOMIC|__GFP_NOWARN; + gfp_t gfp = GFP_KERNEL | __GFP_NOWARN; struct bpf_cpu_map_entry *rcpu; - int numa, err; + struct xdp_bulk_queue *bq; + int numa, err, i; /* Have map->numa_node, but choose node of redirect target CPU */ numa = cpu_to_node(cpu); @@ -220,6 +355,11 @@ struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, int map_id) if (!rcpu->bulkq) goto free_rcu; + for_each_possible_cpu(i) { + bq = per_cpu_ptr(rcpu->bulkq, i); + bq->obj = rcpu; + } + /* Alloc queue */ rcpu->queue = kzalloc_node(sizeof(*rcpu->queue), gfp, numa); if (!rcpu->queue) @@ -229,7 +369,9 @@ struct bpf_cpu_map_entry *__cpu_map_entry_alloc(u32 qsize, u32 cpu, int map_id) if (err) goto free_queue; - rcpu->qsize = qsize; + rcpu->cpu = cpu; + rcpu->map_id = map_id; + rcpu->qsize = qsize; /* Setup kthread */ rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa, @@ -257,7 +399,7 @@ free_rcu: return NULL; } -void __cpu_map_entry_free(struct rcu_head *rcu) +static void __cpu_map_entry_free(struct rcu_head *rcu) { struct bpf_cpu_map_entry *rcpu; int cpu; @@ -274,7 +416,7 @@ void __cpu_map_entry_free(struct rcu_head *rcu) struct xdp_bulk_queue *bq = per_cpu_ptr(rcpu->bulkq, cpu); /* No concurrent bq_enqueue can run at this point */ - bq_flush_to_queue(rcpu, bq); + bq_flush_to_queue(bq, false); } free_percpu(rcpu->bulkq); /* Cannot kthread_stop() here, last put free rcpu resources */ @@ -300,8 +442,8 @@ void __cpu_map_entry_free(struct rcu_head *rcu) * cpu_map_kthread_stop, which waits for an RCU graze period before * stopping kthread, emptying the queue. */ -void __cpu_map_entry_replace(struct bpf_cpu_map *cmap, - u32 key_cpu, struct bpf_cpu_map_entry *rcpu) +static void __cpu_map_entry_replace(struct bpf_cpu_map *cmap, + u32 key_cpu, struct bpf_cpu_map_entry *rcpu) { struct bpf_cpu_map_entry *old_rcpu; @@ -313,7 +455,7 @@ void __cpu_map_entry_replace(struct bpf_cpu_map *cmap, } } -int cpu_map_delete_elem(struct bpf_map *map, void *key) +static int cpu_map_delete_elem(struct bpf_map *map, void *key) { struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map); u32 key_cpu = *(u32 *)key; @@ -326,8 +468,8 @@ int cpu_map_delete_elem(struct bpf_map *map, void *key) return 0; } -int cpu_map_update_elem(struct bpf_map *map, void *key, void *value, - u64 map_flags) +static int cpu_map_update_elem(struct bpf_map *map, void *key, void *value, + u64 map_flags) { struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map); struct bpf_cpu_map_entry *rcpu; @@ -347,7 +489,7 @@ int cpu_map_update_elem(struct bpf_map *map, void *key, void *value, return -EOVERFLOW; /* Make sure CPU is a valid possible cpu */ - if (!cpu_possible(key_cpu)) + if (key_cpu >= nr_cpumask_bits || !cpu_possible(key_cpu)) return -ENODEV; if (qsize == 0) { @@ -357,6 +499,7 @@ int cpu_map_update_elem(struct bpf_map *map, void *key, void *value, rcpu = __cpu_map_entry_alloc(qsize, key_cpu, map->id); if (!rcpu) return -ENOMEM; + rcpu->cmap = cmap; } rcu_read_lock(); __cpu_map_entry_replace(cmap, key_cpu, rcpu); @@ -364,7 +507,7 @@ int cpu_map_update_elem(struct bpf_map *map, void *key, void *value, return 0; } -void cpu_map_free(struct bpf_map *map) +static void cpu_map_free(struct bpf_map *map) { struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map); int cpu; @@ -378,17 +521,19 @@ void cpu_map_free(struct bpf_map *map) * It does __not__ ensure pending flush operations (if any) are * complete. */ + + bpf_clear_redirect_map(map); synchronize_rcu(); /* To ensure all pending flush operations have completed wait for flush - * bitmap to indicate all flush_needed bits to be zero on _all_ cpus. - * Because the above synchronize_rcu() ensures the map is disconnected - * from the program we can assume no new bits will be set. + * list be empty on _all_ cpus. Because the above synchronize_rcu() + * ensures the map is disconnected from the program we can assume no new + * items will be added to the list. */ for_each_online_cpu(cpu) { - unsigned long *bitmap = per_cpu_ptr(cmap->flush_needed, cpu); + struct list_head *flush_list = per_cpu_ptr(cmap->flush_list, cpu); - while (!bitmap_empty(bitmap, cmap->map.max_entries)) + while (!list_empty(flush_list)) cond_resched(); } @@ -405,7 +550,7 @@ void cpu_map_free(struct bpf_map *map) /* bq flush and cleanup happens after RCU graze-period */ __cpu_map_entry_replace(cmap, i, NULL); /* call_rcu */ } - free_percpu(cmap->flush_needed); + free_percpu(cmap->flush_list); bpf_map_area_free(cmap->cpu_map); kfree(cmap); } @@ -457,9 +602,11 @@ const struct bpf_map_ops cpu_map_ops = { .map_check_btf = map_check_no_btf, }; -static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu, - struct xdp_bulk_queue *bq) +static int bq_flush_to_queue(struct xdp_bulk_queue *bq, bool in_napi_ctx) { + struct bpf_cpu_map_entry *rcpu = bq->obj; + unsigned int processed = 0, drops = 0; + const int to_cpu = rcpu->cpu; struct ptr_ring *q; int i; @@ -470,86 +617,83 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu, spin_lock(&q->producer_lock); for (i = 0; i < bq->count; i++) { - void *xdp_pkt = bq->q[i]; + struct xdp_frame *xdpf = bq->q[i]; int err; - err = __ptr_ring_produce(q, xdp_pkt); + err = __ptr_ring_produce(q, xdpf); if (err) { - /* Free xdp_pkt */ - page_frag_free(xdp_pkt); + drops++; + if (likely(in_napi_ctx)) + xdp_return_frame_rx_napi(xdpf); + else + xdp_return_frame(xdpf); } + processed++; } bq->count = 0; spin_unlock(&q->producer_lock); + __list_del_clearprev(&bq->flush_node); + + /* Feedback loop via tracepoints */ + trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu); return 0; } -/* Notice: Will change in later patch */ -struct xdp_pkt { - void *data; - u16 len; - u16 headroom; -}; - /* Runs under RCU-read-side, plus in softirq under NAPI protection. * Thus, safe percpu variable access. */ -int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_pkt *xdp_pkt) +static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) { + struct list_head *flush_list = this_cpu_ptr(rcpu->cmap->flush_list); struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq); if (unlikely(bq->count == CPU_MAP_BULK_SIZE)) - bq_flush_to_queue(rcpu, bq); + bq_flush_to_queue(bq, true); /* Notice, xdp_buff/page MUST be queued here, long enough for * driver to code invoking us to finished, due to driver * (e.g. ixgbe) recycle tricks based on page-refcnt. * - * Thus, incoming xdp_pkt is always queued here (else we race + * Thus, incoming xdp_frame is always queued here (else we race * with another CPU on page-refcnt and remaining driver code). * Queue time is very short, as driver will invoke flush * operation, when completing napi->poll call. */ - bq->q[bq->count++] = xdp_pkt; + bq->q[bq->count++] = xdpf; + + if (!bq->flush_node.prev) + list_add(&bq->flush_node, flush_list); + return 0; } -void __cpu_map_insert_ctx(struct bpf_map *map, u32 bit) +int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp, + struct net_device *dev_rx) { - struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map); - unsigned long *bitmap = this_cpu_ptr(cmap->flush_needed); + struct xdp_frame *xdpf; - __set_bit(bit, bitmap); + xdpf = convert_to_xdp_frame(xdp); + if (unlikely(!xdpf)) + return -EOVERFLOW; + + /* Info needed when constructing SKB on remote CPU */ + xdpf->dev_rx = dev_rx; + + bq_enqueue(rcpu, xdpf); + return 0; } void __cpu_map_flush(struct bpf_map *map) { struct bpf_cpu_map *cmap = container_of(map, struct bpf_cpu_map, map); - unsigned long *bitmap = this_cpu_ptr(cmap->flush_needed); - u32 bit; + struct list_head *flush_list = this_cpu_ptr(cmap->flush_list); + struct xdp_bulk_queue *bq, *tmp; - /* The napi->poll softirq makes sure __cpu_map_insert_ctx() - * and __cpu_map_flush() happen on same CPU. Thus, the percpu - * bitmap indicate which percpu bulkq have packets. - */ - for_each_set_bit(bit, bitmap, map->max_entries) { - struct bpf_cpu_map_entry *rcpu = READ_ONCE(cmap->cpu_map[bit]); - struct xdp_bulk_queue *bq; - - /* This is possible if entry is removed by user space - * between xdp redirect and flush op. - */ - if (unlikely(!rcpu)) - continue; - - __clear_bit(bit, bitmap); - - /* Flush all frames in bulkq to real queue */ - bq = this_cpu_ptr(rcpu->bulkq); - bq_flush_to_queue(rcpu, bq); + list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { + bq_flush_to_queue(bq, true); /* If already running, costs spin_lock_irqsave + smb_mb */ - wake_up_process(rcpu->kthread); + wake_up_process(bq->obj->kthread); } } diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index 18ec86effea5..08ff40e3921c 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -1,13 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2017 Covalent IO, Inc. http://covalent.io - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ /* Devmaps primary use is as a backend map for XDP BPF helper call @@ -25,9 +17,8 @@ * datapath always has a valid copy. However, the datapath does a "flush" * operation that pushes any pending packets in the driver outside the RCU * critical section. Each bpf_dtab_netdev tracks these pending operations using - * an atomic per-cpu bitmap. The bpf_dtab_netdev object will not be destroyed - * until all bits are cleared indicating outstanding flush operations have - * completed. + * a per-cpu flush list. The bpf_dtab_netdev object will not be destroyed until + * this list is empty, indicating outstanding flush operations have completed. * * BPF syscalls may race with BPF program calls on any of the update, delete * or lookup operations. As noted above the xchg() operation also keep the @@ -54,23 +45,37 @@ * types of devmap; only the lookup and insertion is different. */ #include +#include #include +#include #define DEV_CREATE_FLAG_MASK \ (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) +#define DEV_MAP_BULK_SIZE 16 +struct bpf_dtab_netdev; + +struct xdp_bulk_queue { + struct xdp_frame *q[DEV_MAP_BULK_SIZE]; + struct list_head flush_node; + struct net_device *dev_rx; + struct bpf_dtab_netdev *obj; + unsigned int count; +}; + struct bpf_dtab_netdev { - struct net_device *dev; + struct net_device *dev; /* must be first member, due to tracepoint */ struct hlist_node index_hlist; struct bpf_dtab *dtab; - unsigned int bit; + struct xdp_bulk_queue __percpu *bulkq; struct rcu_head rcu; + unsigned int idx; /* keep track of map index for tracepoint */ }; struct bpf_dtab { struct bpf_map map; - struct bpf_dtab_netdev **netdev_map; - unsigned long __percpu *flush_needed; + struct bpf_dtab_netdev **netdev_map; /* DEVMAP type only */ + struct list_head __percpu *flush_list; struct list_head list; /* these are only used for DEVMAP_HASH type maps */ @@ -83,12 +88,13 @@ struct bpf_dtab { static DEFINE_SPINLOCK(dev_map_lock); static LIST_HEAD(dev_map_list); -static struct hlist_head *dev_map_create_hash(unsigned int entries) +static struct hlist_head *dev_map_create_hash(unsigned int entries, + int numa_node) { int i; struct hlist_head *hash; - hash = kmalloc_array(entries, sizeof(*hash), GFP_KERNEL); + hash = bpf_map_area_alloc((u64) entries * sizeof(*hash), numa_node); if (hash != NULL) for (i = 0; i < entries; i++) INIT_HLIST_HEAD(&hash[i]); @@ -96,80 +102,98 @@ static struct hlist_head *dev_map_create_hash(unsigned int entries) return hash; } -static u64 dev_map_bitmap_size(const union bpf_attr *attr) +static inline struct hlist_head *dev_map_index_hash(struct bpf_dtab *dtab, + int idx) { - return BITS_TO_LONGS((u64) attr->max_entries) * sizeof(unsigned long); + return &dtab->dev_index_head[idx & (dtab->n_buckets - 1)]; } -static struct bpf_map *dev_map_alloc(union bpf_attr *attr) +static int dev_map_init_map(struct bpf_dtab *dtab, union bpf_attr *attr) { - struct bpf_dtab *dtab; - int err = -EINVAL; + int err, cpu; u64 cost; - if (!capable(CAP_NET_ADMIN)) - return ERR_PTR(-EPERM); - /* check sanity of attributes */ if (attr->max_entries == 0 || attr->key_size != 4 || attr->value_size != 4 || attr->map_flags & ~DEV_CREATE_FLAG_MASK) - return ERR_PTR(-EINVAL); + return -EINVAL; /* Lookup returns a pointer straight to dev->ifindex, so make sure the * verifier prevents writes from the BPF side */ attr->map_flags |= BPF_F_RDONLY_PROG; - dtab = kzalloc(sizeof(*dtab), GFP_USER); - if (!dtab) - return ERR_PTR(-ENOMEM); bpf_map_init_from_attr(&dtab->map, attr); /* make sure page count doesn't overflow */ - cost = (u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *); - cost += dev_map_bitmap_size(attr) * num_possible_cpus(); - if (cost >= U32_MAX - PAGE_SIZE) - goto free_dtab; - - dtab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; + cost = (u64) sizeof(struct list_head) * num_possible_cpus(); if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) { - dtab->n_buckets = roundup_pow_of_two(dtab->map.max_entries); + /* hash table size must be power of 2; roundup_pow_of_two() can + * overflow into UB on 32-bit arches, so check that first + */ + if (dtab->map.max_entries > 1UL << 31) + return -EINVAL; - if (!dtab->n_buckets) { /* Overflow check */ - err = -EINVAL; - goto free_dtab; - } - cost += sizeof(struct hlist_head) * dtab->n_buckets; + dtab->n_buckets = roundup_pow_of_two(dtab->map.max_entries); + cost += (u64) sizeof(struct hlist_head) * dtab->n_buckets; + } else { + cost += (u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *); } - /* if map size is larger than memlock limit, reject it early */ - err = bpf_map_precharge_memlock(dtab->map.pages); + /* if map size is larger than memlock limit, reject it */ + err = bpf_map_charge_init(&dtab->map.memory, cost); if (err) - goto free_dtab; + return -EINVAL; - err = -ENOMEM; + dtab->flush_list = alloc_percpu(struct list_head); + if (!dtab->flush_list) + goto free_charge; - /* A per cpu bitfield with a bit per possible net device */ - dtab->flush_needed = __alloc_percpu_gfp(dev_map_bitmap_size(attr), - __alignof__(unsigned long), - GFP_KERNEL | __GFP_NOWARN); - if (!dtab->flush_needed) - goto free_dtab; - - dtab->netdev_map = bpf_map_area_alloc(dtab->map.max_entries * - sizeof(struct bpf_dtab_netdev *), - dtab->map.numa_node); - if (!dtab->netdev_map) - goto free_dtab; + for_each_possible_cpu(cpu) + INIT_LIST_HEAD(per_cpu_ptr(dtab->flush_list, cpu)); if (attr->map_type == BPF_MAP_TYPE_DEVMAP_HASH) { - dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets); + dtab->dev_index_head = dev_map_create_hash(dtab->n_buckets, + dtab->map.numa_node); if (!dtab->dev_index_head) - goto free_map_area; + goto free_percpu; spin_lock_init(&dtab->index_lock); + } else { + dtab->netdev_map = bpf_map_area_alloc((u64) dtab->map.max_entries * + sizeof(struct bpf_dtab_netdev *), + dtab->map.numa_node); + if (!dtab->netdev_map) + goto free_percpu; + } + + return 0; + +free_percpu: + free_percpu(dtab->flush_list); +free_charge: + bpf_map_charge_finish(&dtab->map.memory); + return -ENOMEM; +} + +static struct bpf_map *dev_map_alloc(union bpf_attr *attr) +{ + struct bpf_dtab *dtab; + int err; + + if (!capable(CAP_NET_ADMIN)) + return ERR_PTR(-EPERM); + + dtab = kzalloc(sizeof(*dtab), GFP_USER); + if (!dtab) + return ERR_PTR(-ENOMEM); + + err = dev_map_init_map(dtab, attr); + if (err) { + kfree(dtab); + return ERR_PTR(err); } spin_lock(&dev_map_lock); @@ -177,63 +201,67 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr) spin_unlock(&dev_map_lock); return &dtab->map; -free_map_area: - bpf_map_area_free(dtab->netdev_map); -free_dtab: - free_percpu(dtab->flush_needed); - kfree(dtab->dev_index_head); - kfree(dtab); - return ERR_PTR(err); } static void dev_map_free(struct bpf_map *map) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); - int i, cpu; + u32 i; /* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0, * so the programs (can be more than one that used this map) were - * disconnected from events. Wait for outstanding critical sections in - * these programs to complete. The rcu critical section only guarantees - * no further reads against netdev_map. It does __not__ ensure pending - * flush operations (if any) are complete. + * disconnected from events. The following synchronize_rcu() guarantees + * both rcu read critical sections complete and waits for + * preempt-disable regions (NAPI being the relevant context here) so we + * are certain there will be no further reads against the netdev_map and + * all flush operations are complete. Flush operations can only be done + * from NAPI context for this reason. */ spin_lock(&dev_map_lock); list_del_rcu(&dtab->list); spin_unlock(&dev_map_lock); + bpf_clear_redirect_map(map); synchronize_rcu(); /* Make sure prior __dev_map_entry_free() have completed. */ rcu_barrier(); - /* To ensure all pending flush operations have completed wait for flush - * bitmap to indicate all flush_needed bits to be zero on _all_ cpus. - * Because the above synchronize_rcu() ensures the map is disconnected - * from the program we can assume no new bits will be set. - */ - for_each_online_cpu(cpu) { - unsigned long *bitmap = per_cpu_ptr(dtab->flush_needed, cpu); + if (dtab->map.map_type == BPF_MAP_TYPE_DEVMAP_HASH) { + for (i = 0; i < dtab->n_buckets; i++) { + struct bpf_dtab_netdev *dev; + struct hlist_head *head; + struct hlist_node *next; - while (!bitmap_empty(bitmap, dtab->map.max_entries)) - cond_resched(); + head = dev_map_index_hash(dtab, i); + + hlist_for_each_entry_safe(dev, next, head, index_hlist) { + hlist_del_rcu(&dev->index_hlist); + free_percpu(dev->bulkq); + dev_put(dev->dev); + kfree(dev); + } + } + + bpf_map_area_free(dtab->dev_index_head); + } else { + for (i = 0; i < dtab->map.max_entries; i++) { + struct bpf_dtab_netdev *dev; + + dev = dtab->netdev_map[i]; + if (!dev) + continue; + + free_percpu(dev->bulkq); + dev_put(dev->dev); + kfree(dev); + } + + bpf_map_area_free(dtab->netdev_map); } - for (i = 0; i < dtab->map.max_entries; i++) { - struct bpf_dtab_netdev *dev; - - dev = dtab->netdev_map[i]; - if (!dev) - continue; - - dev_put(dev->dev); - kfree(dev); - } - - free_percpu(dtab->flush_needed); - bpf_map_area_free(dtab->netdev_map); - kfree(dtab->dev_index_head); + free_percpu(dtab->flush_list); kfree(dtab); } @@ -254,32 +282,20 @@ static int dev_map_get_next_key(struct bpf_map *map, void *key, void *next_key) return 0; } -static inline struct hlist_head *dev_map_index_hash(struct bpf_dtab *dtab, - int idx) -{ - return &dtab->dev_index_head[idx & (dtab->n_buckets - 1)]; -} - -static struct bpf_dtab_netdev *__dev_map_hash_lookup_elem_dtab(struct bpf_map *map, u32 key) +struct bpf_dtab_netdev *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); struct hlist_head *head = dev_map_index_hash(dtab, key); struct bpf_dtab_netdev *dev; - hlist_for_each_entry_rcu(dev, head, index_hlist) - if (dev->bit == key) + hlist_for_each_entry_rcu(dev, head, index_hlist, + lockdep_is_held(&dtab->index_lock)) + if (dev->idx == key) return dev; return NULL; } -struct net_device *__dev_map_hash_lookup_elem(struct bpf_map *map, u32 key) -{ - struct bpf_dtab_netdev *dev = __dev_map_hash_lookup_elem_dtab(map, key); - - return dev ? dev->dev : NULL; -} - static int dev_map_hash_get_next_key(struct bpf_map *map, void *key, void *next_key) { @@ -294,7 +310,7 @@ static int dev_map_hash_get_next_key(struct bpf_map *map, void *key, idx = *(u32 *)key; - dev = __dev_map_hash_lookup_elem_dtab(map, idx); + dev = __dev_map_hash_lookup_elem(map, idx); if (!dev) goto find_first; @@ -302,7 +318,7 @@ static int dev_map_hash_get_next_key(struct bpf_map *map, void *key, struct bpf_dtab_netdev, index_hlist); if (next_dev) { - *next = next_dev->bit; + *next = next_dev->idx; return 0; } @@ -317,7 +333,7 @@ static int dev_map_hash_get_next_key(struct bpf_map *map, void *key, struct bpf_dtab_netdev, index_hlist); if (next_dev) { - *next = next_dev->bit; + *next = next_dev->idx; return 0; } } @@ -325,96 +341,171 @@ static int dev_map_hash_get_next_key(struct bpf_map *map, void *key, return -ENOENT; } -void __dev_map_insert_ctx(struct bpf_map *map, u32 bit) +static int bq_xmit_all(struct xdp_bulk_queue *bq, u32 flags) { - struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); - unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed); + struct bpf_dtab_netdev *obj = bq->obj; + struct net_device *dev = obj->dev; + int sent = 0, drops = 0, err = 0; + int i; - __set_bit(bit, bitmap); + if (unlikely(!bq->count)) + return 0; + + for (i = 0; i < bq->count; i++) { + struct xdp_frame *xdpf = bq->q[i]; + + prefetch(xdpf); + } + + sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags); + if (sent < 0) { + err = sent; + sent = 0; + goto error; + } + drops = bq->count - sent; +out: + bq->count = 0; + + trace_xdp_devmap_xmit(&obj->dtab->map, obj->idx, + sent, drops, bq->dev_rx, dev, err); + bq->dev_rx = NULL; + __list_del_clearprev(&bq->flush_node); + return 0; +error: + /* If ndo_xdp_xmit fails with an errno, no frames have been + * xmit'ed and it's our responsibility to them free all. + */ + for (i = 0; i < bq->count; i++) { + struct xdp_frame *xdpf = bq->q[i]; + + xdp_return_frame_rx_napi(xdpf); + drops++; + } + goto out; } /* __dev_map_flush is called from xdp_do_flush_map() which _must_ be signaled * from the driver before returning from its napi->poll() routine. The poll() * routine is called either from busy_poll context or net_rx_action signaled * from NET_RX_SOFTIRQ. Either way the poll routine must complete before the - * net device can be torn down. On devmap tear down we ensure the ctx bitmap - * is zeroed before completing to ensure all flush operations have completed. + * net device can be torn down. On devmap tear down we ensure the flush list + * is empty before completing to ensure all flush operations have completed. */ void __dev_map_flush(struct bpf_map *map) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); - unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed); - u32 bit; + struct list_head *flush_list = this_cpu_ptr(dtab->flush_list); + struct xdp_bulk_queue *bq, *tmp; - for_each_set_bit(bit, bitmap, map->max_entries) { - struct bpf_dtab_netdev *dev = READ_ONCE(dtab->netdev_map[bit]); - struct net_device *netdev; - - /* This is possible if the dev entry is removed by user space - * between xdp redirect and flush op. - */ - if (unlikely(!dev)) - continue; - - __clear_bit(bit, bitmap); - netdev = dev->dev; - if (likely(netdev->netdev_ops->ndo_xdp_flush)) - netdev->netdev_ops->ndo_xdp_flush(netdev); - } + rcu_read_lock(); + list_for_each_entry_safe(bq, tmp, flush_list, flush_node) + bq_xmit_all(bq, XDP_XMIT_FLUSH); + rcu_read_unlock(); } /* rcu_read_lock (from syscall and BPF contexts) ensures that if a delete and/or * update happens in parallel here a dev_put wont happen until after reading the * ifindex. */ -struct net_device *__dev_map_lookup_elem(struct bpf_map *map, u32 key) +struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); - struct bpf_dtab_netdev *dev; + struct bpf_dtab_netdev *obj; if (key >= map->max_entries) return NULL; - dev = READ_ONCE(dtab->netdev_map[key]); - return dev ? dev->dev : NULL; + obj = READ_ONCE(dtab->netdev_map[key]); + return obj; +} + +/* Runs under RCU-read-side, plus in softirq under NAPI protection. + * Thus, safe percpu variable access. + */ +static int bq_enqueue(struct bpf_dtab_netdev *obj, struct xdp_frame *xdpf, + struct net_device *dev_rx) + +{ + struct list_head *flush_list = this_cpu_ptr(obj->dtab->flush_list); + struct xdp_bulk_queue *bq = this_cpu_ptr(obj->bulkq); + + if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) + bq_xmit_all(bq, 0); + + /* Ingress dev_rx will be the same for all xdp_frame's in + * bulk_queue, because bq stored per-CPU and must be flushed + * from net_device drivers NAPI func end. + */ + if (!bq->dev_rx) + bq->dev_rx = dev_rx; + + bq->q[bq->count++] = xdpf; + + if (!bq->flush_node.prev) + list_add(&bq->flush_node, flush_list); + + return 0; +} + +int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp, + struct net_device *dev_rx) +{ + struct net_device *dev = dst->dev; + struct xdp_frame *xdpf; + int err; + + if (!dev->netdev_ops->ndo_xdp_xmit) + return -EOPNOTSUPP; + + err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data); + if (unlikely(err)) + return err; + + xdpf = convert_to_xdp_frame(xdp); + if (unlikely(!xdpf)) + return -EOVERFLOW; + + return bq_enqueue(dst, xdpf, dev_rx); +} + +int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb, + struct bpf_prog *xdp_prog) +{ + int err; + + err = xdp_ok_fwd_dev(dst->dev, skb->len); + if (unlikely(err)) + return err; + skb->dev = dst->dev; + generic_xdp_tx(skb, xdp_prog); + + return 0; } static void *dev_map_lookup_elem(struct bpf_map *map, void *key) { - struct net_device *dev = __dev_map_lookup_elem(map, *(u32 *)key); + struct bpf_dtab_netdev *obj = __dev_map_lookup_elem(map, *(u32 *)key); + struct net_device *dev = obj ? obj->dev : NULL; return dev ? &dev->ifindex : NULL; } static void *dev_map_hash_lookup_elem(struct bpf_map *map, void *key) { - struct net_device *dev = __dev_map_hash_lookup_elem(map, *(u32 *)key); + struct bpf_dtab_netdev *obj = __dev_map_hash_lookup_elem(map, + *(u32 *)key); + struct net_device *dev = obj ? obj->dev : NULL; return dev ? &dev->ifindex : NULL; } -static void dev_map_flush_old(struct bpf_dtab_netdev *dev) -{ - if (dev->dev->netdev_ops->ndo_xdp_flush) { - struct net_device *fl = dev->dev; - unsigned long *bitmap; - int cpu; - - for_each_online_cpu(cpu) { - bitmap = per_cpu_ptr(dev->dtab->flush_needed, cpu); - __clear_bit(dev->bit, bitmap); - - fl->netdev_ops->ndo_xdp_flush(dev->dev); - } - } -} - static void __dev_map_entry_free(struct rcu_head *rcu) { struct bpf_dtab_netdev *dev; dev = container_of(rcu, struct bpf_dtab_netdev, rcu); - dev_map_flush_old(dev); + free_percpu(dev->bulkq); dev_put(dev->dev); kfree(dev); } @@ -423,18 +514,17 @@ static int dev_map_delete_elem(struct bpf_map *map, void *key) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); struct bpf_dtab_netdev *old_dev; - int k = *(u32 *)key; + u32 k = *(u32 *)key; if (k >= map->max_entries) return -EINVAL; /* Use call_rcu() here to ensure any rcu critical sections have - * completed, but this does not guarantee a flush has happened - * yet. Because driver side rcu_read_lock/unlock only protects the - * running XDP program. However, for pending flush operations the - * dev and ctx are stored in another per cpu map. And additionally, - * the driver tear down ensures all soft irqs are complete before - * removing the net device in the case of dev_put equals zero. + * completed as well as any flush operations because call_rcu + * will wait for preempt-disable region to complete, NAPI in this + * context. And additionally, the driver tear down ensures all + * soft irqs are complete before removing the net device in the + * case of dev_put equals zero. */ old_dev = xchg(&dtab->netdev_map[k], NULL); if (old_dev) @@ -446,13 +536,13 @@ static int dev_map_hash_delete_elem(struct bpf_map *map, void *key) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); struct bpf_dtab_netdev *old_dev; - int k = *(u32 *)key; + u32 k = *(u32 *)key; unsigned long flags; int ret = -ENOENT; spin_lock_irqsave(&dtab->index_lock, flags); - old_dev = __dev_map_hash_lookup_elem_dtab(map, k); + old_dev = __dev_map_hash_lookup_elem(map, k); if (old_dev) { dtab->items--; hlist_del_init_rcu(&old_dev->index_hlist); @@ -471,31 +561,45 @@ static struct bpf_dtab_netdev *__dev_map_alloc_node(struct net *net, { gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN; struct bpf_dtab_netdev *dev; + struct xdp_bulk_queue *bq; + int cpu; dev = kmalloc_node(sizeof(*dev), gfp, dtab->map.numa_node); if (!dev) return ERR_PTR(-ENOMEM); + dev->bulkq = __alloc_percpu_gfp(sizeof(*dev->bulkq), + sizeof(void *), gfp); + if (!dev->bulkq) { + kfree(dev); + return ERR_PTR(-ENOMEM); + } + + for_each_possible_cpu(cpu) { + bq = per_cpu_ptr(dev->bulkq, cpu); + bq->obj = dev; + } + dev->dev = dev_get_by_index(net, ifindex); if (!dev->dev) { + free_percpu(dev->bulkq); kfree(dev); return ERR_PTR(-EINVAL); } - dev->bit = idx; + dev->idx = idx; dev->dtab = dtab; return dev; } -static int dev_map_update_elem(struct bpf_map *map, void *key, void *value, - u64 map_flags) +static int __dev_map_update_elem(struct net *net, struct bpf_map *map, + void *key, void *value, u64 map_flags) { struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); - struct net *net = current->nsproxy->net_ns; struct bpf_dtab_netdev *dev, *old_dev; - u32 i = *(u32 *)key; u32 ifindex = *(u32 *)value; + u32 i = *(u32 *)key; if (unlikely(map_flags > BPF_EXIST)) return -EINVAL; @@ -523,6 +627,12 @@ static int dev_map_update_elem(struct bpf_map *map, void *key, void *value, return 0; } +static int dev_map_update_elem(struct bpf_map *map, void *key, void *value, + u64 map_flags) +{ + return __dev_map_update_elem(current->nsproxy->net_ns, + map, key, value, map_flags); +} static int __dev_map_hash_update_elem(struct net *net, struct bpf_map *map, void *key, void *value, u64 map_flags) @@ -532,19 +642,22 @@ static int __dev_map_hash_update_elem(struct net *net, struct bpf_map *map, u32 ifindex = *(u32 *)value; u32 idx = *(u32 *)key; unsigned long flags; + int err = -EEXIST; if (unlikely(map_flags > BPF_EXIST || !ifindex)) return -EINVAL; - old_dev = __dev_map_hash_lookup_elem_dtab(map, idx); + spin_lock_irqsave(&dtab->index_lock, flags); + + old_dev = __dev_map_hash_lookup_elem(map, idx); if (old_dev && (map_flags & BPF_NOEXIST)) - return -EEXIST; + goto out_err; dev = __dev_map_alloc_node(net, dtab, ifindex, idx); - if (IS_ERR(dev)) - return PTR_ERR(dev); - - spin_lock_irqsave(&dtab->index_lock, flags); + if (IS_ERR(dev)) { + err = PTR_ERR(dev); + goto out_err; + } if (old_dev) { hlist_del_rcu(&old_dev->index_hlist); @@ -565,6 +678,10 @@ static int __dev_map_hash_update_elem(struct net *net, struct bpf_map *map, call_rcu(&old_dev->rcu, __dev_map_entry_free); return 0; + +out_err: + spin_unlock_irqrestore(&dtab->index_lock, flags); + return err; } static int dev_map_hash_update_elem(struct bpf_map *map, void *key, void *value, @@ -591,8 +708,35 @@ const struct bpf_map_ops dev_map_hash_ops = { .map_lookup_elem = dev_map_hash_lookup_elem, .map_update_elem = dev_map_hash_update_elem, .map_delete_elem = dev_map_hash_delete_elem, + .map_check_btf = map_check_no_btf, }; +static void dev_map_hash_remove_netdev(struct bpf_dtab *dtab, + struct net_device *netdev) +{ + unsigned long flags; + u32 i; + + spin_lock_irqsave(&dtab->index_lock, flags); + for (i = 0; i < dtab->n_buckets; i++) { + struct bpf_dtab_netdev *dev; + struct hlist_head *head; + struct hlist_node *next; + + head = dev_map_index_hash(dtab, i); + + hlist_for_each_entry_safe(dev, next, head, index_hlist) { + if (netdev != dev->dev) + continue; + + dtab->items--; + hlist_del_rcu(&dev->index_hlist); + call_rcu(&dev->rcu, __dev_map_entry_free); + } + } + spin_unlock_irqrestore(&dtab->index_lock, flags); +} + static int dev_map_notification(struct notifier_block *notifier, ulong event, void *ptr) { @@ -609,6 +753,11 @@ static int dev_map_notification(struct notifier_block *notifier, */ rcu_read_lock(); list_for_each_entry_rcu(dtab, &dev_map_list, list) { + if (dtab->map.map_type == BPF_MAP_TYPE_DEVMAP_HASH) { + dev_map_hash_remove_netdev(dtab, netdev); + continue; + } + for (i = 0; i < dtab->map.max_entries; i++) { struct bpf_dtab_netdev *dev, *odev; @@ -635,6 +784,9 @@ static struct notifier_block dev_map_notifier = { static int __init dev_map_init(void) { + /* Assure tracepoint shadow struct _bpf_dtab_netdev is in sync */ + BUILD_BUG_ON(offsetof(struct bpf_dtab_netdev, dev) != + offsetof(struct _bpf_dtab_netdev, dev)); register_netdevice_notifier(&dev_map_notifier); return 0; } diff --git a/kernel/bpf/disasm.c b/kernel/bpf/disasm.c index d70920831285..ff1dd7d45b58 100644 --- a/kernel/bpf/disasm.c +++ b/kernel/bpf/disasm.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com * Copyright (c) 2016 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include @@ -47,8 +39,8 @@ static const char *__func_imm_name(const struct bpf_insn_cbs *cbs, { if (cbs && cbs->cb_imm) return cbs->cb_imm(cbs->private_data, insn, full_imm); - snprintf(buff, len, "0x%llx", (unsigned long long)full_imm); + snprintf(buff, len, "0x%llx", (unsigned long long)full_imm); return buff; } @@ -67,7 +59,7 @@ const char *const bpf_class_string[8] = { [BPF_STX] = "stx", [BPF_ALU] = "alu", [BPF_JMP] = "jmp", - [BPF_RET] = "BUG", + [BPF_JMP32] = "jmp32", [BPF_ALU64] = "alu64", }; @@ -136,23 +128,22 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs, else print_bpf_end_insn(verbose, cbs->private_data, insn); } else if (BPF_OP(insn->code) == BPF_NEG) { - verbose(cbs->private_data, "(%02x) r%d = %s-r%d\n", - insn->code, insn->dst_reg, - class == BPF_ALU ? "(u32) " : "", + verbose(cbs->private_data, "(%02x) %c%d = -%c%d\n", + insn->code, class == BPF_ALU ? 'w' : 'r', + insn->dst_reg, class == BPF_ALU ? 'w' : 'r', insn->dst_reg); } else if (BPF_SRC(insn->code) == BPF_X) { - verbose(cbs->private_data, "(%02x) %sr%d %s %sr%d\n", - insn->code, class == BPF_ALU ? "(u32) " : "", + verbose(cbs->private_data, "(%02x) %c%d %s %c%d\n", + insn->code, class == BPF_ALU ? 'w' : 'r', insn->dst_reg, bpf_alu_string[BPF_OP(insn->code) >> 4], - class == BPF_ALU ? "(u32) " : "", + class == BPF_ALU ? 'w' : 'r', insn->src_reg); } else { - verbose(cbs->private_data, "(%02x) %sr%d %s %s%d\n", - insn->code, class == BPF_ALU ? "(u32) " : "", + verbose(cbs->private_data, "(%02x) %c%d %s %d\n", + insn->code, class == BPF_ALU ? 'w' : 'r', insn->dst_reg, bpf_alu_string[BPF_OP(insn->code) >> 4], - class == BPF_ALU ? "(u32) " : "", insn->imm); } } else if (class == BPF_STX) { @@ -171,15 +162,17 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs, else verbose(cbs->private_data, "BUG_%02x\n", insn->code); } else if (class == BPF_ST) { - if (BPF_MODE(insn->code) != BPF_MEM) { + if (BPF_MODE(insn->code) == BPF_MEM) { + verbose(cbs->private_data, "(%02x) *(%s *)(r%d %+d) = %d\n", + insn->code, + bpf_ldst_string[BPF_SIZE(insn->code) >> 3], + insn->dst_reg, + insn->off, insn->imm); + } else if (BPF_MODE(insn->code) == 0xc0 /* BPF_NOSPEC, no UAPI */) { + verbose(cbs->private_data, "(%02x) nospec\n", insn->code); + } else { verbose(cbs->private_data, "BUG_st_%02x\n", insn->code); - return; } - verbose(cbs->private_data, "(%02x) *(%s *)(r%d %+d) = %d\n", - insn->code, - bpf_ldst_string[BPF_SIZE(insn->code) >> 3], - insn->dst_reg, - insn->off, insn->imm); } else if (class == BPF_LDX) { if (BPF_MODE(insn->code) != BPF_MEM) { verbose(cbs->private_data, "BUG_ldx_%02x\n", insn->code); @@ -221,11 +214,12 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs, verbose(cbs->private_data, "BUG_ld_%02x\n", insn->code); return; } - } else if (class == BPF_JMP) { + } else if (class == BPF_JMP32 || class == BPF_JMP) { u8 opcode = BPF_OP(insn->code); if (opcode == BPF_CALL) { char tmp[64]; + if (insn->src_reg == BPF_PSEUDO_CALL) { verbose(cbs->private_data, "(%02x) call pc%s\n", insn->code, @@ -244,13 +238,18 @@ void print_bpf_insn(const struct bpf_insn_cbs *cbs, } else if (insn->code == (BPF_JMP | BPF_EXIT)) { verbose(cbs->private_data, "(%02x) exit\n", insn->code); } else if (BPF_SRC(insn->code) == BPF_X) { - verbose(cbs->private_data, "(%02x) if r%d %s r%d goto pc%+d\n", - insn->code, insn->dst_reg, + verbose(cbs->private_data, + "(%02x) if %c%d %s %c%d goto pc%+d\n", + insn->code, class == BPF_JMP32 ? 'w' : 'r', + insn->dst_reg, bpf_jmp_string[BPF_OP(insn->code) >> 4], + class == BPF_JMP32 ? 'w' : 'r', insn->src_reg, insn->off); } else { - verbose(cbs->private_data, "(%02x) if r%d %s 0x%x goto pc%+d\n", - insn->code, insn->dst_reg, + verbose(cbs->private_data, + "(%02x) if %c%d %s 0x%x goto pc%+d\n", + insn->code, class == BPF_JMP32 ? 'w' : 'r', + insn->dst_reg, bpf_jmp_string[BPF_OP(insn->code) >> 4], insn->imm, insn->off); } diff --git a/kernel/bpf/disasm.h b/kernel/bpf/disasm.h index 786ebe260000..e546b18d27da 100644 --- a/kernel/bpf/disasm.h +++ b/kernel/bpf/disasm.h @@ -1,14 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com * Copyright (c) 2016 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef __BPF_DISASM_H__ @@ -34,6 +26,7 @@ typedef const char *(*bpf_insn_revmap_call_t)(void *private_data, typedef const char *(*bpf_insn_print_imm_t)(void *private_data, const struct bpf_insn *insn, __u64 full_imm); + struct bpf_insn_cbs { bpf_insn_print_t cb_print; bpf_insn_revmap_call_t cb_call; @@ -44,5 +37,4 @@ struct bpf_insn_cbs { void print_bpf_insn(const struct bpf_insn_cbs *cbs, const struct bpf_insn *insn, bool allow_ptr_leaks); - #endif diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index b169df182ee5..ddcf0d46228c 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1,26 +1,21 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com * Copyright (c) 2016 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include +#include #include #include #include +#include +#include #include "percpu_freelist.h" #include "bpf_lru_list.h" #include "map_in_map.h" #define HTAB_CREATE_FLAG_MASK \ (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \ - BPF_F_RDONLY | BPF_F_WRONLY) + BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED) struct bucket { struct hlist_nulls_head head; @@ -39,6 +34,7 @@ struct bpf_htab { atomic_t count; /* number of elements in this hashtable */ u32 n_buckets; /* number of hash buckets */ u32 elem_size; /* size of each element in bytes */ + u32 hashrnd; }; /* each htab element is struct htab_elem + key + value */ @@ -114,6 +110,7 @@ static void htab_free_elems(struct bpf_htab *htab) pptr = htab_elem_get_ptr(get_htab_elem(htab, i), htab->map.key_size); free_percpu(pptr); + cond_resched(); } free_elems: bpf_map_area_free(htab->elems); @@ -159,6 +156,7 @@ static int prealloc_init(struct bpf_htab *htab) goto free_elems; htab_elem_set_ptr(get_htab_elem(htab, i), htab->map.key_size, pptr); + cond_resched(); } skip_percpu_elems: @@ -225,6 +223,78 @@ static int alloc_extra_elems(struct bpf_htab *htab) } /* Called from syscall */ +static int htab_map_alloc_check(union bpf_attr *attr) +{ + bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH || + attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH); + bool lru = (attr->map_type == BPF_MAP_TYPE_LRU_HASH || + attr->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH); + /* percpu_lru means each cpu has its own LRU list. + * it is different from BPF_MAP_TYPE_PERCPU_HASH where + * the map's value itself is percpu. percpu_lru has + * nothing to do with the map's value. + */ + bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU); + bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC); + bool zero_seed = (attr->map_flags & BPF_F_ZERO_SEED); + int numa_node = bpf_map_attr_numa_node(attr); + + BUILD_BUG_ON(offsetof(struct htab_elem, htab) != + offsetof(struct htab_elem, hash_node.pprev)); + BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) != + offsetof(struct htab_elem, hash_node.pprev)); + + if (lru && !capable(CAP_SYS_ADMIN)) + /* LRU implementation is much complicated than other + * maps. Hence, limit to CAP_SYS_ADMIN for now. + */ + return -EPERM; + + if (zero_seed && !capable(CAP_SYS_ADMIN)) + /* Guard against local DoS, and discourage production use. */ + return -EPERM; + + if (attr->map_flags & ~HTAB_CREATE_FLAG_MASK || + !bpf_map_flags_access_ok(attr->map_flags)) + return -EINVAL; + + if (!lru && percpu_lru) + return -EINVAL; + + if (lru && !prealloc) + return -ENOTSUPP; + + if (numa_node != NUMA_NO_NODE && (percpu || percpu_lru)) + return -EINVAL; + + /* check sanity of attributes. + * value_size == 0 may be allowed in the future to use map as a set + */ + if (attr->max_entries == 0 || attr->key_size == 0 || + attr->value_size == 0) + return -EINVAL; + + if (attr->key_size > MAX_BPF_STACK) + /* eBPF programs initialize keys on stack, so they cannot be + * larger than max stack size + */ + return -E2BIG; + + if (attr->value_size >= KMALLOC_MAX_SIZE - + MAX_BPF_STACK - sizeof(struct htab_elem)) + /* if value_size is bigger, the user space won't be able to + * access the elements via bpf syscall. This check also makes + * sure that the elem_size doesn't overflow and it's + * kmalloc-able later in htab_map_update_elem() + */ + return -E2BIG; + /* percpu map value size is bound by PCPU_MIN_UNIT_SIZE */ + if (percpu && round_up(attr->value_size, 8) > PCPU_MIN_UNIT_SIZE) + return -E2BIG; + + return 0; +} + static struct bpf_map *htab_map_alloc(union bpf_attr *attr) { bool percpu = (attr->map_type == BPF_MAP_TYPE_PERCPU_HASH || @@ -242,41 +312,12 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) int err, i; u64 cost; - BUILD_BUG_ON(offsetof(struct htab_elem, htab) != - offsetof(struct htab_elem, hash_node.pprev)); - BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) != - offsetof(struct htab_elem, hash_node.pprev)); - - if (lru && !capable(CAP_SYS_ADMIN)) - /* LRU implementation is much complicated than other - * maps. Hence, limit to CAP_SYS_ADMIN for now. - */ - return ERR_PTR(-EPERM); - - if (attr->map_flags & ~HTAB_CREATE_FLAG_MASK) - /* reserved bits should not be used */ - return ERR_PTR(-EINVAL); - - if (!lru && percpu_lru) - return ERR_PTR(-EINVAL); - - if (lru && !prealloc) - return ERR_PTR(-ENOTSUPP); - htab = kzalloc(sizeof(*htab), GFP_USER); if (!htab) return ERR_PTR(-ENOMEM); bpf_map_init_from_attr(&htab->map, attr); - /* check sanity of attributes. - * value_size == 0 may be allowed in the future to use map as a set - */ - err = -EINVAL; - if (htab->map.max_entries == 0 || htab->map.key_size == 0 || - htab->map.value_size == 0) - goto free_htab; - if (percpu_lru) { /* ensure each CPU's lru list has >=1 elements. * since we are at it, make each lru list has the same @@ -298,22 +339,6 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) htab->n_buckets = roundup_pow_of_two(htab->map.max_entries); - err = -E2BIG; - if (htab->map.key_size > MAX_BPF_STACK) - /* eBPF programs initialize keys on stack, so they cannot be - * larger than max stack size - */ - goto free_htab; - - if (htab->map.value_size >= KMALLOC_MAX_SIZE - - MAX_BPF_STACK - sizeof(struct htab_elem)) - /* if value_size is bigger, the user space won't be able to - * access the elements via bpf syscall. This check also makes - * sure that the elem_size doesn't overflow and it's - * kmalloc-able later in htab_map_update_elem() - */ - goto free_htab; - htab->elem_size = sizeof(struct htab_elem) + round_up(htab->map.key_size, 8); if (percpu) @@ -334,14 +359,8 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) else cost += (u64) htab->elem_size * num_possible_cpus(); - if (cost >= U32_MAX - PAGE_SIZE) - /* make sure page count doesn't overflow */ - goto free_htab; - - htab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; - - /* if map size is larger than memlock limit, reject it early */ - err = bpf_map_precharge_memlock(htab->map.pages); + /* if map size is larger than memlock limit, reject it */ + err = bpf_map_charge_init(&htab->map.memory, cost); if (err) goto free_htab; @@ -350,7 +369,12 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) sizeof(struct bucket), htab->map.numa_node); if (!htab->buckets) - goto free_htab; + goto free_charge; + + if (htab->map.map_flags & BPF_F_ZERO_SEED) + htab->hashrnd = 0; + else + htab->hashrnd = get_random_int(); for (i = 0; i < htab->n_buckets; i++) { INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); @@ -378,14 +402,16 @@ free_prealloc: prealloc_destroy(htab); free_buckets: bpf_map_area_free(htab->buckets); +free_charge: + bpf_map_charge_finish(&htab->map.memory); free_htab: kfree(htab); return ERR_PTR(err); } -static inline u32 htab_map_hash(const void *key, u32 key_len) +static inline u32 htab_map_hash(const void *key, u32 key_len, u32 hashrnd) { - return jhash(key, key_len, 0); + return jhash(key, key_len, hashrnd); } static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash) @@ -451,7 +477,7 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key) key_size = map->key_size; - hash = htab_map_hash(key, key_size); + hash = htab_map_hash(key, key_size, htab->hashrnd); head = select_bucket(htab, hash); @@ -486,7 +512,9 @@ static u32 htab_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf) struct bpf_insn *insn = insn_buf; const int ret = BPF_REG_0; - *insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, u64))__htab_map_lookup_elem); + BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem, + (void *(*)(struct bpf_map *map, void *key))NULL)); + *insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem)); *insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1); *insn++ = BPF_ALU64_IMM(BPF_ADD, ret, offsetof(struct htab_elem, key) + @@ -525,7 +553,9 @@ static u32 htab_lru_map_gen_lookup(struct bpf_map *map, const int ret = BPF_REG_0; const int ref_reg = BPF_REG_1; - *insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, u64))__htab_map_lookup_elem); + BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem, + (void *(*)(struct bpf_map *map, void *key))NULL)); + *insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem)); *insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 4); *insn++ = BPF_LDX_MEM(BPF_B, ref_reg, ret, offsetof(struct htab_elem, lru_node) + @@ -586,7 +616,7 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key) if (!key) goto find_first_elem; - hash = htab_map_hash(key, key_size); + hash = htab_map_hash(key, key_size, htab->hashrnd); head = select_bucket(htab, hash); @@ -651,7 +681,7 @@ static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l) if (map->ops->map_fd_put_ptr) { ptr = fd_htab_map_get_ptr(map, l); - map->ops->map_fd_put_ptr(map, ptr, true); + map->ops->map_fd_put_ptr(ptr); } } @@ -686,6 +716,32 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr, } } +static void pcpu_init_value(struct bpf_htab *htab, void __percpu *pptr, + void *value, bool onallcpus) +{ + /* When using prealloc and not setting the initial value on all cpus, + * zero-fill element values for other cpus (just as what happens when + * not using prealloc). Otherwise, bpf program has no way to ensure + * known initial values for cpus other than current one + * (onallcpus=false always when coming from bpf prog). + */ + if (htab_is_prealloc(htab) && !onallcpus) { + u32 size = round_up(htab->map.value_size, 8); + int current_cpu = raw_smp_processor_id(); + int cpu; + + for_each_possible_cpu(cpu) { + if (cpu == current_cpu) + bpf_long_memcpy(per_cpu_ptr(pptr, cpu), value, + size); + else + memset(per_cpu_ptr(pptr, cpu), 0, size); + } + } else { + pcpu_copy_value(htab, pptr, value, onallcpus); + } +} + static bool fd_htab_map_needs_adjust(const struct bpf_htab *htab) { return htab->map.map_type == BPF_MAP_TYPE_HASH_OF_MAPS && @@ -756,7 +812,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key, } } - pcpu_copy_value(htab, pptr, value, onallcpus); + pcpu_init_value(htab, pptr, value, onallcpus); if (!prealloc) htab_elem_set_ptr(l_new, key_size, pptr); @@ -802,7 +858,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, u32 key_size, hash; int ret; - if (unlikely(map_flags > BPF_EXIST)) + if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST)) /* unknown flags */ return -EINVAL; @@ -810,11 +866,33 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, key_size = map->key_size; - hash = htab_map_hash(key, key_size); + hash = htab_map_hash(key, key_size, htab->hashrnd); b = __select_bucket(htab, hash); head = &b->head; + if (unlikely(map_flags & BPF_F_LOCK)) { + if (unlikely(!map_value_has_spin_lock(map))) + return -EINVAL; + /* find an element without taking the bucket lock */ + l_old = lookup_nulls_elem_raw(head, hash, key, key_size, + htab->n_buckets); + ret = check_flags(htab, l_old, map_flags); + if (ret) + return ret; + if (l_old) { + /* grab the element lock and update value in place */ + copy_map_value_locked(map, + l_old->key + round_up(key_size, 8), + value, false); + return 0; + } + /* fall through, grab the bucket lock and lookup again. + * 99.9% chance that the element won't be found, + * but second lookup under lock has to be done. + */ + } + /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&b->lock, flags); @@ -880,7 +958,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value, key_size = map->key_size; - hash = htab_map_hash(key, key_size); + hash = htab_map_hash(key, key_size, htab->hashrnd); b = __select_bucket(htab, hash); head = &b->head; @@ -945,36 +1023,11 @@ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key, key_size = map->key_size; - hash = htab_map_hash(key, key_size); + hash = htab_map_hash(key, key_size, htab->hashrnd); b = __select_bucket(htab, hash); head = &b->head; - if (unlikely(map_flags & BPF_F_LOCK)) { - if (unlikely(!map_value_has_spin_lock(map))) - return -EINVAL; - - /* find an element without taking the bucket lock */ - l_old = lookup_nulls_elem_raw(head, hash, key, key_size, - htab->n_buckets); - ret = check_flags(htab, l_old, map_flags); - if (ret) - return ret; - - if (l_old) { - /* grab the element lock and update value in place */ - copy_map_value_locked(map, - l_old->key + round_up(key_size, 8), - value, false); - return 0; - } - - /* fall through, grab the bucket lock and lookup again. - * 99.9% chance that the element won't be found, - * but second lookup under lock has to be done. - */ - } - /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&b->lock, flags); @@ -1015,7 +1068,7 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, u32 key_size, hash; int ret; - if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST)) + if (unlikely(map_flags > BPF_EXIST)) /* unknown flags */ return -EINVAL; @@ -1023,7 +1076,7 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, key_size = map->key_size; - hash = htab_map_hash(key, key_size); + hash = htab_map_hash(key, key_size, htab->hashrnd); b = __select_bucket(htab, hash); head = &b->head; @@ -1055,7 +1108,7 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, pcpu_copy_value(htab, htab_elem_get_ptr(l_old, key_size), value, onallcpus); } else { - pcpu_copy_value(htab, htab_elem_get_ptr(l_new, key_size), + pcpu_init_value(htab, htab_elem_get_ptr(l_new, key_size), value, onallcpus); hlist_nulls_add_head_rcu(&l_new->hash_node, head); l_new = NULL; @@ -1096,7 +1149,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key) key_size = map->key_size; - hash = htab_map_hash(key, key_size); + hash = htab_map_hash(key, key_size, htab->hashrnd); b = __select_bucket(htab, hash); head = &b->head; @@ -1128,7 +1181,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key) key_size = map->key_size; - hash = htab_map_hash(key, key_size); + hash = htab_map_hash(key, key_size, htab->hashrnd); b = __select_bucket(htab, hash); head = &b->head; @@ -1189,7 +1242,29 @@ static void htab_map_free(struct bpf_map *map) kfree(htab); } +static void htab_map_seq_show_elem(struct bpf_map *map, void *key, + struct seq_file *m) +{ + void *value; + + rcu_read_lock(); + + value = htab_map_lookup_elem(map, key); + if (!value) { + rcu_read_unlock(); + return; + } + + btf_type_seq_show(map->btf, map->btf_key_type_id, key, m); + seq_puts(m, ": "); + btf_type_seq_show(map->btf, map->btf_value_type_id, value, m); + seq_puts(m, "\n"); + + rcu_read_unlock(); +} + const struct bpf_map_ops htab_map_ops = { + .map_alloc_check = htab_map_alloc_check, .map_alloc = htab_map_alloc, .map_free = htab_map_free, .map_get_next_key = htab_map_get_next_key, @@ -1197,9 +1272,11 @@ const struct bpf_map_ops htab_map_ops = { .map_update_elem = htab_map_update_elem, .map_delete_elem = htab_map_delete_elem, .map_gen_lookup = htab_map_gen_lookup, + .map_seq_show_elem = htab_map_seq_show_elem, }; const struct bpf_map_ops htab_lru_map_ops = { + .map_alloc_check = htab_map_alloc_check, .map_alloc = htab_map_alloc, .map_free = htab_map_free, .map_get_next_key = htab_map_get_next_key, @@ -1208,6 +1285,7 @@ const struct bpf_map_ops htab_lru_map_ops = { .map_update_elem = htab_lru_map_update_elem, .map_delete_elem = htab_lru_map_delete_elem, .map_gen_lookup = htab_lru_map_gen_lookup, + .map_seq_show_elem = htab_map_seq_show_elem, }; /* Called from eBPF program */ @@ -1283,29 +1361,62 @@ int bpf_percpu_hash_update(struct bpf_map *map, void *key, void *value, return ret; } +static void htab_percpu_map_seq_show_elem(struct bpf_map *map, void *key, + struct seq_file *m) +{ + struct htab_elem *l; + void __percpu *pptr; + int cpu; + + rcu_read_lock(); + + l = __htab_map_lookup_elem(map, key); + if (!l) { + rcu_read_unlock(); + return; + } + + btf_type_seq_show(map->btf, map->btf_key_type_id, key, m); + seq_puts(m, ": {\n"); + pptr = htab_elem_get_ptr(l, map->key_size); + for_each_possible_cpu(cpu) { + seq_printf(m, "\tcpu%d: ", cpu); + btf_type_seq_show(map->btf, map->btf_value_type_id, + per_cpu_ptr(pptr, cpu), m); + seq_puts(m, "\n"); + } + seq_puts(m, "}\n"); + + rcu_read_unlock(); +} + const struct bpf_map_ops htab_percpu_map_ops = { + .map_alloc_check = htab_map_alloc_check, .map_alloc = htab_map_alloc, .map_free = htab_map_free, .map_get_next_key = htab_map_get_next_key, .map_lookup_elem = htab_percpu_map_lookup_elem, .map_update_elem = htab_percpu_map_update_elem, .map_delete_elem = htab_map_delete_elem, + .map_seq_show_elem = htab_percpu_map_seq_show_elem, }; const struct bpf_map_ops htab_lru_percpu_map_ops = { + .map_alloc_check = htab_map_alloc_check, .map_alloc = htab_map_alloc, .map_free = htab_map_free, .map_get_next_key = htab_map_get_next_key, .map_lookup_elem = htab_lru_percpu_map_lookup_elem, .map_update_elem = htab_lru_percpu_map_update_elem, .map_delete_elem = htab_lru_map_delete_elem, + .map_seq_show_elem = htab_percpu_map_seq_show_elem, }; -static struct bpf_map *fd_htab_map_alloc(union bpf_attr *attr) +static int fd_htab_map_alloc_check(union bpf_attr *attr) { if (attr->value_size != sizeof(u32)) - return ERR_PTR(-EINVAL); - return htab_map_alloc(attr); + return -EINVAL; + return htab_map_alloc_check(attr); } static void fd_htab_map_free(struct bpf_map *map) @@ -1322,7 +1433,7 @@ static void fd_htab_map_free(struct bpf_map *map) hlist_nulls_for_each_entry_safe(l, n, head, hash_node) { void *ptr = fd_htab_map_get_ptr(map, l); - map->ops->map_fd_put_ptr(map, ptr, false); + map->ops->map_fd_put_ptr(ptr); } } @@ -1363,7 +1474,7 @@ int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file, ret = htab_map_update_elem(map, key, &ptr, map_flags); if (ret) - map->ops->map_fd_put_ptr(map, ptr, false); + map->ops->map_fd_put_ptr(ptr); return ret; } @@ -1376,7 +1487,7 @@ static struct bpf_map *htab_of_map_alloc(union bpf_attr *attr) if (IS_ERR(inner_map_meta)) return inner_map_meta; - map = fd_htab_map_alloc(attr); + map = htab_map_alloc(attr); if (IS_ERR(map)) { bpf_map_meta_free(inner_map_meta); return map; @@ -1403,7 +1514,9 @@ static u32 htab_of_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn = insn_buf; const int ret = BPF_REG_0; - *insn++ = BPF_EMIT_CALL((u64 (*)(u64, u64, u64, u64, u64))__htab_map_lookup_elem); + BUILD_BUG_ON(!__same_type(&__htab_map_lookup_elem, + (void *(*)(struct bpf_map *map, void *key))NULL)); + *insn++ = BPF_EMIT_CALL(BPF_CAST_CALL(__htab_map_lookup_elem)); *insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 2); *insn++ = BPF_ALU64_IMM(BPF_ADD, ret, offsetof(struct htab_elem, key) + @@ -1420,6 +1533,7 @@ static void htab_of_map_free(struct bpf_map *map) } const struct bpf_map_ops htab_of_maps_map_ops = { + .map_alloc_check = fd_htab_map_alloc_check, .map_alloc = htab_of_map_alloc, .map_free = htab_of_map_free, .map_get_next_key = htab_map_get_next_key, diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index a2cf6744d88a..5859edfb5ad9 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1,13 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include @@ -18,6 +10,9 @@ #include #include #include +#include + +#include "../../lib/kstrtox.h" /* If kernel subsystem is allowing eBPF programs to call this function, * inside its own verifier_ops->get_func_proto() callback it should return @@ -76,6 +71,47 @@ const struct bpf_func_proto bpf_map_delete_elem_proto = { .arg2_type = ARG_PTR_TO_MAP_KEY, }; +BPF_CALL_3(bpf_map_push_elem, struct bpf_map *, map, void *, value, u64, flags) +{ + return map->ops->map_push_elem(map, value, flags); +} + +const struct bpf_func_proto bpf_map_push_elem_proto = { + .func = bpf_map_push_elem, + .gpl_only = false, + .pkt_access = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_PTR_TO_MAP_VALUE, + .arg3_type = ARG_ANYTHING, +}; + +BPF_CALL_2(bpf_map_pop_elem, struct bpf_map *, map, void *, value) +{ + return map->ops->map_pop_elem(map, value); +} + +const struct bpf_func_proto bpf_map_pop_elem_proto = { + .func = bpf_map_pop_elem, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_PTR_TO_UNINIT_MAP_VALUE, +}; + +BPF_CALL_2(bpf_map_peek_elem, struct bpf_map *, map, void *, value) +{ + return map->ops->map_peek_elem(map, value); +} + +const struct bpf_func_proto bpf_map_peek_elem_proto = { + .func = bpf_map_peek_elem, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_PTR_TO_UNINIT_MAP_VALUE, +}; + const struct bpf_func_proto bpf_get_prandom_u32_proto = { .func = bpf_user_rnd_u32, .gpl_only = false, @@ -193,6 +229,7 @@ const struct bpf_func_proto bpf_get_current_comm_proto = { }; #if defined(CONFIG_QUEUED_SPINLOCKS) || defined(CONFIG_BPF_ARCH_SPINLOCK) + static inline void __bpf_spin_lock(struct bpf_spin_lock *lock) { arch_spinlock_t *l = (void *)lock; @@ -200,6 +237,7 @@ static inline void __bpf_spin_lock(struct bpf_spin_lock *lock) __u32 val; arch_spinlock_t lock; } u = { .lock = __ARCH_SPIN_LOCK_UNLOCKED }; + compiletime_assert(u.val == 0, "__ARCH_SPIN_LOCK_UNLOCKED not 0"); BUILD_BUG_ON(sizeof(*l) != sizeof(__u32)); BUILD_BUG_ON(sizeof(*lock) != sizeof(__u32)); @@ -209,13 +247,16 @@ static inline void __bpf_spin_lock(struct bpf_spin_lock *lock) static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock) { arch_spinlock_t *l = (void *)lock; + arch_spin_unlock(l); } #else + static inline void __bpf_spin_lock(struct bpf_spin_lock *lock) { atomic_t *l = (void *)lock; + BUILD_BUG_ON(sizeof(*l) != sizeof(*lock)); do { atomic_cond_read_relaxed(l, !VAL); @@ -225,17 +266,26 @@ static inline void __bpf_spin_lock(struct bpf_spin_lock *lock) static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock) { atomic_t *l = (void *)lock; + atomic_set_release(l, 0); } + #endif static DEFINE_PER_CPU(unsigned long, irqsave_flags); -notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock) + +static inline void __bpf_spin_lock_irqsave(struct bpf_spin_lock *lock) { unsigned long flags; + local_irq_save(flags); __bpf_spin_lock(lock); __this_cpu_write(irqsave_flags, flags); +} + +NOTRACE_BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock) +{ + __bpf_spin_lock_irqsave(lock); return 0; } @@ -246,12 +296,18 @@ const struct bpf_func_proto bpf_spin_lock_proto = { .arg1_type = ARG_PTR_TO_SPIN_LOCK, }; -notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock) +static inline void __bpf_spin_unlock_irqrestore(struct bpf_spin_lock *lock) { unsigned long flags; + flags = __this_cpu_read(irqsave_flags); __bpf_spin_unlock(lock); local_irq_restore(flags); +} + +NOTRACE_BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock) +{ + __bpf_spin_unlock_irqrestore(lock); return 0; } @@ -266,14 +322,15 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, bool lock_src) { struct bpf_spin_lock *lock; + if (lock_src) lock = src + map->spin_lock_off; else lock = dst + map->spin_lock_off; preempt_disable(); - ____bpf_spin_lock(lock); + __bpf_spin_lock_irqsave(lock); copy_map_value(map, dst, src); - ____bpf_spin_unlock(lock); + __bpf_spin_unlock_irqrestore(lock); preempt_enable(); } @@ -281,17 +338,20 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, BPF_CALL_0(bpf_get_current_cgroup_id) { struct cgroup *cgrp = task_dfl_cgroup(current); + return cgrp->kn->id.id; } + const struct bpf_func_proto bpf_get_current_cgroup_id_proto = { .func = bpf_get_current_cgroup_id, .gpl_only = false, .ret_type = RET_INTEGER, }; -#endif #ifdef CONFIG_CGROUP_BPF -DECLARE_PER_CPU(void*, bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); +DECLARE_PER_CPU(struct bpf_cgroup_storage*, + bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); + BPF_CALL_2(bpf_get_local_storage, struct bpf_map *, map, u64, flags) { /* flags argument is not used now, @@ -308,8 +368,10 @@ BPF_CALL_2(bpf_get_local_storage, struct bpf_map *, map, u64, flags) ptr = &READ_ONCE(storage->buf)->data[0]; else ptr = this_cpu_ptr(storage->percpu_buf); + return (unsigned long)ptr; } + const struct bpf_func_proto bpf_get_local_storage_proto = { .func = bpf_get_local_storage, .gpl_only = false, @@ -318,3 +380,132 @@ const struct bpf_func_proto bpf_get_local_storage_proto = { .arg2_type = ARG_ANYTHING, }; #endif + +#define BPF_STRTOX_BASE_MASK 0x1F + +static int __bpf_strtoull(const char *buf, size_t buf_len, u64 flags, + unsigned long long *res, bool *is_negative) +{ + unsigned int base = flags & BPF_STRTOX_BASE_MASK; + const char *cur_buf = buf; + size_t cur_len = buf_len; + unsigned int consumed; + size_t val_len; + char str[64]; + + if (!buf || !buf_len || !res || !is_negative) + return -EINVAL; + + if (base != 0 && base != 8 && base != 10 && base != 16) + return -EINVAL; + + if (flags & ~BPF_STRTOX_BASE_MASK) + return -EINVAL; + + while (cur_buf < buf + buf_len && isspace(*cur_buf)) + ++cur_buf; + + *is_negative = (cur_buf < buf + buf_len && *cur_buf == '-'); + if (*is_negative) + ++cur_buf; + + consumed = cur_buf - buf; + cur_len -= consumed; + if (!cur_len) + return -EINVAL; + + cur_len = min(cur_len, sizeof(str) - 1); + memcpy(str, cur_buf, cur_len); + str[cur_len] = '\0'; + cur_buf = str; + + cur_buf = _parse_integer_fixup_radix(cur_buf, &base); + val_len = _parse_integer(cur_buf, base, res); + + if (val_len & KSTRTOX_OVERFLOW) + return -ERANGE; + + if (val_len == 0) + return -EINVAL; + + cur_buf += val_len; + consumed += cur_buf - str; + + return consumed; +} + +static int __bpf_strtoll(const char *buf, size_t buf_len, u64 flags, + long long *res) +{ + unsigned long long _res; + bool is_negative; + int err; + + err = __bpf_strtoull(buf, buf_len, flags, &_res, &is_negative); + if (err < 0) + return err; + if (is_negative) { + if ((long long)-_res > 0) + return -ERANGE; + *res = -_res; + } else { + if ((long long)_res < 0) + return -ERANGE; + *res = _res; + } + return err; +} + +BPF_CALL_4(bpf_strtol, const char *, buf, size_t, buf_len, u64, flags, + s64 *, res) +{ + long long _res; + int err; + + err = __bpf_strtoll(buf, buf_len, flags, &_res); + if (err < 0) + return err; + if (_res != (long)_res) + return -ERANGE; + *res = _res; + return err; +} + +const struct bpf_func_proto bpf_strtol_proto = { + .func = bpf_strtol, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_MEM, + .arg2_type = ARG_CONST_SIZE, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_LONG, +}; + +BPF_CALL_4(bpf_strtoul, const char *, buf, size_t, buf_len, u64, flags, + u64 *, res) +{ + unsigned long long _res; + bool is_negative; + int err; + + err = __bpf_strtoull(buf, buf_len, flags, &_res, &is_negative); + if (err < 0) + return err; + if (is_negative) + return -EINVAL; + if (_res != (unsigned long)_res) + return -ERANGE; + *res = _res; + return err; +} + +const struct bpf_func_proto bpf_strtoul_proto = { + .func = bpf_strtoul, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_MEM, + .arg2_type = ARG_CONST_SIZE, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_LONG, +}; +#endif diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 0f27179d8358..b9e5fdfa2be4 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Minimal file system backend for holding eBPF maps and programs, * used by bpf(2) object pinning. @@ -5,10 +6,6 @@ * Authors: * * Daniel Borkmann - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * version 2 as published by the Free Software Foundation. */ #include @@ -154,14 +151,17 @@ struct map_iter { void *key; bool done; }; + static struct map_iter *map_iter(struct seq_file *m) { return m->private; } + static struct bpf_map *seq_file_to_map(struct seq_file *m) { return file_inode(m->file)->i_private; } + static void map_iter_free(struct map_iter *iter) { if (iter) { @@ -169,87 +169,116 @@ static void map_iter_free(struct map_iter *iter) kfree(iter); } } + static struct map_iter *map_iter_alloc(struct bpf_map *map) { struct map_iter *iter; + iter = kzalloc(sizeof(*iter), GFP_KERNEL | __GFP_NOWARN); if (!iter) goto error; + iter->key = kzalloc(map->key_size, GFP_KERNEL | __GFP_NOWARN); if (!iter->key) goto error; + return iter; + error: map_iter_free(iter); return NULL; } + static void *map_seq_next(struct seq_file *m, void *v, loff_t *pos) { struct bpf_map *map = seq_file_to_map(m); void *key = map_iter(m)->key; + void *prev_key; + + (*pos)++; if (map_iter(m)->done) return NULL; + if (unlikely(v == SEQ_START_TOKEN)) - goto done; - if (map->ops->map_get_next_key(map, key, key)) { + prev_key = NULL; + else + prev_key = key; + + rcu_read_lock(); + if (map->ops->map_get_next_key(map, prev_key, key)) { map_iter(m)->done = true; - return NULL; + key = NULL; } -done: - ++(*pos); + rcu_read_unlock(); return key; } + static void *map_seq_start(struct seq_file *m, loff_t *pos) { if (map_iter(m)->done) return NULL; + return *pos ? map_iter(m)->key : SEQ_START_TOKEN; } + static void map_seq_stop(struct seq_file *m, void *v) { } + static int map_seq_show(struct seq_file *m, void *v) { struct bpf_map *map = seq_file_to_map(m); void *key = map_iter(m)->key; + if (unlikely(v == SEQ_START_TOKEN)) { seq_puts(m, "# WARNING!! The output is for debug purpose only\n"); seq_puts(m, "# WARNING!! The output format will change\n"); } else { map->ops->map_seq_show_elem(map, key, m); } + return 0; } + static const struct seq_operations bpffs_map_seq_ops = { .start = map_seq_start, .next = map_seq_next, .show = map_seq_show, .stop = map_seq_stop, }; + static int bpffs_map_open(struct inode *inode, struct file *file) { struct bpf_map *map = inode->i_private; struct map_iter *iter; struct seq_file *m; int err; + iter = map_iter_alloc(map); if (!iter) return -ENOMEM; + err = seq_open(file, &bpffs_map_seq_ops); if (err) { map_iter_free(iter); return err; } + m = file->private_data; m->private = iter; + return 0; } + static int bpffs_map_release(struct inode *inode, struct file *file) { struct seq_file *m = file->private_data; + map_iter_free(map_iter(m)); + return seq_release(inode, file); } + /* bpffs_map_fops should only implement the basic * read operation for a BPF map. The purpose is to * provide a simple user intuitive way to do @@ -266,6 +295,15 @@ static const struct file_operations bpffs_map_fops = { .release = bpffs_map_release, }; +static int bpffs_obj_open(struct inode *inode, struct file *file) +{ + return -EIO; +} + +static const struct file_operations bpffs_obj_fops = { + .open = bpffs_obj_open, +}; + static int bpf_mkobj_ops(struct dentry *dentry, umode_t mode, void *raw, const struct inode_operations *iops, const struct file_operations *fops) @@ -276,8 +314,8 @@ static int bpf_mkobj_ops(struct dentry *dentry, umode_t mode, void *raw, return PTR_ERR(inode); inode->i_op = iops; - inode->i_private = raw; inode->i_fop = fops; + inode->i_private = raw; bpf_dentry_finalize(dentry, inode, dir); return 0; @@ -285,19 +323,25 @@ static int bpf_mkobj_ops(struct dentry *dentry, umode_t mode, void *raw, static int bpf_mkprog(struct dentry *dentry, umode_t mode, void *arg) { - return bpf_mkobj_ops(dentry, mode, arg, &bpf_prog_iops, NULL); + return bpf_mkobj_ops(dentry, mode, arg, &bpf_prog_iops, + &bpffs_obj_fops); } static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg) { struct bpf_map *map = arg; + return bpf_mkobj_ops(dentry, mode, arg, &bpf_map_iops, - map->btf ? &bpffs_map_fops : NULL); + bpf_map_support_seq_show(map) ? + &bpffs_map_fops : &bpffs_obj_fops); } static struct dentry * bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags) { + /* Dots in names (e.g. "/sys/fs/bpf/foo.bar") are reserved for future + * extensions. + */ if (strchr(dentry->d_name.name, '.')) return ERR_PTR(-EPERM); @@ -350,6 +394,7 @@ static int bpf_obj_do_pin(const struct filename *pathname, void *raw, return PTR_ERR(dentry); mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask()); + ret = security_path_mknod(&path, dentry, mode, 0); if (ret) goto out; @@ -395,13 +440,6 @@ int bpf_obj_pin_user(u32 ufd, const char __user *pathname) ret = bpf_obj_do_pin(pname, raw, type); if (ret != 0) bpf_any_put(raw, type); - if ((trace_bpf_obj_pin_prog_enabled() || - trace_bpf_obj_pin_map_enabled()) && !ret) { - if (type == BPF_TYPE_PROG) - trace_bpf_obj_pin_prog(raw, ufd, pname); - if (type == BPF_TYPE_MAP) - trace_bpf_obj_pin_map(raw, ufd, pname); - } out: putname(pname); return ret; @@ -468,15 +506,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags) else goto out; - if (ret < 0) { + if (ret < 0) bpf_any_put(raw, type); - } else if (trace_bpf_obj_get_prog_enabled() || - trace_bpf_obj_get_map_enabled()) { - if (type == BPF_TYPE_PROG) - trace_bpf_obj_get_prog(raw, ret, pname); - if (type == BPF_TYPE_MAP) - trace_bpf_obj_get_map(raw, ret, pname); - } out: putname(pname); return ret; @@ -500,6 +531,9 @@ static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type if (ret < 0) return ERR_PTR(ret); + if (!bpf_prog_get_ok(prog, &type, false)) + return ERR_PTR(-EINVAL); + return bpf_prog_inc(prog); } @@ -530,9 +564,8 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root) return 0; } -static void bpf_destroy_inode_deferred(struct rcu_head *head) +static void bpf_free_inode(struct inode *inode) { - struct inode *inode = container_of(head, struct inode, i_rcu); enum bpf_type type; if (S_ISLNK(inode->i_mode)) @@ -542,16 +575,11 @@ static void bpf_destroy_inode_deferred(struct rcu_head *head) free_inode_nonrcu(inode); } -static void bpf_destroy_inode(struct inode *inode) -{ - call_rcu(&inode->i_rcu, bpf_destroy_inode_deferred); -} - static const struct super_operations bpf_super_ops = { .statfs = simple_statfs, .drop_inode = generic_delete_inode, .show_options = bpf_show_options, - .destroy_inode = bpf_destroy_inode, + .free_inode = bpf_free_inode, }; enum { diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c index 08860b9963c7..addd6fdceec8 100644 --- a/kernel/bpf/local_storage.c +++ b/kernel/bpf/local_storage.c @@ -9,12 +9,12 @@ #include #include -DEFINE_PER_CPU(void*, bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); +DEFINE_PER_CPU(struct bpf_cgroup_storage*, bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); #ifdef CONFIG_CGROUP_BPF #define LOCAL_STORAGE_CREATE_FLAG_MASK \ - (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) + (BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK) struct bpf_cgroup_storage_map { struct bpf_map map; @@ -152,12 +152,14 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key, } new = kmalloc_node(sizeof(struct bpf_storage_buffer) + - map->value_size, __GFP_ZERO | GFP_USER, + map->value_size, + __GFP_ZERO | GFP_ATOMIC | __GFP_NOWARN, map->numa_node); if (!new) return -ENOMEM; memcpy(&new->data[0], value, map->value_size); + check_and_init_map_lock(map, new->data); new = xchg(&storage->buf, new); kfree_rcu(new, rcu); @@ -173,12 +175,14 @@ int bpf_percpu_cgroup_storage_copy(struct bpf_map *_map, void *_key, struct bpf_cgroup_storage *storage; int cpu, off = 0; u32 size; + rcu_read_lock(); storage = cgroup_storage_lookup(map, key, false); if (!storage) { rcu_read_unlock(); return -ENOENT; } + /* per_cpu areas are zero-filled and bpf programs can only * access 'value_size' of them, so copying rounded areas * will not leak any kernel data @@ -201,14 +205,17 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *_map, void *_key, struct bpf_cgroup_storage *storage; int cpu, off = 0; u32 size; + if (map_flags != BPF_ANY && map_flags != BPF_EXIST) return -EINVAL; + rcu_read_lock(); storage = cgroup_storage_lookup(map, key, false); if (!storage) { rcu_read_unlock(); return -ENOENT; } + /* the user space will provide round_up(value_size, 8) bytes that * will be copied into per-cpu area. bpf programs can only access * value_size of it. During lookup the same extra bytes will be @@ -265,28 +272,38 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr) { int numa_node = bpf_map_attr_numa_node(attr); struct bpf_cgroup_storage_map *map; + struct bpf_map_memory mem; + int ret; if (attr->key_size != sizeof(struct bpf_cgroup_storage_key)) return ERR_PTR(-EINVAL); + if (attr->value_size == 0) + return ERR_PTR(-EINVAL); + if (attr->value_size > PAGE_SIZE) return ERR_PTR(-E2BIG); - if (attr->map_flags & ~LOCAL_STORAGE_CREATE_FLAG_MASK) - /* reserved bits should not be used */ + if (attr->map_flags & ~LOCAL_STORAGE_CREATE_FLAG_MASK || + !bpf_map_flags_access_ok(attr->map_flags)) return ERR_PTR(-EINVAL); if (attr->max_entries) /* max_entries is not used and enforced to be 0 */ return ERR_PTR(-EINVAL); + ret = bpf_map_charge_init(&mem, sizeof(struct bpf_cgroup_storage_map)); + if (ret < 0) + return ERR_PTR(ret); + map = kmalloc_node(sizeof(struct bpf_cgroup_storage_map), __GFP_ZERO | GFP_USER, numa_node); - if (!map) + if (!map) { + bpf_map_charge_finish(&mem); return ERR_PTR(-ENOMEM); + } - map->map.pages = round_up(sizeof(struct bpf_cgroup_storage_map), - PAGE_SIZE) >> PAGE_SHIFT; + bpf_map_charge_move(&map->map.memory, &mem); /* copy mandatory map attributes */ bpf_map_init_from_attr(&map->map, attr); @@ -328,12 +345,14 @@ static int cgroup_storage_check_btf(const struct bpf_map *map, * __u32 attach_type; * }; */ + /* * Key_type must be a structure with two fields. */ if (BTF_INFO_KIND(key_type->info) != BTF_KIND_STRUCT || BTF_INFO_VLEN(key_type->info) != 2) return -EINVAL; + /* * The first field must be a 64 bit integer at 0 offset. */ @@ -341,13 +360,13 @@ static int cgroup_storage_check_btf(const struct bpf_map *map, size = FIELD_SIZEOF(struct bpf_cgroup_storage_key, cgroup_inode_id); if (!btf_member_is_reg_int(btf, key_type, m, 0, size)) return -EINVAL; + /* * The second field must be a 32 bit integer at 64 bit offset. */ m++; offset = offsetof(struct bpf_cgroup_storage_key, attach_type); size = FIELD_SIZEOF(struct bpf_cgroup_storage_key, attach_type); - if (!btf_member_is_reg_int(btf, key_type, m, offset, size)) return -EINVAL; @@ -361,9 +380,9 @@ static void cgroup_storage_seq_show_elem(struct bpf_map *map, void *_key, struct bpf_cgroup_storage_key *key = _key; struct bpf_cgroup_storage *storage; int cpu; + rcu_read_lock(); storage = cgroup_storage_lookup(map_to_storage(map), key, false); - if (!storage) { rcu_read_unlock(); return; @@ -371,7 +390,6 @@ static void cgroup_storage_seq_show_elem(struct bpf_map *map, void *_key, btf_type_seq_show(map->btf, map->btf_key_type_id, key, m); stype = cgroup_storage_type(map); - if (stype == BPF_CGROUP_STORAGE_SHARED) { seq_puts(m, ": "); btf_type_seq_show(map->btf, map->btf_value_type_id, @@ -442,6 +460,7 @@ void bpf_cgroup_storage_release(struct bpf_prog *prog, struct bpf_map *_map) static size_t bpf_cgroup_storage_calculate_size(struct bpf_map *map, u32 *pages) { size_t size; + if (cgroup_storage_type(map) == BPF_CGROUP_STORAGE_SHARED) { size = sizeof(struct bpf_storage_buffer) + map->value_size; *pages = round_up(sizeof(struct bpf_cgroup_storage) + size, @@ -451,6 +470,7 @@ static size_t bpf_cgroup_storage_calculate_size(struct bpf_map *map, u32 *pages) *pages = round_up(round_up(size, 8) * num_possible_cpus(), PAGE_SIZE) >> PAGE_SHIFT; } + return size; } @@ -468,20 +488,22 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog, return NULL; size = bpf_cgroup_storage_calculate_size(map, &pages); + if (bpf_map_charge_memlock(map, pages)) return ERR_PTR(-EPERM); storage = kmalloc_node(sizeof(struct bpf_cgroup_storage), __GFP_ZERO | GFP_USER, map->numa_node); - if (!storage) goto enomem; flags = __GFP_ZERO | GFP_USER; + if (stype == BPF_CGROUP_STORAGE_SHARED) { storage->buf = kmalloc_node(size, flags, map->numa_node); if (!storage->buf) goto enomem; + check_and_init_map_lock(map, storage->buf->data); } else { storage->percpu_buf = __alloc_percpu_gfp(size, 8, flags); if (!storage->percpu_buf) @@ -491,6 +513,7 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog, storage->map = (struct bpf_cgroup_storage_map *)map; return storage; + enomem: bpf_map_uncharge_memlock(map, pages); kfree(storage); @@ -501,6 +524,7 @@ static void free_shared_cgroup_storage_rcu(struct rcu_head *rcu) { struct bpf_cgroup_storage *storage = container_of(rcu, struct bpf_cgroup_storage, rcu); + kfree(storage->buf); kfree(storage); } @@ -509,6 +533,7 @@ static void free_percpu_cgroup_storage_rcu(struct rcu_head *rcu) { struct bpf_cgroup_storage *storage = container_of(rcu, struct bpf_cgroup_storage, rcu); + free_percpu(storage->percpu_buf); kfree(storage); } @@ -523,6 +548,7 @@ void bpf_cgroup_storage_free(struct bpf_cgroup_storage *storage) return; map = &storage->map->map; + bpf_cgroup_storage_calculate_size(map, &pages); bpf_map_uncharge_memlock(map, pages); diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 3c1877c56d6c..f726ceb8d7e9 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -1,15 +1,13 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Longest prefix match list implementation * * Copyright (c) 2016,2017 Daniel Mack * Copyright (c) 2016 David Herrmann - * - * This file is subject to the terms and conditions of version 2 of the GNU - * General Public License. See the file COPYING in the main directory of the - * Linux distribution for more details. */ #include +#include #include #include #include @@ -167,20 +165,59 @@ static size_t longest_prefix_match(const struct lpm_trie *trie, const struct lpm_trie_node *node, const struct bpf_lpm_trie_key *key) { - size_t prefixlen = 0; - size_t i; + u32 limit = min(node->prefixlen, key->prefixlen); + u32 prefixlen = 0, i = 0; - for (i = 0; i < trie->data_size; i++) { - size_t b; + BUILD_BUG_ON(offsetof(struct lpm_trie_node, data) % sizeof(u32)); + BUILD_BUG_ON(offsetof(struct bpf_lpm_trie_key, data) % sizeof(u32)); - b = 8 - fls(node->data[i] ^ key->data[i]); - prefixlen += b; +#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && defined(CONFIG_64BIT) - if (prefixlen >= node->prefixlen || prefixlen >= key->prefixlen) - return min(node->prefixlen, key->prefixlen); + /* data_size >= 16 has very small probability. + * We do not use a loop for optimal code generation. + */ + if (trie->data_size >= 8) { + u64 diff = be64_to_cpu(*(__be64 *)node->data ^ + *(__be64 *)key->data); - if (b < 8) - break; + prefixlen = 64 - fls64(diff); + if (prefixlen >= limit) + return limit; + if (diff) + return prefixlen; + i = 8; + } +#endif + + while (trie->data_size >= i + 4) { + u32 diff = be32_to_cpu(*(__be32 *)&node->data[i] ^ + *(__be32 *)&key->data[i]); + + prefixlen += 32 - fls(diff); + if (prefixlen >= limit) + return limit; + if (diff) + return prefixlen; + i += 4; + } + + if (trie->data_size >= i + 2) { + u16 diff = be16_to_cpu(*(__be16 *)&node->data[i] ^ + *(__be16 *)&key->data[i]); + + prefixlen += 16 - fls(diff); + if (prefixlen >= limit) + return limit; + if (diff) + return prefixlen; + i += 2; + } + + if (trie->data_size >= i + 1) { + prefixlen += 8 - fls(node->data[i] ^ key->data[i]); + + if (prefixlen >= limit) + return limit; } return prefixlen; @@ -327,6 +364,10 @@ static int trie_update_elem(struct bpf_map *map, * simply assign the @new_node to that slot and be done. */ if (!node) { + if (flags == BPF_EXIST) { + ret = -ENOENT; + goto out; + } rcu_assign_pointer(*slot, new_node); goto out; } @@ -335,18 +376,31 @@ static int trie_update_elem(struct bpf_map *map, * which already has the correct data array set. */ if (node->prefixlen == matchlen) { + if (!(node->flags & LPM_TREE_NODE_FLAG_IM)) { + if (flags == BPF_NOEXIST) { + ret = -EEXIST; + goto out; + } + trie->n_entries--; + } else if (flags == BPF_EXIST) { + ret = -ENOENT; + goto out; + } + new_node->child[0] = node->child[0]; new_node->child[1] = node->child[1]; - if (!(node->flags & LPM_TREE_NODE_FLAG_IM)) - trie->n_entries--; - rcu_assign_pointer(*slot, new_node); kfree_rcu(node, rcu); goto out; } + if (flags == BPF_EXIST) { + ret = -ENOENT; + goto out; + } + /* If the new node matches the prefix completely, it must be inserted * as an ancestor. Simply insert it between @node and *@slot. */ @@ -393,10 +447,100 @@ out: return ret; } -static int trie_delete_elem(struct bpf_map *map, void *key) +/* Called from syscall or from eBPF program */ +static int trie_delete_elem(struct bpf_map *map, void *_key) { - /* TODO */ - return -ENOSYS; + struct lpm_trie *trie = container_of(map, struct lpm_trie, map); + struct bpf_lpm_trie_key *key = _key; + struct lpm_trie_node __rcu **trim, **trim2; + struct lpm_trie_node *node, *parent; + unsigned long irq_flags; + unsigned int next_bit; + size_t matchlen = 0; + int ret = 0; + + if (key->prefixlen > trie->max_prefixlen) + return -EINVAL; + + raw_spin_lock_irqsave(&trie->lock, irq_flags); + + /* Walk the tree looking for an exact key/length match and keeping + * track of the path we traverse. We will need to know the node + * we wish to delete, and the slot that points to the node we want + * to delete. We may also need to know the nodes parent and the + * slot that contains it. + */ + trim = &trie->root; + trim2 = trim; + parent = NULL; + while ((node = rcu_dereference_protected( + *trim, lockdep_is_held(&trie->lock)))) { + matchlen = longest_prefix_match(trie, node, key); + + if (node->prefixlen != matchlen || + node->prefixlen == key->prefixlen) + break; + + parent = node; + trim2 = trim; + next_bit = extract_bit(key->data, node->prefixlen); + trim = &node->child[next_bit]; + } + + if (!node || node->prefixlen != key->prefixlen || + node->prefixlen != matchlen || + (node->flags & LPM_TREE_NODE_FLAG_IM)) { + ret = -ENOENT; + goto out; + } + + trie->n_entries--; + + /* If the node we are removing has two children, simply mark it + * as intermediate and we are done. + */ + if (rcu_access_pointer(node->child[0]) && + rcu_access_pointer(node->child[1])) { + node->flags |= LPM_TREE_NODE_FLAG_IM; + goto out; + } + + /* If the parent of the node we are about to delete is an intermediate + * node, and the deleted node doesn't have any children, we can delete + * the intermediate parent as well and promote its other child + * up the tree. Doing this maintains the invariant that all + * intermediate nodes have exactly 2 children and that there are no + * unnecessary intermediate nodes in the tree. + */ + if (parent && (parent->flags & LPM_TREE_NODE_FLAG_IM) && + !node->child[0] && !node->child[1]) { + if (node == rcu_access_pointer(parent->child[0])) + rcu_assign_pointer( + *trim2, rcu_access_pointer(parent->child[1])); + else + rcu_assign_pointer( + *trim2, rcu_access_pointer(parent->child[0])); + kfree_rcu(parent, rcu); + kfree_rcu(node, rcu); + goto out; + } + + /* The node we are removing has either zero or one child. If there + * is a child, move it into the removed node's slot then delete + * the node. Otherwise just clear the slot and delete the node. + */ + if (node->child[0]) + rcu_assign_pointer(*trim, rcu_access_pointer(node->child[0])); + else if (node->child[1]) + rcu_assign_pointer(*trim, rcu_access_pointer(node->child[1])); + else + RCU_INIT_POINTER(*trim, NULL); + kfree_rcu(node, rcu); + +out: + raw_spin_unlock_irqrestore(&trie->lock, irq_flags); + + return ret; } #define LPM_DATA_SIZE_MAX 256 @@ -411,7 +555,7 @@ static int trie_delete_elem(struct bpf_map *map, void *key) #define LPM_KEY_SIZE_MIN LPM_KEY_SIZE(LPM_DATA_SIZE_MIN) #define LPM_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_NUMA_NODE | \ - BPF_F_RDONLY | BPF_F_WRONLY) + BPF_F_ACCESS_MASK) static struct bpf_map *trie_alloc(union bpf_attr *attr) { @@ -426,6 +570,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr) if (attr->max_entries == 0 || !(attr->map_flags & BPF_F_NO_PREALLOC) || attr->map_flags & ~LPM_CREATE_FLAG_MASK || + !bpf_map_flags_access_ok(attr->map_flags) || attr->key_size < LPM_KEY_SIZE_MIN || attr->key_size > LPM_KEY_SIZE_MAX || attr->value_size < LPM_VAL_SIZE_MIN || @@ -436,6 +581,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr) if (!trie) return ERR_PTR(-ENOMEM); + /* copy mandatory map attributes */ bpf_map_init_from_attr(&trie->map, attr); trie->data_size = attr->key_size - offsetof(struct bpf_lpm_trie_key, data); @@ -444,14 +590,8 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr) cost_per_node = sizeof(struct lpm_trie_node) + attr->value_size + trie->data_size; cost += (u64) attr->max_entries * cost_per_node; - if (cost >= U32_MAX - PAGE_SIZE) { - ret = -E2BIG; - goto out_err; - } - trie->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; - - ret = bpf_map_precharge_memlock(trie->map.pages); + ret = bpf_map_charge_init(&trie->map.memory, cost); if (ret) goto out_err; diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index 1ed945522f0e..fab4fb134547 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -1,8 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2017 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. */ #include #include @@ -24,7 +21,9 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) * is a runtime binding. Doing static check alone * in the verifier is not enough. */ - if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY) { + if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY || + inner_map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE || + inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) { fdput(f); return ERR_PTR(-ENOTSUPP); } @@ -56,6 +55,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) inner_map_meta->value_size = inner_map->value_size; inner_map_meta->map_flags = inner_map->map_flags; inner_map_meta->max_entries = inner_map->max_entries; + inner_map_meta->spin_lock_off = inner_map->spin_lock_off; /* Misc members not needed in bpf_map_meta_equal() check. */ inner_map_meta->ops = inner_map->ops; @@ -106,7 +106,7 @@ void *bpf_map_fd_get_ptr(struct bpf_map *map, return inner_map; } -void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool need_defer) +void bpf_map_fd_put_ptr(void *ptr) { /* ptr->ops->map_free() has to go through one * rcu grace period by itself. diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h index 1e652a7bf60e..a507bf6ef8b9 100644 --- a/kernel/bpf/map_in_map.h +++ b/kernel/bpf/map_in_map.h @@ -1,8 +1,5 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* Copyright (c) 2017 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. */ #ifndef __MAP_IN_MAP_H__ #define __MAP_IN_MAP_H__ @@ -18,7 +15,7 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0, const struct bpf_map *meta1); void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file, int ufd); -void bpf_map_fd_put_ptr(struct bpf_map *map, void *ptr, bool need_defer); +void bpf_map_fd_put_ptr(void *ptr); u32 bpf_map_fd_sys_lookup_elem(void *ptr); #endif diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c index e60bb5c3465e..3668a0bc18ec 100644 --- a/kernel/bpf/offload.c +++ b/kernel/bpf/offload.c @@ -1,21 +1,62 @@ +/* + * Copyright (C) 2017-2018 Netronome Systems, Inc. + * + * This software is licensed under the GNU General License Version 2, + * June 1991 as shown in the file COPYING in the top-level directory of this + * source tree. + * + * THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" + * WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, + * BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS + * FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE + * OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME + * THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + */ + #include #include #include #include #include +#include #include #include #include +#include #include #include -/* Protects bpf_prog_offload_devs, bpf_map_offload_devs and offload members +/* Protects offdevs, members of bpf_offload_netdev and offload members * of all progs. * RTNL lock cannot be taken when holding this lock. */ static DECLARE_RWSEM(bpf_devs_lock); -static LIST_HEAD(bpf_prog_offload_devs); -static LIST_HEAD(bpf_map_offload_devs); + +struct bpf_offload_dev { + const struct bpf_prog_offload_ops *ops; + struct list_head netdevs; + void *priv; +}; + +struct bpf_offload_netdev { + struct rhash_head l; + struct net_device *netdev; + struct bpf_offload_dev *offdev; + struct list_head progs; + struct list_head maps; + struct list_head offdev_netdevs; +}; + +static const struct rhashtable_params offdevs_params = { + .nelem_hint = 4, + .key_len = sizeof(struct net_device *), + .key_offset = offsetof(struct bpf_offload_netdev, netdev), + .head_offset = offsetof(struct bpf_offload_netdev, l), + .automatic_shrinking = true, +}; + +static struct rhashtable offdevs; +static bool offdevs_inited; static int bpf_dev_offload_check(struct net_device *netdev) { @@ -26,13 +67,25 @@ static int bpf_dev_offload_check(struct net_device *netdev) return 0; } +static struct bpf_offload_netdev * +bpf_offload_find_netdev(struct net_device *netdev) +{ + lockdep_assert_held(&bpf_devs_lock); + + if (!offdevs_inited) + return NULL; + return rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params); +} + int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr) { + struct bpf_offload_netdev *ondev; struct bpf_prog_offload *offload; int err; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; + if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS && + attr->prog_type != BPF_PROG_TYPE_XDP) + return -EINVAL; if (attr->prog_flags) return -EINVAL; @@ -45,19 +98,19 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr) offload->netdev = dev_get_by_index(current->nsproxy->net_ns, attr->prog_ifindex); - err = bpf_dev_offload_check(offload->netdev); if (err) goto err_maybe_put; down_write(&bpf_devs_lock); - if (offload->netdev->reg_state != NETREG_REGISTERED) { + ondev = bpf_offload_find_netdev(offload->netdev); + if (!ondev) { err = -EINVAL; goto err_unlock; } - + offload->offdev = ondev->offdev; prog->aux->offload = offload; - list_add_tail(&offload->offloads, &bpf_prog_offload_devs); + list_add_tail(&offload->offloads, &ondev->progs); dev_put(offload->netdev); up_write(&bpf_devs_lock); @@ -71,38 +124,20 @@ err_maybe_put: return err; } -static int __bpf_offload_ndo(struct bpf_prog *prog, enum bpf_netdev_command cmd, - struct netdev_bpf *data) +int bpf_prog_offload_verifier_prep(struct bpf_prog *prog) { - struct net_device *netdev = prog->aux->offload->netdev; + struct bpf_prog_offload *offload; + int ret = -ENODEV; - ASSERT_RTNL(); + down_read(&bpf_devs_lock); + offload = prog->aux->offload; + if (offload) { + ret = offload->offdev->ops->prepare(prog); + offload->dev_state = !ret; + } + up_read(&bpf_devs_lock); - if (!netdev) - return -ENODEV; - - data->command = cmd; - - return netdev->netdev_ops->ndo_bpf(netdev, data); -} - -int bpf_prog_offload_verifier_prep(struct bpf_verifier_env *env) -{ - struct netdev_bpf data = {}; - int err; - - data.verifier.prog = env->prog; - - rtnl_lock(); - err = __bpf_offload_ndo(env->prog, BPF_OFFLOAD_VERIFIER_PREP, &data); - if (err) - goto exit_unlock; - - env->prog->aux->offload->dev_ops = data.verifier.ops; - env->prog->aux->offload->dev_state = true; -exit_unlock: - rtnl_unlock(); - return err; + return ret; } int bpf_prog_offload_verify_insn(struct bpf_verifier_env *env, @@ -110,55 +145,103 @@ int bpf_prog_offload_verify_insn(struct bpf_verifier_env *env, { struct bpf_prog_offload *offload; int ret = -ENODEV; + down_read(&bpf_devs_lock); offload = env->prog->aux->offload; - if (offload->netdev) - ret = offload->dev_ops->insn_hook(env, insn_idx, prev_insn_idx); + if (offload) + ret = offload->offdev->ops->insn_hook(env, insn_idx, + prev_insn_idx); up_read(&bpf_devs_lock); + return ret; } +int bpf_prog_offload_finalize(struct bpf_verifier_env *env) +{ + struct bpf_prog_offload *offload; + int ret = -ENODEV; + + down_read(&bpf_devs_lock); + offload = env->prog->aux->offload; + if (offload) { + if (offload->offdev->ops->finalize) + ret = offload->offdev->ops->finalize(env); + else + ret = 0; + } + up_read(&bpf_devs_lock); + + return ret; +} + +void +bpf_prog_offload_replace_insn(struct bpf_verifier_env *env, u32 off, + struct bpf_insn *insn) +{ + const struct bpf_prog_offload_ops *ops; + struct bpf_prog_offload *offload; + int ret = -EOPNOTSUPP; + + down_read(&bpf_devs_lock); + offload = env->prog->aux->offload; + if (offload) { + ops = offload->offdev->ops; + if (!offload->opt_failed && ops->replace_insn) + ret = ops->replace_insn(env, off, insn); + offload->opt_failed |= ret; + } + up_read(&bpf_devs_lock); +} + +void +bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt) +{ + struct bpf_prog_offload *offload; + int ret = -EOPNOTSUPP; + + down_read(&bpf_devs_lock); + offload = env->prog->aux->offload; + if (offload) { + if (!offload->opt_failed && offload->offdev->ops->remove_insns) + ret = offload->offdev->ops->remove_insns(env, off, cnt); + offload->opt_failed |= ret; + } + up_read(&bpf_devs_lock); +} + static void __bpf_prog_offload_destroy(struct bpf_prog *prog) { struct bpf_prog_offload *offload = prog->aux->offload; - struct netdev_bpf data = {}; - - data.offload.prog = prog; if (offload->dev_state) - WARN_ON(__bpf_offload_ndo(prog, BPF_OFFLOAD_DESTROY, &data)); + offload->offdev->ops->destroy(prog); /* Make sure BPF_PROG_GET_NEXT_ID can't find this dead program */ bpf_prog_free_id(prog, true); - offload->dev_state = false; list_del_init(&offload->offloads); - offload->netdev = NULL; + kfree(offload); + prog->aux->offload = NULL; } void bpf_prog_offload_destroy(struct bpf_prog *prog) { - struct bpf_prog_offload *offload = prog->aux->offload; - - rtnl_lock(); down_write(&bpf_devs_lock); - __bpf_prog_offload_destroy(prog); + if (prog->aux->offload) + __bpf_prog_offload_destroy(prog); up_write(&bpf_devs_lock); - rtnl_unlock(); - - kfree(offload); } static int bpf_prog_offload_translate(struct bpf_prog *prog) { - struct netdev_bpf data = {}; - int ret; + struct bpf_prog_offload *offload; + int ret = -ENODEV; - data.offload.prog = prog; - - rtnl_lock(); - ret = __bpf_offload_ndo(prog, BPF_OFFLOAD_TRANSLATE, &data); - rtnl_unlock(); + down_read(&bpf_devs_lock); + offload = prog->aux->offload; + if (offload) + ret = offload->offdev->ops->translate(prog); + up_read(&bpf_devs_lock); return ret; } @@ -215,17 +298,40 @@ int bpf_prog_offload_info_fill(struct bpf_prog_info *info, .prog = prog, .info = info, }; + struct bpf_prog_aux *aux = prog->aux; struct inode *ns_inode; struct path ns_path; - int res; + char __user *uinsns; + void *res; + u32 ulen; res = ns_get_path_cb(&ns_path, bpf_prog_offload_info_fill_ns, &args); - if (res) { + if (IS_ERR(res)) { if (!info->ifindex) return -ENODEV; - return res; + return PTR_ERR(res); } + down_read(&bpf_devs_lock); + + if (!aux->offload) { + up_read(&bpf_devs_lock); + return -ENODEV; + } + + ulen = info->jited_prog_len; + info->jited_prog_len = aux->offload->jited_len; + if (info->jited_prog_len && ulen) { + uinsns = u64_to_user_ptr(info->jited_prog_insns); + ulen = min_t(u32, info->jited_prog_len, ulen); + if (copy_to_user(uinsns, aux->offload->jited_image, ulen)) { + up_read(&bpf_devs_lock); + return -EFAULT; + } + } + + up_read(&bpf_devs_lock); + ns_inode = ns_path.dentry->d_inode; info->netns_dev = new_encode_dev(ns_inode->i_sb->s_dev); info->netns_ino = ns_inode->i_ino; @@ -234,9 +340,6 @@ int bpf_prog_offload_info_fill(struct bpf_prog_info *info, return 0; } -const struct bpf_verifier_ops bpf_offload_verifier_ops = { -}; - const struct bpf_prog_ops bpf_offload_prog_ops = { }; @@ -245,45 +348,66 @@ static int bpf_map_offload_ndo(struct bpf_offloaded_map *offmap, { struct netdev_bpf data = {}; struct net_device *netdev; + ASSERT_RTNL(); + data.command = cmd; data.offmap = offmap; /* Caller must make sure netdev is valid */ netdev = offmap->netdev; + return netdev->netdev_ops->ndo_bpf(netdev, &data); } + struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr) { struct net *net = current->nsproxy->net_ns; + struct bpf_offload_netdev *ondev; struct bpf_offloaded_map *offmap; int err; + if (!capable(CAP_SYS_ADMIN)) return ERR_PTR(-EPERM); - if (attr->map_type != BPF_MAP_TYPE_HASH) + if (attr->map_type != BPF_MAP_TYPE_ARRAY && + attr->map_type != BPF_MAP_TYPE_HASH) return ERR_PTR(-EINVAL); + offmap = kzalloc(sizeof(*offmap), GFP_USER); if (!offmap) return ERR_PTR(-ENOMEM); + bpf_map_init_from_attr(&offmap->map, attr); + rtnl_lock(); down_write(&bpf_devs_lock); offmap->netdev = __dev_get_by_index(net, attr->map_ifindex); err = bpf_dev_offload_check(offmap->netdev); if (err) goto err_unlock; + + ondev = bpf_offload_find_netdev(offmap->netdev); + if (!ondev) { + err = -EINVAL; + goto err_unlock; + } + err = bpf_map_offload_ndo(offmap, BPF_OFFLOAD_MAP_ALLOC); if (err) goto err_unlock; - list_add_tail(&offmap->offloads, &bpf_map_offload_devs); + + list_add_tail(&offmap->offloads, &ondev->maps); up_write(&bpf_devs_lock); rtnl_unlock(); + return &offmap->map; + err_unlock: up_write(&bpf_devs_lock); rtnl_unlock(); kfree(offmap); return ERR_PTR(err); } + static void __bpf_map_offload_destroy(struct bpf_offloaded_map *offmap) { WARN_ON(bpf_map_offload_ndo(offmap, BPF_OFFLOAD_MAP_FREE)); @@ -292,59 +416,75 @@ static void __bpf_map_offload_destroy(struct bpf_offloaded_map *offmap) list_del_init(&offmap->offloads); offmap->netdev = NULL; } + void bpf_map_offload_map_free(struct bpf_map *map) { struct bpf_offloaded_map *offmap = map_to_offmap(map); + rtnl_lock(); down_write(&bpf_devs_lock); if (offmap->netdev) __bpf_map_offload_destroy(offmap); up_write(&bpf_devs_lock); rtnl_unlock(); + kfree(offmap); } + int bpf_map_offload_lookup_elem(struct bpf_map *map, void *key, void *value) { struct bpf_offloaded_map *offmap = map_to_offmap(map); int ret = -ENODEV; + down_read(&bpf_devs_lock); if (offmap->netdev) ret = offmap->dev_ops->map_lookup_elem(offmap, key, value); up_read(&bpf_devs_lock); + return ret; } + int bpf_map_offload_update_elem(struct bpf_map *map, void *key, void *value, u64 flags) { struct bpf_offloaded_map *offmap = map_to_offmap(map); int ret = -ENODEV; + if (unlikely(flags > BPF_EXIST)) return -EINVAL; + down_read(&bpf_devs_lock); if (offmap->netdev) ret = offmap->dev_ops->map_update_elem(offmap, key, value, flags); up_read(&bpf_devs_lock); + return ret; } + int bpf_map_offload_delete_elem(struct bpf_map *map, void *key) { struct bpf_offloaded_map *offmap = map_to_offmap(map); int ret = -ENODEV; + down_read(&bpf_devs_lock); if (offmap->netdev) ret = offmap->dev_ops->map_delete_elem(offmap, key); up_read(&bpf_devs_lock); + return ret; } + int bpf_map_offload_get_next_key(struct bpf_map *map, void *key, void *next_key) { struct bpf_offloaded_map *offmap = map_to_offmap(map); int ret = -ENODEV; + down_read(&bpf_devs_lock); if (offmap->netdev) ret = offmap->dev_ops->map_get_next_key(offmap, key, next_key); up_read(&bpf_devs_lock); + return ret; } @@ -352,13 +492,16 @@ struct ns_get_path_bpf_map_args { struct bpf_offloaded_map *offmap; struct bpf_map_info *info; }; + static struct ns_common *bpf_map_offload_info_fill_ns(void *private_data) { struct ns_get_path_bpf_map_args *args = private_data; struct ns_common *ns; struct net *net; + rtnl_lock(); down_read(&bpf_devs_lock); + if (args->offmap->netdev) { args->info->ifindex = args->offmap->netdev->ifindex; net = dev_net(args->offmap->netdev); @@ -368,10 +511,13 @@ static struct ns_common *bpf_map_offload_info_fill_ns(void *private_data) args->info->ifindex = 0; ns = NULL; } + up_read(&bpf_devs_lock); rtnl_unlock(); + return ns; } + int bpf_map_offload_info_fill(struct bpf_map_info *info, struct bpf_map *map) { struct ns_get_path_bpf_map_args args = { @@ -380,83 +526,187 @@ int bpf_map_offload_info_fill(struct bpf_map_info *info, struct bpf_map *map) }; struct inode *ns_inode; struct path ns_path; - int res; + void *res; + res = ns_get_path_cb(&ns_path, bpf_map_offload_info_fill_ns, &args); - if (res) { + if (IS_ERR(res)) { if (!info->ifindex) return -ENODEV; - return res; + return PTR_ERR(res); } + ns_inode = ns_path.dentry->d_inode; info->netns_dev = new_encode_dev(ns_inode->i_sb->s_dev); info->netns_ino = ns_inode->i_ino; path_put(&ns_path); + return 0; } -bool bpf_offload_dev_match(struct bpf_prog *prog, struct bpf_map *map) +static bool __bpf_offload_dev_match(struct bpf_prog *prog, + struct net_device *netdev) { - struct bpf_offloaded_map *offmap; + struct bpf_offload_netdev *ondev1, *ondev2; struct bpf_prog_offload *offload; - bool ret; - if (!!bpf_prog_is_dev_bound(prog->aux) != !!bpf_map_is_dev_bound(map)) - return false; + if (!bpf_prog_is_dev_bound(prog->aux)) - return true; - down_read(&bpf_devs_lock); + return false; + offload = prog->aux->offload; - offmap = map_to_offmap(map); - ret = offload && offload->netdev == offmap->netdev; - up_read(&bpf_devs_lock); - return ret; -} -static void bpf_offload_orphan_all_progs(struct net_device *netdev) -{ - struct bpf_prog_offload *offload, *tmp; - list_for_each_entry_safe(offload, tmp, &bpf_prog_offload_devs, offloads) - if (offload->netdev == netdev) - __bpf_prog_offload_destroy(offload->prog); -} -static void bpf_offload_orphan_all_maps(struct net_device *netdev) -{ - struct bpf_offloaded_map *offmap, *tmp; - list_for_each_entry_safe(offmap, tmp, &bpf_map_offload_devs, offloads) - if (offmap->netdev == netdev) - __bpf_map_offload_destroy(offmap); + if (!offload) + return false; + if (offload->netdev == netdev) + return true; + + ondev1 = bpf_offload_find_netdev(offload->netdev); + ondev2 = bpf_offload_find_netdev(netdev); + + return ondev1 && ondev2 && ondev1->offdev == ondev2->offdev; } -static int bpf_offload_notification(struct notifier_block *notifier, - ulong event, void *ptr) +bool bpf_offload_dev_match(struct bpf_prog *prog, struct net_device *netdev) { - struct net_device *netdev = netdev_notifier_info_to_dev(ptr); + bool ret; + + down_read(&bpf_devs_lock); + ret = __bpf_offload_dev_match(prog, netdev); + up_read(&bpf_devs_lock); + + return ret; +} +EXPORT_SYMBOL_GPL(bpf_offload_dev_match); + +bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map) +{ + struct bpf_offloaded_map *offmap; + bool ret; + + if (!bpf_map_is_dev_bound(map)) + return bpf_map_offload_neutral(map); + offmap = map_to_offmap(map); + + down_read(&bpf_devs_lock); + ret = __bpf_offload_dev_match(prog, offmap->netdev); + up_read(&bpf_devs_lock); + + return ret; +} + +int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev, + struct net_device *netdev) +{ + struct bpf_offload_netdev *ondev; + int err; + + ondev = kzalloc(sizeof(*ondev), GFP_KERNEL); + if (!ondev) + return -ENOMEM; + + ondev->netdev = netdev; + ondev->offdev = offdev; + INIT_LIST_HEAD(&ondev->progs); + INIT_LIST_HEAD(&ondev->maps); + + down_write(&bpf_devs_lock); + err = rhashtable_insert_fast(&offdevs, &ondev->l, offdevs_params); + if (err) { + netdev_warn(netdev, "failed to register for BPF offload\n"); + goto err_unlock_free; + } + + list_add(&ondev->offdev_netdevs, &offdev->netdevs); + up_write(&bpf_devs_lock); + return 0; + +err_unlock_free: + up_write(&bpf_devs_lock); + kfree(ondev); + return err; +} +EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_register); + +void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev, + struct net_device *netdev) +{ + struct bpf_offload_netdev *ondev, *altdev; + struct bpf_offloaded_map *offmap, *mtmp; + struct bpf_prog_offload *offload, *ptmp; ASSERT_RTNL(); - switch (event) { - case NETDEV_UNREGISTER: - /* ignore namespace changes */ - if (netdev->reg_state != NETREG_UNREGISTERING) - break; + down_write(&bpf_devs_lock); + ondev = rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params); + if (WARN_ON(!ondev)) + goto unlock; - down_write(&bpf_devs_lock); - bpf_offload_orphan_all_progs(netdev); - bpf_offload_orphan_all_maps(netdev); - up_write(&bpf_devs_lock); - break; - default: - break; + WARN_ON(rhashtable_remove_fast(&offdevs, &ondev->l, offdevs_params)); + list_del(&ondev->offdev_netdevs); + + /* Try to move the objects to another netdev of the device */ + altdev = list_first_entry_or_null(&offdev->netdevs, + struct bpf_offload_netdev, + offdev_netdevs); + if (altdev) { + list_for_each_entry(offload, &ondev->progs, offloads) + offload->netdev = altdev->netdev; + list_splice_init(&ondev->progs, &altdev->progs); + + list_for_each_entry(offmap, &ondev->maps, offloads) + offmap->netdev = altdev->netdev; + list_splice_init(&ondev->maps, &altdev->maps); + } else { + list_for_each_entry_safe(offload, ptmp, &ondev->progs, offloads) + __bpf_prog_offload_destroy(offload->prog); + list_for_each_entry_safe(offmap, mtmp, &ondev->maps, offloads) + __bpf_map_offload_destroy(offmap); } - return NOTIFY_OK; + + WARN_ON(!list_empty(&ondev->progs)); + WARN_ON(!list_empty(&ondev->maps)); + kfree(ondev); +unlock: + up_write(&bpf_devs_lock); } +EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_unregister); -static struct notifier_block bpf_offload_notifier = { - .notifier_call = bpf_offload_notification, -}; - -static int __init bpf_offload_init(void) +struct bpf_offload_dev * +bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv) { - register_netdevice_notifier(&bpf_offload_notifier); - return 0; -} + struct bpf_offload_dev *offdev; + int err; -subsys_initcall(bpf_offload_init); + down_write(&bpf_devs_lock); + if (!offdevs_inited) { + err = rhashtable_init(&offdevs, &offdevs_params); + if (err) { + up_write(&bpf_devs_lock); + return ERR_PTR(err); + } + offdevs_inited = true; + } + up_write(&bpf_devs_lock); + + offdev = kzalloc(sizeof(*offdev), GFP_KERNEL); + if (!offdev) + return ERR_PTR(-ENOMEM); + + offdev->ops = ops; + offdev->priv = priv; + INIT_LIST_HEAD(&offdev->netdevs); + + return offdev; +} +EXPORT_SYMBOL_GPL(bpf_offload_dev_create); + +void bpf_offload_dev_destroy(struct bpf_offload_dev *offdev) +{ + WARN_ON(!list_empty(&offdev->netdevs)); + kfree(offdev); +} +EXPORT_SYMBOL_GPL(bpf_offload_dev_destroy); + +void *bpf_offload_dev_priv(struct bpf_offload_dev *offdev) +{ + return offdev->priv; +} +EXPORT_SYMBOL_GPL(bpf_offload_dev_priv); diff --git a/kernel/bpf/percpu_freelist.c b/kernel/bpf/percpu_freelist.c index 0c1b4ba9e90e..6e090140b924 100644 --- a/kernel/bpf/percpu_freelist.c +++ b/kernel/bpf/percpu_freelist.c @@ -1,8 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2016 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. */ #include "percpu_freelist.h" diff --git a/kernel/bpf/percpu_freelist.h b/kernel/bpf/percpu_freelist.h index c3960118e617..fbf8a8a28979 100644 --- a/kernel/bpf/percpu_freelist.h +++ b/kernel/bpf/percpu_freelist.h @@ -1,8 +1,5 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* Copyright (c) 2016 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. */ #ifndef __PERCPU_FREELIST_H__ #define __PERCPU_FREELIST_H__ diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c new file mode 100644 index 000000000000..26ba7cb01136 --- /dev/null +++ b/kernel/bpf/queue_stack_maps.c @@ -0,0 +1,304 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * queue_stack_maps.c: BPF queue and stack maps + * + * Copyright (c) 2018 Politecnico di Torino + */ +#include +#include +#include +#include +#include "percpu_freelist.h" + +#define QUEUE_STACK_CREATE_FLAG_MASK \ + (BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK) + +struct bpf_queue_stack { + struct bpf_map map; + raw_spinlock_t lock; + u32 head, tail; + u32 size; /* max_entries + 1 */ + + char elements[0] __aligned(8); +}; + +static struct bpf_queue_stack *bpf_queue_stack(struct bpf_map *map) +{ + return container_of(map, struct bpf_queue_stack, map); +} + +static bool queue_stack_map_is_empty(struct bpf_queue_stack *qs) +{ + return qs->head == qs->tail; +} + +static bool queue_stack_map_is_full(struct bpf_queue_stack *qs) +{ + u32 head = qs->head + 1; + + if (unlikely(head >= qs->size)) + head = 0; + + return head == qs->tail; +} + +/* Called from syscall */ +static int queue_stack_map_alloc_check(union bpf_attr *attr) +{ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + /* check sanity of attributes */ + if (attr->max_entries == 0 || attr->key_size != 0 || + attr->value_size == 0 || + attr->map_flags & ~QUEUE_STACK_CREATE_FLAG_MASK || + !bpf_map_flags_access_ok(attr->map_flags)) + return -EINVAL; + + if (attr->value_size > KMALLOC_MAX_SIZE) + /* if value_size is bigger, the user space won't be able to + * access the elements. + */ + return -E2BIG; + + return 0; +} + +static struct bpf_map *queue_stack_map_alloc(union bpf_attr *attr) +{ + int ret, numa_node = bpf_map_attr_numa_node(attr); + struct bpf_map_memory mem = {0}; + struct bpf_queue_stack *qs; + u64 size, queue_size, cost; + + size = (u64) attr->max_entries + 1; + cost = queue_size = sizeof(*qs) + size * attr->value_size; + + ret = bpf_map_charge_init(&mem, cost); + if (ret < 0) + return ERR_PTR(ret); + + qs = bpf_map_area_alloc(queue_size, numa_node); + if (!qs) { + bpf_map_charge_finish(&mem); + return ERR_PTR(-ENOMEM); + } + + memset(qs, 0, sizeof(*qs)); + + bpf_map_init_from_attr(&qs->map, attr); + + bpf_map_charge_move(&qs->map.memory, &mem); + qs->size = size; + + raw_spin_lock_init(&qs->lock); + + return &qs->map; +} + +/* Called when map->refcnt goes to zero, either from workqueue or from syscall */ +static void queue_stack_map_free(struct bpf_map *map) +{ + struct bpf_queue_stack *qs = bpf_queue_stack(map); + + /* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0, + * so the programs (can be more than one that used this map) were + * disconnected from events. Wait for outstanding critical sections in + * these programs to complete + */ + synchronize_rcu(); + + bpf_map_area_free(qs); +} + +static int __queue_map_get(struct bpf_map *map, void *value, bool delete) +{ + struct bpf_queue_stack *qs = bpf_queue_stack(map); + unsigned long flags; + int err = 0; + void *ptr; + + if (in_nmi()) { + if (!raw_spin_trylock_irqsave(&qs->lock, flags)) + return -EBUSY; + } else { + raw_spin_lock_irqsave(&qs->lock, flags); + } + + if (queue_stack_map_is_empty(qs)) { + memset(value, 0, qs->map.value_size); + err = -ENOENT; + goto out; + } + + ptr = &qs->elements[qs->tail * qs->map.value_size]; + memcpy(value, ptr, qs->map.value_size); + + if (delete) { + if (unlikely(++qs->tail >= qs->size)) + qs->tail = 0; + } + +out: + raw_spin_unlock_irqrestore(&qs->lock, flags); + return err; +} + + +static int __stack_map_get(struct bpf_map *map, void *value, bool delete) +{ + struct bpf_queue_stack *qs = bpf_queue_stack(map); + unsigned long flags; + int err = 0; + void *ptr; + u32 index; + + if (in_nmi()) { + if (!raw_spin_trylock_irqsave(&qs->lock, flags)) + return -EBUSY; + } else { + raw_spin_lock_irqsave(&qs->lock, flags); + } + + if (queue_stack_map_is_empty(qs)) { + memset(value, 0, qs->map.value_size); + err = -ENOENT; + goto out; + } + + index = qs->head - 1; + if (unlikely(index >= qs->size)) + index = qs->size - 1; + + ptr = &qs->elements[index * qs->map.value_size]; + memcpy(value, ptr, qs->map.value_size); + + if (delete) + qs->head = index; + +out: + raw_spin_unlock_irqrestore(&qs->lock, flags); + return err; +} + +/* Called from syscall or from eBPF program */ +static int queue_map_peek_elem(struct bpf_map *map, void *value) +{ + return __queue_map_get(map, value, false); +} + +/* Called from syscall or from eBPF program */ +static int stack_map_peek_elem(struct bpf_map *map, void *value) +{ + return __stack_map_get(map, value, false); +} + +/* Called from syscall or from eBPF program */ +static int queue_map_pop_elem(struct bpf_map *map, void *value) +{ + return __queue_map_get(map, value, true); +} + +/* Called from syscall or from eBPF program */ +static int stack_map_pop_elem(struct bpf_map *map, void *value) +{ + return __stack_map_get(map, value, true); +} + +/* Called from syscall or from eBPF program */ +static int queue_stack_map_push_elem(struct bpf_map *map, void *value, + u64 flags) +{ + struct bpf_queue_stack *qs = bpf_queue_stack(map); + unsigned long irq_flags; + int err = 0; + void *dst; + + /* BPF_EXIST is used to force making room for a new element in case the + * map is full + */ + bool replace = (flags & BPF_EXIST); + + /* Check supported flags for queue and stack maps */ + if (flags & BPF_NOEXIST || flags > BPF_EXIST) + return -EINVAL; + + if (in_nmi()) { + if (!raw_spin_trylock_irqsave(&qs->lock, irq_flags)) + return -EBUSY; + } else { + raw_spin_lock_irqsave(&qs->lock, irq_flags); + } + + if (queue_stack_map_is_full(qs)) { + if (!replace) { + err = -E2BIG; + goto out; + } + /* advance tail pointer to overwrite oldest element */ + if (unlikely(++qs->tail >= qs->size)) + qs->tail = 0; + } + + dst = &qs->elements[qs->head * qs->map.value_size]; + memcpy(dst, value, qs->map.value_size); + + if (unlikely(++qs->head >= qs->size)) + qs->head = 0; + +out: + raw_spin_unlock_irqrestore(&qs->lock, irq_flags); + return err; +} + +/* Called from syscall or from eBPF program */ +static void *queue_stack_map_lookup_elem(struct bpf_map *map, void *key) +{ + return NULL; +} + +/* Called from syscall or from eBPF program */ +static int queue_stack_map_update_elem(struct bpf_map *map, void *key, + void *value, u64 flags) +{ + return -EINVAL; +} + +/* Called from syscall or from eBPF program */ +static int queue_stack_map_delete_elem(struct bpf_map *map, void *key) +{ + return -EINVAL; +} + +/* Called from syscall */ +static int queue_stack_map_get_next_key(struct bpf_map *map, void *key, + void *next_key) +{ + return -EINVAL; +} + +const struct bpf_map_ops queue_map_ops = { + .map_alloc_check = queue_stack_map_alloc_check, + .map_alloc = queue_stack_map_alloc, + .map_free = queue_stack_map_free, + .map_lookup_elem = queue_stack_map_lookup_elem, + .map_update_elem = queue_stack_map_update_elem, + .map_delete_elem = queue_stack_map_delete_elem, + .map_push_elem = queue_stack_map_push_elem, + .map_pop_elem = queue_map_pop_elem, + .map_peek_elem = queue_map_peek_elem, + .map_get_next_key = queue_stack_map_get_next_key, +}; + +const struct bpf_map_ops stack_map_ops = { + .map_alloc_check = queue_stack_map_alloc_check, + .map_alloc = queue_stack_map_alloc, + .map_free = queue_stack_map_free, + .map_lookup_elem = queue_stack_map_lookup_elem, + .map_update_elem = queue_stack_map_update_elem, + .map_delete_elem = queue_stack_map_delete_elem, + .map_push_elem = queue_stack_map_push_elem, + .map_pop_elem = stack_map_pop_elem, + .map_peek_elem = stack_map_peek_elem, + .map_get_next_key = queue_stack_map_get_next_key, +}; diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c new file mode 100644 index 000000000000..50c083ba978c --- /dev/null +++ b/kernel/bpf/reuseport_array.c @@ -0,0 +1,360 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2018 Facebook + */ +#include +#include +#include +#include + +struct reuseport_array { + struct bpf_map map; + struct sock __rcu *ptrs[]; +}; + +static struct reuseport_array *reuseport_array(struct bpf_map *map) +{ + return (struct reuseport_array *)map; +} + +/* The caller must hold the reuseport_lock */ +void bpf_sk_reuseport_detach(struct sock *sk) +{ + struct sock __rcu **socks; + + write_lock_bh(&sk->sk_callback_lock); + socks = sk->sk_user_data; + if (socks) { + WRITE_ONCE(sk->sk_user_data, NULL); + /* + * Do not move this NULL assignment outside of + * sk->sk_callback_lock because there is + * a race with reuseport_array_free() + * which does not hold the reuseport_lock. + */ + RCU_INIT_POINTER(*socks, NULL); + } + write_unlock_bh(&sk->sk_callback_lock); +} + +static int reuseport_array_alloc_check(union bpf_attr *attr) +{ + if (attr->value_size != sizeof(u32) && + attr->value_size != sizeof(u64)) + return -EINVAL; + + return array_map_alloc_check(attr); +} + +static void *reuseport_array_lookup_elem(struct bpf_map *map, void *key) +{ + struct reuseport_array *array = reuseport_array(map); + u32 index = *(u32 *)key; + + if (unlikely(index >= array->map.max_entries)) + return NULL; + + return rcu_dereference(array->ptrs[index]); +} + +/* Called from syscall only */ +static int reuseport_array_delete_elem(struct bpf_map *map, void *key) +{ + struct reuseport_array *array = reuseport_array(map); + u32 index = *(u32 *)key; + struct sock *sk; + int err; + + if (index >= map->max_entries) + return -E2BIG; + + if (!rcu_access_pointer(array->ptrs[index])) + return -ENOENT; + + spin_lock_bh(&reuseport_lock); + + sk = rcu_dereference_protected(array->ptrs[index], + lockdep_is_held(&reuseport_lock)); + if (sk) { + write_lock_bh(&sk->sk_callback_lock); + WRITE_ONCE(sk->sk_user_data, NULL); + RCU_INIT_POINTER(array->ptrs[index], NULL); + write_unlock_bh(&sk->sk_callback_lock); + err = 0; + } else { + err = -ENOENT; + } + + spin_unlock_bh(&reuseport_lock); + + return err; +} + +static void reuseport_array_free(struct bpf_map *map) +{ + struct reuseport_array *array = reuseport_array(map); + struct sock *sk; + u32 i; + + synchronize_rcu(); + + /* + * ops->map_*_elem() will not be able to access this + * array now. Hence, this function only races with + * bpf_sk_reuseport_detach() which was triggerred by + * close() or disconnect(). + * + * This function and bpf_sk_reuseport_detach() are + * both removing sk from "array". Who removes it + * first does not matter. + * + * The only concern here is bpf_sk_reuseport_detach() + * may access "array" which is being freed here. + * bpf_sk_reuseport_detach() access this "array" + * through sk->sk_user_data _and_ with sk->sk_callback_lock + * held which is enough because this "array" is not freed + * until all sk->sk_user_data has stopped referencing this "array". + * + * Hence, due to the above, taking "reuseport_lock" is not + * needed here. + */ + + /* + * Since reuseport_lock is not taken, sk is accessed under + * rcu_read_lock() + */ + rcu_read_lock(); + for (i = 0; i < map->max_entries; i++) { + sk = rcu_dereference(array->ptrs[i]); + if (sk) { + write_lock_bh(&sk->sk_callback_lock); + /* + * No need for WRITE_ONCE(). At this point, + * no one is reading it without taking the + * sk->sk_callback_lock. + */ + sk->sk_user_data = NULL; + write_unlock_bh(&sk->sk_callback_lock); + RCU_INIT_POINTER(array->ptrs[i], NULL); + } + } + rcu_read_unlock(); + + /* + * Once reaching here, all sk->sk_user_data is not + * referenceing this "array". "array" can be freed now. + */ + bpf_map_area_free(array); +} + +static struct bpf_map *reuseport_array_alloc(union bpf_attr *attr) +{ + int err, numa_node = bpf_map_attr_numa_node(attr); + struct reuseport_array *array; + struct bpf_map_memory mem; + u64 array_size; + + if (!capable(CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + + array_size = sizeof(*array); + array_size += (u64)attr->max_entries * sizeof(struct sock *); + + err = bpf_map_charge_init(&mem, array_size); + if (err) + return ERR_PTR(err); + + /* allocate all map elements and zero-initialize them */ + array = bpf_map_area_alloc(array_size, numa_node); + if (!array) { + bpf_map_charge_finish(&mem); + return ERR_PTR(-ENOMEM); + } + + /* copy mandatory map attributes */ + bpf_map_init_from_attr(&array->map, attr); + bpf_map_charge_move(&array->map.memory, &mem); + + return &array->map; +} + +int bpf_fd_reuseport_array_lookup_elem(struct bpf_map *map, void *key, + void *value) +{ + struct sock *sk; + int err; + + if (map->value_size != sizeof(u64)) + return -ENOSPC; + + rcu_read_lock(); + sk = reuseport_array_lookup_elem(map, key); + if (sk) { + *(u64 *)value = sock_gen_cookie(sk); + err = 0; + } else { + err = -ENOENT; + } + rcu_read_unlock(); + + return err; +} + +static int +reuseport_array_update_check(const struct reuseport_array *array, + const struct sock *nsk, + const struct sock *osk, + const struct sock_reuseport *nsk_reuse, + u32 map_flags) +{ + if (osk && map_flags == BPF_NOEXIST) + return -EEXIST; + + if (!osk && map_flags == BPF_EXIST) + return -ENOENT; + + if (nsk->sk_protocol != IPPROTO_UDP && nsk->sk_protocol != IPPROTO_TCP) + return -ENOTSUPP; + + if (nsk->sk_family != AF_INET && nsk->sk_family != AF_INET6) + return -ENOTSUPP; + + if (nsk->sk_type != SOCK_STREAM && nsk->sk_type != SOCK_DGRAM) + return -ENOTSUPP; + + /* + * sk must be hashed (i.e. listening in the TCP case or binded + * in the UDP case) and + * it must also be a SO_REUSEPORT sk (i.e. reuse cannot be NULL). + * + * Also, sk will be used in bpf helper that is protected by + * rcu_read_lock(). + */ + if (!sock_flag(nsk, SOCK_RCU_FREE) || !sk_hashed(nsk) || !nsk_reuse) + return -EINVAL; + + /* READ_ONCE because the sk->sk_callback_lock may not be held here */ + if (READ_ONCE(nsk->sk_user_data)) + return -EBUSY; + + return 0; +} + +/* + * Called from syscall only. + * The "nsk" in the fd refcnt. + * The "osk" and "reuse" are protected by reuseport_lock. + */ +int bpf_fd_reuseport_array_update_elem(struct bpf_map *map, void *key, + void *value, u64 map_flags) +{ + struct reuseport_array *array = reuseport_array(map); + struct sock *free_osk = NULL, *osk, *nsk; + struct sock_reuseport *reuse; + u32 index = *(u32 *)key; + struct socket *socket; + int err, fd; + + if (map_flags > BPF_EXIST) + return -EINVAL; + + if (index >= map->max_entries) + return -E2BIG; + + if (map->value_size == sizeof(u64)) { + u64 fd64 = *(u64 *)value; + + if (fd64 > S32_MAX) + return -EINVAL; + fd = fd64; + } else { + fd = *(int *)value; + } + + socket = sockfd_lookup(fd, &err); + if (!socket) + return err; + + nsk = socket->sk; + if (!nsk) { + err = -EINVAL; + goto put_file; + } + + /* Quick checks before taking reuseport_lock */ + err = reuseport_array_update_check(array, nsk, + rcu_access_pointer(array->ptrs[index]), + rcu_access_pointer(nsk->sk_reuseport_cb), + map_flags); + if (err) + goto put_file; + + spin_lock_bh(&reuseport_lock); + /* + * Some of the checks only need reuseport_lock + * but it is done under sk_callback_lock also + * for simplicity reason. + */ + write_lock_bh(&nsk->sk_callback_lock); + + osk = rcu_dereference_protected(array->ptrs[index], + lockdep_is_held(&reuseport_lock)); + reuse = rcu_dereference_protected(nsk->sk_reuseport_cb, + lockdep_is_held(&reuseport_lock)); + err = reuseport_array_update_check(array, nsk, osk, reuse, map_flags); + if (err) + goto put_file_unlock; + + /* Ensure reuse->reuseport_id is set */ + err = reuseport_get_id(reuse); + if (err < 0) + goto put_file_unlock; + + WRITE_ONCE(nsk->sk_user_data, &array->ptrs[index]); + rcu_assign_pointer(array->ptrs[index], nsk); + free_osk = osk; + err = 0; + +put_file_unlock: + write_unlock_bh(&nsk->sk_callback_lock); + + if (free_osk) { + write_lock_bh(&free_osk->sk_callback_lock); + WRITE_ONCE(free_osk->sk_user_data, NULL); + write_unlock_bh(&free_osk->sk_callback_lock); + } + + spin_unlock_bh(&reuseport_lock); +put_file: + fput(socket->file); + return err; +} + +/* Called from syscall */ +static int reuseport_array_get_next_key(struct bpf_map *map, void *key, + void *next_key) +{ + struct reuseport_array *array = reuseport_array(map); + u32 index = key ? *(u32 *)key : U32_MAX; + u32 *next = (u32 *)next_key; + + if (index >= array->map.max_entries) { + *next = 0; + return 0; + } + + if (index == array->map.max_entries - 1) + return -ENOENT; + + *next = index + 1; + return 0; +} + +const struct bpf_map_ops reuseport_array_ops = { + .map_alloc_check = reuseport_array_alloc_check, + .map_alloc = reuseport_array_alloc, + .map_free = reuseport_array_free, + .map_lookup_elem = reuseport_array_lookup_elem, + .map_get_next_key = reuseport_array_get_next_key, + .map_delete_elem = reuseport_array_delete_elem, +}; diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index ed71857961cd..bd8516d96745 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -1,24 +1,25 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2016 Facebook - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. */ #include #include #include #include #include +#include +#include +#include #include "percpu_freelist.h" -#define STACK_CREATE_FLAG_MASK \ - (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) +#define STACK_CREATE_FLAG_MASK \ + (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | \ + BPF_F_STACK_BUILD_ID) struct stack_map_bucket { struct pcpu_freelist_node fnode; u32 hash; u32 nr; - u64 ip[]; + u64 data[]; }; struct bpf_stack_map { @@ -29,6 +30,34 @@ struct bpf_stack_map { struct stack_map_bucket *buckets[]; }; +/* irq_work to run up_read() for build_id lookup in nmi context */ +struct stack_map_irq_work { + struct irq_work irq_work; + struct rw_semaphore *sem; +}; + +static void do_up_read(struct irq_work *entry) +{ + struct stack_map_irq_work *work; + + work = container_of(entry, struct stack_map_irq_work, irq_work); + up_read_non_owner(work->sem); + work->sem = NULL; +} + +static DEFINE_PER_CPU(struct stack_map_irq_work, up_read_work); + +static inline bool stack_map_use_build_id(struct bpf_map *map) +{ + return (map->map_flags & BPF_F_STACK_BUILD_ID); +} + +static inline int stack_map_data_size(struct bpf_map *map) +{ + return stack_map_use_build_id(map) ? + sizeof(struct bpf_stack_build_id) : sizeof(u64); +} + static int prealloc_elems_and_freelist(struct bpf_stack_map *smap) { u64 elem_size = sizeof(struct stack_map_bucket) + @@ -58,6 +87,7 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr) { u32 value_size = attr->value_size; struct bpf_stack_map *smap; + struct bpf_map_memory mem; u64 cost, n_buckets; int err; @@ -69,8 +99,16 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr) /* check sanity of attributes */ if (attr->max_entries == 0 || attr->key_size != 4 || - value_size < 8 || value_size % 8 || - value_size / 8 > sysctl_perf_event_max_stack) + value_size < 8 || value_size % 8) + return ERR_PTR(-EINVAL); + + BUILD_BUG_ON(sizeof(struct bpf_stack_build_id) % sizeof(u64)); + if (attr->map_flags & BPF_F_STACK_BUILD_ID) { + if (value_size % sizeof(struct bpf_stack_build_id) || + value_size / sizeof(struct bpf_stack_build_id) + > sysctl_perf_event_max_stack) + return ERR_PTR(-EINVAL); + } else if (value_size / 8 > sysctl_perf_event_max_stack) return ERR_PTR(-EINVAL); /* hash table size must be power of 2; roundup_pow_of_two() can overflow @@ -82,51 +120,242 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr) n_buckets = roundup_pow_of_two(attr->max_entries); cost = n_buckets * sizeof(struct stack_map_bucket *) + sizeof(*smap); - if (cost >= U32_MAX - PAGE_SIZE) - return ERR_PTR(-E2BIG); + err = bpf_map_charge_init(&mem, cost + attr->max_entries * + (sizeof(struct stack_map_bucket) + (u64)value_size)); + if (err) + return ERR_PTR(err); smap = bpf_map_area_alloc(cost, bpf_map_attr_numa_node(attr)); - if (!smap) + if (!smap) { + bpf_map_charge_finish(&mem); return ERR_PTR(-ENOMEM); - - err = -E2BIG; - cost += n_buckets * (value_size + sizeof(struct stack_map_bucket)); - if (cost >= U32_MAX - PAGE_SIZE) - goto free_smap; + } bpf_map_init_from_attr(&smap->map, attr); smap->map.value_size = value_size; smap->n_buckets = n_buckets; - smap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; - - err = bpf_map_precharge_memlock(smap->map.pages); - if (err) - goto free_smap; err = get_callchain_buffers(sysctl_perf_event_max_stack); if (err) - goto free_smap; + goto free_charge; err = prealloc_elems_and_freelist(smap); if (err) goto put_buffers; + bpf_map_charge_move(&smap->map.memory, &mem); + return &smap->map; put_buffers: put_callchain_buffers(); -free_smap: +free_charge: + bpf_map_charge_finish(&mem); bpf_map_area_free(smap); return ERR_PTR(err); } +#define BPF_BUILD_ID 3 +/* + * Parse build id from the note segment. This logic can be shared between + * 32-bit and 64-bit system, because Elf32_Nhdr and Elf64_Nhdr are + * identical. + */ +static inline int stack_map_parse_build_id(void *page_addr, + unsigned char *build_id, + void *note_start, + Elf32_Word note_size) +{ + Elf32_Word note_offs = 0, new_offs; + + /* check for overflow */ + if (note_start < page_addr || note_start + note_size < note_start) + return -EINVAL; + + /* only supports note that fits in the first page */ + if (note_start + note_size > page_addr + PAGE_SIZE) + return -EINVAL; + + while (note_offs + sizeof(Elf32_Nhdr) < note_size) { + Elf32_Nhdr *nhdr = (Elf32_Nhdr *)(note_start + note_offs); + + if (nhdr->n_type == BPF_BUILD_ID && + nhdr->n_namesz == sizeof("GNU") && + nhdr->n_descsz > 0 && + nhdr->n_descsz <= BPF_BUILD_ID_SIZE) { + memcpy(build_id, + note_start + note_offs + + ALIGN(sizeof("GNU"), 4) + sizeof(Elf32_Nhdr), + nhdr->n_descsz); + memset(build_id + nhdr->n_descsz, 0, + BPF_BUILD_ID_SIZE - nhdr->n_descsz); + return 0; + } + new_offs = note_offs + sizeof(Elf32_Nhdr) + + ALIGN(nhdr->n_namesz, 4) + ALIGN(nhdr->n_descsz, 4); + if (new_offs <= note_offs) /* overflow */ + break; + note_offs = new_offs; + } + return -EINVAL; +} + +/* Parse build ID from 32-bit ELF */ +static int stack_map_get_build_id_32(void *page_addr, + unsigned char *build_id) +{ + Elf32_Ehdr *ehdr = (Elf32_Ehdr *)page_addr; + Elf32_Phdr *phdr; + int i; + + /* only supports phdr that fits in one page */ + if (ehdr->e_phnum > + (PAGE_SIZE - sizeof(Elf32_Ehdr)) / sizeof(Elf32_Phdr)) + return -EINVAL; + + phdr = (Elf32_Phdr *)(page_addr + sizeof(Elf32_Ehdr)); + + for (i = 0; i < ehdr->e_phnum; ++i) + if (phdr[i].p_type == PT_NOTE) + return stack_map_parse_build_id(page_addr, build_id, + page_addr + phdr[i].p_offset, + phdr[i].p_filesz); + return -EINVAL; +} + +/* Parse build ID from 64-bit ELF */ +static int stack_map_get_build_id_64(void *page_addr, + unsigned char *build_id) +{ + Elf64_Ehdr *ehdr = (Elf64_Ehdr *)page_addr; + Elf64_Phdr *phdr; + int i; + + /* only supports phdr that fits in one page */ + if (ehdr->e_phnum > + (PAGE_SIZE - sizeof(Elf64_Ehdr)) / sizeof(Elf64_Phdr)) + return -EINVAL; + + phdr = (Elf64_Phdr *)(page_addr + sizeof(Elf64_Ehdr)); + + for (i = 0; i < ehdr->e_phnum; ++i) + if (phdr[i].p_type == PT_NOTE) + return stack_map_parse_build_id(page_addr, build_id, + page_addr + phdr[i].p_offset, + phdr[i].p_filesz); + return -EINVAL; +} + +/* Parse build ID of ELF file mapped to vma */ +static int stack_map_get_build_id(struct vm_area_struct *vma, + unsigned char *build_id) +{ + Elf32_Ehdr *ehdr; + struct page *page; + void *page_addr; + int ret; + + /* only works for page backed storage */ + if (!vma->vm_file) + return -EINVAL; + + page = find_get_page(vma->vm_file->f_mapping, 0); + if (!page) + return -EFAULT; /* page not mapped */ + + ret = -EINVAL; + page_addr = kmap_atomic(page); + ehdr = (Elf32_Ehdr *)page_addr; + + /* compare magic x7f "ELF" */ + if (memcmp(ehdr->e_ident, ELFMAG, SELFMAG) != 0) + goto out; + + /* only support executable file and shared object file */ + if (ehdr->e_type != ET_EXEC && ehdr->e_type != ET_DYN) + goto out; + + if (ehdr->e_ident[EI_CLASS] == ELFCLASS32) + ret = stack_map_get_build_id_32(page_addr, build_id); + else if (ehdr->e_ident[EI_CLASS] == ELFCLASS64) + ret = stack_map_get_build_id_64(page_addr, build_id); +out: + kunmap_atomic(page_addr); + put_page(page); + return ret; +} + +static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, + u64 *ips, u32 trace_nr, bool user) +{ + int i; + struct vm_area_struct *vma; + bool irq_work_busy = false; + struct stack_map_irq_work *work = NULL; + + if (irqs_disabled()) { + work = this_cpu_ptr(&up_read_work); + if (work->irq_work.flags & IRQ_WORK_BUSY) + /* cannot queue more up_read, fallback */ + irq_work_busy = true; + } + + /* + * We cannot do up_read() when the irq is disabled, because of + * risk to deadlock with rq_lock. To do build_id lookup when the + * irqs are disabled, we need to run up_read() in irq_work. We use + * a percpu variable to do the irq_work. If the irq_work is + * already used by another lookup, we fall back to report ips. + * + * Same fallback is used for kernel stack (!user) on a stackmap + * with build_id. + */ + if (!user || !current || !current->mm || irq_work_busy || + down_read_trylock(¤t->mm->mmap_sem) == 0) { + /* cannot access current->mm, fall back to ips */ + for (i = 0; i < trace_nr; i++) { + id_offs[i].status = BPF_STACK_BUILD_ID_IP; + id_offs[i].ip = ips[i]; + memset(id_offs[i].build_id, 0, BPF_BUILD_ID_SIZE); + } + return; + } + + for (i = 0; i < trace_nr; i++) { + vma = find_vma(current->mm, ips[i]); + if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) { + /* per entry fall back to ips */ + id_offs[i].status = BPF_STACK_BUILD_ID_IP; + id_offs[i].ip = ips[i]; + memset(id_offs[i].build_id, 0, BPF_BUILD_ID_SIZE); + continue; + } + id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i] + - vma->vm_start; + id_offs[i].status = BPF_STACK_BUILD_ID_VALID; + } + + if (!work) { + up_read(¤t->mm->mmap_sem); + } else { + work->sem = ¤t->mm->mmap_sem; + irq_work_queue(&work->irq_work); + /* + * The irq_work will release the mmap_sem with + * up_read_non_owner(). The rwsem_release() is called + * here to release the lock from lockdep's perspective. + */ + rwsem_release(¤t->mm->mmap_sem.dep_map, 1, _RET_IP_); + } +} + BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, u64, flags) { struct bpf_stack_map *smap = container_of(map, struct bpf_stack_map, map); struct perf_callchain_entry *trace; struct stack_map_bucket *bucket, *new_bucket, *old_bucket; - u32 max_depth = map->value_size / 8; + u32 max_depth = map->value_size / stack_map_data_size(map); /* stack_map_alloc() checks that max_depth <= sysctl_perf_event_max_stack */ u32 init_nr = sysctl_perf_event_max_stack - max_depth; u32 skip = flags & BPF_F_SKIP_FIELD_MASK; @@ -134,6 +363,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, bool user = flags & BPF_F_USER_STACK; bool kernel = !user; u64 *ips; + bool hash_matches; if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | BPF_F_FAST_STACK_CMP | BPF_F_REUSE_STACKID))) @@ -162,24 +392,45 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map, id = hash & (smap->n_buckets - 1); bucket = READ_ONCE(smap->buckets[id]); - if (bucket && bucket->hash == hash) { - if (flags & BPF_F_FAST_STACK_CMP) + hash_matches = bucket && bucket->hash == hash; + /* fast cmp */ + if (hash_matches && flags & BPF_F_FAST_STACK_CMP) + return id; + + if (stack_map_use_build_id(map)) { + /* for build_id+offset, pop a bucket before slow cmp */ + new_bucket = (struct stack_map_bucket *) + pcpu_freelist_pop(&smap->freelist); + if (unlikely(!new_bucket)) + return -ENOMEM; + new_bucket->nr = trace_nr; + stack_map_get_build_id_offset( + (struct bpf_stack_build_id *)new_bucket->data, + ips, trace_nr, user); + trace_len = trace_nr * sizeof(struct bpf_stack_build_id); + if (hash_matches && bucket->nr == trace_nr && + memcmp(bucket->data, new_bucket->data, trace_len) == 0) { + pcpu_freelist_push(&smap->freelist, &new_bucket->fnode); return id; - if (bucket->nr == trace_nr && - memcmp(bucket->ip, ips, trace_len) == 0) + } + if (bucket && !(flags & BPF_F_REUSE_STACKID)) { + pcpu_freelist_push(&smap->freelist, &new_bucket->fnode); + return -EEXIST; + } + } else { + if (hash_matches && bucket->nr == trace_nr && + memcmp(bucket->data, ips, trace_len) == 0) return id; + if (bucket && !(flags & BPF_F_REUSE_STACKID)) + return -EEXIST; + + new_bucket = (struct stack_map_bucket *) + pcpu_freelist_pop(&smap->freelist); + if (unlikely(!new_bucket)) + return -ENOMEM; + memcpy(new_bucket->data, ips, trace_len); } - /* this call stack is not in the map, try to add it */ - if (bucket && !(flags & BPF_F_REUSE_STACKID)) - return -EEXIST; - - new_bucket = (struct stack_map_bucket *) - pcpu_freelist_pop(&smap->freelist); - if (unlikely(!new_bucket)) - return -ENOMEM; - - memcpy(new_bucket->ip, ips, trace_len); new_bucket->hash = hash; new_bucket->nr = trace_nr; @@ -198,10 +449,77 @@ const struct bpf_func_proto bpf_get_stackid_proto = { .arg3_type = ARG_ANYTHING, }; +BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size, + u64, flags) +{ + u32 init_nr, trace_nr, copy_len, elem_size, num_elem; + bool user_build_id = flags & BPF_F_USER_BUILD_ID; + u32 skip = flags & BPF_F_SKIP_FIELD_MASK; + bool user = flags & BPF_F_USER_STACK; + struct perf_callchain_entry *trace; + bool kernel = !user; + int err = -EINVAL; + u64 *ips; + + if (unlikely(flags & ~(BPF_F_SKIP_FIELD_MASK | BPF_F_USER_STACK | + BPF_F_USER_BUILD_ID))) + goto clear; + if (kernel && user_build_id) + goto clear; + + elem_size = (user && user_build_id) ? sizeof(struct bpf_stack_build_id) + : sizeof(u64); + if (unlikely(size % elem_size)) + goto clear; + + num_elem = size / elem_size; + if (sysctl_perf_event_max_stack < num_elem) + init_nr = 0; + else + init_nr = sysctl_perf_event_max_stack - num_elem; + trace = get_perf_callchain(regs, init_nr, kernel, user, + sysctl_perf_event_max_stack, false, false); + if (unlikely(!trace)) + goto err_fault; + + trace_nr = trace->nr - init_nr; + if (trace_nr < skip) + goto err_fault; + + trace_nr -= skip; + trace_nr = (trace_nr <= num_elem) ? trace_nr : num_elem; + copy_len = trace_nr * elem_size; + ips = trace->ip + skip + init_nr; + if (user && user_build_id) + stack_map_get_build_id_offset(buf, ips, trace_nr, user); + else + memcpy(buf, ips, copy_len); + + if (size > copy_len) + memset(buf + copy_len, 0, size - copy_len); + return copy_len; + +err_fault: + err = -EFAULT; +clear: + memset(buf, 0, size); + return err; +} + +const struct bpf_func_proto bpf_get_stack_proto = { + .func = bpf_get_stack, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_UNINIT_MEM, + .arg3_type = ARG_CONST_SIZE_OR_ZERO, + .arg4_type = ARG_ANYTHING, +}; + /* Called from eBPF program */ static void *stack_map_lookup_elem(struct bpf_map *map, void *key) { - return NULL; + return ERR_PTR(-EOPNOTSUPP); } /* Called from syscall */ @@ -218,8 +536,8 @@ int bpf_stackmap_copy(struct bpf_map *map, void *key, void *value) if (!bucket) return -ENOENT; - trace_len = bucket->nr * sizeof(u64); - memcpy(value, bucket->ip, trace_len); + trace_len = bucket->nr * stack_map_data_size(map); + memcpy(value, bucket->data, trace_len); memset(value + trace_len, 0, map->value_size - trace_len); old_bucket = xchg(&smap->buckets[id], bucket); @@ -228,9 +546,33 @@ int bpf_stackmap_copy(struct bpf_map *map, void *key, void *value) return 0; } -static int stack_map_get_next_key(struct bpf_map *map, void *key, void *next_key) +static int stack_map_get_next_key(struct bpf_map *map, void *key, + void *next_key) { - return -EINVAL; + struct bpf_stack_map *smap = container_of(map, + struct bpf_stack_map, map); + u32 id; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + if (!key) { + id = 0; + } else { + id = *(u32 *)key; + if (id >= smap->n_buckets || !smap->buckets[id]) + id = 0; + else + id++; + } + + while (id < smap->n_buckets && !smap->buckets[id]) + id++; + + if (id >= smap->n_buckets) + return -ENOENT; + + *(u32 *)next_key = id; + return 0; } static int stack_map_update_elem(struct bpf_map *map, void *key, void *value, @@ -272,7 +614,7 @@ static void stack_map_free(struct bpf_map *map) put_callchain_buffers(); } -const struct bpf_map_ops stack_map_ops = { +const struct bpf_map_ops stack_trace_map_ops = { .map_alloc = stack_map_alloc, .map_free = stack_map_free, .map_get_next_key = stack_map_get_next_key, @@ -281,3 +623,16 @@ const struct bpf_map_ops stack_map_ops = { .map_delete_elem = stack_map_delete_elem, .map_check_btf = map_check_no_btf, }; + +static int __init stack_map_init(void) +{ + int cpu; + struct stack_map_irq_work *work; + + for_each_possible_cpu(cpu) { + work = per_cpu_ptr(&up_read_work, cpu); + init_irq_work(&work->irq_work, do_up_read); + } + return 0; +} +subsys_initcall(stack_map_init); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 261067c99afa..f5a7b31e6f07 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1,16 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include +#include #include #include #include @@ -18,7 +11,9 @@ #include #include #include +#include #include +#include #include #include #include @@ -27,7 +22,7 @@ #include #include #include -#include +#include #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PROG_ARRAY || \ (map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \ @@ -106,47 +101,60 @@ const struct bpf_map_ops bpf_map_offload_ops = { static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) { const struct bpf_map_ops *ops; + u32 type = attr->map_type; struct bpf_map *map; int err; - if (attr->map_type >= ARRAY_SIZE(bpf_map_types)) + if (type >= ARRAY_SIZE(bpf_map_types)) return ERR_PTR(-EINVAL); - ops = bpf_map_types[attr->map_type]; + type = array_index_nospec(type, ARRAY_SIZE(bpf_map_types)); + ops = bpf_map_types[type]; if (!ops) return ERR_PTR(-EINVAL); - if (attr->map_ifindex) - ops = &bpf_map_offload_ops; if (ops->map_alloc_check) { err = ops->map_alloc_check(attr); if (err) return ERR_PTR(err); } + if (attr->map_ifindex) + ops = &bpf_map_offload_ops; map = ops->map_alloc(attr); if (IS_ERR(map)) return map; map->ops = ops; - map->map_type = attr->map_type; + map->map_type = type; return map; } -void *bpf_map_area_alloc(size_t size, int numa_node) +void *bpf_map_area_alloc(u64 size, int numa_node) { - /* We definitely need __GFP_NORETRY, so OOM killer doesn't - * trigger under memory pressure as we really just want to - * fail instead. + /* We really just want to fail instead of triggering OOM killer + * under memory pressure, therefore we set __GFP_NORETRY to kmalloc, + * which is used for lower order allocation requests. + * + * It has been observed that higher order allocation requests done by + * vmalloc with __GFP_NORETRY being set might fail due to not trying + * to reclaim memory from the page cache, thus we set + * __GFP_RETRY_MAYFAIL to avoid such situations. */ - const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO; + + const gfp_t flags = __GFP_NOWARN | __GFP_ZERO; void *area; + if (size >= SIZE_MAX) + return NULL; + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) { - area = kmalloc_node(size, GFP_USER | flags, numa_node); + area = kmalloc_node(size, GFP_USER | __GFP_NORETRY | flags, + numa_node); if (area != NULL) return area; } - return __vmalloc_node_flags_caller(size, numa_node, GFP_KERNEL | flags, - __builtin_return_address(0)); + return __vmalloc_node_flags_caller(size, numa_node, + GFP_KERNEL | __GFP_RETRY_MAYFAIL | + flags, __builtin_return_address(0)); } void bpf_map_area_free(void *area) @@ -154,29 +162,28 @@ void bpf_map_area_free(void *area) kvfree(area); } +static u32 bpf_map_flags_retain_permanent(u32 flags) +{ + /* Some map creation flags are not tied to the map object but + * rather to the map fd instead, so they have no meaning upon + * map object inspection since multiple file descriptors with + * different (access) properties can exist here. Thus, given + * this has zero meaning for the map itself, lets clear these + * from here. + */ + return flags & ~(BPF_F_RDONLY | BPF_F_WRONLY); +} + void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr) { map->map_type = attr->map_type; map->key_size = attr->key_size; map->value_size = attr->value_size; map->max_entries = attr->max_entries; - map->map_flags = attr->map_flags; + map->map_flags = bpf_map_flags_retain_permanent(attr->map_flags); map->numa_node = bpf_map_attr_numa_node(attr); } -int bpf_map_precharge_memlock(u32 pages) -{ - struct user_struct *user = get_current_user(); - unsigned long memlock_limit, cur; - - memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - cur = atomic_long_read(&user->locked_vm); - free_uid(user); - if (cur + pages > memlock_limit) - return -EPERM; - return 0; -} - static int bpf_charge_memlock(struct user_struct *user, u32 pages) { unsigned long memlock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; @@ -190,54 +197,75 @@ static int bpf_charge_memlock(struct user_struct *user, u32 pages) static void bpf_uncharge_memlock(struct user_struct *user, u32 pages) { - atomic_long_sub(pages, &user->locked_vm); + if (user) + atomic_long_sub(pages, &user->locked_vm); } -static int bpf_map_init_memlock(struct bpf_map *map) + +int bpf_map_charge_init(struct bpf_map_memory *mem, u64 size) { - struct user_struct *user = get_current_user(); + u32 pages = round_up(size, PAGE_SIZE) >> PAGE_SHIFT; + struct user_struct *user; int ret; - ret = bpf_charge_memlock(user, map->pages); + if (size >= U32_MAX - PAGE_SIZE) + return -E2BIG; + + user = get_current_user(); + ret = bpf_charge_memlock(user, pages); if (ret) { free_uid(user); return ret; } - map->user = user; - return ret; + + mem->pages = pages; + mem->user = user; + + return 0; } -static void bpf_map_release_memlock(struct bpf_map *map) +void bpf_map_charge_finish(struct bpf_map_memory *mem) { - struct user_struct *user = map->user; + bpf_uncharge_memlock(mem->user, mem->pages); + free_uid(mem->user); +} - bpf_uncharge_memlock(user, map->pages); - free_uid(user); +void bpf_map_charge_move(struct bpf_map_memory *dst, + struct bpf_map_memory *src) +{ + *dst = *src; + + /* Make sure src will not be used for the redundant uncharging. */ + memset(src, 0, sizeof(struct bpf_map_memory)); } int bpf_map_charge_memlock(struct bpf_map *map, u32 pages) { int ret; - ret = bpf_charge_memlock(map->user, pages); + + ret = bpf_charge_memlock(map->memory.user, pages); if (ret) return ret; - map->pages += pages; + map->memory.pages += pages; return ret; } + void bpf_map_uncharge_memlock(struct bpf_map *map, u32 pages) { - bpf_uncharge_memlock(map->user, pages); - map->pages -= pages; + bpf_uncharge_memlock(map->memory.user, pages); + map->memory.pages -= pages; } static int bpf_map_alloc_id(struct bpf_map *map) { int id; + idr_preload(GFP_KERNEL); spin_lock_bh(&map_idr_lock); id = idr_alloc_cyclic(&map_idr, map, 1, INT_MAX, GFP_ATOMIC); if (id > 0) map->id = id; spin_unlock_bh(&map_idr_lock); + idr_preload_end(); if (WARN_ON_ONCE(!id)) return -ENOSPC; @@ -275,11 +303,13 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock) static void bpf_map_free_deferred(struct work_struct *work) { struct bpf_map *map = container_of(work, struct bpf_map, work); + struct bpf_map_memory mem; - bpf_map_release_memlock(map); + bpf_map_charge_move(&mem, &map->memory); security_bpf_map_free(map); /* implementation dependent freeing */ map->ops->map_free(map); + bpf_map_charge_finish(&mem); } static void bpf_map_put_uref(struct bpf_map *map) @@ -308,6 +338,7 @@ void bpf_map_put(struct bpf_map *map) { __bpf_map_put(map, true); } +EXPORT_SYMBOL_GPL(bpf_map_put); void bpf_map_put_with_uref(struct bpf_map *map) { @@ -326,6 +357,18 @@ static int bpf_map_release(struct inode *inode, struct file *filp) return 0; } +static fmode_t map_get_sys_perms(struct bpf_map *map, struct fd f) +{ + fmode_t mode = f.file->f_mode; + + /* Our file permissions may have been overridden by global + * map permissions facing syscall side. + */ + if (READ_ONCE(map->frozen)) + mode &= ~FMODE_CAN_WRITE; + return mode; +} + #ifdef CONFIG_PROC_FS static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp) { @@ -346,13 +389,17 @@ static void bpf_map_show_fdinfo(struct seq_file *m, struct file *filp) "value_size:\t%u\n" "max_entries:\t%u\n" "map_flags:\t%#x\n" - "memlock:\t%llu\n", + "memlock:\t%llu\n" + "map_id:\t%u\n" + "frozen:\t%u\n", map->map_type, map->key_size, map->value_size, map->max_entries, map->map_flags, - map->pages * 1ULL << PAGE_SHIFT); + map->memory.pages * 1ULL << PAGE_SHIFT, + map->id, + READ_ONCE(map->frozen)); if (owner_prog_type) { seq_printf(m, "owner_prog_type:\t%u\n", @@ -429,10 +476,10 @@ static int bpf_obj_name_cpy(char *dst, const char *src) const char *end = src + BPF_OBJ_NAME_LEN; memset(dst, 0, BPF_OBJ_NAME_LEN); - - /* Copy all isalnum() and '_' char */ + /* Copy all isalnum(), '_' and '.' chars. */ while (src < end && *src) { - if (!isalnum(*src) && *src != '_') + if (!isalnum(*src) && + *src != '_' && *src != '.') return -EINVAL; *dst++ = *src++; } @@ -458,10 +505,17 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, const struct btf_type *key_type, *value_type; u32 key_size, value_size; int ret = 0; - key_type = btf_type_id_size(btf, &btf_key_id, &key_size); - if (!key_type || key_size != map->key_size) - return -EINVAL; + /* Some maps allow key to be unspecified. */ + if (btf_key_id) { + key_type = btf_type_id_size(btf, &btf_key_id, &key_size); + if (!key_type || key_size != map->key_size) + return -EINVAL; + } else { + key_type = btf_type_by_id(btf, 0); + if (!map->ops->map_check_btf) + return -EINVAL; + } value_type = btf_type_id_size(btf, &btf_value_id, &value_size); if (!value_type || value_size != map->value_size) @@ -470,8 +524,11 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, map->spin_lock_off = btf_find_spin_lock(btf, value_type); if (map_value_has_spin_lock(map)) { + if (map->map_flags & BPF_F_RDONLY_PROG) + return -EACCES; if (map->map_type != BPF_MAP_TYPE_HASH && map->map_type != BPF_MAP_TYPE_ARRAY && + map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE && map->map_type != BPF_MAP_TYPE_SK_STORAGE) return -ENOTSUPP; if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > @@ -494,6 +551,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, static int map_create(union bpf_attr *attr) { int numa_node = bpf_map_attr_numa_node(attr); + struct bpf_map_memory mem; struct bpf_map *map; int f_flags; int err; @@ -518,31 +576,32 @@ static int map_create(union bpf_attr *attr) err = bpf_obj_name_cpy(map->name, attr->map_name); if (err) - goto free_map_nouncharge; + goto free_map; atomic_set(&map->refcnt, 1); atomic_set(&map->usercnt, 1); - if (bpf_map_support_seq_show(map) && - (attr->btf_key_type_id || attr->btf_value_type_id)) { + if (attr->btf_key_type_id || attr->btf_value_type_id) { struct btf *btf; - if (!attr->btf_key_type_id || !attr->btf_value_type_id) { + if (!attr->btf_value_type_id) { err = -EINVAL; - goto free_map_nouncharge; + goto free_map; } + btf = btf_get_by_fd(attr->btf_fd); if (IS_ERR(btf)) { err = PTR_ERR(btf); - goto free_map_nouncharge; + goto free_map; } err = map_check_btf(map, btf, attr->btf_key_type_id, - attr->btf_value_type_id); + attr->btf_value_type_id); if (err) { btf_put(btf); - goto free_map_nouncharge; + goto free_map; } + map->btf = btf; map->btf_key_type_id = attr->btf_key_type_id; map->btf_value_type_id = attr->btf_value_type_id; @@ -552,15 +611,11 @@ static int map_create(union bpf_attr *attr) err = security_bpf_map_alloc(map); if (err) - goto free_map_nouncharge; - - err = bpf_map_init_memlock(map); - if (err) - goto free_map_sec; + goto free_map; err = bpf_map_alloc_id(map); if (err) - goto free_map; + goto free_map_sec; err = bpf_map_new_fd(map, f_flags); if (err < 0) { @@ -574,16 +629,15 @@ static int map_create(union bpf_attr *attr) return err; } - trace_bpf_map_create(map, err); return err; -free_map: - bpf_map_release_memlock(map); free_map_sec: security_bpf_map_free(map); -free_map_nouncharge: +free_map: btf_put(map->btf); + bpf_map_charge_move(&mem, &map->memory); map->ops->map_free(map); + bpf_map_charge_finish(&mem); return err; } @@ -615,6 +669,7 @@ struct bpf_map *bpf_map_inc(struct bpf_map *map, bool uref) atomic_inc(&map->usercnt); return map; } +EXPORT_SYMBOL_GPL(bpf_map_inc); struct bpf_map *bpf_map_get_with_uref(u32 ufd) { @@ -632,12 +687,12 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd) } /* map_idr_lock should have been held */ -static struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map, - bool uref) +static struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, + bool uref) { int refold; - refold = __atomic_add_unless(&map->refcnt, 1, 0); + refold = atomic_fetch_add_unless(&map->refcnt, 1, 0); if (refold >= BPF_MAX_REFCNT) { __bpf_map_put(map, false); @@ -653,11 +708,32 @@ static struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map, return map; } +struct bpf_map *bpf_map_inc_not_zero(struct bpf_map *map, bool uref) +{ + spin_lock_bh(&map_idr_lock); + map = __bpf_map_inc_not_zero(map, uref); + spin_unlock_bh(&map_idr_lock); + + return map; +} +EXPORT_SYMBOL_GPL(bpf_map_inc_not_zero); + int __weak bpf_stackmap_copy(struct bpf_map *map, void *key, void *value) { return -ENOTSUPP; } +static void *__bpf_copy_key(void __user *ukey, u64 key_size) +{ + if (key_size) + return memdup_user(ukey, key_size); + + if (ukey) + return ERR_PTR(-EINVAL); + + return NULL; +} + /* last field in 'union bpf_attr' used by this command */ #define BPF_MAP_LOOKUP_ELEM_LAST_FIELD flags @@ -682,8 +758,7 @@ static int map_lookup_elem(union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); - - if (!(f.file->f_mode & FMODE_CAN_READ)) { + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) { err = -EPERM; goto err_put; } @@ -694,7 +769,7 @@ static int map_lookup_elem(union bpf_attr *attr) goto err_put; } - key = memdup_user(ukey, map->key_size); + key = __bpf_copy_key(ukey, map->key_size); if (IS_ERR(key)) { err = PTR_ERR(key); goto err_put; @@ -717,8 +792,13 @@ static int map_lookup_elem(union bpf_attr *attr) if (bpf_map_is_dev_bound(map)) { err = bpf_map_offload_lookup_elem(map, key, value); - } else if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || - map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { + goto done; + } + + preempt_disable(); + this_cpu_inc(bpf_prog_active); + if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || + map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { err = bpf_percpu_hash_copy(map, key, value); } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { err = bpf_percpu_array_copy(map, key, value); @@ -726,20 +806,27 @@ static int map_lookup_elem(union bpf_attr *attr) err = bpf_percpu_cgroup_storage_copy(map, key, value); } else if (map->map_type == BPF_MAP_TYPE_STACK_TRACE) { err = bpf_stackmap_copy(map, key, value); - } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) { - err = bpf_percpu_cgroup_storage_update(map, key, value, - attr->flags); } else if (IS_FD_ARRAY(map)) { err = bpf_fd_array_map_lookup_elem(map, key, value); } else if (IS_FD_HASH(map)) { err = bpf_fd_htab_map_lookup_elem(map, key, value); + } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) { + err = bpf_fd_reuseport_array_lookup_elem(map, key, value); + } else if (map->map_type == BPF_MAP_TYPE_QUEUE || + map->map_type == BPF_MAP_TYPE_STACK) { + err = map->ops->map_peek_elem(map, value); } else { rcu_read_lock(); if (map->ops->map_lookup_elem_sys_only) ptr = map->ops->map_lookup_elem_sys_only(map, key); else ptr = map->ops->map_lookup_elem(map, key); - if (ptr) { + if (IS_ERR(ptr)) { + err = PTR_ERR(ptr); + } else if (!ptr) { + err = -ENOENT; + } else { + err = 0; if (attr->flags & BPF_F_LOCK) /* lock 'ptr' and copy everything but lock */ copy_map_value_locked(map, value, ptr, true); @@ -749,9 +836,11 @@ static int map_lookup_elem(union bpf_attr *attr) check_and_init_map_lock(map, value); } rcu_read_unlock(); - err = ptr ? 0 : -ENOENT; } + this_cpu_dec(bpf_prog_active); + preempt_enable(); +done: if (err) goto free_value; @@ -759,7 +848,6 @@ static int map_lookup_elem(union bpf_attr *attr) if (copy_to_user(uvalue, value, value_size) != 0) goto free_value; - trace_bpf_map_lookup_elem(map, ufd, key, value); err = 0; free_value: @@ -802,8 +890,7 @@ static int map_update_elem(union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); - - if (!(f.file->f_mode & FMODE_CAN_WRITE)) { + if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { err = -EPERM; goto err_put; } @@ -814,7 +901,7 @@ static int map_update_elem(union bpf_attr *attr) goto err_put; } - key = memdup_user(ukey, map->key_size); + key = __bpf_copy_key(ukey, map->key_size); if (IS_ERR(key)) { err = PTR_ERR(key); goto err_put; @@ -841,7 +928,9 @@ static int map_update_elem(union bpf_attr *attr) if (bpf_map_is_dev_bound(map)) { err = bpf_map_offload_update_elem(map, key, value, attr->flags); goto out; - } else if (map->map_type == BPF_MAP_TYPE_CPUMAP) { + } else if (map->map_type == BPF_MAP_TYPE_CPUMAP || + map->map_type == BPF_MAP_TYPE_SOCKHASH || + map->map_type == BPF_MAP_TYPE_SOCKMAP) { err = map->ops->map_update_elem(map, key, value, attr->flags); goto out; } @@ -856,10 +945,10 @@ static int map_update_elem(union bpf_attr *attr) err = bpf_percpu_hash_update(map, key, value, attr->flags); } else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) { err = bpf_percpu_array_update(map, key, value, attr->flags); - } else if (map->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || - map->map_type == BPF_MAP_TYPE_PROG_ARRAY || - map->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || - map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS) { + } else if (map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) { + err = bpf_percpu_cgroup_storage_update(map, key, value, + attr->flags); + } else if (IS_FD_ARRAY(map)) { rcu_read_lock(); err = bpf_fd_array_map_update_elem(map, f.file, key, value, attr->flags); @@ -869,6 +958,13 @@ static int map_update_elem(union bpf_attr *attr) err = bpf_fd_htab_map_update_elem(map, f.file, key, value, attr->flags); rcu_read_unlock(); + } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) { + /* rcu_read_lock() is not needed */ + err = bpf_fd_reuseport_array_update_elem(map, key, value, + attr->flags); + } else if (map->map_type == BPF_MAP_TYPE_QUEUE || + map->map_type == BPF_MAP_TYPE_STACK) { + err = map->ops->map_push_elem(map, value, attr->flags); } else { rcu_read_lock(); err = map->ops->map_update_elem(map, key, value, attr->flags); @@ -877,10 +973,7 @@ static int map_update_elem(union bpf_attr *attr) __this_cpu_dec(bpf_prog_active); preempt_enable(); maybe_wait_bpf_programs(map); - out: - if (!err) - trace_bpf_map_update_elem(map, ufd, key, value); free_value: kfree(value); free_key: @@ -908,13 +1001,12 @@ static int map_delete_elem(union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); - - if (!(f.file->f_mode & FMODE_CAN_WRITE)) { + if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { err = -EPERM; goto err_put; } - key = memdup_user(ukey, map->key_size); + key = __bpf_copy_key(ukey, map->key_size); if (IS_ERR(key)) { err = PTR_ERR(key); goto err_put; @@ -933,10 +1025,7 @@ static int map_delete_elem(union bpf_attr *attr) __this_cpu_dec(bpf_prog_active); preempt_enable(); maybe_wait_bpf_programs(map); - out: - if (!err) - trace_bpf_map_delete_elem(map, ufd, key); kfree(key); err_put: fdput(f); @@ -963,14 +1052,13 @@ static int map_get_next_key(union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); - - if (!(f.file->f_mode & FMODE_CAN_READ)) { + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) { err = -EPERM; goto err_put; } if (ukey) { - key = memdup_user(ukey, map->key_size); + key = __bpf_copy_key(ukey, map->key_size); if (IS_ERR(key)) { err = PTR_ERR(key); goto err_put; @@ -1000,7 +1088,6 @@ out: if (copy_to_user(unext_key, next_key, map->key_size) != 0) goto free_next_key; - trace_bpf_map_next_key(map, ufd, key, next_key); err = 0; free_next_key: @@ -1012,6 +1099,101 @@ err_put: return err; } +#define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value + +static int map_lookup_and_delete_elem(union bpf_attr *attr) +{ + void __user *ukey = u64_to_user_ptr(attr->key); + void __user *uvalue = u64_to_user_ptr(attr->value); + int ufd = attr->map_fd; + struct bpf_map *map; + void *key, *value; + u32 value_size; + struct fd f; + int err; + + if (CHECK_ATTR(BPF_MAP_LOOKUP_AND_DELETE_ELEM)) + return -EINVAL; + + f = fdget(ufd); + map = __bpf_map_get(f); + if (IS_ERR(map)) + return PTR_ERR(map); + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ) || + !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { + err = -EPERM; + goto err_put; + } + + key = __bpf_copy_key(ukey, map->key_size); + if (IS_ERR(key)) { + err = PTR_ERR(key); + goto err_put; + } + + value_size = map->value_size; + + err = -ENOMEM; + value = kmalloc(value_size, GFP_USER | __GFP_NOWARN); + if (!value) + goto free_key; + + if (map->map_type == BPF_MAP_TYPE_QUEUE || + map->map_type == BPF_MAP_TYPE_STACK) { + err = map->ops->map_pop_elem(map, value); + } else { + err = -ENOTSUPP; + } + + if (err) + goto free_value; + + if (copy_to_user(uvalue, value, value_size) != 0) { + err = -EFAULT; + goto free_value; + } + + err = 0; + +free_value: + kfree(value); +free_key: + kfree(key); +err_put: + fdput(f); + return err; +} + +#define BPF_MAP_FREEZE_LAST_FIELD map_fd + +static int map_freeze(const union bpf_attr *attr) +{ + int err = 0, ufd = attr->map_fd; + struct bpf_map *map; + struct fd f; + + if (CHECK_ATTR(BPF_MAP_FREEZE)) + return -EINVAL; + + f = fdget(ufd); + map = __bpf_map_get(f); + if (IS_ERR(map)) + return PTR_ERR(map); + if (READ_ONCE(map->frozen)) { + err = -EBUSY; + goto err_put; + } + if (!capable(CAP_SYS_ADMIN)) { + err = -EPERM; + goto err_put; + } + + WRITE_ONCE(map->frozen, true); +err_put: + fdput(f); + return err; +} + static const struct bpf_prog_ops * const bpf_prog_types[] = { #define BPF_PROG_TYPE(_id, _name) \ [_id] = & _name ## _prog_ops, @@ -1023,11 +1205,17 @@ static const struct bpf_prog_ops * const bpf_prog_types[] = { static int find_prog_type(enum bpf_prog_type type, struct bpf_prog *prog) { - if (type >= ARRAY_SIZE(bpf_prog_types) || !bpf_prog_types[type]) + const struct bpf_prog_ops *ops; + + if (type >= ARRAY_SIZE(bpf_prog_types)) + return -EINVAL; + type = array_index_nospec(type, ARRAY_SIZE(bpf_prog_types)); + ops = bpf_prog_types[type]; + if (!ops) return -EINVAL; if (!bpf_prog_is_dev_bound(prog->aux)) - prog->aux->ops = bpf_prog_types[type]; + prog->aux->ops = ops; else prog->aux->ops = &bpf_offload_prog_ops; prog->type = type; @@ -1102,11 +1290,13 @@ static int bpf_prog_alloc_id(struct bpf_prog *prog) { int id; + idr_preload(GFP_KERNEL); spin_lock_bh(&prog_idr_lock); id = idr_alloc_cyclic(&prog_idr, prog, 1, INT_MAX, GFP_ATOMIC); if (id > 0) prog->aux->id = id; spin_unlock_bh(&prog_idr_lock); + idr_preload_end(); /* id is in [1, INT_MAX) */ if (WARN_ON_ONCE(!id)) @@ -1131,8 +1321,8 @@ void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock) __acquire(&prog_idr_lock); idr_remove(&prog_idr, prog->aux->id); - prog->aux->id = 0; + if (do_idr_lock) spin_unlock_bh(&prog_idr_lock); else @@ -1143,24 +1333,32 @@ static void __bpf_prog_put_rcu(struct rcu_head *rcu) { struct bpf_prog_aux *aux = container_of(rcu, struct bpf_prog_aux, rcu); + kvfree(aux->func_info); free_used_maps(aux); bpf_prog_uncharge_memlock(aux->prog); security_bpf_prog_free(aux); bpf_prog_free(aux->prog); } +static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) +{ + bpf_prog_kallsyms_del_all(prog); + btf_put(prog->aux->btf); + bpf_prog_free_linfo(prog); + + if (deferred) + call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); + else + __bpf_prog_put_rcu(&prog->aux->rcu); +} + static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock) { if (atomic_dec_and_test(&prog->aux->refcnt)) { - trace_bpf_prog_put_rcu(prog); + perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_UNLOAD, 0); /* bpf_prog_free_id() must be called first */ bpf_prog_free_id(prog, do_idr_lock); - bpf_prog_kallsyms_del(prog); - btf_put(prog->aux->btf); - kvfree(prog->aux->func_info); - bpf_prog_free_linfo(prog); - - call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); + __bpf_prog_put_noref(prog, true); } } @@ -1178,22 +1376,54 @@ static int bpf_prog_release(struct inode *inode, struct file *filp) return 0; } +static void bpf_prog_get_stats(const struct bpf_prog *prog, + struct bpf_prog_stats *stats) +{ + u64 nsecs = 0, cnt = 0; + int cpu; + + for_each_possible_cpu(cpu) { + const struct bpf_prog_stats *st; + unsigned int start; + u64 tnsecs, tcnt; + + st = per_cpu_ptr(prog->aux->stats, cpu); + do { + start = u64_stats_fetch_begin_irq(&st->syncp); + tnsecs = st->nsecs; + tcnt = st->cnt; + } while (u64_stats_fetch_retry_irq(&st->syncp, start)); + nsecs += tnsecs; + cnt += tcnt; + } + stats->nsecs = nsecs; + stats->cnt = cnt; +} + #ifdef CONFIG_PROC_FS static void bpf_prog_show_fdinfo(struct seq_file *m, struct file *filp) { const struct bpf_prog *prog = filp->private_data; char prog_tag[sizeof(prog->tag) * 2 + 1] = { }; + struct bpf_prog_stats stats; + bpf_prog_get_stats(prog, &stats); bin2hex(prog_tag, prog->tag, sizeof(prog->tag)); seq_printf(m, "prog_type:\t%u\n" "prog_jited:\t%u\n" "prog_tag:\t%s\n" - "memlock:\t%llu\n", + "memlock:\t%llu\n" + "prog_id:\t%u\n" + "run_time_ns:\t%llu\n" + "run_cnt:\t%llu\n", prog->type, prog->jited, prog_tag, - prog->pages * 1ULL << PAGE_SHIFT); + prog->pages * 1ULL << PAGE_SHIFT, + prog->aux->id, + stats.nsecs, + stats.cnt); } #endif @@ -1262,7 +1492,7 @@ struct bpf_prog *bpf_prog_inc_not_zero(struct bpf_prog *prog) { int refold; - refold = __atomic_add_unless(&prog->aux->refcnt, 1, 0); + refold = atomic_fetch_add_unless(&prog->aux->refcnt, 1, 0); if (refold >= BPF_MAX_REFCNT) { __bpf_prog_put(prog, false); @@ -1276,7 +1506,23 @@ struct bpf_prog *bpf_prog_inc_not_zero(struct bpf_prog *prog) } EXPORT_SYMBOL_GPL(bpf_prog_inc_not_zero); -static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type *attach_type) +bool bpf_prog_get_ok(struct bpf_prog *prog, + enum bpf_prog_type *attach_type, bool attach_drv) +{ + /* not an attachment, just a refcount inc, always allow */ + if (!attach_type) + return true; + + if (prog->type != *attach_type) + return false; + if (bpf_prog_is_dev_bound(prog->aux) && !attach_drv) + return false; + + return true; +} + +static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type *attach_type, + bool attach_drv) { struct fd f = fdget(ufd); struct bpf_prog *prog; @@ -1284,7 +1530,7 @@ static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type *attach_type) prog = ____bpf_prog_get(f); if (IS_ERR(prog)) return prog; - if (attach_type && (prog->type != *attach_type || prog->aux->offload)) { + if (!bpf_prog_get_ok(prog, attach_type, attach_drv)) { prog = ERR_PTR(-EINVAL); goto out; } @@ -1297,18 +1543,15 @@ out: struct bpf_prog *bpf_prog_get(u32 ufd) { - return __bpf_prog_get(ufd, NULL); + return __bpf_prog_get(ufd, NULL, false); } -struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type) +struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type, + bool attach_drv) { - struct bpf_prog *prog = __bpf_prog_get(ufd, &type); - - if (!IS_ERR(prog)) - trace_bpf_prog_get_type(prog); - return prog; + return __bpf_prog_get(ufd, &type, attach_drv); } -EXPORT_SYMBOL_GPL(bpf_prog_get_type); +EXPORT_SYMBOL_GPL(bpf_prog_get_type_dev); /* Initially all BPF programs could be loaded w/o specifying * expected_attach_type. Later for some of them specifying expected_attach_type @@ -1364,6 +1607,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type, default: return -EINVAL; } + case BPF_PROG_TYPE_CGROUP_SKB: + switch (expected_attach_type) { + case BPF_CGROUP_INET_INGRESS: + case BPF_CGROUP_INET_EGRESS: + return 0; + default: + return -EINVAL; + } case BPF_PROG_TYPE_CGROUP_SOCKOPT: switch (expected_attach_type) { case BPF_CGROUP_SETSOCKOPT: @@ -1377,19 +1628,6 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type, } } -static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog, - enum bpf_attach_type attach_type) -{ - switch (prog->type) { - case BPF_PROG_TYPE_CGROUP_SOCK: - case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: - case BPF_PROG_TYPE_CGROUP_SOCKOPT: - return attach_type == prog->expected_attach_type ? 0 : -EINVAL; - default: - return 0; - } -} - /* last field in 'union bpf_attr' used by this command */ #define BPF_PROG_LOAD_LAST_FIELD line_info_cnt @@ -1404,9 +1642,17 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) if (CHECK_ATTR(BPF_PROG_LOAD)) return -EINVAL; - if (attr->prog_flags & ~BPF_F_STRICT_ALIGNMENT) + if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | + BPF_F_ANY_ALIGNMENT | + BPF_F_TEST_STATE_FREQ | + BPF_F_TEST_RND_HI32)) return -EINVAL; + if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && + (attr->prog_flags & BPF_F_ANY_ALIGNMENT) && + !capable(CAP_SYS_ADMIN)) + return -EPERM; + /* copy eBPF program license from user space */ if (strncpy_from_user(license, u64_to_user_ptr(attr->license), sizeof(license) - 1) < 0) @@ -1416,13 +1662,9 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) /* eBPF programs must be GPL compatible to use GPL-ed functions */ is_gpl = license_is_gpl_compatible(license); - if (attr->insn_cnt == 0 || attr->insn_cnt > BPF_MAXINSNS) + if (attr->insn_cnt == 0 || + attr->insn_cnt > (capable(CAP_SYS_ADMIN) ? BPF_COMPLEXITY_LIMIT_INSNS : BPF_MAXINSNS)) return -E2BIG; - - if (type == BPF_PROG_TYPE_KPROBE && - attr->kern_version != LINUX_VERSION_CODE) - return -EINVAL; - if (type != BPF_PROG_TYPE_SOCKET_FILTER && type != BPF_PROG_TYPE_CGROUP_SKB && !capable(CAP_SYS_ADMIN)) @@ -1439,6 +1681,8 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) prog->expected_attach_type = attr->expected_attach_type; + prog->aux->offload_requested = !!attr->prog_ifindex; + err = security_bpf_prog_alloc(prog->aux); if (err) goto free_prog_nouncharge; @@ -1460,7 +1704,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) atomic_set(&prog->aux->refcnt, 1); prog->gpl_compatible = is_gpl ? 1 : 0; - if (attr->prog_ifindex) { + if (bpf_prog_is_dev_bound(prog->aux)) { err = bpf_prog_offload_init(prog, attr); if (err) goto free_prog; @@ -1471,7 +1715,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) if (err < 0) goto free_prog; - prog->aux->load_time = ktime_get_boot_ns(); + prog->aux->load_time = ktime_get_boottime_ns(); err = bpf_obj_name_cpy(prog->aux->name, attr->prog_name); if (err) goto free_prog; @@ -1504,7 +1748,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) * be using bpf_prog_put() given the program is exposed. */ bpf_prog_kallsyms_add(prog); - trace_bpf_prog_load(prog, err); + perf_event_bpf_event(prog, PERF_BPF_EVENT_PROG_LOAD, 0); err = bpf_prog_new_fd(prog); if (err < 0) @@ -1512,8 +1756,12 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) return err; free_used_maps: - bpf_prog_free_linfo(prog); - free_used_maps(prog->aux); + /* In case we have subprogs, we need to wait for a grace + * period before we can tear down JIT memory since symbols + * are already exposed under kallsyms. + */ + __bpf_prog_put_noref(prog, prog->aux->func_cnt); + return err; free_prog: bpf_prog_uncharge_memlock(prog); free_prog_sec: @@ -1556,17 +1804,19 @@ static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp) bpf_probe_unregister(raw_tp->btp, raw_tp->prog); bpf_prog_put(raw_tp->prog); } + bpf_put_raw_tracepoint(raw_tp->btp); kfree(raw_tp); return 0; } static const struct file_operations bpf_raw_tp_fops = { - .release = bpf_raw_tracepoint_release, - .read = bpf_dummy_read, - .write = bpf_dummy_write, + .release = bpf_raw_tracepoint_release, + .read = bpf_dummy_read, + .write = bpf_dummy_write, }; #define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd + static int bpf_raw_tracepoint_open(const union bpf_attr *attr) { struct bpf_raw_tracepoint *raw_tp; @@ -1576,17 +1826,19 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) int tp_fd, err; if (strncpy_from_user(tp_name, u64_to_user_ptr(attr->raw_tracepoint.name), - sizeof(tp_name) - 1) < 0) + sizeof(tp_name) - 1) < 0) return -EFAULT; tp_name[sizeof(tp_name) - 1] = 0; - btp = bpf_find_raw_tracepoint(tp_name); + btp = bpf_get_raw_tracepoint(tp_name); if (!btp) return -ENOENT; raw_tp = kzalloc(sizeof(*raw_tp), GFP_USER); - if (!raw_tp) - return -ENOMEM; + if (!raw_tp) { + err = -ENOMEM; + goto out_put_btp; + } raw_tp->btp = btp; prog = bpf_prog_get(attr->raw_tracepoint.prog_fd); @@ -1594,7 +1846,6 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) err = PTR_ERR(prog); goto out_free_tp; } - if (prog->type != BPF_PROG_TYPE_RAW_TRACEPOINT && prog->type != BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE) { err = -EINVAL; @@ -1607,7 +1858,7 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) raw_tp->prog = prog; tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp, - O_CLOEXEC); + O_CLOEXEC); if (tp_fd < 0) { bpf_probe_unregister(raw_tp->btp, prog); err = tp_fd; @@ -1619,10 +1870,27 @@ out_put_prog: bpf_prog_put(prog); out_free_tp: kfree(raw_tp); +out_put_btp: + bpf_put_raw_tracepoint(btp); return err; } -#ifdef CONFIG_CGROUP_BPF +static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog, + enum bpf_attach_type attach_type) +{ + switch (prog->type) { + case BPF_PROG_TYPE_CGROUP_SOCK: + case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: + case BPF_PROG_TYPE_CGROUP_SOCKOPT: + return attach_type == prog->expected_attach_type ? 0 : -EINVAL; + case BPF_PROG_TYPE_CGROUP_SKB: + return prog->enforce_expected_attach_type && + prog->expected_attach_type != attach_type ? + -EINVAL : 0; + default: + return 0; + } +} #define BPF_PROG_ATTACH_LAST_FIELD attach_flags @@ -1633,7 +1901,6 @@ static int bpf_prog_attach(const union bpf_attr *attr) { enum bpf_prog_type ptype; struct bpf_prog *prog; - struct cgroup *cgrp; int ret; if (!capable(CAP_NET_ADMIN)) @@ -1668,15 +1935,18 @@ static int bpf_prog_attach(const union bpf_attr *attr) case BPF_CGROUP_SOCK_OPS: ptype = BPF_PROG_TYPE_SOCK_OPS; break; - case BPF_CGROUP_DEVICE: - ptype = BPF_PROG_TYPE_CGROUP_DEVICE; - break; + case BPF_CGROUP_DEVICE: + ptype = BPF_PROG_TYPE_CGROUP_DEVICE; + break; case BPF_SK_MSG_VERDICT: - ret = sock_map_get_from_fd(attr, prog); + ptype = BPF_PROG_TYPE_SK_MSG; break; case BPF_SK_SKB_STREAM_PARSER: case BPF_SK_SKB_STREAM_VERDICT: - ret = sock_map_get_from_fd(attr, prog); + ptype = BPF_PROG_TYPE_SK_SKB; + break; + case BPF_LIRC_MODE2: + ptype = BPF_PROG_TYPE_LIRC_MODE2; break; case BPF_FLOW_DISSECTOR: ptype = BPF_PROG_TYPE_FLOW_DISSECTOR; @@ -1701,18 +1971,23 @@ static int bpf_prog_attach(const union bpf_attr *attr) return -EINVAL; } - cgrp = cgroup_get_from_fd(attr->target_fd); - if (IS_ERR(cgrp)) { - bpf_prog_put(prog); - return PTR_ERR(cgrp); + switch (ptype) { + case BPF_PROG_TYPE_SK_SKB: + case BPF_PROG_TYPE_SK_MSG: + ret = sock_map_get_from_fd(attr, prog); + break; + case BPF_PROG_TYPE_LIRC_MODE2: + ret = lirc_prog_attach(attr, prog); + break; + case BPF_PROG_TYPE_FLOW_DISSECTOR: + ret = skb_flow_dissector_bpf_prog_attach(attr, prog); + break; + default: + ret = cgroup_bpf_prog_attach(attr, ptype, prog); } - ret = cgroup_bpf_attach(cgrp, prog, attr->attach_type, - attr->attach_flags); if (ret) bpf_prog_put(prog); - cgroup_put(cgrp); - return ret; } @@ -1721,9 +1996,6 @@ static int bpf_prog_attach(const union bpf_attr *attr) static int bpf_prog_detach(const union bpf_attr *attr) { enum bpf_prog_type ptype; - struct bpf_prog *prog; - struct cgroup *cgrp; - int ret; if (!capable(CAP_NET_ADMIN)) return -EPERM; @@ -1754,14 +2026,16 @@ static int bpf_prog_detach(const union bpf_attr *attr) case BPF_CGROUP_SOCK_OPS: ptype = BPF_PROG_TYPE_SOCK_OPS; break; - case BPF_CGROUP_DEVICE: - ptype = BPF_PROG_TYPE_CGROUP_DEVICE; - break; + case BPF_CGROUP_DEVICE: + ptype = BPF_PROG_TYPE_CGROUP_DEVICE; + break; case BPF_SK_MSG_VERDICT: - return sock_map_get_from_fd(attr, NULL); + return sock_map_prog_detach(attr, BPF_PROG_TYPE_SK_MSG); case BPF_SK_SKB_STREAM_PARSER: case BPF_SK_SKB_STREAM_VERDICT: - return sock_map_get_from_fd(attr, NULL); + return sock_map_prog_detach(attr, BPF_PROG_TYPE_SK_SKB); + case BPF_LIRC_MODE2: + return lirc_prog_detach(attr); case BPF_FLOW_DISSECTOR: return skb_flow_dissector_bpf_prog_detach(attr); case BPF_CGROUP_SYSCTL: @@ -1775,19 +2049,7 @@ static int bpf_prog_detach(const union bpf_attr *attr) return -EINVAL; } - cgrp = cgroup_get_from_fd(attr->target_fd); - if (IS_ERR(cgrp)) - return PTR_ERR(cgrp); - - prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype); - if (IS_ERR(prog)) - prog = NULL; - - ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type, 0); - if (prog) - bpf_prog_put(prog); - cgroup_put(cgrp); - return ret; + return cgroup_bpf_prog_detach(attr, ptype); } #define BPF_PROG_QUERY_LAST_FIELD query.prog_cnt @@ -1795,9 +2057,6 @@ static int bpf_prog_detach(const union bpf_attr *attr) static int bpf_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) { - struct cgroup *cgrp; - int ret; - if (!capable(CAP_NET_ADMIN)) return -EPERM; if (CHECK_ATTR(BPF_PROG_QUERY)) @@ -1820,23 +2079,23 @@ static int bpf_prog_query(const union bpf_attr *attr, case BPF_CGROUP_UDP4_RECVMSG: case BPF_CGROUP_UDP6_RECVMSG: case BPF_CGROUP_SOCK_OPS: + case BPF_CGROUP_DEVICE: case BPF_CGROUP_SYSCTL: case BPF_CGROUP_GETSOCKOPT: case BPF_CGROUP_SETSOCKOPT: break; + case BPF_LIRC_MODE2: + return lirc_prog_query(attr, uattr); + case BPF_FLOW_DISSECTOR: + return skb_flow_dissector_prog_query(attr, uattr); default: return -EINVAL; } - cgrp = cgroup_get_from_fd(attr->query.target_fd); - if (IS_ERR(cgrp)) - return PTR_ERR(cgrp); - ret = cgroup_bpf_query(cgrp, attr, uattr); - cgroup_put(cgrp); - return ret; -} -#endif /* CONFIG_CGROUP_BPF */ -#define BPF_PROG_TEST_RUN_LAST_FIELD test.duration + return cgroup_bpf_prog_query(attr, uattr); +} + +#define BPF_PROG_TEST_RUN_LAST_FIELD test.ctx_out static int bpf_prog_test_run(const union bpf_attr *attr, union bpf_attr __user *uattr) @@ -1844,9 +2103,19 @@ static int bpf_prog_test_run(const union bpf_attr *attr, struct bpf_prog *prog; int ret = -ENOTSUPP; + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; if (CHECK_ATTR(BPF_PROG_TEST_RUN)) return -EINVAL; + if ((attr->test.ctx_size_in && !attr->test.ctx_in) || + (!attr->test.ctx_size_in && attr->test.ctx_in)) + return -EINVAL; + + if ((attr->test.ctx_size_out && !attr->test.ctx_out) || + (!attr->test.ctx_size_out && attr->test.ctx_out)) + return -EINVAL; + prog = bpf_prog_get(attr->test.prog_fd); if (IS_ERR(prog)) return PTR_ERR(prog); @@ -1941,7 +2210,7 @@ static int bpf_map_get_fd_by_id(const union bpf_attr *attr) spin_lock_bh(&map_idr_lock); map = idr_find(&map_idr, id); if (map) - map = bpf_map_inc_not_zero(map, true); + map = __bpf_map_inc_not_zero(map, true); else map = ERR_PTR(-ENOENT); spin_unlock_bh(&map_idr_lock); @@ -1980,7 +2249,8 @@ static const struct bpf_map *bpf_map_from_imm(const struct bpf_prog *prog, return NULL; } -static struct bpf_insn *bpf_insn_prepare_dump(const struct bpf_prog *prog) +static struct bpf_insn *bpf_insn_prepare_dump(const struct bpf_prog *prog, + const struct cred *f_cred) { const struct bpf_map *map; struct bpf_insn *insns; @@ -1999,12 +2269,11 @@ static struct bpf_insn *bpf_insn_prepare_dump(const struct bpf_prog *prog) insns[i].imm = BPF_FUNC_tail_call; /* fall-through */ } - if (insns[i].code == (BPF_JMP | BPF_CALL) || insns[i].code == (BPF_JMP | BPF_CALL_ARGS)) { if (insns[i].code == (BPF_JMP | BPF_CALL_ARGS)) insns[i].code = BPF_JMP | BPF_CALL; - if (!bpf_dump_raw_ok()) + if (!bpf_dump_raw_ok(f_cred)) insns[i].imm = 0; continue; } @@ -2014,20 +2283,12 @@ static struct bpf_insn *bpf_insn_prepare_dump(const struct bpf_prog *prog) imm = ((u64)insns[i + 1].imm << 32) | (u32)insns[i].imm; map = bpf_map_from_imm(prog, imm, &off, &type); - if (map) { insns[i].src_reg = type; insns[i].imm = map->id; insns[i + 1].imm = off; continue; } - - if (!bpf_dump_raw_ok() && - imm == (unsigned long)prog->aux) { - insns[i].imm = 0; - insns[i + 1].imm = 0; - continue; - } } return insns; @@ -2044,15 +2305,19 @@ static int set_info_rec_size(struct bpf_prog_info *info) * zero. In this case, the kernel will set the expected * _rec_size back to the info. */ - if ((info->func_info_cnt || info->func_info_rec_size) && + + if ((info->nr_func_info || info->func_info_rec_size) && info->func_info_rec_size != sizeof(struct bpf_func_info)) return -EINVAL; - if ((info->line_info_cnt || info->line_info_rec_size) && + + if ((info->nr_line_info || info->line_info_rec_size) && info->line_info_rec_size != sizeof(struct bpf_line_info)) return -EINVAL; - if ((info->jited_line_info_cnt || info->jited_line_info_rec_size) && + + if ((info->nr_jited_line_info || info->jited_line_info_rec_size) && info->jited_line_info_rec_size != sizeof(__u64)) return -EINVAL; + info->func_info_rec_size = sizeof(struct bpf_func_info); info->line_info_rec_size = sizeof(struct bpf_line_info); info->jited_line_info_rec_size = sizeof(__u64); @@ -2060,14 +2325,15 @@ static int set_info_rec_size(struct bpf_prog_info *info) return 0; } - -static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, +static int bpf_prog_get_info_by_fd(struct file *file, + struct bpf_prog *prog, const union bpf_attr *attr, union bpf_attr __user *uattr) { struct bpf_prog_info __user *uinfo = u64_to_user_ptr(attr->info.info); struct bpf_prog_info info; u32 info_len = attr->info.info_len; + struct bpf_prog_stats stats; char __user *uinsns; u32 ulen; int err; @@ -2108,40 +2374,32 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, if (err) return err; + bpf_prog_get_stats(prog, &stats); + info.run_time_ns = stats.nsecs; + info.run_cnt = stats.cnt; + if (!capable(CAP_SYS_ADMIN)) { info.jited_prog_len = 0; info.xlated_prog_len = 0; info.nr_jited_ksyms = 0; info.nr_jited_func_lens = 0; - info.func_info_cnt = 0; - info.line_info_cnt = 0; - info.jited_line_info_cnt = 0; + info.nr_func_info = 0; + info.nr_line_info = 0; + info.nr_jited_line_info = 0; goto done; } - ulen = info.jited_prog_len; - info.jited_prog_len = prog->jited_len; - if (info.jited_prog_len && ulen) { - if (bpf_dump_raw_ok()) { - uinsns = u64_to_user_ptr(info.jited_prog_insns); - ulen = min_t(u32, info.jited_prog_len, ulen); - if (copy_to_user(uinsns, prog->bpf_func, ulen)) - return -EFAULT; - } else { - info.jited_prog_insns = 0; - } - } - ulen = info.xlated_prog_len; info.xlated_prog_len = bpf_prog_insn_size(prog); if (info.xlated_prog_len && ulen) { struct bpf_insn *insns_sanitized; bool fault; - if (prog->blinded && !bpf_dump_raw_ok()) { + + if (prog->blinded && !bpf_dump_raw_ok(file->f_cred)) { info.xlated_prog_insns = 0; goto done; } - insns_sanitized = bpf_insn_prepare_dump(prog); + insns_sanitized = bpf_insn_prepare_dump(prog, file->f_cred); if (!insns_sanitized) return -ENOMEM; uinsns = u64_to_user_ptr(info.xlated_prog_insns); @@ -2156,14 +2414,63 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, err = bpf_prog_offload_info_fill(&info, prog); if (err) return err; + goto done; + } + + /* NOTE: the following code is supposed to be skipped for offload. + * bpf_prog_offload_info_fill() is the place to fill similar fields + * for offload. + */ + ulen = info.jited_prog_len; + if (prog->aux->func_cnt) { + u32 i; + + info.jited_prog_len = 0; + for (i = 0; i < prog->aux->func_cnt; i++) + info.jited_prog_len += prog->aux->func[i]->jited_len; + } else { + info.jited_prog_len = prog->jited_len; + } + + if (info.jited_prog_len && ulen) { + if (bpf_dump_raw_ok(file->f_cred)) { + uinsns = u64_to_user_ptr(info.jited_prog_insns); + ulen = min_t(u32, info.jited_prog_len, ulen); + + /* for multi-function programs, copy the JITed + * instructions for all the functions + */ + if (prog->aux->func_cnt) { + u32 len, free, i; + u8 *img; + + free = ulen; + for (i = 0; i < prog->aux->func_cnt; i++) { + len = prog->aux->func[i]->jited_len; + len = min_t(u32, len, free); + img = (u8 *) prog->aux->func[i]->bpf_func; + if (copy_to_user(uinsns, img, len)) + return -EFAULT; + uinsns += len; + free -= len; + if (!free) + break; + } + } else { + if (copy_to_user(uinsns, prog->bpf_func, ulen)) + return -EFAULT; + } + } else { + info.jited_prog_insns = 0; + } } ulen = info.nr_jited_ksyms; - info.nr_jited_ksyms = prog->aux->func_cnt; - if (info.nr_jited_ksyms && ulen) { - if (bpf_dump_raw_ok()) { + info.nr_jited_ksyms = prog->aux->func_cnt ? : 1; + if (ulen) { + if (bpf_dump_raw_ok(file->f_cred)) { + unsigned long ksym_addr; u64 __user *user_ksyms; - ulong ksym_addr; u32 i; /* copy the address of the kernel symbol @@ -2171,10 +2478,17 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, */ ulen = min_t(u32, info.nr_jited_ksyms, ulen); user_ksyms = u64_to_user_ptr(info.jited_ksyms); - for (i = 0; i < ulen; i++) { - ksym_addr = (ulong) prog->aux->func[i]->bpf_func; - ksym_addr &= PAGE_MASK; - if (put_user((u64) ksym_addr, &user_ksyms[i])) + if (prog->aux->func_cnt) { + for (i = 0; i < ulen; i++) { + ksym_addr = (unsigned long) + prog->aux->func[i]->bpf_func; + if (put_user((u64) ksym_addr, + &user_ksyms[i])) + return -EFAULT; + } + } else { + ksym_addr = (unsigned long) prog->bpf_func; + if (put_user((u64) ksym_addr, &user_ksyms[0])) return -EFAULT; } } else { @@ -2183,18 +2497,25 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, } ulen = info.nr_jited_func_lens; - info.nr_jited_func_lens = prog->aux->func_cnt; - if (info.nr_jited_func_lens && ulen) { - if (bpf_dump_raw_ok()) { + info.nr_jited_func_lens = prog->aux->func_cnt ? : 1; + if (ulen) { + if (bpf_dump_raw_ok(file->f_cred)) { u32 __user *user_lens; u32 func_len, i; /* copy the JITed image lengths for each function */ ulen = min_t(u32, info.nr_jited_func_lens, ulen); user_lens = u64_to_user_ptr(info.jited_func_lens); - for (i = 0; i < ulen; i++) { - func_len = prog->aux->func[i]->jited_len; - if (put_user(func_len, &user_lens[i])) + if (prog->aux->func_cnt) { + for (i = 0; i < ulen; i++) { + func_len = + prog->aux->func[i]->jited_len; + if (put_user(func_len, &user_lens[i])) + return -EFAULT; + } + } else { + func_len = prog->jited_len; + if (put_user(func_len, &user_lens[0])) return -EFAULT; } } else { @@ -2202,61 +2523,45 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, } } - if (prog->aux->btf) { - u32 krec_size = sizeof(struct bpf_func_info); - u32 ucnt, urec_size; - + if (prog->aux->btf) info.btf_id = btf_id(prog->aux->btf); - ucnt = info.func_info_cnt; - info.func_info_cnt = prog->aux->func_info_cnt; - urec_size = info.func_info_rec_size; - info.func_info_rec_size = krec_size; - if (ucnt) { - /* expect passed-in urec_size is what the kernel expects */ - if (urec_size != info.func_info_rec_size) - return -EINVAL; + ulen = info.nr_func_info; + info.nr_func_info = prog->aux->func_info_cnt; + if (info.nr_func_info && ulen) { + char __user *user_finfo; - if (bpf_dump_raw_ok()) { - char __user *user_finfo; - user_finfo = u64_to_user_ptr(info.func_info); - ucnt = min_t(u32, info.func_info_cnt, ucnt); - if (copy_to_user(user_finfo, prog->aux->func_info, - krec_size * ucnt)) - return -EFAULT; - } else { - info.func_info_cnt = 0; - } - } - } else { - info.func_info_cnt = 0; + user_finfo = u64_to_user_ptr(info.func_info); + ulen = min_t(u32, info.nr_func_info, ulen); + if (copy_to_user(user_finfo, prog->aux->func_info, + info.func_info_rec_size * ulen)) + return -EFAULT; } - ulen = info.line_info_cnt; - info.line_info_cnt = prog->aux->nr_linfo; - if (info.line_info_cnt && ulen) { - if (bpf_dump_raw_ok()) { - __u8 __user *user_linfo; - user_linfo = u64_to_user_ptr(info.line_info); - ulen = min_t(u32, info.line_info_cnt, ulen); - if (copy_to_user(user_linfo, prog->aux->linfo, - info.line_info_rec_size * ulen)) - return -EFAULT; - } else { - info.line_info = 0; - } + ulen = info.nr_line_info; + info.nr_line_info = prog->aux->nr_linfo; + if (info.nr_line_info && ulen) { + __u8 __user *user_linfo; + + user_linfo = u64_to_user_ptr(info.line_info); + ulen = min_t(u32, info.nr_line_info, ulen); + if (copy_to_user(user_linfo, prog->aux->linfo, + info.line_info_rec_size * ulen)) + return -EFAULT; } - ulen = info.jited_line_info_cnt; + + ulen = info.nr_jited_line_info; if (prog->aux->jited_linfo) - info.jited_line_info_cnt = prog->aux->nr_linfo; + info.nr_jited_line_info = prog->aux->nr_linfo; else - info.jited_line_info_cnt = 0; - if (info.jited_line_info_cnt && ulen) { - if (bpf_dump_raw_ok()) { + info.nr_jited_line_info = 0; + if (info.nr_jited_line_info && ulen) { + if (bpf_dump_raw_ok(file->f_cred)) { __u64 __user *user_linfo; u32 i; + user_linfo = u64_to_user_ptr(info.jited_line_info); - ulen = min_t(u32, info.jited_line_info_cnt, ulen); + ulen = min_t(u32, info.nr_jited_line_info, ulen); for (i = 0; i < ulen; i++) { if (put_user((__u64)(long)prog->aux->jited_linfo[i], &user_linfo[i])) @@ -2267,6 +2572,28 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog, } } + ulen = info.nr_prog_tags; + info.nr_prog_tags = prog->aux->func_cnt ? : 1; + if (ulen) { + __u8 __user (*user_prog_tags)[BPF_TAG_SIZE]; + u32 i; + + user_prog_tags = u64_to_user_ptr(info.prog_tags); + ulen = min_t(u32, info.nr_prog_tags, ulen); + if (prog->aux->func_cnt) { + for (i = 0; i < ulen; i++) { + if (copy_to_user(user_prog_tags[i], + prog->aux->func[i]->tag, + BPF_TAG_SIZE)) + return -EFAULT; + } + } else { + if (copy_to_user(user_prog_tags[0], + prog->tag, BPF_TAG_SIZE)) + return -EFAULT; + } + } + done: if (copy_to_user(uinfo, &info, info_len) || put_user(info_len, &uattr->info.info_len)) @@ -2275,7 +2602,8 @@ done: return 0; } -static int bpf_map_get_info_by_fd(struct bpf_map *map, +static int bpf_map_get_info_by_fd(struct file *file, + struct bpf_map *map, const union bpf_attr *attr, union bpf_attr __user *uattr) { @@ -2317,6 +2645,22 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map, return 0; } +static int bpf_btf_get_info_by_fd(struct file *file, + struct btf *btf, + const union bpf_attr *attr, + union bpf_attr __user *uattr) +{ + struct bpf_btf_info __user *uinfo = u64_to_user_ptr(attr->info.info); + u32 info_len = attr->info.info_len; + int err; + + err = bpf_check_uarg_tail_zero(uinfo, sizeof(*uinfo), info_len); + if (err) + return err; + + return btf_get_info_by_fd(btf, attr, uattr); +} + #define BPF_OBJ_GET_INFO_BY_FD_LAST_FIELD info.info static int bpf_obj_get_info_by_fd(const union bpf_attr *attr, @@ -2334,13 +2678,13 @@ static int bpf_obj_get_info_by_fd(const union bpf_attr *attr, return -EBADFD; if (f.file->f_op == &bpf_prog_fops) - err = bpf_prog_get_info_by_fd(f.file->private_data, attr, + err = bpf_prog_get_info_by_fd(f.file, f.file->private_data, attr, uattr); else if (f.file->f_op == &bpf_map_fops) - err = bpf_map_get_info_by_fd(f.file->private_data, attr, + err = bpf_map_get_info_by_fd(f.file, f.file->private_data, attr, uattr); else if (f.file->f_op == &btf_fops) - err = btf_get_info_by_fd(f.file->private_data, attr, uattr); + err = bpf_btf_get_info_by_fd(f.file, f.file->private_data, attr, uattr); else err = -EINVAL; @@ -2374,6 +2718,134 @@ static int bpf_btf_get_fd_by_id(const union bpf_attr *attr) return btf_get_fd_by_id(attr->btf_id); } +static int bpf_task_fd_query_copy(const union bpf_attr *attr, + union bpf_attr __user *uattr, + u32 prog_id, u32 fd_type, + const char *buf, u64 probe_offset, + u64 probe_addr) +{ + char __user *ubuf = u64_to_user_ptr(attr->task_fd_query.buf); + u32 len = buf ? strlen(buf) : 0, input_len; + int err = 0; + + if (put_user(len, &uattr->task_fd_query.buf_len)) + return -EFAULT; + input_len = attr->task_fd_query.buf_len; + if (input_len && ubuf) { + if (!len) { + /* nothing to copy, just make ubuf NULL terminated */ + char zero = '\0'; + + if (put_user(zero, ubuf)) + return -EFAULT; + } else if (input_len >= len + 1) { + /* ubuf can hold the string with NULL terminator */ + if (copy_to_user(ubuf, buf, len + 1)) + return -EFAULT; + } else { + /* ubuf cannot hold the string with NULL terminator, + * do a partial copy with NULL terminator. + */ + char zero = '\0'; + + err = -ENOSPC; + if (copy_to_user(ubuf, buf, input_len - 1)) + return -EFAULT; + if (put_user(zero, ubuf + input_len - 1)) + return -EFAULT; + } + } + + if (put_user(prog_id, &uattr->task_fd_query.prog_id) || + put_user(fd_type, &uattr->task_fd_query.fd_type) || + put_user(probe_offset, &uattr->task_fd_query.probe_offset) || + put_user(probe_addr, &uattr->task_fd_query.probe_addr)) + return -EFAULT; + + return err; +} + +#define BPF_TASK_FD_QUERY_LAST_FIELD task_fd_query.probe_addr + +static int bpf_task_fd_query(const union bpf_attr *attr, + union bpf_attr __user *uattr) +{ + pid_t pid = attr->task_fd_query.pid; + u32 fd = attr->task_fd_query.fd; + const struct perf_event *event; + struct files_struct *files; + struct task_struct *task; + struct file *file; + int err; + + if (CHECK_ATTR(BPF_TASK_FD_QUERY)) + return -EINVAL; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (attr->task_fd_query.flags != 0) + return -EINVAL; + + rcu_read_lock(); + task = get_pid_task(find_vpid(pid), PIDTYPE_PID); + rcu_read_unlock(); + if (!task) + return -ENOENT; + + files = get_files_struct(task); + put_task_struct(task); + if (!files) + return -ENOENT; + + err = 0; + spin_lock(&files->file_lock); + file = fcheck_files(files, fd); + if (!file) + err = -EBADF; + else + get_file(file); + spin_unlock(&files->file_lock); + put_files_struct(files); + + if (err) + goto out; + + if (file->f_op == &bpf_raw_tp_fops) { + struct bpf_raw_tracepoint *raw_tp = file->private_data; + struct bpf_raw_event_map *btp = raw_tp->btp; + + err = bpf_task_fd_query_copy(attr, uattr, + raw_tp->prog->aux->id, + BPF_FD_TYPE_RAW_TRACEPOINT, + btp->tp->name, 0, 0); + goto put_file; + } + + event = perf_get_event(file); + if (!IS_ERR(event)) { + u64 probe_offset, probe_addr; + u32 prog_id, fd_type; + const char *buf; + + err = bpf_get_perf_event_info(event, &prog_id, &fd_type, + &buf, &probe_offset, + &probe_addr); + if (!err) + err = bpf_task_fd_query_copy(attr, uattr, prog_id, + fd_type, buf, + probe_offset, + probe_addr); + goto put_file; + } + + err = -ENOTSUPP; +put_file: + fput(file); +out: + return err; +} + SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size) { union bpf_attr attr; @@ -2412,6 +2884,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz case BPF_MAP_GET_NEXT_KEY: err = map_get_next_key(&attr); break; + case BPF_MAP_FREEZE: + err = map_freeze(&attr); + break; case BPF_PROG_LOAD: err = bpf_prog_load(&attr, uattr); break; @@ -2421,7 +2896,6 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz case BPF_OBJ_GET: err = bpf_obj_get(&attr); break; -#ifdef CONFIG_CGROUP_BPF case BPF_PROG_ATTACH: err = bpf_prog_attach(&attr); break; @@ -2431,7 +2905,6 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz case BPF_PROG_QUERY: err = bpf_prog_query(&attr, uattr); break; -#endif case BPF_PROG_TEST_RUN: err = bpf_prog_test_run(&attr, uattr); break; @@ -2443,6 +2916,10 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz err = bpf_obj_get_next_id(&attr, uattr, &map_idr, &map_idr_lock); break; + case BPF_BTF_GET_NEXT_ID: + err = bpf_obj_get_next_id(&attr, uattr, + &btf_idr, &btf_idr_lock); + break; case BPF_PROG_GET_FD_BY_ID: err = bpf_prog_get_fd_by_id(&attr); break; @@ -2461,6 +2938,12 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz case BPF_BTF_GET_FD_BY_ID: err = bpf_btf_get_fd_by_id(&attr); break; + case BPF_TASK_FD_QUERY: + err = bpf_task_fd_query(&attr, uattr); + break; + case BPF_MAP_LOOKUP_AND_DELETE_ELEM: + err = map_lookup_and_delete_elem(&attr); + break; default: err = -EINVAL; break; diff --git a/kernel/bpf/sysfs_btf.c b/kernel/bpf/sysfs_btf.c new file mode 100644 index 000000000000..11b3380887fa --- /dev/null +++ b/kernel/bpf/sysfs_btf.c @@ -0,0 +1,45 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Provide kernel BTF information for introspection and use by eBPF tools. + */ +#include +#include +#include +#include +#include + +/* See scripts/link-vmlinux.sh, gen_btf() func for details */ +extern char __weak __start_BTF[]; +extern char __weak __stop_BTF[]; + +static ssize_t +btf_vmlinux_read(struct file *file, struct kobject *kobj, + struct bin_attribute *bin_attr, + char *buf, loff_t off, size_t len) +{ + memcpy(buf, __start_BTF + off, len); + return len; +} + +static struct bin_attribute bin_attr_btf_vmlinux __ro_after_init = { + .attr = { .name = "vmlinux", .mode = 0444, }, + .read = btf_vmlinux_read, +}; + +static struct kobject *btf_kobj; + +static int __init btf_vmlinux_init(void) +{ + bin_attr_btf_vmlinux.size = __stop_BTF - __start_BTF; + + if (!__start_BTF || bin_attr_btf_vmlinux.size == 0) + return 0; + + btf_kobj = kobject_create_and_add("btf", kernel_kobj); + if (!btf_kobj) + return -ENOMEM; + + return sysfs_create_bin_file(btf_kobj, &bin_attr_btf_vmlinux); +} + +subsys_initcall(btf_vmlinux_init); diff --git a/kernel/bpf/tnum.c b/kernel/bpf/tnum.c index 1f4bf68c12db..d4f335a9a899 100644 --- a/kernel/bpf/tnum.c +++ b/kernel/bpf/tnum.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* tnum: tracked (or tristate) numbers * * A tnum tracks knowledge about the bits of a value. Each bit can be either @@ -43,6 +44,21 @@ struct tnum tnum_rshift(struct tnum a, u8 shift) return TNUM(a.value >> shift, a.mask >> shift); } +struct tnum tnum_arshift(struct tnum a, u8 min_shift, u8 insn_bitness) +{ + /* if a.value is negative, arithmetic shifting by minimum shift + * will have larger negative offset compared to more shifting. + * If a.value is nonnegative, arithmetic shifting by minimum shift + * will have larger positive offset compare to more shifting. + */ + if (insn_bitness == 32) + return TNUM((u32)(((s32)a.value) >> min_shift), + (u32)(((s32)a.mask) >> min_shift)); + else + return TNUM((s64)a.value >> min_shift, + (s64)a.mask >> min_shift); +} + struct tnum tnum_add(struct tnum a, struct tnum b) { u64 sm, sv, sigma, chi, mu; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index ee2c13fa1f5d..013b9062c47c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1,33 +1,7 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com * Copyright (c) 2016 Facebook * Copyright (c) 2018 Covalent IO, Inc. http://covalent.io - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * The following reference types represent a potential reference to a kernel - * resource which, after first being allocated, must be checked and freed by - * the BPF program: - * - PTR_TO_SOCKET_OR_NULL, PTR_TO_SOCKET - * - * When the verifier sees a helper call return a reference type, it allocates a - * pointer id for the reference and stores it in the current function state. - * Similar to the way that PTR_TO_MAP_VALUE_OR_NULL is converted into - * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the type - * passes through a NULL-check conditional. For the branch wherein the state is - * changed to CONST_IMM, the verifier releases the reference. - * - * For each helper function that allocates a reference, such as - * bpf_sk_lookup_tcp(), there is a corresponding release function, such as - * bpf_sk_release(). When a reference type passes into the release function, - * the verifier also releases the reference. If any unchecked or unreleased - * reference remains at the end of the program, the verifier rejects it. */ #include #include @@ -43,6 +17,8 @@ #include #include #include +#include +#include #include "disasm.h" @@ -100,8 +76,8 @@ static const struct bpf_verifier_ops * const bpf_verifier_ops[] = { * (like pointer plus pointer becomes SCALAR_VALUE type) * * When verifier sees load or store instructions the type of base register - * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK. These are three pointer - * types recognized by check_mem_access() function. + * can be: PTR_TO_MAP_VALUE, PTR_TO_CTX, PTR_TO_STACK, PTR_TO_SOCKET. These are + * four pointer types recognized by check_mem_access() function. * * PTR_TO_MAP_VALUE means that this register is pointing to 'map element value' * and the range of [ptr, ptr + map's value_size) is accessible. @@ -160,6 +136,24 @@ static const struct bpf_verifier_ops * const bpf_verifier_ops[] = { * * After the call R0 is set to return type of the function and registers R1-R5 * are set to NOT_INIT to indicate that they are no longer readable. + * + * The following reference types represent a potential reference to a kernel + * resource which, after first being allocated, must be checked and freed by + * the BPF program: + * - PTR_TO_SOCKET_OR_NULL, PTR_TO_SOCKET + * + * When the verifier sees a helper call return a reference type, it allocates a + * pointer id for the reference and stores it in the current function state. + * Similar to the way that PTR_TO_MAP_VALUE_OR_NULL is converted into + * PTR_TO_MAP_VALUE, PTR_TO_SOCKET_OR_NULL becomes PTR_TO_SOCKET when the type + * passes through a NULL-check conditional. For the branch wherein the state is + * changed to CONST_IMM, the verifier releases the reference. + * + * For each helper function that allocates a reference, such as + * bpf_sk_lookup_tcp(), there is a corresponding release function, such as + * bpf_sk_release(). When a reference type passes into the release function, + * the verifier also releases the reference. If any unchecked or unreleased + * reference remains at the end of the program, the verifier rejects it. */ /* verifier_state + insn_idx are pushed to stack when branch is encountered */ @@ -174,13 +168,14 @@ struct bpf_verifier_stack_elem { struct bpf_verifier_stack_elem *next; }; -#define BPF_COMPLEXITY_LIMIT_INSNS 131072 -#define BPF_COMPLEXITY_LIMIT_STACK 1024 +#define BPF_COMPLEXITY_LIMIT_JMP_SEQ 8192 +#define BPF_COMPLEXITY_LIMIT_STATES 64 #define BPF_MAP_PTR_UNPRIV 1UL #define BPF_MAP_PTR_POISON ((void *)((0xeB9FUL << 1) + \ POISON_POINTER_DELTA)) #define BPF_MAP_PTR(X) ((struct bpf_map *)((X) & ~BPF_MAP_PTR_UNPRIV)) + static bool bpf_map_ptr_poisoned(const struct bpf_insn_aux_data *aux) { return BPF_MAP_PTR(aux->map_state) == BPF_MAP_PTR_POISON; @@ -206,20 +201,41 @@ struct bpf_call_arg_meta { bool pkt_access; int regno; int access_size; - s64 msize_smax_value; - u64 msize_umax_value; - int ptr_id; + u64 msize_max_value; + int ref_obj_id; int func_id; }; static DEFINE_MUTEX(bpf_verifier_lock); +static const struct bpf_line_info * +find_linfo(const struct bpf_verifier_env *env, u32 insn_off) +{ + const struct bpf_line_info *linfo; + const struct bpf_prog *prog; + u32 i, nr_linfo; + + prog = env->prog; + nr_linfo = prog->aux->nr_linfo; + + if (!nr_linfo || insn_off >= prog->len) + return NULL; + + linfo = prog->aux->linfo; + for (i = 1; i < nr_linfo; i++) + if (insn_off < linfo[i].insn_off) + break; + + return &linfo[i - 1]; +} + void bpf_verifier_vlog(struct bpf_verifier_log *log, const char *fmt, va_list args) { unsigned int n; n = vscnprintf(log->kbuf, BPF_VERIFIER_TMP_LOG_SIZE, fmt, args); + WARN_ONCE(n >= BPF_VERIFIER_TMP_LOG_SIZE - 1, "verifier log line truncated - local buffer too short\n"); @@ -263,6 +279,42 @@ __printf(2, 3) static void verbose(void *private_data, const char *fmt, ...) va_end(args); } +static const char *ltrim(const char *s) +{ + while (isspace(*s)) + s++; + + return s; +} + +__printf(3, 4) static void verbose_linfo(struct bpf_verifier_env *env, + u32 insn_off, + const char *prefix_fmt, ...) +{ + const struct bpf_line_info *linfo; + + if (!bpf_verifier_log_needed(&env->log)) + return; + + linfo = find_linfo(env, insn_off); + if (!linfo || linfo == env->prev_linfo) + return; + + if (prefix_fmt) { + va_list args; + + va_start(args, prefix_fmt); + bpf_verifier_vlog(&env->log, prefix_fmt, args); + va_end(args); + } + + verbose(env, "%s\n", + ltrim(btf_name_by_offset(env->prog->aux->btf, + linfo->line_off))); + + env->prev_linfo = linfo; +} + static bool type_is_pkt_pointer(enum bpf_reg_type type) { return type == PTR_TO_PACKET || @@ -273,7 +325,8 @@ static bool type_is_sk_pointer(enum bpf_reg_type type) { return type == PTR_TO_SOCKET || type == PTR_TO_SOCK_COMMON || - type == PTR_TO_TCP_SOCK; + type == PTR_TO_TCP_SOCK || + type == PTR_TO_XDP_SOCK; } static bool reg_type_may_be_null(enum bpf_reg_type type) @@ -284,35 +337,23 @@ static bool reg_type_may_be_null(enum bpf_reg_type type) type == PTR_TO_TCP_SOCK_OR_NULL; } -static bool type_is_refcounted(enum bpf_reg_type type) -{ - return type == PTR_TO_SOCKET; -} - -static bool type_is_refcounted_or_null(enum bpf_reg_type type) -{ - return type == PTR_TO_SOCKET || type == PTR_TO_SOCKET_OR_NULL; -} - -static bool reg_is_refcounted(const struct bpf_reg_state *reg) -{ - return type_is_refcounted(reg->type); -} - static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg) { return reg->type == PTR_TO_MAP_VALUE && map_value_has_spin_lock(reg->map_ptr); } -static bool reg_is_refcounted_or_null(const struct bpf_reg_state *reg) +static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type) { - return type_is_refcounted_or_null(reg->type); + return type == PTR_TO_SOCKET || + type == PTR_TO_SOCKET_OR_NULL || + type == PTR_TO_TCP_SOCK || + type == PTR_TO_TCP_SOCK_OR_NULL; } -static bool arg_type_is_refcounted(enum bpf_arg_type type) +static bool arg_type_may_be_refcounted(enum bpf_arg_type type) { - return type == ARG_PTR_TO_SOCKET; + return type == ARG_PTR_TO_SOCK_COMMON; } /* Determine whether the function releases some resources allocated by another @@ -327,7 +368,14 @@ static bool is_release_function(enum bpf_func_id func_id) static bool is_acquire_function(enum bpf_func_id func_id) { return func_id == BPF_FUNC_sk_lookup_tcp || - func_id == BPF_FUNC_sk_lookup_udp; + func_id == BPF_FUNC_sk_lookup_udp || + func_id == BPF_FUNC_skc_lookup_tcp; +} + +static bool is_ptr_cast_function(enum bpf_func_id func_id) +{ + return func_id == BPF_FUNC_tcp_sock || + func_id == BPF_FUNC_sk_fullsock; } /* string representation of 'enum bpf_reg_type' */ @@ -350,17 +398,27 @@ static const char * const reg_type_str[] = { [PTR_TO_TCP_SOCK] = "tcp_sock", [PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null", [PTR_TO_TP_BUFFER] = "tp_buffer", + [PTR_TO_XDP_SOCK] = "xdp_sock", +}; + +static char slot_type_char[] = { + [STACK_INVALID] = '?', + [STACK_SPILL] = 'r', + [STACK_MISC] = 'm', + [STACK_ZERO] = '0', }; static void print_liveness(struct bpf_verifier_env *env, enum bpf_reg_liveness live) { - if (live & (REG_LIVE_READ | REG_LIVE_WRITTEN)) + if (live & (REG_LIVE_READ | REG_LIVE_WRITTEN | REG_LIVE_DONE)) verbose(env, "_"); if (live & REG_LIVE_READ) verbose(env, "r"); if (live & REG_LIVE_WRITTEN) verbose(env, "w"); + if (live & REG_LIVE_DONE) + verbose(env, "D"); } static struct bpf_func_state *func(struct bpf_verifier_env *env, @@ -380,7 +438,6 @@ static void print_verifier_state(struct bpf_verifier_env *env, if (state->frameno) verbose(env, " frame%d:", state->frameno); - for (i = 0; i < MAX_BPF_REG; i++) { reg = &state->regs[i]; t = reg->type; @@ -389,14 +446,16 @@ static void print_verifier_state(struct bpf_verifier_env *env, verbose(env, " R%d", i); print_liveness(env, reg->live); verbose(env, "=%s", reg_type_str[t]); + if (t == SCALAR_VALUE && reg->precise) + verbose(env, "P"); if ((t == SCALAR_VALUE || t == PTR_TO_STACK) && tnum_is_const(reg->var_off)) { /* reg->off should be 0 for SCALAR_VALUE */ verbose(env, "%lld", reg->var_off.value + reg->off); - if (t == PTR_TO_STACK) - verbose(env, ",call_%d", func(env, reg)->callsite); } else { verbose(env, "(id=%d", reg->id); + if (reg_type_may_be_refcounted_or_null(t)) + verbose(env, ",ref_obj_id=%d", reg->ref_obj_id); if (t != SCALAR_VALUE) verbose(env, ",off=%d", reg->off); if (type_is_pkt_pointer(t)) @@ -439,12 +498,31 @@ static void print_verifier_state(struct bpf_verifier_env *env, } } for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) { + char types_buf[BPF_REG_SIZE + 1]; + bool valid = false; + int j; + + for (j = 0; j < BPF_REG_SIZE; j++) { + if (state->stack[i].slot_type[j] != STACK_INVALID) + valid = true; + types_buf[j] = slot_type_char[ + state->stack[i].slot_type[j]]; + } + types_buf[BPF_REG_SIZE] = 0; + if (!valid) + continue; + verbose(env, " fp%d", (-i - 1) * BPF_REG_SIZE); + print_liveness(env, state->stack[i].spilled_ptr.live); if (state->stack[i].slot_type[0] == STACK_SPILL) { - verbose(env, " fp%d", - (-i - 1) * BPF_REG_SIZE); - print_liveness(env, state->stack[i].spilled_ptr.live); - verbose(env, "=%s", - reg_type_str[state->stack[i].spilled_ptr.type]); + reg = &state->stack[i].spilled_ptr; + t = reg->type; + verbose(env, "=%s", reg_type_str[t]); + if (t == SCALAR_VALUE && reg->precise) + verbose(env, "P"); + if (t == SCALAR_VALUE && tnum_is_const(reg->var_off)) + verbose(env, "%lld", reg->var_off.value + reg->off); + } else { + verbose(env, "=%s", types_buf); } } if (state->acquired_refs && state->refs[0].id) { @@ -471,12 +549,12 @@ static int copy_##NAME##_state(struct bpf_func_state *dst, \ sizeof(*src->FIELD) * (src->COUNT / SIZE)); \ return 0; \ } - /* copy_reference_state() */ COPY_STATE_FN(reference, acquired_refs, refs, 1) /* copy_stack_state() */ COPY_STATE_FN(stack, allocated_stack, stack, BPF_REG_SIZE) #undef COPY_STATE_FN + #define REALLOC_STATE_FN(NAME, COUNT, FIELD, SIZE) \ static int realloc_##NAME##_state(struct bpf_func_state *state, int size, \ bool copy_old) \ @@ -543,12 +621,14 @@ static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx) struct bpf_func_state *state = cur_func(env); int new_ofs = state->acquired_refs; int id, err; + err = realloc_reference_state(state, state->acquired_refs + 1, true); if (err) return err; id = ++env->id_gen; state->refs[new_ofs].id = id; state->refs[new_ofs].insn_idx = insn_idx; + return id; } @@ -585,11 +665,20 @@ static int transfer_reference_state(struct bpf_func_state *dst, static void free_func_state(struct bpf_func_state *state) { + if (!state) + return; kfree(state->refs); kfree(state->stack); kfree(state); } +static void clear_jmp_history(struct bpf_verifier_state *state) +{ + kfree(state->jmp_history); + state->jmp_history = NULL; + state->jmp_history_cnt = 0; +} + static void free_verifier_state(struct bpf_verifier_state *state, bool free_self) { @@ -599,7 +688,7 @@ static void free_verifier_state(struct bpf_verifier_state *state, free_func_state(state->frame[i]); state->frame[i] = NULL; } - + clear_jmp_history(state); if (free_self) kfree(state); } @@ -616,10 +705,8 @@ static int copy_func_state(struct bpf_func_state *dst, false); if (err) return err; - memcpy(dst, src, offsetof(struct bpf_func_state, acquired_refs)); err = copy_reference_state(dst, src); - if (err) return err; return copy_stack_state(dst, src); @@ -629,18 +716,30 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state, const struct bpf_verifier_state *src) { struct bpf_func_state *dst; + u32 jmp_sz = sizeof(struct bpf_idx_pair) * src->jmp_history_cnt; int i, err; + if (dst_state->jmp_history_cnt < src->jmp_history_cnt) { + kfree(dst_state->jmp_history); + dst_state->jmp_history = kmalloc(jmp_sz, GFP_USER); + if (!dst_state->jmp_history) + return -ENOMEM; + } + memcpy(dst_state->jmp_history, src->jmp_history, jmp_sz); + dst_state->jmp_history_cnt = src->jmp_history_cnt; + /* if dst has more stack frames then src frame, free them */ for (i = src->curframe + 1; i <= dst_state->curframe; i++) { free_func_state(dst_state->frame[i]); dst_state->frame[i] = NULL; } - + dst_state->speculative = src->speculative; dst_state->curframe = src->curframe; - dst_state->parent = src->parent; dst_state->active_spin_lock = src->active_spin_lock; - + dst_state->branches = src->branches; + dst_state->parent = src->parent; + dst_state->first_insn_idx = src->first_insn_idx; + dst_state->last_insn_idx = src->last_insn_idx; for (i = 0; i <= src->curframe; i++) { dst = dst_state->frame[i]; if (!dst) { @@ -653,10 +752,26 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state, if (err) return err; } - return 0; } +static void update_branch_counts(struct bpf_verifier_env *env, struct bpf_verifier_state *st) +{ + while (st) { + u32 br = --st->branches; + + /* WARN_ON(br > 1) technically makes sense here, + * but see comment in push_stack(), hence: + */ + WARN_ONCE((int)br < 0, + "BUG update_branch_counts:branches_to_explore=%d\n", + br); + if (br) + break; + st = st->parent; + } +} + static int pop_stack(struct bpf_verifier_env *env, int *prev_insn_idx, int *insn_idx) { @@ -688,8 +803,8 @@ static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx, bool speculative) { - struct bpf_verifier_stack_elem *elem; struct bpf_verifier_state *cur = env->cur_state; + struct bpf_verifier_stack_elem *elem; int err; elem = kzalloc(sizeof(struct bpf_verifier_stack_elem), GFP_KERNEL); @@ -699,18 +814,33 @@ static struct bpf_verifier_state *push_stack(struct bpf_verifier_env *env, elem->insn_idx = insn_idx; elem->prev_insn_idx = prev_insn_idx; elem->next = env->head; - elem->st.speculative |= speculative; env->head = elem; env->stack_size++; err = copy_verifier_state(&elem->st, cur); if (err) goto err; - if (env->stack_size > BPF_COMPLEXITY_LIMIT_STACK) { - verbose(env, "BPF program is too complex\n"); + elem->st.speculative |= speculative; + if (env->stack_size > BPF_COMPLEXITY_LIMIT_JMP_SEQ) { + verbose(env, "The sequence of %d jumps is too complex.\n", + env->stack_size); goto err; } + if (elem->st.parent) { + ++elem->st.parent->branches; + /* WARN_ON(branches > 2) technically makes sense here, + * but + * 1. speculative states will bump 'branches' for non-branch + * instructions + * 2. is_state_visited() heuristics may decide not to create + * a new state for a sequence of branches and all such current + * and cloned states will be pointing to a single parent state + * which might have large 'branches' count. + */ + } return &elem->st; err: + free_verifier_state(env->cur_state, true); + env->cur_state = NULL; /* pop all elements and return */ while (!pop_stack(env, NULL, NULL)); return NULL; @@ -720,19 +850,18 @@ err: static const int caller_saved[CALLER_SAVED_REGS] = { BPF_REG_0, BPF_REG_1, BPF_REG_2, BPF_REG_3, BPF_REG_4, BPF_REG_5 }; -#define CALLEE_SAVED_REGS 5 -static const int callee_saved[CALLEE_SAVED_REGS] = { - BPF_REG_6, BPF_REG_7, BPF_REG_8, BPF_REG_9 -}; -static void __mark_reg_not_init(struct bpf_reg_state *reg); +static void __mark_reg_not_init(const struct bpf_verifier_env *env, + struct bpf_reg_state *reg); /* Mark the unknown part of a register (variable offset or scalar value) as * known to have the value @imm. */ static void __mark_reg_known(struct bpf_reg_state *reg, u64 imm) { - reg->id = 0; + /* Clear id, off, and union(map_ptr, range) */ + memset(((u8 *)reg) + sizeof(reg->type), 0, + offsetof(struct bpf_reg_state, var_off) - sizeof(reg->type)); reg->var_off = tnum_const(imm); reg->smin_value = (s64)imm; reg->smax_value = (s64)imm; @@ -748,6 +877,25 @@ static void __mark_reg_known_zero(struct bpf_reg_state *reg) __mark_reg_known(reg, 0); } +static void __mark_reg_const_zero(struct bpf_reg_state *reg) +{ + __mark_reg_known(reg, 0); + reg->type = SCALAR_VALUE; +} + +static void mark_reg_known_zero(struct bpf_verifier_env *env, + struct bpf_reg_state *regs, u32 regno) +{ + if (WARN_ON(regno >= MAX_BPF_REG)) { + verbose(env, "mark_reg_known_zero(regs, %u)\n", regno); + /* Something bad happened, let's kill all regs */ + for (regno = 0; regno < MAX_BPF_REG; regno++) + __mark_reg_not_init(env, regs + regno); + return; + } + __mark_reg_known_zero(regs + regno); +} + static bool reg_is_pkt_pointer(const struct bpf_reg_state *reg) { return type_is_pkt_pointer(reg->type); @@ -773,19 +921,6 @@ static bool reg_is_init_pkt_pointer(const struct bpf_reg_state *reg, tnum_equals_const(reg->var_off, 0); } -static void mark_reg_known_zero(struct bpf_verifier_env *env, - struct bpf_reg_state *regs, u32 regno) -{ - if (WARN_ON(regno >= MAX_BPF_REG)) { - verbose(env, "mark_reg_known_zero(env, regs, %u)\n", regno); - /* Something bad happened, let's kill all regs */ - for (regno = 0; regno < MAX_BPF_REG; regno++) - __mark_reg_not_init(regs + regno); - return; - } - __mark_reg_known_zero(regs + regno); -} - /* Attempts to improve min/max values based on var_off information */ static void __update_reg_bounds(struct bpf_reg_state *reg) { @@ -853,13 +988,19 @@ static void __mark_reg_unbounded(struct bpf_reg_state *reg) } /* Mark a register as having a completely unknown (scalar) value. */ -static void __mark_reg_unknown(struct bpf_reg_state *reg) +static void __mark_reg_unknown(const struct bpf_verifier_env *env, + struct bpf_reg_state *reg) { + /* + * Clear type, id, off, and union(map_ptr, range) and + * padding between 'type' and union + */ + memset(reg, 0, offsetof(struct bpf_reg_state, var_off)); reg->type = SCALAR_VALUE; - reg->id = 0; - reg->off = 0; reg->var_off = tnum_unknown; reg->frameno = 0; + reg->precise = env->subprog_cnt > 1 || !env->allow_ptr_leaks ? + true : false; __mark_reg_unbounded(reg); } @@ -867,18 +1008,19 @@ static void mark_reg_unknown(struct bpf_verifier_env *env, struct bpf_reg_state *regs, u32 regno) { if (WARN_ON(regno >= MAX_BPF_REG)) { - verbose(env, "mark_reg_unknown(env, regs, %u)\n", regno); - /* Something bad happened, let's kill all regs */ - for (regno = 0; regno < MAX_BPF_REG; regno++) - __mark_reg_not_init(regs + regno); + verbose(env, "mark_reg_unknown(regs, %u)\n", regno); + /* Something bad happened, let's kill all regs except FP */ + for (regno = 0; regno < BPF_REG_FP; regno++) + __mark_reg_not_init(env, regs + regno); return; } - __mark_reg_unknown(regs + regno); + __mark_reg_unknown(env, regs + regno); } -static void __mark_reg_not_init(struct bpf_reg_state *reg) +static void __mark_reg_not_init(const struct bpf_verifier_env *env, + struct bpf_reg_state *reg) { - __mark_reg_unknown(reg); + __mark_reg_unknown(env, reg); reg->type = NOT_INIT; } @@ -886,15 +1028,16 @@ static void mark_reg_not_init(struct bpf_verifier_env *env, struct bpf_reg_state *regs, u32 regno) { if (WARN_ON(regno >= MAX_BPF_REG)) { - verbose(env, "mark_reg_not_init(env, regs, %u)\n", regno); - /* Something bad happened, let's kill all regs */ - for (regno = 0; regno < MAX_BPF_REG; regno++) - __mark_reg_not_init(regs + regno); + verbose(env, "mark_reg_not_init(regs, %u)\n", regno); + /* Something bad happened, let's kill all regs except FP */ + for (regno = 0; regno < BPF_REG_FP; regno++) + __mark_reg_not_init(env, regs + regno); return; } - __mark_reg_not_init(regs + regno); + __mark_reg_not_init(env, regs + regno); } +#define DEF_NOT_SUBREG (0) static void init_reg_state(struct bpf_verifier_env *env, struct bpf_func_state *state) { @@ -904,6 +1047,8 @@ static void init_reg_state(struct bpf_verifier_env *env, for (i = 0; i < MAX_BPF_REG; i++) { mark_reg_not_init(env, regs, i); regs[i].live = REG_LIVE_NONE; + regs[i].parent = NULL; + regs[i].subreg_def = DEF_NOT_SUBREG; } /* frame pointer */ @@ -947,34 +1092,32 @@ static int find_subprog(struct bpf_verifier_env *env, int off) sizeof(env->subprog_info[0]), cmp_subprogs); if (!p) return -ENOENT; - return p - env->subprog_info; + } static int add_subprog(struct bpf_verifier_env *env, int off) { int insn_cnt = env->prog->len; int ret; + if (off >= insn_cnt || off < 0) { verbose(env, "call to invalid destination\n"); return -EINVAL; } - ret = find_subprog(env, off); if (ret >= 0) return 0; - - if (env->subprog_cnt > BPF_MAX_SUBPROGS) { + if (env->subprog_cnt >= BPF_MAX_SUBPROGS) { verbose(env, "too many subprograms\n"); return -E2BIG; } - env->subprog_info[env->subprog_cnt++].start = off; sort(env->subprog_info, env->subprog_cnt, sizeof(env->subprog_info[0]), cmp_subprogs, NULL); - return 0; } + static int check_subprogs(struct bpf_verifier_env *env) { int i, ret, subprog_start, subprog_end, off, cur_subprog = 0; @@ -997,28 +1140,31 @@ static int check_subprogs(struct bpf_verifier_env *env) verbose(env, "function calls to other bpf functions are allowed for root only\n"); return -EPERM; } - if (bpf_prog_is_dev_bound(env->prog->aux)) { - verbose(env, "funcation calls in offloaded programs are not supported yet\n"); - return -EINVAL; - } ret = add_subprog(env, i + insn[i].imm + 1); if (ret < 0) return ret; } - if (env->log.level > 1) + /* Add a fake 'exit' subprog which could simplify subprog iteration + * logic. 'subprog_cnt' should not be increased. + */ + subprog[env->subprog_cnt].start = insn_cnt; + + if (env->log.level & BPF_LOG_LEVEL2) for (i = 0; i < env->subprog_cnt; i++) verbose(env, "func#%d @%d\n", i, subprog[i].start); /* now check that all jumps are within the same subprog */ - subprog_start = 0; - if (env->subprog_cnt == cur_subprog + 1) - subprog_end = insn_cnt; - else - subprog_end = subprog[cur_subprog + 1].start; + subprog_start = subprog[cur_subprog].start; + subprog_end = subprog[cur_subprog + 1].start; for (i = 0; i < insn_cnt; i++) { u8 code = insn[i].code; - if (BPF_CLASS(code) != BPF_JMP) + + if (code == (BPF_JMP | BPF_CALL) && + insn[i].imm == BPF_FUNC_tail_call && + insn[i].src_reg != BPF_PSEUDO_CALL) + subprog[cur_subprog].has_tail_call = true; + if (BPF_CLASS(code) != BPF_JMP && BPF_CLASS(code) != BPF_JMP32) goto next; if (BPF_OP(code) == BPF_EXIT || BPF_OP(code) == BPF_CALL) goto next; @@ -1038,124 +1184,676 @@ next: verbose(env, "last insn is not an exit or jmp\n"); return -EINVAL; } - cur_subprog++; subprog_start = subprog_end; - if (env->subprog_cnt == cur_subprog + 1) - subprog_end = insn_cnt; - else + cur_subprog++; + if (cur_subprog < env->subprog_cnt) subprog_end = subprog[cur_subprog + 1].start; } } return 0; } -struct bpf_verifier_state *skip_callee(struct bpf_verifier_env *env, - const struct bpf_verifier_state *state, - struct bpf_verifier_state *parent, - u32 regno) -{ - struct bpf_verifier_state *tmp = NULL; - - /* 'parent' could be a state of caller and - * 'state' could be a state of callee. In such case - * parent->curframe < state->curframe - * and it's ok for r1 - r5 registers - * - * 'parent' could be a callee's state after it bpf_exit-ed. - * In such case parent->curframe > state->curframe - * and it's ok for r0 only - */ - if (parent->curframe == state->curframe || - (parent->curframe < state->curframe && - regno >= BPF_REG_1 && regno <= BPF_REG_5) || - (parent->curframe > state->curframe && - regno == BPF_REG_0)) - return parent; - - if (parent->curframe > state->curframe && - regno >= BPF_REG_6) { - /* for callee saved regs we have to skip the whole chain - * of states that belong to callee and mark as LIVE_READ - * the registers before the call - */ - tmp = parent; - while (tmp && tmp->curframe != state->curframe) { - tmp = tmp->parent; - } - if (!tmp) - goto bug; - parent = tmp; - } else { - goto bug; - } - - return parent; -bug: - verbose(env, "verifier bug regno %d tmp %p\n", regno, tmp); - verbose(env, "regno %d parent frame %d current frame %d\n", - regno, parent->curframe, state->curframe); - return 0; -} - +/* Parentage chain of this register (or stack slot) should take care of all + * issues like callee-saved registers, stack slot allocation time, etc. + */ static int mark_reg_read(struct bpf_verifier_env *env, - const struct bpf_verifier_state *state, - struct bpf_verifier_state *parent, - u32 regno) + const struct bpf_reg_state *state, + struct bpf_reg_state *parent, u8 flag) { bool writes = parent == state->parent; /* Observe write marks */ - - if (regno == BPF_REG_FP) - /* We don't need to worry about FP liveness because it's read-only */ - return 0; + int cnt = 0; while (parent) { /* if read wasn't screened by an earlier write ... */ - if (writes && state->frame[state->curframe]->regs[regno].live & REG_LIVE_WRITTEN) + if (writes && state->live & REG_LIVE_WRITTEN) break; - parent = skip_callee(env, state, parent, regno); - if (!parent) + if (parent->live & REG_LIVE_DONE) { + verbose(env, "verifier BUG type %s var_off %lld off %d\n", + reg_type_str[parent->type], + parent->var_off.value, parent->off); return -EFAULT; + } + /* The first condition is more likely to be true than the + * second, checked it first. + */ + if ((parent->live & REG_LIVE_READ) == flag || + parent->live & REG_LIVE_READ64) + /* The parentage chain never changes and + * this parent was already marked as LIVE_READ. + * There is no need to keep walking the chain again and + * keep re-marking all parents as LIVE_READ. + * This case happens when the same register is read + * multiple times without writes into it in-between. + * Also, if parent has the stronger REG_LIVE_READ64 set, + * then no need to set the weak REG_LIVE_READ32. + */ + break; /* ... then we depend on parent's value */ - parent->frame[parent->curframe]->regs[regno].live |= REG_LIVE_READ; + parent->live |= flag; + /* REG_LIVE_READ64 overrides REG_LIVE_READ32. */ + if (flag == REG_LIVE_READ64) + parent->live &= ~REG_LIVE_READ32; state = parent; parent = state->parent; writes = true; + cnt++; } + + if (env->longest_mark_read_walk < cnt) + env->longest_mark_read_walk = cnt; return 0; } +/* This function is supposed to be used by the following 32-bit optimization + * code only. It returns TRUE if the source or destination register operates + * on 64-bit, otherwise return FALSE. + */ +static bool is_reg64(struct bpf_verifier_env *env, struct bpf_insn *insn, + u32 regno, struct bpf_reg_state *reg, enum reg_arg_type t) +{ + u8 code, class, op; + + code = insn->code; + class = BPF_CLASS(code); + op = BPF_OP(code); + if (class == BPF_JMP) { + /* BPF_EXIT for "main" will reach here. Return TRUE + * conservatively. + */ + if (op == BPF_EXIT) + return true; + if (op == BPF_CALL) { + /* BPF to BPF call will reach here because of marking + * caller saved clobber with DST_OP_NO_MARK for which we + * don't care the register def because they are anyway + * marked as NOT_INIT already. + */ + if (insn->src_reg == BPF_PSEUDO_CALL) + return false; + /* Helper call will reach here because of arg type + * check, conservatively return TRUE. + */ + if (t == SRC_OP) + return true; + + return false; + } + } + + if (class == BPF_ALU64 || class == BPF_JMP || + /* BPF_END always use BPF_ALU class. */ + (class == BPF_ALU && op == BPF_END && insn->imm == 64)) + return true; + + if (class == BPF_ALU || class == BPF_JMP32) + return false; + + if (class == BPF_LDX) { + if (t != SRC_OP) + return BPF_SIZE(code) == BPF_DW; + /* LDX source must be ptr. */ + return true; + } + + if (class == BPF_STX) { + if (reg->type != SCALAR_VALUE) + return true; + return BPF_SIZE(code) == BPF_DW; + } + + if (class == BPF_LD) { + u8 mode = BPF_MODE(code); + + /* LD_IMM64 */ + if (mode == BPF_IMM) + return true; + + /* Both LD_IND and LD_ABS return 32-bit data. */ + if (t != SRC_OP) + return false; + + /* Implicit ctx ptr. */ + if (regno == BPF_REG_6) + return true; + + /* Explicit source could be any width. */ + return true; + } + + if (class == BPF_ST) + /* The only source register for BPF_ST is a ptr. */ + return true; + + /* Conservatively return true at default. */ + return true; +} + +/* Return TRUE if INSN doesn't have explicit value define. */ +static bool insn_no_def(struct bpf_insn *insn) +{ + u8 class = BPF_CLASS(insn->code); + + return (class == BPF_JMP || class == BPF_JMP32 || + class == BPF_STX || class == BPF_ST); +} + +/* Return TRUE if INSN has defined any 32-bit value explicitly. */ +static bool insn_has_def32(struct bpf_verifier_env *env, struct bpf_insn *insn) +{ + if (insn_no_def(insn)) + return false; + + return !is_reg64(env, insn, insn->dst_reg, NULL, DST_OP); +} + +static void mark_insn_zext(struct bpf_verifier_env *env, + struct bpf_reg_state *reg) +{ + s32 def_idx = reg->subreg_def; + + if (def_idx == DEF_NOT_SUBREG) + return; + + env->insn_aux_data[def_idx - 1].zext_dst = true; + /* The dst will be zero extended, so won't be sub-register anymore. */ + reg->subreg_def = DEF_NOT_SUBREG; +} + static int check_reg_arg(struct bpf_verifier_env *env, u32 regno, enum reg_arg_type t) { struct bpf_verifier_state *vstate = env->cur_state; struct bpf_func_state *state = vstate->frame[vstate->curframe]; - struct bpf_reg_state *regs = state->regs; + struct bpf_insn *insn = env->prog->insnsi + env->insn_idx; + struct bpf_reg_state *reg, *regs = state->regs; + bool rw64; if (regno >= MAX_BPF_REG) { verbose(env, "R%d is invalid\n", regno); return -EINVAL; } + reg = ®s[regno]; + rw64 = is_reg64(env, insn, regno, reg, t); if (t == SRC_OP) { /* check whether register used as source operand can be read */ - if (regs[regno].type == NOT_INIT) { + if (reg->type == NOT_INIT) { verbose(env, "R%d !read_ok\n", regno); return -EACCES; } - return mark_reg_read(env, vstate, vstate->parent, regno); + /* We don't need to worry about FP liveness because it's read-only */ + if (regno == BPF_REG_FP) + return 0; + + if (rw64) + mark_insn_zext(env, reg); + + return mark_reg_read(env, reg, reg->parent, + rw64 ? REG_LIVE_READ64 : REG_LIVE_READ32); } else { /* check whether register used as dest operand can be written to */ if (regno == BPF_REG_FP) { verbose(env, "frame pointer is read only\n"); return -EACCES; } - regs[regno].live |= REG_LIVE_WRITTEN; + reg->live |= REG_LIVE_WRITTEN; + reg->subreg_def = rw64 ? DEF_NOT_SUBREG : env->insn_idx + 1; if (t == DST_OP) mark_reg_unknown(env, regs, regno); } return 0; } +/* for any branch, call, exit record the history of jmps in the given state */ +static int push_jmp_history(struct bpf_verifier_env *env, + struct bpf_verifier_state *cur) +{ + u32 cnt = cur->jmp_history_cnt; + struct bpf_idx_pair *p; + + cnt++; + p = krealloc(cur->jmp_history, cnt * sizeof(*p), GFP_USER); + if (!p) + return -ENOMEM; + p[cnt - 1].idx = env->insn_idx; + p[cnt - 1].prev_idx = env->prev_insn_idx; + cur->jmp_history = p; + cur->jmp_history_cnt = cnt; + return 0; +} + +/* Backtrack one insn at a time. If idx is not at the top of recorded + * history then previous instruction came from straight line execution. + */ +static int get_prev_insn_idx(struct bpf_verifier_state *st, int i, + u32 *history) +{ + u32 cnt = *history; + + if (cnt && st->jmp_history[cnt - 1].idx == i) { + i = st->jmp_history[cnt - 1].prev_idx; + (*history)--; + } else { + i--; + } + return i; +} + +/* For given verifier state backtrack_insn() is called from the last insn to + * the first insn. Its purpose is to compute a bitmask of registers and + * stack slots that needs precision in the parent verifier state. + */ +static int backtrack_insn(struct bpf_verifier_env *env, int idx, + u32 *reg_mask, u64 *stack_mask) +{ + const struct bpf_insn_cbs cbs = { + .cb_print = verbose, + .private_data = env, + }; + struct bpf_insn *insn = env->prog->insnsi + idx; + u8 class = BPF_CLASS(insn->code); + u8 opcode = BPF_OP(insn->code); + u8 mode = BPF_MODE(insn->code); + u32 dreg = 1u << insn->dst_reg; + u32 sreg = 1u << insn->src_reg; + u32 spi; + + if (insn->code == 0) + return 0; + if (env->log.level & BPF_LOG_LEVEL) { + verbose(env, "regs=%x stack=%llx before ", *reg_mask, *stack_mask); + verbose(env, "%d: ", idx); + print_bpf_insn(&cbs, insn, env->allow_ptr_leaks); + } + + if (class == BPF_ALU || class == BPF_ALU64) { + if (!(*reg_mask & dreg)) + return 0; + if (opcode == BPF_END || opcode == BPF_NEG) { + /* sreg is reserved and unused + * dreg still need precision before this insn + */ + return 0; + } else if (opcode == BPF_MOV) { + if (BPF_SRC(insn->code) == BPF_X) { + /* dreg = sreg + * dreg needs precision after this insn + * sreg needs precision before this insn + */ + *reg_mask &= ~dreg; + *reg_mask |= sreg; + } else { + /* dreg = K + * dreg needs precision after this insn. + * Corresponding register is already marked + * as precise=true in this verifier state. + * No further markings in parent are necessary + */ + *reg_mask &= ~dreg; + } + } else { + if (BPF_SRC(insn->code) == BPF_X) { + /* dreg += sreg + * both dreg and sreg need precision + * before this insn + */ + *reg_mask |= sreg; + } /* else dreg += K + * dreg still needs precision before this insn + */ + } + } else if (class == BPF_LDX) { + if (!(*reg_mask & dreg)) + return 0; + *reg_mask &= ~dreg; + + /* scalars can only be spilled into stack w/o losing precision. + * Load from any other memory can be zero extended. + * The desire to keep that precision is already indicated + * by 'precise' mark in corresponding register of this state. + * No further tracking necessary. + */ + if (insn->src_reg != BPF_REG_FP) + return 0; + if (BPF_SIZE(insn->code) != BPF_DW) + return 0; + + /* dreg = *(u64 *)[fp - off] was a fill from the stack. + * that [fp - off] slot contains scalar that needs to be + * tracked with precision + */ + spi = (-insn->off - 1) / BPF_REG_SIZE; + if (spi >= 64) { + verbose(env, "BUG spi %d\n", spi); + WARN_ONCE(1, "verifier backtracking bug"); + return -EFAULT; + } + *stack_mask |= 1ull << spi; + } else if (class == BPF_STX || class == BPF_ST) { + if (*reg_mask & dreg) + /* stx & st shouldn't be using _scalar_ dst_reg + * to access memory. It means backtracking + * encountered a case of pointer subtraction. + */ + return -ENOTSUPP; + /* scalars can only be spilled into stack */ + if (insn->dst_reg != BPF_REG_FP) + return 0; + if (BPF_SIZE(insn->code) != BPF_DW) + return 0; + spi = (-insn->off - 1) / BPF_REG_SIZE; + if (spi >= 64) { + verbose(env, "BUG spi %d\n", spi); + WARN_ONCE(1, "verifier backtracking bug"); + return -EFAULT; + } + if (!(*stack_mask & (1ull << spi))) + return 0; + *stack_mask &= ~(1ull << spi); + if (class == BPF_STX) + *reg_mask |= sreg; + } else if (class == BPF_JMP || class == BPF_JMP32) { + if (opcode == BPF_CALL) { + if (insn->src_reg == BPF_PSEUDO_CALL) + return -ENOTSUPP; + /* regular helper call sets R0 */ + *reg_mask &= ~1; + if (*reg_mask & 0x3f) { + /* if backtracing was looking for registers R1-R5 + * they should have been found already. + */ + verbose(env, "BUG regs %x\n", *reg_mask); + WARN_ONCE(1, "verifier backtracking bug"); + return -EFAULT; + } + } else if (opcode == BPF_EXIT) { + return -ENOTSUPP; + } else if (BPF_SRC(insn->code) == BPF_X) { + if (!(*reg_mask & (dreg | sreg))) + return 0; + /* dreg sreg + * Both dreg and sreg need precision before + * this insn. If only sreg was marked precise + * before it would be equally necessary to + * propagate it to dreg. + */ + *reg_mask |= (sreg | dreg); + /* else dreg K + * Only dreg still needs precision before + * this insn, so for the K-based conditional + * there is nothing new to be marked. + */ + } + } else if (class == BPF_LD) { + if (!(*reg_mask & dreg)) + return 0; + *reg_mask &= ~dreg; + /* It's ld_imm64 or ld_abs or ld_ind. + * For ld_imm64 no further tracking of precision + * into parent is necessary + */ + if (mode == BPF_IND || mode == BPF_ABS) + /* to be analyzed */ + return -ENOTSUPP; + } + return 0; +} + +/* the scalar precision tracking algorithm: + * . at the start all registers have precise=false. + * . scalar ranges are tracked as normal through alu and jmp insns. + * . once precise value of the scalar register is used in: + * . ptr + scalar alu + * . if (scalar cond K|scalar) + * . helper_call(.., scalar, ...) where ARG_CONST is expected + * backtrack through the verifier states and mark all registers and + * stack slots with spilled constants that these scalar regisers + * should be precise. + * . during state pruning two registers (or spilled stack slots) + * are equivalent if both are not precise. + * + * Note the verifier cannot simply walk register parentage chain, + * since many different registers and stack slots could have been + * used to compute single precise scalar. + * + * The approach of starting with precise=true for all registers and then + * backtrack to mark a register as not precise when the verifier detects + * that program doesn't care about specific value (e.g., when helper + * takes register as ARG_ANYTHING parameter) is not safe. + * + * It's ok to walk single parentage chain of the verifier states. + * It's possible that this backtracking will go all the way till 1st insn. + * All other branches will be explored for needing precision later. + * + * The backtracking needs to deal with cases like: + * R8=map_value(id=0,off=0,ks=4,vs=1952,imm=0) R9_w=map_value(id=0,off=40,ks=4,vs=1952,imm=0) + * r9 -= r8 + * r5 = r9 + * if r5 > 0x79f goto pc+7 + * R5_w=inv(id=0,umax_value=1951,var_off=(0x0; 0x7ff)) + * r5 += 1 + * ... + * call bpf_perf_event_output#25 + * where .arg5_type = ARG_CONST_SIZE_OR_ZERO + * + * and this case: + * r6 = 1 + * call foo // uses callee's r6 inside to compute r0 + * r0 += r6 + * if r0 == 0 goto + * + * to track above reg_mask/stack_mask needs to be independent for each frame. + * + * Also if parent's curframe > frame where backtracking started, + * the verifier need to mark registers in both frames, otherwise callees + * may incorrectly prune callers. This is similar to + * commit 7640ead93924 ("bpf: verifier: make sure callees don't prune with caller differences") + * + * For now backtracking falls back into conservative marking. + */ +static void mark_all_scalars_precise(struct bpf_verifier_env *env, + struct bpf_verifier_state *st) +{ + struct bpf_func_state *func; + struct bpf_reg_state *reg; + int i, j; + + /* big hammer: mark all scalars precise in this path. + * pop_stack may still get !precise scalars. + */ + for (; st; st = st->parent) + for (i = 0; i <= st->curframe; i++) { + func = st->frame[i]; + for (j = 0; j < BPF_REG_FP; j++) { + reg = &func->regs[j]; + if (reg->type != SCALAR_VALUE) + continue; + reg->precise = true; + } + for (j = 0; j < func->allocated_stack / BPF_REG_SIZE; j++) { + if (func->stack[j].slot_type[0] != STACK_SPILL) + continue; + reg = &func->stack[j].spilled_ptr; + if (reg->type != SCALAR_VALUE) + continue; + reg->precise = true; + } + } +} + +static int __mark_chain_precision(struct bpf_verifier_env *env, int regno, + int spi) +{ + struct bpf_verifier_state *st = env->cur_state; + int first_idx = st->first_insn_idx; + int last_idx = env->insn_idx; + struct bpf_func_state *func; + struct bpf_reg_state *reg; + u32 reg_mask = regno >= 0 ? 1u << regno : 0; + u64 stack_mask = spi >= 0 ? 1ull << spi : 0; + bool skip_first = true; + bool new_marks = false; + int i, err; + + if (!env->allow_ptr_leaks) + /* backtracking is root only for now */ + return 0; + + func = st->frame[st->curframe]; + if (regno >= 0) { + reg = &func->regs[regno]; + if (reg->type != SCALAR_VALUE) { + WARN_ONCE(1, "backtracing misuse"); + return -EFAULT; + } + if (!reg->precise) + new_marks = true; + else + reg_mask = 0; + reg->precise = true; + } + + while (spi >= 0) { + if (func->stack[spi].slot_type[0] != STACK_SPILL) { + stack_mask = 0; + break; + } + reg = &func->stack[spi].spilled_ptr; + if (reg->type != SCALAR_VALUE) { + stack_mask = 0; + break; + } + if (!reg->precise) + new_marks = true; + else + stack_mask = 0; + reg->precise = true; + break; + } + + if (!new_marks) + return 0; + if (!reg_mask && !stack_mask) + return 0; + for (;;) { + DECLARE_BITMAP(mask, 64); + u32 history = st->jmp_history_cnt; + + if (env->log.level & BPF_LOG_LEVEL) + verbose(env, "last_idx %d first_idx %d\n", last_idx, first_idx); + for (i = last_idx;;) { + if (skip_first) { + err = 0; + skip_first = false; + } else { + err = backtrack_insn(env, i, ®_mask, &stack_mask); + } + if (err == -ENOTSUPP) { + mark_all_scalars_precise(env, st); + return 0; + } else if (err) { + return err; + } + if (!reg_mask && !stack_mask) + /* Found assignment(s) into tracked register in this state. + * Since this state is already marked, just return. + * Nothing to be tracked further in the parent state. + */ + return 0; + if (i == first_idx) + break; + i = get_prev_insn_idx(st, i, &history); + if (i >= env->prog->len) { + /* This can happen if backtracking reached insn 0 + * and there are still reg_mask or stack_mask + * to backtrack. + * It means the backtracking missed the spot where + * particular register was initialized with a constant. + */ + verbose(env, "BUG backtracking idx %d\n", i); + WARN_ONCE(1, "verifier backtracking bug"); + return -EFAULT; + } + } + st = st->parent; + if (!st) + break; + + new_marks = false; + func = st->frame[st->curframe]; + bitmap_from_u64(mask, reg_mask); + for_each_set_bit(i, mask, 32) { + reg = &func->regs[i]; + if (reg->type != SCALAR_VALUE) { + reg_mask &= ~(1u << i); + continue; + } + if (!reg->precise) + new_marks = true; + reg->precise = true; + } + + bitmap_from_u64(mask, stack_mask); + for_each_set_bit(i, mask, 64) { + if (i >= func->allocated_stack / BPF_REG_SIZE) { + /* the sequence of instructions: + * 2: (bf) r3 = r10 + * 3: (7b) *(u64 *)(r3 -8) = r0 + * 4: (79) r4 = *(u64 *)(r10 -8) + * doesn't contain jmps. It's backtracked + * as a single block. + * During backtracking insn 3 is not recognized as + * stack access, so at the end of backtracking + * stack slot fp-8 is still marked in stack_mask. + * However the parent state may not have accessed + * fp-8 and it's "unallocated" stack space. + * In such case fallback to conservative. + */ + mark_all_scalars_precise(env, st); + return 0; + } + + if (func->stack[i].slot_type[0] != STACK_SPILL) { + stack_mask &= ~(1ull << i); + continue; + } + reg = &func->stack[i].spilled_ptr; + if (reg->type != SCALAR_VALUE) { + stack_mask &= ~(1ull << i); + continue; + } + if (!reg->precise) + new_marks = true; + reg->precise = true; + } + if (env->log.level & BPF_LOG_LEVEL) { + print_verifier_state(env, func); + verbose(env, "parent %s regs=%x stack=%llx marks\n", + new_marks ? "didn't have" : "already had", + reg_mask, stack_mask); + } + + if (!reg_mask && !stack_mask) + break; + if (!new_marks) + break; + + last_idx = st->last_insn_idx; + first_idx = st->first_insn_idx; + } + return 0; +} + +static int mark_chain_precision(struct bpf_verifier_env *env, int regno) +{ + return __mark_chain_precision(env, regno, -1); +} + +static int mark_chain_precision_stack(struct bpf_verifier_env *env, int spi) +{ + return __mark_chain_precision(env, -1, spi); +} + static bool is_spillable_regtype(enum bpf_reg_type type) { switch (type) { @@ -1174,12 +1872,45 @@ static bool is_spillable_regtype(enum bpf_reg_type type) case PTR_TO_SOCK_COMMON_OR_NULL: case PTR_TO_TCP_SOCK: case PTR_TO_TCP_SOCK_OR_NULL: + case PTR_TO_XDP_SOCK: return true; default: return false; } } +/* Does this register contain a constant zero? */ +static bool register_is_null(struct bpf_reg_state *reg) +{ + return reg->type == SCALAR_VALUE && tnum_equals_const(reg->var_off, 0); +} + +static bool register_is_const(struct bpf_reg_state *reg) +{ + return reg->type == SCALAR_VALUE && tnum_is_const(reg->var_off); +} + +static bool __is_pointer_value(bool allow_ptr_leaks, + const struct bpf_reg_state *reg) +{ + if (allow_ptr_leaks) + return false; + + return reg->type != SCALAR_VALUE; +} + +static void save_register_state(struct bpf_func_state *state, + int spi, struct bpf_reg_state *reg) +{ + int i; + + state->stack[spi].spilled_ptr = *reg; + state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN; + + for (i = 0; i < BPF_REG_SIZE; i++) + state->stack[spi].slot_type[i] = STACK_SPILL; +} + /* check_stack_read/write functions track spill/fill of registers, * stack boundary and alignment are checked in check_mem_access() */ @@ -1189,7 +1920,8 @@ static int check_stack_write(struct bpf_verifier_env *env, { struct bpf_func_state *cur; /* state of the current function */ int i, slot = -off - 1, spi = slot / BPF_REG_SIZE, err; - enum bpf_reg_type type; + u32 dst_reg = env->prog->insnsi[insn_idx].dst_reg; + struct bpf_reg_state *reg = NULL; err = realloc_func_state(state, round_up(slot + 1, BPF_REG_SIZE), state->acquired_refs, true); @@ -1206,108 +1938,88 @@ static int check_stack_write(struct bpf_verifier_env *env, } cur = env->cur_state->frame[env->cur_state->curframe]; - if (value_regno >= 0 && - is_spillable_regtype((type = cur->regs[value_regno].type))) { + if (value_regno >= 0) + reg = &cur->regs[value_regno]; + if (!env->allow_ptr_leaks) { + bool sanitize = reg && is_spillable_regtype(reg->type); + for (i = 0; i < size; i++) { + u8 type = state->stack[spi].slot_type[i]; + + if (type != STACK_MISC && type != STACK_ZERO) { + sanitize = true; + break; + } + } + + if (sanitize) + env->insn_aux_data[insn_idx].sanitize_stack_spill = true; + } + + if (reg && size == BPF_REG_SIZE && register_is_const(reg) && + !register_is_null(reg) && env->allow_ptr_leaks) { + if (dst_reg != BPF_REG_FP) { + /* The backtracking logic can only recognize explicit + * stack slot address like [fp - 8]. Other spill of + * scalar via different register has to be conervative. + * Backtrack from here and mark all registers as precise + * that contributed into 'reg' being a constant. + */ + err = mark_chain_precision(env, value_regno); + if (err) + return err; + } + save_register_state(state, spi, reg); + } else if (reg && is_spillable_regtype(reg->type)) { /* register containing pointer is being spilled into stack */ if (size != BPF_REG_SIZE) { + verbose_linfo(env, insn_idx, "; "); verbose(env, "invalid size of register spill\n"); return -EACCES; } - - if (state != cur && type == PTR_TO_STACK) { + if (state != cur && reg->type == PTR_TO_STACK) { verbose(env, "cannot spill pointers to stack into stack frame of the caller\n"); return -EINVAL; } - - /* save register state */ - state->stack[spi].spilled_ptr = cur->regs[value_regno]; - state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN; - - for (i = 0; i < BPF_REG_SIZE; i++) { - if (state->stack[spi].slot_type[i] == STACK_MISC && - !env->allow_ptr_leaks) { - int *poff = &env->insn_aux_data[insn_idx].sanitize_stack_off; - int soff = (-spi - 1) * BPF_REG_SIZE; - - /* detected reuse of integer stack slot with a pointer - * which means either llvm is reusing stack slot or - * an attacker is trying to exploit CVE-2018-3639 - * (speculative store bypass) - * Have to sanitize that slot with preemptive - * store of zero. - */ - if (*poff && *poff != soff) { - /* disallow programs where single insn stores - * into two different stack slots, since verifier - * cannot sanitize them - */ - verbose(env, "insn %d cannot access two stack slots fp%d and fp%d", - insn_idx, *poff, soff); - return -EINVAL; - } - *poff = soff; - } - state->stack[spi].slot_type[i] = STACK_SPILL; - } + save_register_state(state, spi, reg); } else { - /* regular write of data into stack */ - state->stack[spi].spilled_ptr = (struct bpf_reg_state) {}; + u8 type = STACK_MISC; + /* regular write of data into stack destroys any spilled ptr */ + state->stack[spi].spilled_ptr.type = NOT_INIT; + /* Mark slots as STACK_MISC if they belonged to spilled ptr. */ + if (state->stack[spi].slot_type[0] == STACK_SPILL) + for (i = 0; i < BPF_REG_SIZE; i++) + state->stack[spi].slot_type[i] = STACK_MISC; + + /* only mark the slot as written if all 8 bytes were written + * otherwise read propagation may incorrectly stop too soon + * when stack slots are partially written. + * This heuristic means that read propagation will be + * conservative, since it will add reg_live_read marks + * to stack slots all the way to first state when programs + * writes+reads less than 8 bytes + */ + if (size == BPF_REG_SIZE) + state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN; + + /* when we zero initialize stack slots mark them as such */ + if (reg && register_is_null(reg)) { + /* backtracking doesn't work for STACK_ZERO yet. */ + err = mark_chain_precision(env, value_regno); + if (err) + return err; + type = STACK_ZERO; + } + + /* Mark slots affected by this stack write. */ for (i = 0; i < size; i++) state->stack[spi].slot_type[(slot - i) % BPF_REG_SIZE] = - STACK_MISC; + type; } return 0; } -/* registers of every function are unique and mark_reg_read() propagates - * the liveness in the following cases: - * - from callee into caller for R1 - R5 that were used as arguments - * - from caller into callee for R0 that used as result of the call - * - from caller to the same caller skipping states of the callee for R6 - R9, - * since R6 - R9 are callee saved by implicit function prologue and - * caller's R6 != callee's R6, so when we propagate liveness up to - * parent states we need to skip callee states for R6 - R9. - * - * stack slot marking is different, since stacks of caller and callee are - * accessible in both (since caller can pass a pointer to caller's stack to - * callee which can pass it to another function), hence mark_stack_slot_read() - * has to propagate the stack liveness to all parent states at given frame number. - * Consider code: - * f1() { - * ptr = fp - 8; - * *ptr = ctx; - * call f2 { - * .. = *ptr; - * } - * .. = *ptr; - * } - * First *ptr is reading from f1's stack and mark_stack_slot_read() has - * to mark liveness at the f1's frame and not f2's frame. - * Second *ptr is also reading from f1's stack and mark_stack_slot_read() has - * to propagate liveness to f2 states at f1's frame level and further into - * f1 states at f1's frame level until write into that stack slot - */ -static void mark_stack_slot_read(struct bpf_verifier_env *env, - const struct bpf_verifier_state *state, - struct bpf_verifier_state *parent, - int slot, int frameno) -{ - bool writes = parent == state->parent; /* Observe write marks */ - - while (parent) { - /* if read wasn't screened by an earlier write ... */ - if (writes && state->frame[frameno]->stack[slot].spilled_ptr.live & REG_LIVE_WRITTEN) - break; - /* ... then we depend on parent's value */ - parent->frame[frameno]->stack[slot].spilled_ptr.live |= REG_LIVE_READ; - state = parent; - parent = state->parent; - writes = true; - } -} - static int check_stack_read(struct bpf_verifier_env *env, struct bpf_func_state *reg_state /* func where register points to */, int off, int size, int value_regno) @@ -1315,6 +2027,7 @@ static int check_stack_read(struct bpf_verifier_env *env, struct bpf_verifier_state *vstate = env->cur_state; struct bpf_func_state *state = vstate->frame[vstate->curframe]; int i, slot = -off - 1, spi = slot / BPF_REG_SIZE; + struct bpf_reg_state *reg; u8 *stype; if (reg_state->allocated_stack <= slot) { @@ -1323,11 +2036,21 @@ static int check_stack_read(struct bpf_verifier_env *env, return -EACCES; } stype = reg_state->stack[spi].slot_type; + reg = ®_state->stack[spi].spilled_ptr; if (stype[0] == STACK_SPILL) { if (size != BPF_REG_SIZE) { - verbose(env, "invalid size of register spill\n"); - return -EACCES; + if (reg->type != SCALAR_VALUE) { + verbose_linfo(env, env->insn_idx, "; "); + verbose(env, "invalid size of register fill\n"); + return -EACCES; + } + if (value_regno >= 0) { + mark_reg_unknown(env, state->regs, value_regno); + state->regs[value_regno].live |= REG_LIVE_WRITTEN; + } + mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64); + return 0; } for (i = 1; i < BPF_REG_SIZE; i++) { if (stype[(slot - i) % BPF_REG_SIZE] != STACK_SPILL) { @@ -1338,24 +2061,64 @@ static int check_stack_read(struct bpf_verifier_env *env, if (value_regno >= 0) { /* restore register state from stack */ - state->regs[value_regno] = reg_state->stack[spi].spilled_ptr; - mark_stack_slot_read(env, vstate, vstate->parent, spi, - reg_state->frameno); + state->regs[value_regno] = *reg; + /* mark reg as written since spilled pointer state likely + * has its liveness marks cleared by is_state_visited() + * which resets stack/reg liveness for state transitions + */ + state->regs[value_regno].live |= REG_LIVE_WRITTEN; + } else if (__is_pointer_value(env->allow_ptr_leaks, reg)) { + /* If value_regno==-1, the caller is asking us whether + * it is acceptable to use this value as a SCALAR_VALUE + * (e.g. for XADD). + * We must not allow unprivileged callers to do that + * with spilled pointers. + */ + verbose(env, "leaking pointer from stack off %d\n", + off); + return -EACCES; } - return 0; + mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64); } else { + int zeros = 0; + for (i = 0; i < size; i++) { - if (stype[(slot - i) % BPF_REG_SIZE] != STACK_MISC) { - verbose(env, "invalid read from stack off %d+%d size %d\n", - off, i, size); - return -EACCES; + if (stype[(slot - i) % BPF_REG_SIZE] == STACK_MISC) + continue; + if (stype[(slot - i) % BPF_REG_SIZE] == STACK_ZERO) { + zeros++; + continue; } + verbose(env, "invalid read from stack off %d+%d size %d\n", + off, i, size); + return -EACCES; + } + mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64); + if (value_regno >= 0) { + if (zeros == size) { + /* any size read into register is zero extended, + * so the whole register == const_zero + */ + __mark_reg_const_zero(&state->regs[value_regno]); + /* backtracking doesn't support STACK_ZERO yet, + * so mark it precise here, so that later + * backtracking can stop here. + * Backtracking may not need this if this register + * doesn't participate in pointer adjustment. + * Forward propagation of precise flag is not + * necessary either. This mark is only to stop + * backtracking. Any register that contributed + * to const 0 was marked precise before spill. + */ + state->regs[value_regno].precise = true; + } else { + /* have read misc data from the stack */ + mark_reg_unknown(env, state->regs, value_regno); + } + state->regs[value_regno].live |= REG_LIVE_WRITTEN; } - if (value_regno >= 0) - /* have read misc data from the stack */ - mark_reg_unknown(env, state->regs, value_regno); - return 0; } + return 0; } static int check_stack_access(struct bpf_verifier_env *env, @@ -1370,7 +2133,7 @@ static int check_stack_access(struct bpf_verifier_env *env, char tn_buf[48]; tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); - verbose(env, "variable stack access var_off=%s off=%d size=%d", + verbose(env, "variable stack access var_off=%s off=%d size=%d\n", tn_buf, off, size); return -EACCES; } @@ -1383,14 +2146,37 @@ static int check_stack_access(struct bpf_verifier_env *env, return 0; } +static int check_map_access_type(struct bpf_verifier_env *env, u32 regno, + int off, int size, enum bpf_access_type type) +{ + struct bpf_reg_state *regs = cur_regs(env); + struct bpf_map *map = regs[regno].map_ptr; + u32 cap = bpf_map_flags_to_cap(map); + + if (type == BPF_WRITE && !(cap & BPF_MAP_CAN_WRITE)) { + verbose(env, "write into map forbidden, value_size=%d off=%d size=%d\n", + map->value_size, off, size); + return -EACCES; + } + + if (type == BPF_READ && !(cap & BPF_MAP_CAN_READ)) { + verbose(env, "read from map forbidden, value_size=%d off=%d size=%d\n", + map->value_size, off, size); + return -EACCES; + } + + return 0; +} + /* check read/write into map element returned by bpf_map_lookup_elem() */ static int __check_map_access(struct bpf_verifier_env *env, u32 regno, int off, - int size) + int size, bool zero_size_allowed) { struct bpf_reg_state *regs = cur_regs(env); struct bpf_map *map = regs[regno].map_ptr; - if (off < 0 || size <= 0 || off + size > map->value_size) { + if (off < 0 || size < 0 || (size == 0 && !zero_size_allowed) || + off + size > map->value_size) { verbose(env, "invalid access to map value, value_size=%d off=%d size=%d\n", map->value_size, off, size); return -EACCES; @@ -1400,7 +2186,7 @@ static int __check_map_access(struct bpf_verifier_env *env, u32 regno, int off, /* check read/write into a map element with possible variable offset */ static int check_map_access(struct bpf_verifier_env *env, u32 regno, - int off, int size) + int off, int size, bool zero_size_allowed) { struct bpf_verifier_state *vstate = env->cur_state; struct bpf_func_state *state = vstate->frame[vstate->curframe]; @@ -1411,7 +2197,7 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, * need to try adding each of min_value and max_value to off * to make sure our theoretical access will be safe. */ - if (env->log.level) + if (env->log.level & BPF_LOG_LEVEL) print_verifier_state(env, state); /* The minimum value is only important with signed @@ -1428,10 +2214,11 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, regno); return -EACCES; } - - err = __check_map_access(env, regno, reg->smin_value + off, size); + err = __check_map_access(env, regno, reg->smin_value + off, size, + zero_size_allowed); if (err) { - verbose(env, "R%d min value is outside of the array range\n", regno); + verbose(env, "R%d min value is outside of the array range\n", + regno); return err; } @@ -1444,13 +2231,15 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, regno); return -EACCES; } - - err = __check_map_access(env, regno, reg->umax_value + off, size); + err = __check_map_access(env, regno, reg->umax_value + off, size, + zero_size_allowed); if (err) - verbose(env, "R%d max value is outside of the array range\n", regno); + verbose(env, "R%d max value is outside of the array range\n", + regno); if (map_value_has_spin_lock(reg->map_ptr)) { u32 lock = reg->map_ptr->spin_lock_off; + /* if any part of struct bpf_spin_lock can be touched by * load/store reject this program. * To check that [x1, x2) overlaps with [y1, y2) @@ -1462,7 +2251,6 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, return -EACCES; } } - return err; } @@ -1473,40 +2261,49 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env, enum bpf_access_type t) { switch (env->prog->type) { + /* Program types only with direct read access go here! */ case BPF_PROG_TYPE_LWT_IN: case BPF_PROG_TYPE_LWT_OUT: - /* dst_input() and dst_output() can't write for now */ + case BPF_PROG_TYPE_LWT_SEG6LOCAL: + case BPF_PROG_TYPE_SK_REUSEPORT: + case BPF_PROG_TYPE_FLOW_DISSECTOR: + case BPF_PROG_TYPE_CGROUP_SKB: if (t == BPF_WRITE) return false; /* fallthrough */ + + /* Program types with direct read + write access go here! */ case BPF_PROG_TYPE_SCHED_CLS: case BPF_PROG_TYPE_SCHED_ACT: case BPF_PROG_TYPE_XDP: case BPF_PROG_TYPE_LWT_XMIT: case BPF_PROG_TYPE_SK_SKB: case BPF_PROG_TYPE_SK_MSG: - case BPF_PROG_TYPE_FLOW_DISSECTOR: if (meta) return meta->pkt_access; env->seen_direct_write = true; return true; + case BPF_PROG_TYPE_CGROUP_SOCKOPT: if (t == BPF_WRITE) env->seen_direct_write = true; + return true; + default: return false; } } static int __check_packet_access(struct bpf_verifier_env *env, u32 regno, - int off, int size) + int off, int size, bool zero_size_allowed) { struct bpf_reg_state *regs = cur_regs(env); struct bpf_reg_state *reg = ®s[regno]; - if (off < 0 || size <= 0 || (u64)off + size > reg->range) { + if (off < 0 || size < 0 || (size == 0 && !zero_size_allowed) || + (u64)off + size > reg->range) { verbose(env, "invalid access to packet, off=%d size=%d, R%d(id=%d,off=%d,r=%d)\n", off, size, regno, reg->id, reg->off, reg->range); return -EACCES; @@ -1515,7 +2312,7 @@ static int __check_packet_access(struct bpf_verifier_env *env, u32 regno, } static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off, - int size) + int size, bool zero_size_allowed) { struct bpf_reg_state *regs = cur_regs(env); struct bpf_reg_state *reg = ®s[regno]; @@ -1534,11 +2331,22 @@ static int check_packet_access(struct bpf_verifier_env *env, u32 regno, int off, regno); return -EACCES; } - err = __check_packet_access(env, regno, off, size); + err = __check_packet_access(env, regno, off, size, zero_size_allowed); if (err) { verbose(env, "R%d offset is outside of the packet\n", regno); return err; } + + /* __check_packet_access has made sure "off + size - 1" is within u16. + * reg->umax_value can't be bigger than MAX_PACKET_OFF which is 0xffff, + * otherwise find_good_pkt_pointers would have refused to set range info + * that __check_packet_access would have rejected this pkt access. + * Therefore, "off + reg->umax_value + size - 1" won't overflow u32. + */ + env->prog->aux->max_pkt_offset = + max_t(u32, env->prog->aux->max_pkt_offset, + off + reg->umax_value + size - 1); + return err; } @@ -1550,10 +2358,6 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, .reg_type = *reg_type, }; - /* for analyzer ctx accesses are already validated and converted */ - if (env->analyzer_ops) - return 0; - if (env->ops->is_valid_access && env->ops->is_valid_access(off, size, t, env->prog, &info)) { /* A non zero info.ctx_field_size indicates that this field is a @@ -1563,9 +2367,9 @@ static int check_ctx_access(struct bpf_verifier_env *env, int insn_idx, int off, * will only allow for whole field access and rejects any other * type of narrower access. */ - env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size; *reg_type = info.reg_type; + env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size; /* remember the offset of last byte accessed in ctx */ if (env->prog->aux->max_ctx_offset < off + size) env->prog->aux->max_ctx_offset = off + size; @@ -1581,6 +2385,8 @@ static int check_flow_keys_access(struct bpf_verifier_env *env, int off, { if (size < 0 || off < 0 || (u64)off + size > sizeof(struct bpf_flow_keys)) { + verbose(env, "invalid access to flow keys off=%d size=%d\n", + off, size); return -EACCES; } return 0; @@ -1596,6 +2402,8 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx, bool valid; if (reg->smin_value < 0) { + verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n", + regno); return -EACCES; } @@ -1609,10 +2417,14 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx, case PTR_TO_TCP_SOCK: valid = bpf_tcp_sock_is_valid_access(off, size, t, &info); break; + case PTR_TO_XDP_SOCK: + valid = bpf_xdp_sock_is_valid_access(off, size, t, &info); + break; default: valid = false; } + if (valid) { env->insn_aux_data[insn_idx].ctx_field_size = info.ctx_field_size; @@ -1621,18 +2433,10 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx, verbose(env, "R%d invalid %s access off=%d size=%d\n", regno, reg_type_str[reg->type], off, size); + return -EACCES; } -static bool __is_pointer_value(bool allow_ptr_leaks, - const struct bpf_reg_state *reg) -{ - if (allow_ptr_leaks) - return false; - - return reg->type != SCALAR_VALUE; -} - static struct bpf_reg_state *reg_state(struct bpf_verifier_env *env, int regno) { return cur_regs(env) + regno; @@ -1640,12 +2444,12 @@ static struct bpf_reg_state *reg_state(struct bpf_verifier_env *env, int regno) static bool is_pointer_value(struct bpf_verifier_env *env, int regno) { - return __is_pointer_value(env->allow_ptr_leaks, cur_regs(env) + regno); + return __is_pointer_value(env->allow_ptr_leaks, reg_state(env, regno)); } static bool is_ctx_reg(struct bpf_verifier_env *env, int regno) { - const struct bpf_reg_state *reg = cur_regs(env) + regno; + const struct bpf_reg_state *reg = reg_state(env, regno); return reg->type == PTR_TO_CTX; } @@ -1653,14 +2457,23 @@ static bool is_ctx_reg(struct bpf_verifier_env *env, int regno) static bool is_sk_reg(struct bpf_verifier_env *env, int regno) { const struct bpf_reg_state *reg = reg_state(env, regno); + return type_is_sk_pointer(reg->type); } static bool is_pkt_reg(struct bpf_verifier_env *env, int regno) { - const struct bpf_reg_state *reg = cur_regs(env) + regno; + const struct bpf_reg_state *reg = reg_state(env, regno); - return reg->type == PTR_TO_PACKET; + return type_is_pkt_pointer(reg->type); +} + +static bool is_flow_key_reg(struct bpf_verifier_env *env, int regno) +{ + const struct bpf_reg_state *reg = reg_state(env, regno); + + /* Separate to is_ctx_reg() since we still want to allow BPF_ST here. */ + return reg->type == PTR_TO_FLOW_KEYS; } static int check_pkt_ptr_alignment(struct bpf_verifier_env *env, @@ -1689,7 +2502,8 @@ static int check_pkt_ptr_alignment(struct bpf_verifier_env *env, char tn_buf[48]; tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); - verbose(env, "misaligned packet access off %d+%s+%d+%d size %d\n", + verbose(env, + "misaligned packet access off %d+%s+%d+%d size %d\n", ip_align, tn_buf, reg->off, off, size); return -EACCES; } @@ -1761,6 +2575,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env, case PTR_TO_TCP_SOCK: pointer_desc = "tcp_sock "; break; + case PTR_TO_XDP_SOCK: + pointer_desc = "xdp_sock "; + break; default: break; } @@ -1793,11 +2610,35 @@ static int check_max_stack_depth(struct bpf_verifier_env *env) int depth = 0, frame = 0, idx = 0, i = 0, subprog_end; struct bpf_subprog_info *subprog = env->subprog_info; struct bpf_insn *insn = env->prog->insnsi; - int insn_cnt = env->prog->len; int ret_insn[MAX_CALL_FRAMES]; int ret_prog[MAX_CALL_FRAMES]; process_func: + /* protect against potential stack overflow that might happen when + * bpf2bpf calls get combined with tailcalls. Limit the caller's stack + * depth for such case down to 256 so that the worst case scenario + * would result in 8k stack size (32 which is tailcall limit * 256 = + * 8k). + * + * To get the idea what might happen, see an example: + * func1 -> sub rsp, 128 + * subfunc1 -> sub rsp, 256 + * tailcall1 -> add rsp, 256 + * func2 -> sub rsp, 192 (total stack size = 128 + 192 = 320) + * subfunc2 -> sub rsp, 64 + * subfunc22 -> sub rsp, 128 + * tailcall2 -> add rsp, 128 + * func3 -> sub rsp, 32 (total stack size 128 + 192 + 64 + 32 = 416) + * + * tailcall will unwind the current stack frame but it will not get rid + * of caller's stack as shown on the example above. + */ + if (idx && subprog[idx].has_tail_call && depth >= 256) { + verbose(env, + "tail_calls are not allowed when call stack of previous frames is %d bytes. Too large\n", + depth); + return -EACCES; + } /* round up to 32-bytes, since this is granularity * of interpreter stack size */ @@ -1807,12 +2648,8 @@ process_func: frame + 1, depth); return -EACCES; } - continue_func: - if (env->subprog_cnt == idx + 1) - subprog_end = insn_cnt; - else - subprog_end = subprog[idx + 1].start; + subprog_end = subprog[idx + 1].start; for (; i < subprog_end; i++) { if (insn[i].code != (BPF_JMP | BPF_CALL)) continue; @@ -1821,6 +2658,7 @@ continue_func: /* remember insn and function to return to */ ret_insn[frame] = i + 1; ret_prog[frame] = idx; + /* find the callee */ i = i + insn[i].imm + 1; idx = find_subprog(env, i); @@ -1831,8 +2669,9 @@ continue_func: } frame++; if (frame >= MAX_CALL_FRAMES) { - WARN_ONCE(1, "verifier bug. Call stack is too deep\n"); - return -EFAULT; + verbose(env, "the call stack of %d frames is too deep !\n", + frame); + return -E2BIG; } goto process_func; } @@ -1841,7 +2680,6 @@ continue_func: */ if (frame == 0) return 0; - depth -= round_up(max_t(u32, subprog[idx].stack_depth, 1), 32); frame--; i = ret_insn[frame]; @@ -1854,14 +2692,13 @@ static int get_callee_stack_depth(struct bpf_verifier_env *env, const struct bpf_insn *insn, int idx) { int start = idx + insn->imm + 1, subprog; - subprog = find_subprog(env, start); + subprog = find_subprog(env, start); if (subprog < 0) { WARN_ONCE(1, "verifier bug. No program starts at insn %d\n", start); return -EFAULT; } - return env->subprog_info[subprog].stack_depth; } #endif @@ -1900,16 +2737,15 @@ static int check_tp_buffer_access(struct bpf_verifier_env *env, regno, off, size); return -EACCES; } - if (!tnum_is_const(reg->var_off) || reg->var_off.value) { char tn_buf[48]; + tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); verbose(env, "R%d invalid variable buffer offset: off=%d, var_off=%s", regno, off, tn_buf); return -EACCES; } - if (off + size > env->prog->aux->max_tp_access) env->prog->aux->max_tp_access = off + size; @@ -1940,6 +2776,41 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size) reg->smax_value = reg->umax_value; } +static bool bpf_map_is_rdonly(const struct bpf_map *map) +{ + return (map->map_flags & BPF_F_RDONLY_PROG) && map->frozen; +} + +static int bpf_map_direct_read(struct bpf_map *map, int off, int size, u64 *val) +{ + void *ptr; + u64 addr; + int err; + + err = map->ops->map_direct_value_addr(map, &addr, off); + if (err) + return err; + ptr = (void *)(long)addr + off; + + switch (size) { + case sizeof(u8): + *val = (u64)*(u8 *)ptr; + break; + case sizeof(u16): + *val = (u64)*(u16 *)ptr; + break; + case sizeof(u32): + *val = (u64)*(u32 *)ptr; + break; + case sizeof(u64): + *val = *(u64 *)ptr; + break; + default: + return -EINVAL; + } + return 0; +} + /* check whether memory at (regno + off) is accessible for t = (read | write) * if t==write, value_regno is a register which value is stored into memory * if t==read, value_regno is a register which will receive the value from memory @@ -1973,11 +2844,31 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn verbose(env, "R%d leaks addr into map\n", value_regno); return -EACCES; } + err = check_map_access_type(env, regno, off, size, t); + if (err) + return err; + err = check_map_access(env, regno, off, size, false); + if (!err && t == BPF_READ && value_regno >= 0) { + struct bpf_map *map = reg->map_ptr; - err = check_map_access(env, regno, off, size); - if (!err && t == BPF_READ && value_regno >= 0) - mark_reg_unknown(env, regs, value_regno); + /* if map is read-only, track its contents as scalars */ + if (tnum_is_const(reg->var_off) && + bpf_map_is_rdonly(map) && + map->ops->map_direct_value_addr) { + int map_off = off + reg->var_off.value; + u64 val = 0; + err = bpf_map_direct_read(map, map_off, size, + &val); + if (err) + return err; + + regs[value_regno].type = SCALAR_VALUE; + __mark_reg_known(®s[value_regno], val); + } else { + mark_reg_unknown(env, regs, value_regno); + } + } } else if (reg->type == PTR_TO_CTX) { enum bpf_reg_type reg_type = SCALAR_VALUE; @@ -1986,6 +2877,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn verbose(env, "R%d leaks addr into ctx\n", value_regno); return -EACCES; } + err = check_ctx_reg(env, reg, regno); if (err < 0) return err; @@ -1993,19 +2885,23 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn err = check_ctx_access(env, insn_idx, off, size, t, ®_type); if (!err && t == BPF_READ && value_regno >= 0) { /* ctx access returns either a scalar, or a - * PTR_TO_PACKET[_END]. In the latter case, we know - * the offset is zero. + * PTR_TO_PACKET[_META,_END]. In the latter + * case, we know the offset is zero. */ if (reg_type == SCALAR_VALUE) { mark_reg_unknown(env, regs, value_regno); } else { - mark_reg_known_zero(env, regs, value_regno); + mark_reg_known_zero(env, regs, + value_regno); if (reg_type_may_be_null(reg_type)) regs[value_regno].id = ++env->id_gen; + /* A load of ctx field could have different + * actual load size with the one encoded in the + * insn. When the dst is PTR, it is for sure not + * a sub-register. + */ + regs[value_regno].subreg_def = DEF_NOT_SUBREG; } - regs[value_regno].id = 0; - regs[value_regno].off = 0; - regs[value_regno].range = 0; regs[value_regno].type = reg_type; } @@ -2025,25 +2921,32 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn value_regno, insn_idx); else err = check_stack_read(env, state, off, size, - value_regno); + value_regno); } else if (reg_is_pkt_pointer(reg)) { if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL, t)) { + verbose(env, "cannot write into packet\n"); return -EACCES; } if (t == BPF_WRITE && value_regno >= 0 && is_pointer_value(env, value_regno)) { + verbose(env, "R%d leaks addr into packet\n", + value_regno); return -EACCES; } - err = check_packet_access(env, regno, off, size); + err = check_packet_access(env, regno, off, size, false); if (!err && t == BPF_READ && value_regno >= 0) mark_reg_unknown(env, regs, value_regno); } else if (reg->type == PTR_TO_FLOW_KEYS) { if (t == BPF_WRITE && value_regno >= 0 && is_pointer_value(env, value_regno)) { + verbose(env, "R%d leaks addr into flow keys\n", + value_regno); return -EACCES; } err = check_flow_keys_access(env, off, size); + if (!err && t == BPF_READ && value_regno >= 0) + mark_reg_unknown(env, regs, value_regno); } else if (type_is_sk_pointer(reg->type)) { if (t == BPF_WRITE) { verbose(env, "R%d cannot write into %s\n", @@ -2058,6 +2961,8 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn if (!err && t == BPF_READ && value_regno >= 0) mark_reg_unknown(env, regs, value_regno); } else { + verbose(env, "R%d invalid mem access '%s'\n", regno, + reg_type_str[reg->type]); return -EACCES; } @@ -2095,11 +3000,12 @@ static int check_xadd(struct bpf_verifier_env *env, int insn_idx, struct bpf_ins } if (is_ctx_reg(env, insn->dst_reg) || - is_pkt_reg(env, insn->dst_reg) || + is_pkt_reg(env, insn->dst_reg) || + is_flow_key_reg(env, insn->dst_reg) || is_sk_reg(env, insn->dst_reg)) { verbose(env, "BPF_XADD stores into R%d %s is not allowed\n", - insn->dst_reg, is_ctx_reg(env, insn->dst_reg) ? - "context" : "packet"); + insn->dst_reg, + reg_type_str[reg_state(env, insn->dst_reg)->type]); return -EACCES; } @@ -2114,10 +3020,27 @@ static int check_xadd(struct bpf_verifier_env *env, int insn_idx, struct bpf_ins BPF_SIZE(insn->code), BPF_WRITE, -1, true); } -/* Does this register contain a constant zero? */ -static bool register_is_null(struct bpf_reg_state *reg) +static int __check_stack_boundary(struct bpf_verifier_env *env, u32 regno, + int off, int access_size, + bool zero_size_allowed) { - return reg->type == SCALAR_VALUE && tnum_equals_const(reg->var_off, 0); + struct bpf_reg_state *reg = reg_state(env, regno); + + if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 || + access_size < 0 || (access_size == 0 && !zero_size_allowed)) { + if (tnum_is_const(reg->var_off)) { + verbose(env, "invalid stack type R%d off=%d access_size=%d\n", + regno, off, access_size); + } else { + char tn_buf[48]; + + tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); + verbose(env, "invalid stack type R%d var_off=%s access_size=%d\n", + regno, tn_buf, access_size); + } + return -EACCES; + } + return 0; } /* when register 'regno' is passed into function that will read 'access_size' @@ -2130,9 +3053,9 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno, int access_size, bool zero_size_allowed, struct bpf_call_arg_meta *meta) { - struct bpf_reg_state *reg = cur_regs(env) + regno; + struct bpf_reg_state *reg = reg_state(env, regno); struct bpf_func_state *state = func(env, reg); - int off, i, slot, spi; + int err, min_off, max_off, i, j, slot, spi; if (reg->type != PTR_TO_STACK) { /* Allow zero-byte read from NULL, regardless of pointer type */ @@ -2146,21 +3069,57 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno, return -EACCES; } - /* Only allow fixed-offset stack reads */ - if (!tnum_is_const(reg->var_off)) { - char tn_buf[48]; + if (tnum_is_const(reg->var_off)) { + min_off = max_off = reg->var_off.value + reg->off; + err = __check_stack_boundary(env, regno, min_off, access_size, + zero_size_allowed); + if (err) + return err; + } else { + /* Variable offset is prohibited for unprivileged mode for + * simplicity since it requires corresponding support in + * Spectre masking for stack ALU. + * See also retrieve_ptr_limit(). + */ + if (!env->allow_ptr_leaks) { + char tn_buf[48]; - tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); - verbose(env, "invalid variable stack read R%d var_off=%s\n", - regno, tn_buf); - return -EACCES; - } - off = reg->off + reg->var_off.value; - if (off >= 0 || off < -MAX_BPF_STACK || off + access_size > 0 || - access_size <= 0) { - verbose(env, "invalid stack type R%d off=%d access_size=%d\n", - regno, off, access_size); - return -EACCES; + tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); + verbose(env, "R%d indirect variable offset stack access prohibited for !root, var_off=%s\n", + regno, tn_buf); + return -EACCES; + } + /* Only initialized buffer on stack is allowed to be accessed + * with variable offset. With uninitialized buffer it's hard to + * guarantee that whole memory is marked as initialized on + * helper return since specific bounds are unknown what may + * cause uninitialized stack leaking. + */ + if (meta && meta->raw_mode) + meta = NULL; + + if (reg->smax_value >= BPF_MAX_VAR_OFF || + reg->smax_value <= -BPF_MAX_VAR_OFF) { + verbose(env, "R%d unbounded indirect variable offset stack access\n", + regno); + return -EACCES; + } + min_off = reg->smin_value + reg->off; + max_off = reg->smax_value + reg->off; + err = __check_stack_boundary(env, regno, min_off, access_size, + zero_size_allowed); + if (err) { + verbose(env, "R%d min value is outside of stack bound\n", + regno); + return err; + } + err = __check_stack_boundary(env, regno, max_off, access_size, + zero_size_allowed); + if (err) { + verbose(env, "R%d max value is outside of stack bound\n", + regno); + return err; + } } if (meta && meta->raw_mode) { @@ -2169,19 +3128,50 @@ static int check_stack_boundary(struct bpf_verifier_env *env, int regno, return 0; } - for (i = 0; i < access_size; i++) { - slot = -(off + i) - 1; - spi = slot / BPF_REG_SIZE; - if (state->allocated_stack <= slot || - state->stack[spi].slot_type[slot % BPF_REG_SIZE] != - STACK_MISC) { - verbose(env, "invalid indirect read from stack off %d+%d size %d\n", - off, i, access_size); - return -EACCES; - } - } + for (i = min_off; i < max_off + access_size; i++) { + u8 *stype; - return update_stack_depth(env, state, off); + slot = -i - 1; + spi = slot / BPF_REG_SIZE; + if (state->allocated_stack <= slot) + goto err; + stype = &state->stack[spi].slot_type[slot % BPF_REG_SIZE]; + if (*stype == STACK_MISC) + goto mark; + if (*stype == STACK_ZERO) { + /* helper can write anything into the stack */ + *stype = STACK_MISC; + goto mark; + } + if (state->stack[spi].slot_type[0] == STACK_SPILL && + state->stack[spi].spilled_ptr.type == SCALAR_VALUE) { + __mark_reg_unknown(env, &state->stack[spi].spilled_ptr); + for (j = 0; j < BPF_REG_SIZE; j++) + state->stack[spi].slot_type[j] = STACK_MISC; + goto mark; + } + +err: + if (tnum_is_const(reg->var_off)) { + verbose(env, "invalid indirect read from stack off %d+%d size %d\n", + min_off, i - min_off, access_size); + } else { + char tn_buf[48]; + + tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); + verbose(env, "invalid indirect read from stack var_off %s+%d size %d\n", + tn_buf, i - min_off, access_size); + } + return -EACCES; +mark: + /* reading any byte out of 8-byte 'spill_slot' will cause + * the whole slot to be marked as 'read' + */ + mark_reg_read(env, &state->stack[spi].spilled_ptr, + state->stack[spi].spilled_ptr.parent, + REG_LIVE_READ64); + } + return update_stack_depth(env, state, min_off); } static int check_helper_mem_access(struct bpf_verifier_env *env, int regno, @@ -2193,11 +3183,15 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno, switch (reg->type) { case PTR_TO_PACKET: case PTR_TO_PACKET_META: - return check_packet_access(env, regno, reg->off, access_size); - case PTR_TO_FLOW_KEYS: - return check_flow_keys_access(env, reg->off, access_size); + return check_packet_access(env, regno, reg->off, access_size, + zero_size_allowed); case PTR_TO_MAP_VALUE: - return check_map_access(env, regno, reg->off, access_size); + if (check_map_access_type(env, regno, reg->off, access_size, + meta && meta->raw_mode ? BPF_WRITE : + BPF_READ)) + return -EACCES; + return check_map_access(env, regno, reg->off, access_size, + zero_size_allowed); default: /* scalar_value|ptr_to_stack or invalid ptr */ return check_stack_boundary(env, regno, access_size, zero_size_allowed, meta); @@ -2231,6 +3225,7 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno, bool is_const = tnum_is_const(reg->var_off); struct bpf_map *map = reg->map_ptr; u64 val = reg->var_off.value; + if (reg->type != PTR_TO_MAP_VALUE) { verbose(env, "R%d is not a pointer to map_value\n", regno); return -EINVAL; @@ -2285,7 +3280,6 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno, } cur->active_spin_lock = 0; } - return 0; } @@ -2314,6 +3308,7 @@ static int int_ptr_type_to_size(enum bpf_arg_type type) return sizeof(u32); else if (type == ARG_PTR_TO_LONG) return sizeof(u64); + return -EINVAL; } @@ -2334,7 +3329,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, if (arg_type == ARG_ANYTHING) { if (is_pointer_value(env, regno)) { - verbose(env, "R%d leaks addr into helper function\n", regno); + verbose(env, "R%d leaks addr into helper function\n", + regno); return -EACCES; } return 0; @@ -2379,16 +3375,15 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, /* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */ if (!type_is_sk_pointer(type)) goto err_type; - } else if (arg_type == ARG_PTR_TO_SOCKET) { - expected_type = PTR_TO_SOCKET; - if (type != expected_type) - goto err_type; - if (meta->ptr_id || !reg->id) { - verbose(env, "verifier internal error: mismatched references meta=%d, reg=%d\n", - meta->ptr_id, reg->id); - return -EFAULT; + if (reg->ref_obj_id) { + if (meta->ref_obj_id) { + verbose(env, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n", + regno, reg->ref_obj_id, + meta->ref_obj_id); + return -EFAULT; + } + meta->ref_obj_id = reg->ref_obj_id; } - meta->ptr_id = reg->id; } else if (arg_type == ARG_PTR_TO_SOCKET) { expected_type = PTR_TO_SOCKET; if (type != expected_type) @@ -2471,13 +3466,11 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, /* remember the mem_size which may be used later * to refine return values. */ - meta->msize_smax_value = reg->smax_value; - meta->msize_umax_value = reg->umax_value; + meta->msize_max_value = reg->umax_value; /* The register is SCALAR_VALUE; the access check * happens using its boundaries. */ - if (!tnum_is_const(reg->var_off)) /* For unprivileged variable accesses, disable raw * mode so that the program is required to @@ -2508,8 +3501,11 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, err = check_helper_mem_access(env, regno - 1, reg->umax_value, zero_size_allowed, meta); + if (!err) + err = mark_chain_precision(env, regno); } else if (arg_type_is_int_ptr(arg_type)) { int size = int_ptr_type_to_size(arg_type); + err = check_helper_mem_access(env, regno, size, false, meta); if (err) return err; @@ -2537,7 +3533,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, break; case BPF_MAP_TYPE_PERF_EVENT_ARRAY: if (func_id != BPF_FUNC_perf_event_read && - func_id != BPF_FUNC_perf_event_output) + func_id != BPF_FUNC_perf_event_output && + func_id != BPF_FUNC_perf_event_read_value) goto error; break; case BPF_MAP_TYPE_STACK_TRACE: @@ -2560,11 +3557,18 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, func_id != BPF_FUNC_map_lookup_elem) goto error; break; - /* Restrict bpf side of cpumap, open when use-cases appear */ + /* Restrict bpf side of cpumap and xskmap, open when use-cases + * appear. + */ case BPF_MAP_TYPE_CPUMAP: if (func_id != BPF_FUNC_redirect_map) goto error; break; + case BPF_MAP_TYPE_XSKMAP: + if (func_id != BPF_FUNC_redirect_map && + func_id != BPF_FUNC_map_lookup_elem) + goto error; + break; case BPF_MAP_TYPE_ARRAY_OF_MAPS: case BPF_MAP_TYPE_HASH_OF_MAPS: if (func_id != BPF_FUNC_map_lookup_elem) @@ -2577,11 +3581,6 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, func_id != BPF_FUNC_msg_redirect_map) goto error; break; - case BPF_MAP_TYPE_SK_STORAGE: - if (func_id != BPF_FUNC_sk_storage_get && - func_id != BPF_FUNC_sk_storage_delete) - goto error; - break; case BPF_MAP_TYPE_SOCKHASH: if (func_id != BPF_FUNC_sk_redirect_hash && func_id != BPF_FUNC_sock_hash_update && @@ -2589,6 +3588,22 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, func_id != BPF_FUNC_msg_redirect_hash) goto error; break; + case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY: + if (func_id != BPF_FUNC_sk_select_reuseport) + goto error; + break; + case BPF_MAP_TYPE_QUEUE: + case BPF_MAP_TYPE_STACK: + if (func_id != BPF_FUNC_map_peek_elem && + func_id != BPF_FUNC_map_pop_elem && + func_id != BPF_FUNC_map_push_elem) + goto error; + break; + case BPF_MAP_TYPE_SK_STORAGE: + if (func_id != BPF_FUNC_sk_storage_get && + func_id != BPF_FUNC_sk_storage_delete) + goto error; + break; default: break; } @@ -2605,6 +3620,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, break; case BPF_FUNC_perf_event_read: case BPF_FUNC_perf_event_output: + case BPF_FUNC_perf_event_read_value: if (map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY) goto error; break; @@ -2619,7 +3635,9 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, break; case BPF_FUNC_redirect_map: if (map->map_type != BPF_MAP_TYPE_DEVMAP && - map->map_type != BPF_MAP_TYPE_DEVMAP_HASH) + map->map_type != BPF_MAP_TYPE_DEVMAP_HASH && + map->map_type != BPF_MAP_TYPE_CPUMAP && + map->map_type != BPF_MAP_TYPE_XSKMAP) goto error; break; case BPF_FUNC_sk_redirect_map: @@ -2639,6 +3657,17 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, map->map_type != BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) goto error; break; + case BPF_FUNC_sk_select_reuseport: + if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) + goto error; + break; + case BPF_FUNC_map_peek_elem: + case BPF_FUNC_map_pop_elem: + case BPF_FUNC_map_push_elem: + if (map->map_type != BPF_MAP_TYPE_QUEUE && + map->map_type != BPF_MAP_TYPE_STACK) + goto error; + break; case BPF_FUNC_sk_storage_get: case BPF_FUNC_sk_storage_delete: if (map->map_type != BPF_MAP_TYPE_SK_STORAGE) @@ -2669,6 +3698,7 @@ static bool check_raw_mode_ok(const struct bpf_func_proto *fn) count++; if (fn->arg5_type == ARG_PTR_TO_UNINIT_MEM) count++; + /* We only support one arg being in raw mode at the moment, * which is sufficient for the helper functions we have * right now. @@ -2699,37 +3729,46 @@ static bool check_arg_pair_ok(const struct bpf_func_proto *fn) check_args_pair_invalid(fn->arg3_type, fn->arg4_type) || check_args_pair_invalid(fn->arg4_type, fn->arg5_type)) return false; + return true; } -static bool check_refcount_ok(const struct bpf_func_proto *fn) +static bool check_refcount_ok(const struct bpf_func_proto *fn, int func_id) { int count = 0; - if (arg_type_is_refcounted(fn->arg1_type)) + + if (arg_type_may_be_refcounted(fn->arg1_type)) count++; - if (arg_type_is_refcounted(fn->arg2_type)) + if (arg_type_may_be_refcounted(fn->arg2_type)) count++; - if (arg_type_is_refcounted(fn->arg3_type)) + if (arg_type_may_be_refcounted(fn->arg3_type)) count++; - if (arg_type_is_refcounted(fn->arg4_type)) + if (arg_type_may_be_refcounted(fn->arg4_type)) count++; - if (arg_type_is_refcounted(fn->arg5_type)) + if (arg_type_may_be_refcounted(fn->arg5_type)) count++; + + /* A reference acquiring function cannot acquire + * another refcounted ptr. + */ + if (is_acquire_function(func_id) && count) + return false; + /* We only support one arg being unreferenced at the moment, * which is sufficient for the helper functions we have right now. */ return count <= 1; } -static int check_func_proto(const struct bpf_func_proto *fn) +static int check_func_proto(const struct bpf_func_proto *fn, int func_id) { return check_raw_mode_ok(fn) && check_arg_pair_ok(fn) && - check_refcount_ok(fn) ? 0 : -EINVAL; + check_refcount_ok(fn, func_id) ? 0 : -EINVAL; } -/* Packet data might have moved, any old PTR_TO_PACKET[_END] are now invalid, - * so turn them into unknown SCALAR_VALUE. +/* Packet data might have moved, any old PTR_TO_PACKET[_META,_END] + * are now invalid, so turn them into unknown SCALAR_VALUE. */ static void __clear_all_pkt_pointers(struct bpf_verifier_env *env, struct bpf_func_state *state) @@ -2745,7 +3784,7 @@ static void __clear_all_pkt_pointers(struct bpf_verifier_env *env, if (!reg) continue; if (reg_is_pkt_pointer_any(reg)) - __mark_reg_unknown(reg); + __mark_reg_unknown(env, reg); } } @@ -2759,18 +3798,21 @@ static void clear_all_pkt_pointers(struct bpf_verifier_env *env) } static void release_reg_references(struct bpf_verifier_env *env, - struct bpf_func_state *state, int id) + struct bpf_func_state *state, + int ref_obj_id) { struct bpf_reg_state *regs = state->regs, *reg; int i; + for (i = 0; i < MAX_BPF_REG; i++) - if (regs[i].id == id) + if (regs[i].ref_obj_id == ref_obj_id) mark_reg_unknown(env, regs, i); + bpf_for_each_spilled_reg(i, state, reg) { if (!reg) continue; - if (reg_is_refcounted(reg) && reg->id == id) - __mark_reg_unknown(reg); + if (reg->ref_obj_id == ref_obj_id) + __mark_reg_unknown(env, reg); } } @@ -2778,17 +3820,22 @@ static void release_reg_references(struct bpf_verifier_env *env, * resources. Identify all copies of the same pointer and clear the reference. */ static int release_reference(struct bpf_verifier_env *env, - struct bpf_call_arg_meta *meta) + int ref_obj_id) { struct bpf_verifier_state *vstate = env->cur_state; + int err; int i; + + err = release_reference_state(cur_func(env), ref_obj_id); + if (err) + return err; + for (i = 0; i <= vstate->curframe; i++) - release_reg_references(env, vstate->frame[i], meta->ptr_id); + release_reg_references(env, vstate->frame[i], ref_obj_id); - return release_reference_state(cur_func(env), meta->ptr_id); + return 0; } - static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, int *insn_idx) { @@ -2796,15 +3843,14 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, struct bpf_func_state *caller, *callee; int i, err, subprog, target_insn; - if (state->curframe >= MAX_CALL_FRAMES) { + if (state->curframe + 1 >= MAX_CALL_FRAMES) { verbose(env, "the call stack of %d frames is too deep\n", - state->curframe); + state->curframe + 2); return -E2BIG; } target_insn = *insn_idx + insn->imm; subprog = find_subprog(env, target_insn + 1); - if (subprog < 0) { verbose(env, "verifier bug. No program starts at insn %d\n", target_insn + 1); @@ -2821,8 +3867,8 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, callee = kzalloc(sizeof(*callee), GFP_KERNEL); if (!callee) return -ENOMEM; - state->frame[state->curframe + 1] = callee; + /* callee cannot access r0, r6 - r9 for reading and has to write * into its own stack before reading from it. * callee can read/write into caller's stack @@ -2838,27 +3884,30 @@ static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn, if (err) return err; - /* copy r1 - r5 args that callee can access */ - + /* copy r1 - r5 args that callee can access. The copy includes parent + * pointers, which connects us up to the liveness chain + */ for (i = BPF_REG_1; i <= BPF_REG_5; i++) callee->regs[i] = caller->regs[i]; - /* after the call regsiters r0 - r5 were scratched */ + /* after the call registers r0 - r5 were scratched */ for (i = 0; i < CALLER_SAVED_REGS; i++) { mark_reg_not_init(env, caller->regs, caller_saved[i]); check_reg_arg(env, caller_saved[i], DST_OP_NO_MARK); } + /* only increment it after check_reg_arg() finished */ state->curframe++; + /* and go analyze first insn of the callee */ *insn_idx = target_insn; - if (env->log.level) { + + if (env->log.level & BPF_LOG_LEVEL) { verbose(env, "caller:\n"); print_verifier_state(env, caller); verbose(env, "callee:\n"); print_verifier_state(env, callee); } - return 0; } @@ -2868,9 +3917,9 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx) struct bpf_func_state *caller, *callee; struct bpf_reg_state *r0; int err; + callee = state->frame[state->curframe]; r0 = &callee->regs[BPF_REG_0]; - if (r0->type == PTR_TO_STACK) { /* technically it's ok to return caller's stack pointer * (or caller's caller's pointer) back to the caller, @@ -2893,16 +3942,54 @@ static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx) return err; *insn_idx = callee->callsite + 1; - if (env->log.level) { + if (env->log.level & BPF_LOG_LEVEL) { verbose(env, "returning from callee:\n"); print_verifier_state(env, callee); verbose(env, "to caller at %d:\n", *insn_idx); print_verifier_state(env, caller); } - /* clear everything in the callee */ free_func_state(callee); state->frame[state->curframe + 1] = NULL; + return 0; +} + +static int do_refine_retval_range(struct bpf_verifier_env *env, + struct bpf_reg_state *regs, int ret_type, + int func_id, struct bpf_call_arg_meta *meta) +{ + struct bpf_reg_state *ret_reg = ®s[BPF_REG_0]; + struct bpf_reg_state tmp_reg = *ret_reg; + bool ret; + + if (ret_type != RET_INTEGER || + (func_id != BPF_FUNC_get_stack && + func_id != BPF_FUNC_probe_read_str)) + return 0; + + /* Error case where ret is in interval [S32MIN, -1]. */ + ret_reg->smin_value = S32_MIN; + ret_reg->smax_value = -1; + + __reg_deduce_bounds(ret_reg); + __reg_bound_offset(ret_reg); + __update_reg_bounds(ret_reg); + + ret = push_stack(env, env->insn_idx + 1, env->insn_idx, false); + if (!ret) + return -EFAULT; + + *ret_reg = tmp_reg; + + /* Success case where ret is in range [0, msize_max_value]. */ + ret_reg->smin_value = 0; + ret_reg->smax_value = meta->msize_max_value; + ret_reg->umin_value = ret_reg->smin_value; + ret_reg->umax_value = ret_reg->smax_value; + + __reg_deduce_bounds(ret_reg); + __reg_bound_offset(ret_reg); + __update_reg_bounds(ret_reg); return 0; } @@ -2912,13 +3999,35 @@ record_func_map(struct bpf_verifier_env *env, struct bpf_call_arg_meta *meta, int func_id, int insn_idx) { struct bpf_insn_aux_data *aux = &env->insn_aux_data[insn_idx]; + struct bpf_map *map = meta->map_ptr; + if (func_id != BPF_FUNC_tail_call && - func_id != BPF_FUNC_map_lookup_elem) + func_id != BPF_FUNC_map_lookup_elem && + func_id != BPF_FUNC_map_update_elem && + func_id != BPF_FUNC_map_delete_elem && + func_id != BPF_FUNC_map_push_elem && + func_id != BPF_FUNC_map_pop_elem && + func_id != BPF_FUNC_map_peek_elem) return 0; - if (meta->map_ptr == NULL) { + + if (map == NULL) { verbose(env, "kernel subsystem misconfigured verifier\n"); return -EINVAL; } + + /* In case of read-only, some additional restrictions + * need to be applied in order to prevent altering the + * state of the map from program side. + */ + if ((map->map_flags & BPF_F_RDONLY_PROG) && + (func_id == BPF_FUNC_map_delete_elem || + func_id == BPF_FUNC_map_update_elem || + func_id == BPF_FUNC_map_push_elem || + func_id == BPF_FUNC_map_pop_elem)) { + verbose(env, "write into map forbidden\n"); + return -EACCES; + } + if (!BPF_MAP_PTR(aux->map_state)) bpf_map_ptr_store(aux, meta->map_ptr, meta->map_ptr->unpriv_array); @@ -2932,6 +4041,7 @@ static int check_reference_leak(struct bpf_verifier_env *env) { struct bpf_func_state *state = cur_func(env); int i; + for (i = 0; i < state->acquired_refs; i++) { verbose(env, "Unreleased reference id=%d alloc_insn=%d\n", state->refs[i].id, state->refs[i].insn_idx); @@ -2939,23 +4049,6 @@ static int check_reference_leak(struct bpf_verifier_env *env) return state->acquired_refs ? -EINVAL : 0; } -static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type, - int func_id, - struct bpf_call_arg_meta *meta) -{ - struct bpf_reg_state *ret_reg = ®s[BPF_REG_0]; - if (ret_type != RET_INTEGER || - (func_id != BPF_FUNC_get_stack && - func_id != BPF_FUNC_probe_read_str)) - return; - - ret_reg->smax_value = meta->msize_smax_value; - ret_reg->umax_value = meta->msize_umax_value; - __reg_deduce_bounds(ret_reg); - __reg_bound_offset(ret_reg); -} - - static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx) { const struct bpf_func_proto *fn = NULL; @@ -2966,30 +4059,37 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn /* find function prototype */ if (func_id < 0 || func_id >= __BPF_FUNC_MAX_ID) { - verbose(env, "invalid func %s#%d\n", func_id_name(func_id), func_id); + verbose(env, "invalid func %s#%d\n", func_id_name(func_id), + func_id); return -EINVAL; } if (env->ops->get_func_proto) fn = env->ops->get_func_proto(func_id, env->prog); - if (!fn) { - verbose(env, "unknown func %s#%d\n", func_id_name(func_id), func_id); + verbose(env, "unknown func %s#%d\n", func_id_name(func_id), + func_id); return -EINVAL; } /* eBPF programs must be GPL compatible to use GPL-ed functions */ if (!env->prog->gpl_compatible && fn->gpl_only) { - verbose(env, "cannot call GPL only function from proprietary program\n"); + verbose(env, "cannot call GPL-restricted function from non-GPL compatible program\n"); return -EINVAL; } + /* With LD_ABS/IND some JITs save/restore skb from r1. */ changes_data = bpf_helper_changes_pkt_data(fn->func); + if (changes_data && fn->arg1_type != ARG_PTR_TO_CTX) { + verbose(env, "kernel subsystem misconfigured func %s#%d: r1 != ctx\n", + func_id_name(func_id), func_id); + return -EINVAL; + } memset(&meta, 0, sizeof(meta)); meta.pkt_access = fn->pkt_access; - err = check_func_proto(fn); + err = check_func_proto(fn, func_id); if (err) { verbose(env, "kernel subsystem misconfigured func %s#%d\n", func_id_name(func_id), func_id); @@ -3004,7 +4104,6 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn err = check_func_arg(env, BPF_REG_2, fn->arg2_type, &meta); if (err) return err; - err = check_func_arg(env, BPF_REG_3, fn->arg3_type, &meta); if (err) return err; @@ -3036,7 +4135,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn return err; } } else if (is_release_function(func_id)) { - err = release_reference(env, &meta); + err = release_reference(env, meta.ref_obj_id); if (err) { verbose(env, "func %s#%d reference has not been acquired before\n", func_id_name(func_id), func_id); @@ -3051,6 +4150,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn */ if (func_id == BPF_FUNC_get_local_storage && !register_is_null(®s[BPF_REG_2])) { + verbose(env, "get_local_storage() doesn't support non-zero flags\n"); return -EINVAL; } @@ -3060,6 +4160,9 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn check_reg_arg(env, caller_saved[i], DST_OP_NO_MARK); } + /* helper call returns 64-bit value. */ + regs[BPF_REG_0].subreg_def = DEF_NOT_SUBREG; + /* update return register (already marked as written above) */ if (fn->ret_type == RET_INTEGER) { /* sets type to SCALAR_VALUE */ @@ -3068,37 +4171,30 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn regs[BPF_REG_0].type = NOT_INIT; } else if (fn->ret_type == RET_PTR_TO_MAP_VALUE_OR_NULL || fn->ret_type == RET_PTR_TO_MAP_VALUE) { - if (fn->ret_type == RET_PTR_TO_MAP_VALUE) - regs[BPF_REG_0].type = PTR_TO_MAP_VALUE; - else - regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL; - /* There is no offset yet applied, variable or fixed */ mark_reg_known_zero(env, regs, BPF_REG_0); - regs[BPF_REG_0].off = 0; /* remember map_ptr, so that check_map_access() * can check 'value_size' boundary of memory access * to map element returned from bpf_map_lookup_elem() */ if (meta.map_ptr == NULL) { - verbose(env, "kernel subsystem misconfigured verifier\n"); + verbose(env, + "kernel subsystem misconfigured verifier\n"); return -EINVAL; } regs[BPF_REG_0].map_ptr = meta.map_ptr; - regs[BPF_REG_0].id = ++env->id_gen; + if (fn->ret_type == RET_PTR_TO_MAP_VALUE) { + regs[BPF_REG_0].type = PTR_TO_MAP_VALUE; + if (map_value_has_spin_lock(meta.map_ptr)) + regs[BPF_REG_0].id = ++env->id_gen; + } else { + regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL; + regs[BPF_REG_0].id = ++env->id_gen; + } } else if (fn->ret_type == RET_PTR_TO_SOCKET_OR_NULL) { mark_reg_known_zero(env, regs, BPF_REG_0); regs[BPF_REG_0].type = PTR_TO_SOCKET_OR_NULL; - if (is_acquire_function(func_id)) { - int id = acquire_reference_state(env, insn_idx); - if (id < 0) - return id; - /* For release_reference() */ - regs[BPF_REG_0].id = id; - } else { - /* For mark_ptr_or_null_reg() */ - regs[BPF_REG_0].id = ++env->id_gen; - } + regs[BPF_REG_0].id = ++env->id_gen; } else if (fn->ret_type == RET_PTR_TO_SOCK_COMMON_OR_NULL) { mark_reg_known_zero(env, regs, BPF_REG_0); regs[BPF_REG_0].type = PTR_TO_SOCK_COMMON_OR_NULL; @@ -3113,12 +4209,46 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn return -EINVAL; } - do_refine_retval_range(regs, fn->ret_type, func_id, &meta); + if (is_ptr_cast_function(func_id)) { + /* For release_reference() */ + regs[BPF_REG_0].ref_obj_id = meta.ref_obj_id; + } else if (is_acquire_function(func_id)) { + int id = acquire_reference_state(env, insn_idx); + + if (id < 0) + return id; + /* For mark_ptr_or_null_reg() */ + regs[BPF_REG_0].id = id; + /* For release_reference() */ + regs[BPF_REG_0].ref_obj_id = id; + } + + err = do_refine_retval_range(env, regs, fn->ret_type, func_id, &meta); + if (err) + return err; err = check_map_func_compatibility(env, meta.map_ptr, func_id); if (err) return err; + if (func_id == BPF_FUNC_get_stack && !env->prog->has_callchain_buf) { + const char *err_str; + +#ifdef CONFIG_PERF_EVENTS + err = get_callchain_buffers(sysctl_perf_event_max_stack); + err_str = "cannot get callchain buffer for func %s#%d\n"; +#else + err = -ENOTSUPP; + err_str = "func %s#%d not supported without CONFIG_PERF_EVENTS\n"; +#endif + if (err) { + verbose(env, err_str, func_id_name(func_id), func_id); + return err; + } + + env->prog->has_callchain_buf = true; + } + if (changes_data) clear_all_pkt_pointers(env); return 0; @@ -3267,6 +4397,27 @@ struct bpf_sanitize_info { bool mask_to_left; }; +static struct bpf_verifier_state * +sanitize_speculative_path(struct bpf_verifier_env *env, + const struct bpf_insn *insn, + u32 next_idx, u32 curr_idx) +{ + struct bpf_verifier_state *branch; + struct bpf_reg_state *regs; + + branch = push_stack(env, next_idx, curr_idx, true); + if (branch && insn) { + regs = branch->frame[branch->curframe]->regs; + if (BPF_SRC(insn->code) == BPF_K) { + mark_reg_unknown(env, regs, insn->dst_reg); + } else if (BPF_SRC(insn->code) == BPF_X) { + mark_reg_unknown(env, regs, insn->dst_reg); + mark_reg_unknown(env, regs, insn->src_reg); + } + } + return branch; +} + static int sanitize_ptr_alu(struct bpf_verifier_env *env, struct bpf_insn *insn, const struct bpf_reg_state *ptr_reg, @@ -3320,6 +4471,12 @@ static int sanitize_ptr_alu(struct bpf_verifier_env *env, alu_state |= off_is_imm ? BPF_ALU_IMMEDIATE : 0; alu_state |= ptr_is_dst_reg ? BPF_ALU_SANITIZE_SRC : BPF_ALU_SANITIZE_DST; + + /* Limit pruning on unknown scalars to enable deep search for + * potential masking differences from other program paths. + */ + if (!off_is_imm) + env->explore_alu_limits = true; } err = update_alu_sanitation_state(aux, alu_state, alu_limit); @@ -3350,12 +4507,26 @@ do_sim: tmp = *dst_reg; *dst_reg = *ptr_reg; } - ret = push_stack(env, env->insn_idx + 1, env->insn_idx, true); + ret = sanitize_speculative_path(env, NULL, env->insn_idx + 1, + env->insn_idx); if (!ptr_is_dst_reg && ret) *dst_reg = tmp; return !ret ? REASON_STACK : 0; } +static void sanitize_mark_insn_seen(struct bpf_verifier_env *env) +{ + struct bpf_verifier_state *vstate = env->cur_state; + + /* If we simulate paths under speculation, we don't update the + * insn as 'seen' such that when we verify unreachable paths in + * the non-speculative domain, sanitize_dead_code() can still + * rewrite/sanitize them. + */ + if (!vstate->speculative) + env->insn_aux_data[env->insn_idx].seen = true; +} + static int sanitize_err(struct bpf_verifier_env *env, const struct bpf_insn *insn, int reason, const struct bpf_reg_state *off_reg, @@ -3417,7 +4588,7 @@ static int sanitize_check_bounds(struct bpf_verifier_env *env, } break; case PTR_TO_MAP_VALUE: - if (check_map_access(env, dst, dst_reg->off, 1)) { + if (check_map_access(env, dst, dst_reg->off, 1, false)) { verbose(env, "R%d pointer arithmetic of map value goes out of range, " "prohibited for !root\n", dst); return -EACCES; @@ -3450,7 +4621,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, umin_ptr = ptr_reg->umin_value, umax_ptr = ptr_reg->umax_value; struct bpf_sanitize_info info = {}; u8 opcode = BPF_OP(insn->code); - u32 dst = insn->dst_reg, src = insn->src_reg; + u32 dst = insn->dst_reg; int ret; dst_reg = ®s[dst]; @@ -3460,13 +4631,14 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, /* Taint dst register if offset had invalid bounds derived from * e.g. dead branches. */ - __mark_reg_unknown(dst_reg); + __mark_reg_unknown(env, dst_reg); return 0; } if (BPF_CLASS(insn->code) != BPF_ALU64) { /* 32-bit ALU ops on pointers produce (meaningless) scalars */ - verbose(env, "R%d 32-bit pointer arithmetic prohibited\n", + verbose(env, + "R%d 32-bit pointer arithmetic prohibited\n", dst); return -EACCES; } @@ -3477,6 +4649,10 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, dst, reg_type_str[ptr_reg->type]); return -EACCES; case CONST_PTR_TO_MAP: + /* smin_val represents the known value */ + if (known && smin_val == 0 && opcode == BPF_ADD) + break; + /* fall-through */ case PTR_TO_PACKET_END: case PTR_TO_SOCKET: case PTR_TO_SOCKET_OR_NULL: @@ -3484,16 +4660,10 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, case PTR_TO_SOCK_COMMON_OR_NULL: case PTR_TO_TCP_SOCK: case PTR_TO_TCP_SOCK_OR_NULL: + case PTR_TO_XDP_SOCK: verbose(env, "R%d pointer arithmetic on %s prohibited\n", dst, reg_type_str[ptr_reg->type]); return -EACCES; - case PTR_TO_MAP_VALUE: - if (!env->allow_ptr_leaks && !known && (smin_val < 0) != (smax_val < 0)) { - verbose(env, "R%d has unknown scalar with mixed signed bounds, pointer arithmetic with it prohibited for !root\n", - off_reg == dst_reg ? dst : src); - return -EACCES; - } - /* fall-through */ default: break; } @@ -3629,7 +4799,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, case BPF_AND: case BPF_OR: case BPF_XOR: - /* bitwise ops on pointers are troublesome. */ + /* bitwise ops on pointers are troublesome, prohibit. */ verbose(env, "R%d bitwise operator %s on pointer prohibited\n", dst, bpf_alu_string[opcode >> 4]); return -EACCES; @@ -3697,13 +4867,13 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, /* Taint dst register if offset had invalid bounds derived from * e.g. dead branches. */ - __mark_reg_unknown(dst_reg); + __mark_reg_unknown(env, dst_reg); return 0; } if (!src_known && opcode != BPF_ADD && opcode != BPF_SUB && opcode != BPF_AND) { - __mark_reg_unknown(dst_reg); + __mark_reg_unknown(env, dst_reg); return 0; } @@ -3861,10 +5031,7 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, dst_reg->umin_value <<= umin_val; dst_reg->umax_value <<= umax_val; } - if (src_known) - dst_reg->var_off = tnum_lshift(dst_reg->var_off, umin_val); - else - dst_reg->var_off = tnum_lshift(tnum_unknown, umin_val); + dst_reg->var_off = tnum_lshift(dst_reg->var_off, umin_val); /* We may learn something more from the var_off */ __update_reg_bounds(dst_reg); break; @@ -3892,16 +5059,42 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, */ dst_reg->smin_value = S64_MIN; dst_reg->smax_value = S64_MAX; - if (src_known) - dst_reg->var_off = tnum_rshift(dst_reg->var_off, - umin_val); - else - dst_reg->var_off = tnum_rshift(tnum_unknown, umin_val); + dst_reg->var_off = tnum_rshift(dst_reg->var_off, umin_val); dst_reg->umin_value >>= umax_val; dst_reg->umax_value >>= umin_val; /* We may learn something more from the var_off */ __update_reg_bounds(dst_reg); break; + case BPF_ARSH: + if (umax_val >= insn_bitness) { + /* Shifts greater than 31 or 63 are undefined. + * This includes shifts by a negative number. + */ + mark_reg_unknown(env, regs, insn->dst_reg); + break; + } + + /* Upon reaching here, src_known is true and + * umax_val is equal to umin_val. + */ + if (insn_bitness == 32) { + dst_reg->smin_value = (u32)(((s32)dst_reg->smin_value) >> umin_val); + dst_reg->smax_value = (u32)(((s32)dst_reg->smax_value) >> umin_val); + } else { + dst_reg->smin_value >>= umin_val; + dst_reg->smax_value >>= umin_val; + } + + dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val, + insn_bitness); + + /* blow away the dst_reg umin_value/umax_value and rely on + * dst_reg var_off to refine the result. + */ + dst_reg->umin_value = 0; + dst_reg->umax_value = U64_MAX; + __update_reg_bounds(dst_reg); + break; default: mark_reg_unknown(env, regs, insn->dst_reg); break; @@ -3929,6 +5122,7 @@ static int adjust_reg_min_max_vals(struct bpf_verifier_env *env, struct bpf_reg_state *regs = state->regs, *dst_reg, *src_reg; struct bpf_reg_state *ptr_reg = NULL, off_reg = {0}; u8 opcode = BPF_OP(insn->code); + int err; dst_reg = ®s[insn->dst_reg]; src_reg = NULL; @@ -3955,13 +5149,24 @@ static int adjust_reg_min_max_vals(struct bpf_verifier_env *env, * This is legal, but we have to reverse our * src/dest handling in computing the range */ + err = mark_chain_precision(env, insn->dst_reg); + if (err) + return err; return adjust_ptr_min_max_vals(env, insn, src_reg, dst_reg); } } else if (ptr_reg) { /* pointer += scalar */ + err = mark_chain_precision(env, insn->src_reg); + if (err) + return err; return adjust_ptr_min_max_vals(env, insn, dst_reg, src_reg); + } else if (dst_reg->precise) { + /* if dst_reg is precise, src_reg should be precise as well */ + err = mark_chain_precision(env, insn->src_reg); + if (err) + return err; } } else { /* Pretend the src is a reg with a known value, since we only @@ -4048,32 +5253,45 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn) } } - /* check dest operand */ - err = check_reg_arg(env, insn->dst_reg, DST_OP); + /* check dest operand, mark as required later */ + err = check_reg_arg(env, insn->dst_reg, DST_OP_NO_MARK); if (err) return err; if (BPF_SRC(insn->code) == BPF_X) { + struct bpf_reg_state *src_reg = regs + insn->src_reg; + struct bpf_reg_state *dst_reg = regs + insn->dst_reg; + if (BPF_CLASS(insn->code) == BPF_ALU64) { /* case: R1 = R2 * copy register state to dest reg */ - regs[insn->dst_reg] = regs[insn->src_reg]; - regs[insn->dst_reg].live |= REG_LIVE_WRITTEN; + *dst_reg = *src_reg; + dst_reg->live |= REG_LIVE_WRITTEN; + dst_reg->subreg_def = DEF_NOT_SUBREG; } else { /* R1 = (u32) R2 */ if (is_pointer_value(env, insn->src_reg)) { - verbose(env, "R%d partial copy of pointer\n", + verbose(env, + "R%d partial copy of pointer\n", insn->src_reg); return -EACCES; + } else if (src_reg->type == SCALAR_VALUE) { + *dst_reg = *src_reg; + dst_reg->live |= REG_LIVE_WRITTEN; + dst_reg->subreg_def = env->insn_idx + 1; + } else { + mark_reg_unknown(env, regs, + insn->dst_reg); } - mark_reg_unknown(env, regs, insn->dst_reg); - coerce_reg_to_size(®s[insn->dst_reg], 4); + coerce_reg_to_size(dst_reg, 4); } } else { /* case: R = imm * remember the value we stored into this reg */ + /* clear any state __mark_reg_known doesn't set */ + mark_reg_unknown(env, regs, insn->dst_reg); regs[insn->dst_reg].type = SCALAR_VALUE; if (BPF_CLASS(insn->code) == BPF_ALU64) { __mark_reg_known(regs + insn->dst_reg, @@ -4117,11 +5335,6 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn) return -EINVAL; } - if (opcode == BPF_ARSH && BPF_CLASS(insn->code) != BPF_ALU64) { - verbose(env, "BPF_ARSH not supported for 32 bit ALU\n"); - return -EINVAL; - } - if ((opcode == BPF_LSH || opcode == BPF_RSH || opcode == BPF_ARSH) && BPF_SRC(insn->code) == BPF_K) { int size = BPF_CLASS(insn->code) == BPF_ALU64 ? 64 : 32; @@ -4143,15 +5356,35 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn) return 0; } +static void __find_good_pkt_pointers(struct bpf_func_state *state, + struct bpf_reg_state *dst_reg, + enum bpf_reg_type type, u16 new_range) +{ + struct bpf_reg_state *reg; + int i; + + for (i = 0; i < MAX_BPF_REG; i++) { + reg = &state->regs[i]; + if (reg->type == type && reg->id == dst_reg->id) + /* keep the maximum range already checked */ + reg->range = max(reg->range, new_range); + } + + bpf_for_each_spilled_reg(i, state, reg) { + if (!reg) + continue; + if (reg->type == type && reg->id == dst_reg->id) + reg->range = max(reg->range, new_range); + } +} + static void find_good_pkt_pointers(struct bpf_verifier_state *vstate, struct bpf_reg_state *dst_reg, enum bpf_reg_type type, bool range_right_open) { - struct bpf_func_state *state = vstate->frame[vstate->curframe]; - struct bpf_reg_state *regs = state->regs, *reg; u16 new_range; - int i, j; + int i; if (dst_reg->off < 0 || (dst_reg->off == 0 && range_right_open)) @@ -4216,20 +5449,214 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate, * the range won't allow anything. * dst_reg->off is known < MAX_PACKET_OFF, therefore it fits in a u16. */ - for (i = 0; i < MAX_BPF_REG; i++) - if (regs[i].type == type && regs[i].id == dst_reg->id) - /* keep the maximum range already checked */ - regs[i].range = max(regs[i].range, new_range); + for (i = 0; i <= vstate->curframe; i++) + __find_good_pkt_pointers(vstate->frame[i], dst_reg, type, + new_range); +} - for (j = 0; j <= vstate->curframe; j++) { - state = vstate->frame[j]; - bpf_for_each_spilled_reg(i, state, reg) { - if (!reg) - continue; - if (reg->type == type && reg->id == dst_reg->id) - reg->range = max(reg->range, new_range); +/* compute branch direction of the expression "if (reg opcode val) goto target;" + * and return: + * 1 - branch will be taken and "goto target" will be executed + * 0 - branch will not be taken and fall-through to next insn + * -1 - unknown. Example: "if (reg < 5)" is unknown when register value range [0,10] + */ +static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode, + bool is_jmp32) +{ + struct bpf_reg_state reg_lo; + s64 sval; + + if (__is_pointer_value(false, reg)) + return -1; + + if (is_jmp32) { + reg_lo = *reg; + reg = ®_lo; + /* For JMP32, only low 32 bits are compared, coerce_reg_to_size + * could truncate high bits and update umin/umax according to + * information of low bits. + */ + coerce_reg_to_size(reg, 4); + /* smin/smax need special handling. For example, after coerce, + * if smin_value is 0x00000000ffffffffLL, the value is -1 when + * used as operand to JMP32. It is a negative number from s32's + * point of view, while it is a positive number when seen as + * s64. The smin/smax are kept as s64, therefore, when used with + * JMP32, they need to be transformed into s32, then sign + * extended back to s64. + * + * Also, smin/smax were copied from umin/umax. If umin/umax has + * different sign bit, then min/max relationship doesn't + * maintain after casting into s32, for this case, set smin/smax + * to safest range. + */ + if ((reg->umax_value ^ reg->umin_value) & + (1ULL << 31)) { + reg->smin_value = S32_MIN; + reg->smax_value = S32_MAX; } + reg->smin_value = (s64)(s32)reg->smin_value; + reg->smax_value = (s64)(s32)reg->smax_value; + + val = (u32)val; + sval = (s64)(s32)val; + } else { + sval = (s64)val; } + + switch (opcode) { + case BPF_JEQ: + if (tnum_is_const(reg->var_off)) + return !!tnum_equals_const(reg->var_off, val); + break; + case BPF_JNE: + if (tnum_is_const(reg->var_off)) + return !tnum_equals_const(reg->var_off, val); + break; + case BPF_JSET: + if ((~reg->var_off.mask & reg->var_off.value) & val) + return 1; + if (!((reg->var_off.mask | reg->var_off.value) & val)) + return 0; + break; + case BPF_JGT: + if (reg->umin_value > val) + return 1; + else if (reg->umax_value <= val) + return 0; + break; + case BPF_JSGT: + if (reg->smin_value > sval) + return 1; + else if (reg->smax_value < sval) + return 0; + break; + case BPF_JLT: + if (reg->umax_value < val) + return 1; + else if (reg->umin_value >= val) + return 0; + break; + case BPF_JSLT: + if (reg->smax_value < sval) + return 1; + else if (reg->smin_value >= sval) + return 0; + break; + case BPF_JGE: + if (reg->umin_value >= val) + return 1; + else if (reg->umax_value < val) + return 0; + break; + case BPF_JSGE: + if (reg->smin_value >= sval) + return 1; + else if (reg->smax_value < sval) + return 0; + break; + case BPF_JLE: + if (reg->umax_value <= val) + return 1; + else if (reg->umin_value > val) + return 0; + break; + case BPF_JSLE: + if (reg->smax_value <= sval) + return 1; + else if (reg->smin_value > sval) + return 0; + break; + } + + return -1; +} + +/* Generate min value of the high 32-bit from TNUM info. */ +static u64 gen_hi_min(struct tnum var) +{ + return var.value & ~0xffffffffULL; +} + +/* Generate max value of the high 32-bit from TNUM info. */ +static u64 gen_hi_max(struct tnum var) +{ + return (var.value | var.mask) & ~0xffffffffULL; +} + +/* Return true if VAL is compared with a s64 sign extended from s32, and they + * are with the same signedness. + */ +static bool cmp_val_with_extended_s64(s64 sval, struct bpf_reg_state *reg) +{ + return ((s32)sval >= 0 && + reg->smin_value >= 0 && reg->smax_value <= S32_MAX) || + ((s32)sval < 0 && + reg->smax_value <= 0 && reg->smin_value >= S32_MIN); +} + +/* Constrain the possible values of @reg with unsigned upper bound @bound. + * If @is_exclusive, @bound is an exclusive limit, otherwise it is inclusive. + * If @is_jmp32, @bound is a 32-bit value that only constrains the low 32 bits + * of @reg. + */ +static void set_upper_bound(struct bpf_reg_state *reg, u64 bound, bool is_jmp32, + bool is_exclusive) +{ + if (is_exclusive) { + /* There are no values for `reg` that make `reg<0` true. */ + if (bound == 0) + return; + bound--; + } + if (is_jmp32) { + /* Constrain the register's value in the tnum representation. + * For 64-bit comparisons this happens later in + * __reg_bound_offset(), but for 32-bit comparisons, we can be + * more precise than what can be derived from the updated + * numeric bounds. + */ + struct tnum t = tnum_range(0, bound); + + t.mask |= ~0xffffffffULL; /* upper half is unknown */ + reg->var_off = tnum_intersect(reg->var_off, t); + + /* Compute the 64-bit bound from the 32-bit bound. */ + bound += gen_hi_max(reg->var_off); + } + reg->umax_value = min(reg->umax_value, bound); +} + +/* Constrain the possible values of @reg with unsigned lower bound @bound. + * If @is_exclusive, @bound is an exclusive limit, otherwise it is inclusive. + * If @is_jmp32, @bound is a 32-bit value that only constrains the low 32 bits + * of @reg. + */ +static void set_lower_bound(struct bpf_reg_state *reg, u64 bound, bool is_jmp32, + bool is_exclusive) +{ + if (is_exclusive) { + /* There are no values for `reg` that make `reg>MAX` true. */ + if (bound == (is_jmp32 ? U32_MAX : U64_MAX)) + return; + bound++; + } + if (is_jmp32) { + /* Constrain the register's value in the tnum representation. + * For 64-bit comparisons this happens later in + * __reg_bound_offset(), but for 32-bit comparisons, we can be + * more precise than what can be derived from the updated + * numeric bounds. + */ + struct tnum t = tnum_range(bound, U32_MAX); + + t.mask |= ~0xffffffffULL; /* upper half is unknown */ + reg->var_off = tnum_intersect(reg->var_off, t); + + /* Compute the 64-bit bound from the 32-bit bound. */ + bound += gen_hi_min(reg->var_off); + } + reg->umin_value = max(reg->umin_value, bound); } /* Adjusts the register min/max values in the case that the dst_reg is the @@ -4239,8 +5666,10 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate, */ static void reg_set_min_max(struct bpf_reg_state *true_reg, struct bpf_reg_state *false_reg, u64 val, - u8 opcode) + u8 opcode, bool is_jmp32) { + s64 sval; + /* If the dst_reg is a pointer, we can't learn anything about its * variable offset from the compare (unless src_reg were a pointer into * the same object, but we don't bother with that. @@ -4250,51 +5679,79 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, if (__is_pointer_value(false, false_reg)) return; + val = is_jmp32 ? (u32)val : val; + sval = is_jmp32 ? (s64)(s32)val : (s64)val; + switch (opcode) { case BPF_JEQ: - /* If this is false then we know nothing Jon Snow, but if it is - * true then we know for sure. - */ - __mark_reg_known(true_reg, val); - break; case BPF_JNE: - /* If this is true we know nothing Jon Snow, but if it is false - * we know the value for sure; + { + struct bpf_reg_state *reg = + opcode == BPF_JEQ ? true_reg : false_reg; + + /* For BPF_JEQ, if this is false we know nothing Jon Snow, but + * if it is true we know the value for sure. Likewise for + * BPF_JNE. */ - __mark_reg_known(false_reg, val); + if (is_jmp32) { + u64 old_v = reg->var_off.value; + u64 hi_mask = ~0xffffffffULL; + + reg->var_off.value = (old_v & hi_mask) | val; + reg->var_off.mask &= hi_mask; + } else { + __mark_reg_known(reg, val); + } break; - case BPF_JGT: - false_reg->umax_value = min(false_reg->umax_value, val); - true_reg->umin_value = max(true_reg->umin_value, val + 1); - break; - case BPF_JSGT: - false_reg->smax_value = min_t(s64, false_reg->smax_value, val); - true_reg->smin_value = max_t(s64, true_reg->smin_value, val + 1); - break; - case BPF_JLT: - false_reg->umin_value = max(false_reg->umin_value, val); - true_reg->umax_value = min(true_reg->umax_value, val - 1); - break; - case BPF_JSLT: - false_reg->smin_value = max_t(s64, false_reg->smin_value, val); - true_reg->smax_value = min_t(s64, true_reg->smax_value, val - 1); + } + case BPF_JSET: + false_reg->var_off = tnum_and(false_reg->var_off, + tnum_const(~val)); + if (is_power_of_2(val)) + true_reg->var_off = tnum_or(true_reg->var_off, + tnum_const(val)); break; case BPF_JGE: - false_reg->umax_value = min(false_reg->umax_value, val - 1); - true_reg->umin_value = max(true_reg->umin_value, val); + case BPF_JGT: + { + set_upper_bound(false_reg, val, is_jmp32, opcode == BPF_JGE); + set_lower_bound(true_reg, val, is_jmp32, opcode == BPF_JGT); break; + } case BPF_JSGE: - false_reg->smax_value = min_t(s64, false_reg->smax_value, val - 1); - true_reg->smin_value = max_t(s64, true_reg->smin_value, val); + case BPF_JSGT: + { + s64 false_smax = opcode == BPF_JSGT ? sval : sval - 1; + s64 true_smin = opcode == BPF_JSGT ? sval + 1 : sval; + + /* If the full s64 was not sign-extended from s32 then don't + * deduct further info. + */ + if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) + break; + false_reg->smax_value = min(false_reg->smax_value, false_smax); + true_reg->smin_value = max(true_reg->smin_value, true_smin); break; + } case BPF_JLE: - false_reg->umin_value = max(false_reg->umin_value, val + 1); - true_reg->umax_value = min(true_reg->umax_value, val); + case BPF_JLT: + { + set_lower_bound(false_reg, val, is_jmp32, opcode == BPF_JLE); + set_upper_bound(true_reg, val, is_jmp32, opcode == BPF_JLT); break; + } case BPF_JSLE: - false_reg->smin_value = max_t(s64, false_reg->smin_value, val + 1); - true_reg->smax_value = min_t(s64, true_reg->smax_value, val); + case BPF_JSLT: + { + s64 false_smin = opcode == BPF_JSLT ? sval : sval + 1; + s64 true_smax = opcode == BPF_JSLT ? sval - 1 : sval; + + if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) + break; + false_reg->smin_value = max(false_reg->smin_value, false_smin); + true_reg->smax_value = min(true_reg->smax_value, true_smax); break; + } default: break; } @@ -4317,56 +5774,79 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, */ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, struct bpf_reg_state *false_reg, u64 val, - u8 opcode) + u8 opcode, bool is_jmp32) { + s64 sval; + if (__is_pointer_value(false, false_reg)) return; + val = is_jmp32 ? (u32)val : val; + sval = is_jmp32 ? (s64)(s32)val : (s64)val; + switch (opcode) { case BPF_JEQ: - /* If this is false then we know nothing Jon Snow, but if it is - * true then we know for sure. - */ - __mark_reg_known(true_reg, val); - break; case BPF_JNE: - /* If this is true we know nothing Jon Snow, but if it is false - * we know the value for sure; - */ - __mark_reg_known(false_reg, val); + { + struct bpf_reg_state *reg = + opcode == BPF_JEQ ? true_reg : false_reg; + + if (is_jmp32) { + u64 old_v = reg->var_off.value; + u64 hi_mask = ~0xffffffffULL; + + reg->var_off.value = (old_v & hi_mask) | val; + reg->var_off.mask &= hi_mask; + } else { + __mark_reg_known(reg, val); + } break; - case BPF_JGT: - true_reg->umax_value = min(true_reg->umax_value, val - 1); - false_reg->umin_value = max(false_reg->umin_value, val); - break; - case BPF_JSGT: - true_reg->smax_value = min_t(s64, true_reg->smax_value, val - 1); - false_reg->smin_value = max_t(s64, false_reg->smin_value, val); - break; - case BPF_JLT: - true_reg->umin_value = max(true_reg->umin_value, val + 1); - false_reg->umax_value = min(false_reg->umax_value, val); - break; - case BPF_JSLT: - true_reg->smin_value = max_t(s64, true_reg->smin_value, val + 1); - false_reg->smax_value = min_t(s64, false_reg->smax_value, val); + } + case BPF_JSET: + false_reg->var_off = tnum_and(false_reg->var_off, + tnum_const(~val)); + if (is_power_of_2(val)) + true_reg->var_off = tnum_or(true_reg->var_off, + tnum_const(val)); break; case BPF_JGE: - true_reg->umax_value = min(true_reg->umax_value, val); - false_reg->umin_value = max(false_reg->umin_value, val + 1); + case BPF_JGT: + { + set_lower_bound(false_reg, val, is_jmp32, opcode == BPF_JGE); + set_upper_bound(true_reg, val, is_jmp32, opcode == BPF_JGT); break; + } case BPF_JSGE: - true_reg->smax_value = min_t(s64, true_reg->smax_value, val); - false_reg->smin_value = max_t(s64, false_reg->smin_value, val + 1); + case BPF_JSGT: + { + s64 false_smin = opcode == BPF_JSGT ? sval : sval + 1; + s64 true_smax = opcode == BPF_JSGT ? sval - 1 : sval; + + if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) + break; + false_reg->smin_value = max(false_reg->smin_value, false_smin); + true_reg->smax_value = min(true_reg->smax_value, true_smax); break; + } case BPF_JLE: - true_reg->umin_value = max(true_reg->umin_value, val); - false_reg->umax_value = min(false_reg->umax_value, val - 1); + case BPF_JLT: + { + set_upper_bound(false_reg, val, is_jmp32, opcode == BPF_JLE); + set_lower_bound(true_reg, val, is_jmp32, opcode == BPF_JLT); break; + } case BPF_JSLE: - true_reg->smin_value = max_t(s64, true_reg->smin_value, val); - false_reg->smax_value = min_t(s64, false_reg->smax_value, val - 1); + case BPF_JSLT: + { + s64 false_smax = opcode == BPF_JSLT ? sval : sval - 1; + s64 true_smin = opcode == BPF_JSLT ? sval + 1 : sval; + + if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) + break; + false_reg->smax_value = min(false_reg->smax_value, false_smax); + true_reg->smin_value = max(true_reg->smin_value, true_smin); break; + } default: break; } @@ -4452,6 +5932,9 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state, if (reg->map_ptr->inner_map_meta) { reg->type = CONST_PTR_TO_MAP; reg->map_ptr = reg->map_ptr->inner_map_meta; + } else if (reg->map_ptr->map_type == + BPF_MAP_TYPE_XSKMAP) { + reg->type = PTR_TO_XDP_SOCK; } else { reg->type = PTR_TO_MAP_VALUE; } @@ -4462,18 +5945,41 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state, } else if (reg->type == PTR_TO_TCP_SOCK_OR_NULL) { reg->type = PTR_TO_TCP_SOCK; } - - if (is_null || !(reg_is_refcounted(reg) || - reg_may_point_to_spin_lock(reg))) { - /* We don't need id from this point onwards anymore, - * thus we should better reset it, so that state - * pruning has chances to take effect. + if (is_null) { + /* We don't need id and ref_obj_id from this point + * onwards anymore, thus we should better reset it, + * so that state pruning has chances to take effect. + */ + reg->id = 0; + reg->ref_obj_id = 0; + } else if (!reg_may_point_to_spin_lock(reg)) { + /* For not-NULL ptr, reg->ref_obj_id will be reset + * in release_reg_references(). + * + * reg->id is still used by spin_lock ptr. Other + * than spin_lock ptr type, reg->id can be reset. */ reg->id = 0; } } } +static void __mark_ptr_or_null_regs(struct bpf_func_state *state, u32 id, + bool is_null) +{ + struct bpf_reg_state *reg; + int i; + + for (i = 0; i < MAX_BPF_REG; i++) + mark_ptr_or_null_reg(state, &state->regs[i], id, is_null); + + bpf_for_each_spilled_reg(i, state, reg) { + if (!reg) + continue; + mark_ptr_or_null_reg(state, reg, id, is_null); + } +} + /* The logic is similar to find_good_pkt_pointers(), both could eventually * be folded together at some point. */ @@ -4481,24 +5987,20 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno, bool is_null) { struct bpf_func_state *state = vstate->frame[vstate->curframe]; - struct bpf_reg_state *reg, *regs = state->regs; + struct bpf_reg_state *regs = state->regs; + u32 ref_obj_id = regs[regno].ref_obj_id; u32 id = regs[regno].id; - int i, j; + int i; - if (reg_is_refcounted_or_null(®s[regno]) && is_null) - release_reference_state(state, id); + if (ref_obj_id && ref_obj_id == id && is_null) + /* regs[regno] is in the " == NULL" branch. + * No one could have freed the reference state before + * doing the NULL check. + */ + WARN_ON_ONCE(release_reference_state(state, id)); - for (i = 0; i < MAX_BPF_REG; i++) - mark_ptr_or_null_reg(state, ®s[i], id, is_null); - - for (j = 0; j <= vstate->curframe; j++) { - state = vstate->frame[j]; - bpf_for_each_spilled_reg(i, state, reg) { - if (!reg) - continue; - mark_ptr_or_null_reg(state, reg, id, is_null); - } - } + for (i = 0; i <= vstate->curframe; i++) + __mark_ptr_or_null_regs(vstate->frame[i], id, is_null); } static bool try_match_pkt_pointers(const struct bpf_insn *insn, @@ -4510,6 +6012,10 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn, if (BPF_SRC(insn->code) != BPF_X) return false; + /* Pointers are always 64-bit. */ + if (BPF_CLASS(insn->code) == BPF_JMP32) + return false; + switch (BPF_OP(insn->code)) { case BPF_JGT: if ((dst_reg->type == PTR_TO_PACKET && @@ -4600,18 +6106,21 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, struct bpf_verifier_state *this_branch = env->cur_state; struct bpf_verifier_state *other_branch; struct bpf_reg_state *regs = this_branch->frame[this_branch->curframe]->regs; - struct bpf_reg_state *dst_reg, *other_branch_regs; + struct bpf_reg_state *dst_reg, *other_branch_regs, *src_reg = NULL; u8 opcode = BPF_OP(insn->code); + bool is_jmp32; + int pred = -1; int err; - if (opcode > BPF_JSLE) { - verbose(env, "invalid BPF_JMP opcode %x\n", opcode); + /* Only conditional jumps are expected to reach here. */ + if (opcode == BPF_JA || opcode > BPF_JSLE) { + verbose(env, "invalid BPF_JMP/JMP32 opcode %x\n", opcode); return -EINVAL; } if (BPF_SRC(insn->code) == BPF_X) { if (insn->imm != 0) { - verbose(env, "BPF_JMP uses reserved fields\n"); + verbose(env, "BPF_JMP/JMP32 uses reserved fields\n"); return -EINVAL; } @@ -4625,9 +6134,10 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, insn->src_reg); return -EACCES; } + src_reg = ®s[insn->src_reg]; } else { if (insn->src_reg != BPF_REG_0) { - verbose(env, "BPF_JMP uses reserved fields\n"); + verbose(env, "BPF_JMP/JMP32 uses reserved fields\n"); return -EINVAL; } } @@ -4638,33 +6148,53 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, return err; dst_reg = ®s[insn->dst_reg]; + is_jmp32 = BPF_CLASS(insn->code) == BPF_JMP32; - /* detect if R == 0 where R was initialized to zero earlier */ - if (BPF_SRC(insn->code) == BPF_K && - (opcode == BPF_JEQ || opcode == BPF_JNE) && - dst_reg->type == SCALAR_VALUE && - tnum_equals_const(dst_reg->var_off, insn->imm)) { - if (opcode == BPF_JEQ) { - /* if (imm == imm) goto pc+off; - * only follow the goto, ignore fall-through - */ - *insn_idx += insn->off; - return 0; - } else { - /* if (imm != imm) goto pc+off; - * only follow fall-through branch, since - * that's where the program will go - */ - return 0; - } + if (BPF_SRC(insn->code) == BPF_K) + pred = is_branch_taken(dst_reg, insn->imm, + opcode, is_jmp32); + else if (src_reg->type == SCALAR_VALUE && + tnum_is_const(src_reg->var_off)) + pred = is_branch_taken(dst_reg, src_reg->var_off.value, + opcode, is_jmp32); + if (pred >= 0) { + err = mark_chain_precision(env, insn->dst_reg); + if (BPF_SRC(insn->code) == BPF_X && !err) + err = mark_chain_precision(env, insn->src_reg); + if (err) + return err; + } + + if (pred == 1) { + /* Only follow the goto, ignore fall-through. If needed, push + * the fall-through branch for simulation under speculative + * execution. + */ + if (!env->allow_ptr_leaks && + !sanitize_speculative_path(env, insn, *insn_idx + 1, + *insn_idx)) + return -EFAULT; + *insn_idx += insn->off; + return 0; + } else if (pred == 0) { + /* Only follow the fall-through branch, since that's where the + * program will go. If needed, push the goto branch for + * simulation under speculative execution. + */ + if (!env->allow_ptr_leaks && + !sanitize_speculative_path(env, insn, + *insn_idx + insn->off + 1, + *insn_idx)) + return -EFAULT; + return 0; } other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx, false); if (!other_branch) return -EFAULT; - other_branch_regs = other_branch->frame[other_branch->curframe]->regs; + /* detect if we are comparing against a constant value so we can adjust * our min/max values for our dst register. * this is only legit if both are scalars (or pointers to the same @@ -4673,30 +6203,51 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, * comparable. */ if (BPF_SRC(insn->code) == BPF_X) { + struct bpf_reg_state *src_reg = ®s[insn->src_reg]; + struct bpf_reg_state lo_reg0 = *dst_reg; + struct bpf_reg_state lo_reg1 = *src_reg; + struct bpf_reg_state *src_lo, *dst_lo; + + dst_lo = &lo_reg0; + src_lo = &lo_reg1; + coerce_reg_to_size(dst_lo, 4); + coerce_reg_to_size(src_lo, 4); + if (dst_reg->type == SCALAR_VALUE && - regs[insn->src_reg].type == SCALAR_VALUE) { - if (tnum_is_const(regs[insn->src_reg].var_off)) + src_reg->type == SCALAR_VALUE) { + if (tnum_is_const(src_reg->var_off) || + (is_jmp32 && tnum_is_const(src_lo->var_off))) reg_set_min_max(&other_branch_regs[insn->dst_reg], - dst_reg, regs[insn->src_reg].var_off.value, - opcode); - else if (tnum_is_const(dst_reg->var_off)) + dst_reg, + is_jmp32 + ? src_lo->var_off.value + : src_reg->var_off.value, + opcode, is_jmp32); + else if (tnum_is_const(dst_reg->var_off) || + (is_jmp32 && tnum_is_const(dst_lo->var_off))) reg_set_min_max_inv(&other_branch_regs[insn->src_reg], - ®s[insn->src_reg], - dst_reg->var_off.value, opcode); - else if (opcode == BPF_JEQ || opcode == BPF_JNE) + src_reg, + is_jmp32 + ? dst_lo->var_off.value + : dst_reg->var_off.value, + opcode, is_jmp32); + else if (!is_jmp32 && + (opcode == BPF_JEQ || opcode == BPF_JNE)) /* Comparing for equality, we can combine knowledge */ reg_combine_min_max(&other_branch_regs[insn->src_reg], &other_branch_regs[insn->dst_reg], - ®s[insn->src_reg], - ®s[insn->dst_reg], opcode); + src_reg, dst_reg, opcode); } } else if (dst_reg->type == SCALAR_VALUE) { reg_set_min_max(&other_branch_regs[insn->dst_reg], - dst_reg, insn->imm, opcode); + dst_reg, insn->imm, opcode, is_jmp32); } - /* detect if R == 0 where R is returned from bpf_map_lookup_elem() */ - if (BPF_SRC(insn->code) == BPF_K && + /* detect if R == 0 where R is returned from bpf_map_lookup_elem(). + * NOTE: these optimizations below are related with pointer comparison + * which will never be JMP32. + */ + if (!is_jmp32 && BPF_SRC(insn->code) == BPF_K && insn->imm == 0 && (opcode == BPF_JEQ || opcode == BPF_JNE) && reg_type_may_be_null(dst_reg->type)) { /* Mark all identical registers in each branch as either @@ -4709,10 +6260,11 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, } else if (!try_match_pkt_pointers(insn, dst_reg, ®s[insn->src_reg], this_branch, other_branch) && is_pointer_value(env, insn->dst_reg)) { - verbose(env, "R%d pointer comparison prohibited\n", insn->dst_reg); + verbose(env, "R%d pointer comparison prohibited\n", + insn->dst_reg); return -EACCES; } - if (env->log.level) + if (env->log.level & BPF_LOG_LEVEL) print_verifier_state(env, this_branch->frame[this_branch->curframe]); return 0; } @@ -4748,7 +6300,6 @@ static int check_ld_imm(struct bpf_verifier_env *env, struct bpf_insn *insn) map = env->used_maps[aux->map_index]; mark_reg_known_zero(env, regs, insn->dst_reg); - regs[insn->dst_reg].map_ptr = map; if (insn->src_reg == BPF_PSEUDO_MAP_VALUE) { @@ -4772,7 +6323,6 @@ static bool may_access_skb(enum bpf_prog_type type) case BPF_PROG_TYPE_SOCKET_FILTER: case BPF_PROG_TYPE_SCHED_CLS: case BPF_PROG_TYPE_SCHED_ACT: - case BPF_PROG_TYPE_CGROUP_SKB: return true; default: return false; @@ -4806,6 +6356,11 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn) return -EINVAL; } + if (!env->ops->gen_ld_abs) { + verbose(env, "bpf verifier is misconfigured\n"); + return -EINVAL; + } + if (env->subprog_cnt > 1) { /* when program has LD_ABS insn JITs and interpreter assume * that r1 == ctx == skb which is not the case for callees @@ -4846,7 +6401,8 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn) } if (regs[ctx_reg].type != PTR_TO_CTX) { - verbose(env, "at the time of BPF_LD_ABS|IND R6 != pointer to skb\n"); + verbose(env, + "at the time of BPF_LD_ABS|IND R6 != pointer to skb\n"); return -EINVAL; } @@ -4872,11 +6428,14 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn) * Already marked as written above. */ mark_reg_unknown(env, regs, BPF_REG_0); + /* ld_abs load up to 32-bit skb data. */ + regs[BPF_REG_0].subreg_def = env->insn_idx + 1; return 0; } static int check_return_code(struct bpf_verifier_env *env) { + struct tnum enforce_attach_type_range = tnum_unknown; struct bpf_reg_state *reg; struct tnum range = tnum_range(0, 1); @@ -4885,9 +6444,17 @@ static int check_return_code(struct bpf_verifier_env *env) if (env->prog->expected_attach_type == BPF_CGROUP_UDP4_RECVMSG || env->prog->expected_attach_type == BPF_CGROUP_UDP6_RECVMSG) range = tnum_range(1, 1); + break; case BPF_PROG_TYPE_CGROUP_SKB: + if (env->prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) { + range = tnum_range(0, 3); + enforce_attach_type_range = tnum_range(2, 3); + } + break; case BPF_PROG_TYPE_CGROUP_SOCK: case BPF_PROG_TYPE_SOCK_OPS: + case BPF_PROG_TYPE_CGROUP_DEVICE: + case BPF_PROG_TYPE_CGROUP_SYSCTL: case BPF_PROG_TYPE_CGROUP_SOCKOPT: break; default: @@ -4915,6 +6482,10 @@ static int check_return_code(struct bpf_verifier_env *env) verbose(env, " should have been in %s\n", tn_buf); return -EINVAL; } + + if (!tnum_is_unknown(enforce_attach_type_range) && + tnum_in(enforce_attach_type_range, reg->var_off)) + env->prog->enforce_expected_attach_type = 1; return 0; } @@ -4958,19 +6529,37 @@ enum { BRANCH = 2, }; -#define STATE_LIST_MARK ((struct bpf_verifier_state_list *) -1L) +static u32 state_htab_size(struct bpf_verifier_env *env) +{ + return env->prog->len; +} -static int *insn_stack; /* stack of insns to process */ -static int cur_stack; /* current stack index */ -static int *insn_state; +static struct bpf_verifier_state_list **explored_state( + struct bpf_verifier_env *env, + int idx) +{ + struct bpf_verifier_state *cur = env->cur_state; + struct bpf_func_state *state = cur->frame[cur->curframe]; + + return &env->explored_states[(idx ^ state->callsite) % state_htab_size(env)]; +} + +static void init_explored_state(struct bpf_verifier_env *env, int idx) +{ + env->insn_aux_data[idx].prune_point = true; +} /* t, w, e - match pseudo-code above: * t - index of current instruction * w - next instruction * e - edge */ -static int push_insn(int t, int w, int e, struct bpf_verifier_env *env) +static int push_insn(int t, int w, int e, struct bpf_verifier_env *env, + bool loop_ok) { + int *insn_stack = env->cfg.insn_stack; + int *insn_state = env->cfg.insn_state; + if (e == FALLTHROUGH && insn_state[t] >= (DISCOVERED | FALLTHROUGH)) return 0; @@ -4978,23 +6567,28 @@ static int push_insn(int t, int w, int e, struct bpf_verifier_env *env) return 0; if (w < 0 || w >= env->prog->len) { + verbose_linfo(env, t, "%d: ", t); verbose(env, "jump out of range from insn %d to %d\n", t, w); return -EINVAL; } if (e == BRANCH) /* mark branch target for state pruning */ - env->explored_states[w] = STATE_LIST_MARK; + init_explored_state(env, w); if (insn_state[w] == 0) { /* tree-edge */ insn_state[t] = DISCOVERED | e; insn_state[w] = DISCOVERED; - if (cur_stack >= env->prog->len) + if (env->cfg.cur_stack >= env->prog->len) return -E2BIG; - insn_stack[cur_stack++] = w; + insn_stack[env->cfg.cur_stack++] = w; return 1; } else if ((insn_state[w] & 0xF0) == DISCOVERED) { + if (loop_ok && env->allow_ptr_leaks) + return 0; + verbose_linfo(env, t, "%d: ", t); + verbose_linfo(env, w, "%d: ", w); verbose(env, "back-edge from insn %d to %d\n", t, w); return -EINVAL; } else if (insn_state[w] == EXPLORED) { @@ -5014,48 +6608,47 @@ static int check_cfg(struct bpf_verifier_env *env) { struct bpf_insn *insns = env->prog->insnsi; int insn_cnt = env->prog->len; + int *insn_stack, *insn_state; int ret = 0; int i, t; - ret = check_subprogs(env); - if (ret < 0) - return ret; - - insn_state = kcalloc(insn_cnt, sizeof(int), GFP_KERNEL); + insn_state = env->cfg.insn_state = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL); if (!insn_state) return -ENOMEM; - insn_stack = kcalloc(insn_cnt, sizeof(int), GFP_KERNEL); + insn_stack = env->cfg.insn_stack = kvcalloc(insn_cnt, sizeof(int), GFP_KERNEL); if (!insn_stack) { - kfree(insn_state); + kvfree(insn_state); return -ENOMEM; } insn_state[0] = DISCOVERED; /* mark 1st insn as discovered */ insn_stack[0] = 0; /* 0 is the first instruction */ - cur_stack = 1; + env->cfg.cur_stack = 1; peek_stack: - if (cur_stack == 0) + if (env->cfg.cur_stack == 0) goto check_state; - t = insn_stack[cur_stack - 1]; + t = insn_stack[env->cfg.cur_stack - 1]; - if (BPF_CLASS(insns[t].code) == BPF_JMP) { + if (BPF_CLASS(insns[t].code) == BPF_JMP || + BPF_CLASS(insns[t].code) == BPF_JMP32) { u8 opcode = BPF_OP(insns[t].code); if (opcode == BPF_EXIT) { goto mark_explored; } else if (opcode == BPF_CALL) { - ret = push_insn(t, t + 1, FALLTHROUGH, env); + ret = push_insn(t, t + 1, FALLTHROUGH, env, false); if (ret == 1) goto peek_stack; else if (ret < 0) goto err_free; if (t + 1 < insn_cnt) - env->explored_states[t + 1] = STATE_LIST_MARK; + init_explored_state(env, t + 1); if (insns[t].src_reg == BPF_PSEUDO_CALL) { - env->explored_states[t] = STATE_LIST_MARK; - ret = push_insn(t, t + insns[t].imm + 1, BRANCH, env); + init_explored_state(env, t); + ret = push_insn(t, t + insns[t].imm + 1, BRANCH, + env, false); if (ret == 1) goto peek_stack; else if (ret < 0) @@ -5068,26 +6661,31 @@ peek_stack: } /* unconditional jump with single edge */ ret = push_insn(t, t + insns[t].off + 1, - FALLTHROUGH, env); + FALLTHROUGH, env, true); if (ret == 1) goto peek_stack; else if (ret < 0) goto err_free; + /* unconditional jmp is not a good pruning point, + * but it's marked, since backtracking needs + * to record jmp history in is_state_visited(). + */ + init_explored_state(env, t + insns[t].off + 1); /* tell verifier to check for equivalent states * after every call and jump */ if (t + 1 < insn_cnt) - env->explored_states[t + 1] = STATE_LIST_MARK; + init_explored_state(env, t + 1); } else { /* conditional jump with two edges */ - env->explored_states[t] = STATE_LIST_MARK; - ret = push_insn(t, t + 1, FALLTHROUGH, env); + init_explored_state(env, t); + ret = push_insn(t, t + 1, FALLTHROUGH, env, true); if (ret == 1) goto peek_stack; else if (ret < 0) goto err_free; - ret = push_insn(t, t + insns[t].off + 1, BRANCH, env); + ret = push_insn(t, t + insns[t].off + 1, BRANCH, env, true); if (ret == 1) goto peek_stack; else if (ret < 0) @@ -5097,7 +6695,7 @@ peek_stack: /* all other non-branch instructions with single * fall-through edge */ - ret = push_insn(t, t + 1, FALLTHROUGH, env); + ret = push_insn(t, t + 1, FALLTHROUGH, env, false); if (ret == 1) goto peek_stack; else if (ret < 0) @@ -5106,7 +6704,7 @@ peek_stack: mark_explored: insn_state[t] = EXPLORED; - if (cur_stack-- <= 0) { + if (env->cfg.cur_stack-- <= 0) { verbose(env, "pop stack internal bug\n"); ret = -EFAULT; goto err_free; @@ -5124,28 +6722,31 @@ check_state: ret = 0; /* cfg looks good */ err_free: - kfree(insn_state); - kfree(insn_stack); + kvfree(insn_state); + kvfree(insn_stack); + env->cfg.insn_state = env->cfg.insn_stack = NULL; return ret; } /* The minimum supported BTF func info size */ #define MIN_BPF_FUNCINFO_SIZE 8 #define MAX_FUNCINFO_REC_SIZE 252 + static int check_btf_func(struct bpf_verifier_env *env, const union bpf_attr *attr, union bpf_attr __user *uattr) { - u32 i, nfuncs, urec_size, min_size, prev_offset; + u32 i, nfuncs, urec_size, min_size; u32 krec_size = sizeof(struct bpf_func_info); struct bpf_func_info *krecord; const struct btf_type *type; struct bpf_prog *prog; const struct btf *btf; void __user *urecord; + u32 prev_offset = 0; int ret = 0; - nfuncs = attr->func_info_cnt; + nfuncs = attr->func_info_cnt; if (!nfuncs) return 0; @@ -5191,24 +6792,24 @@ static int check_btf_func(struct bpf_verifier_env *env, goto err_free; } - /* check insn_offset */ + /* check insn_off */ if (i == 0) { - if (krecord[i].insn_offset) { + if (krecord[i].insn_off) { verbose(env, - "nonzero insn_offset %u for the first func info record", - krecord[i].insn_offset); + "nonzero insn_off %u for the first func info record", + krecord[i].insn_off); ret = -EINVAL; goto err_free; } - } else if (krecord[i].insn_offset <= prev_offset) { + } else if (krecord[i].insn_off <= prev_offset) { verbose(env, "same or smaller insn offset (%u) than previous func info record (%u)", - krecord[i].insn_offset, prev_offset); + krecord[i].insn_off, prev_offset); ret = -EINVAL; goto err_free; } - if (env->subprog_info[i].start != krecord[i].insn_offset) { + if (env->subprog_info[i].start != krecord[i].insn_off) { verbose(env, "func_info BTF section doesn't match subprog layout in BPF program\n"); ret = -EINVAL; goto err_free; @@ -5223,7 +6824,7 @@ static int check_btf_func(struct bpf_verifier_env *env, goto err_free; } - prev_offset = krecord[i].insn_offset; + prev_offset = krecord[i].insn_off; urecord += urec_size; } @@ -5239,15 +6840,18 @@ err_free: static void adjust_btf_func(struct bpf_verifier_env *env) { int i; + if (!env->prog->aux->func_info) return; + for (i = 0; i < env->subprog_cnt; i++) - env->prog->aux->func_info[i].insn_offset = env->subprog_info[i].start; + env->prog->aux->func_info[i].insn_off = env->subprog_info[i].start; } #define MIN_BPF_LINEINFO_SIZE (offsetof(struct bpf_line_info, line_col) + \ sizeof(((struct bpf_line_info *)(0))->line_col)) #define MAX_LINEINFO_REC_SIZE MAX_FUNCINFO_REC_SIZE + static int check_btf_line(struct bpf_verifier_env *env, const union bpf_attr *attr, union bpf_attr __user *uattr) @@ -5261,9 +6865,10 @@ static int check_btf_line(struct bpf_verifier_env *env, int err; nr_linfo = attr->line_info_cnt; - if (!nr_linfo) return 0; + if (nr_linfo > INT_MAX / sizeof(struct bpf_line_info)) + return -EINVAL; rec_size = attr->line_info_rec_size; if (rec_size < MIN_BPF_LINEINFO_SIZE || @@ -5281,12 +6886,12 @@ static int check_btf_line(struct bpf_verifier_env *env, prog = env->prog; btf = prog->aux->btf; + s = 0; sub = env->subprog_info; ulinfo = u64_to_user_ptr(attr->line_info); expected_size = sizeof(struct bpf_line_info); ncopy = min_t(u32, expected_size, rec_size); - for (i = 0; i < nr_linfo; i++) { err = bpf_check_uarg_tail_zero(ulinfo, expected_size, rec_size); if (err) { @@ -5298,10 +6903,12 @@ static int check_btf_line(struct bpf_verifier_env *env, } goto err_free; } + if (copy_from_user(&linfo[i], ulinfo, ncopy)) { err = -EFAULT; goto err_free; } + /* * Check insn_off to ensure * 1) strictly increasing AND @@ -5321,12 +6928,22 @@ static int check_btf_line(struct bpf_verifier_env *env, err = -EINVAL; goto err_free; } + + if (!prog->insnsi[linfo[i].insn_off].code) { + verbose(env, + "Invalid insn code at line_info[%u].insn_off\n", + i); + err = -EINVAL; + goto err_free; + } + if (!btf_name_by_offset(btf, linfo[i].line_off) || !btf_name_by_offset(btf, linfo[i].file_name_off)) { verbose(env, "Invalid line_info[%u].line_off or .file_name_off\n", i); err = -EINVAL; goto err_free; } + if (s != env->subprog_cnt) { if (linfo[i].insn_off == sub[s].start) { sub[s].linfo_idx = i; @@ -5337,6 +6954,7 @@ static int check_btf_line(struct bpf_verifier_env *env, goto err_free; } } + prev_offset = linfo[i].insn_off; ulinfo += rec_size; } @@ -5350,6 +6968,7 @@ static int check_btf_line(struct bpf_verifier_env *env, prog->aux->linfo = linfo; prog->aux->nr_linfo = nr_linfo; + return 0; err_free: @@ -5363,18 +6982,23 @@ static int check_btf_info(struct bpf_verifier_env *env, { struct btf *btf; int err; + if (!attr->func_info_cnt && !attr->line_info_cnt) return 0; + btf = btf_get_by_fd(attr->prog_btf_fd); if (IS_ERR(btf)) return PTR_ERR(btf); env->prog->aux->btf = btf; + err = check_btf_func(env, attr, uattr); if (err) return err; + err = check_btf_line(env, attr, uattr); if (err) return err; + return 0; } @@ -5388,13 +7012,6 @@ static bool range_within(struct bpf_reg_state *old, old->smax_value >= cur->smax_value; } -/* Maximum number of register states that can exist at once */ -#define ID_MAP_SIZE (MAX_BPF_REG + MAX_BPF_STACK / BPF_REG_SIZE) -struct idpair { - u32 old; - u32 cur; -}; - /* If in the old state two registers had the same id, then they need to have * the same id in the new state as well. But that id could be different from * the old state, so we need to track the mapping from old to new ids. @@ -5405,11 +7022,11 @@ struct idpair { * So we look through our idmap to see if this old id has been seen before. If * so, we require the new id to match; otherwise, we add the id pair to the map. */ -static bool check_ids(u32 old_id, u32 cur_id, struct idpair *idmap) +static bool check_ids(u32 old_id, u32 cur_id, struct bpf_id_pair *idmap) { unsigned int i; - for (i = 0; i < ID_MAP_SIZE; i++) { + for (i = 0; i < BPF_ID_MAP_SIZE; i++) { if (!idmap[i].old) { /* Reached an empty slot; haven't seen this id before */ idmap[i].old = old_id; @@ -5424,9 +7041,105 @@ static bool check_ids(u32 old_id, u32 cur_id, struct idpair *idmap) return false; } +static void clean_func_state(struct bpf_verifier_env *env, + struct bpf_func_state *st) +{ + enum bpf_reg_liveness live; + int i, j; + + for (i = 0; i < BPF_REG_FP; i++) { + live = st->regs[i].live; + /* liveness must not touch this register anymore */ + st->regs[i].live |= REG_LIVE_DONE; + if (!(live & REG_LIVE_READ)) + /* since the register is unused, clear its state + * to make further comparison simpler + */ + __mark_reg_not_init(env, &st->regs[i]); + } + + for (i = 0; i < st->allocated_stack / BPF_REG_SIZE; i++) { + live = st->stack[i].spilled_ptr.live; + /* liveness must not touch this stack slot anymore */ + st->stack[i].spilled_ptr.live |= REG_LIVE_DONE; + if (!(live & REG_LIVE_READ)) { + __mark_reg_not_init(env, &st->stack[i].spilled_ptr); + for (j = 0; j < BPF_REG_SIZE; j++) + st->stack[i].slot_type[j] = STACK_INVALID; + } + } +} + +static void clean_verifier_state(struct bpf_verifier_env *env, + struct bpf_verifier_state *st) +{ + int i; + + if (st->frame[0]->regs[0].live & REG_LIVE_DONE) + /* all regs in this state in all frames were already marked */ + return; + + for (i = 0; i <= st->curframe; i++) + clean_func_state(env, st->frame[i]); +} + +/* the parentage chains form a tree. + * the verifier states are added to state lists at given insn and + * pushed into state stack for future exploration. + * when the verifier reaches bpf_exit insn some of the verifer states + * stored in the state lists have their final liveness state already, + * but a lot of states will get revised from liveness point of view when + * the verifier explores other branches. + * Example: + * 1: r0 = 1 + * 2: if r1 == 100 goto pc+1 + * 3: r0 = 2 + * 4: exit + * when the verifier reaches exit insn the register r0 in the state list of + * insn 2 will be seen as !REG_LIVE_READ. Then the verifier pops the other_branch + * of insn 2 and goes exploring further. At the insn 4 it will walk the + * parentage chain from insn 4 into insn 2 and will mark r0 as REG_LIVE_READ. + * + * Since the verifier pushes the branch states as it sees them while exploring + * the program the condition of walking the branch instruction for the second + * time means that all states below this branch were already explored and + * their final liveness markes are already propagated. + * Hence when the verifier completes the search of state list in is_state_visited() + * we can call this clean_live_states() function to mark all liveness states + * as REG_LIVE_DONE to indicate that 'parent' pointers of 'struct bpf_reg_state' + * will not be used. + * This function also clears the registers and stack for states that !READ + * to simplify state merging. + * + * Important note here that walking the same branch instruction in the callee + * doesn't meant that the states are DONE. The verifier has to compare + * the callsites + */ +static void clean_live_states(struct bpf_verifier_env *env, int insn, + struct bpf_verifier_state *cur) +{ + struct bpf_verifier_state_list *sl; + int i; + + sl = *explored_state(env, insn); + while (sl) { + if (sl->state.branches) + goto next; + if (sl->state.insn_idx != insn || + sl->state.curframe != cur->curframe) + goto next; + for (i = 0; i <= cur->curframe; i++) + if (sl->state.frame[i]->callsite != cur->frame[i]->callsite) + goto next; + clean_verifier_state(env, &sl->state); +next: + sl = sl->next; + } +} + /* Returns true if (rold safe implies rcur safe) */ -static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur, - struct idpair *idmap) +static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold, + struct bpf_reg_state *rcur, struct bpf_id_pair *idmap) { bool equal; @@ -5434,12 +7147,14 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur, /* explored state didn't use this */ return true; - equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, frameno)) == 0; + equal = memcmp(rold, rcur, offsetof(struct bpf_reg_state, parent)) == 0; + if (rold->type == PTR_TO_STACK) /* two stack pointers are equal only if they're pointing to * the same stack frame, since fp-8 in foo != fp-8 in bar */ return equal && rold->frameno == rcur->frameno; + if (equal) return true; @@ -5450,7 +7165,11 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur, return false; switch (rold->type) { case SCALAR_VALUE: + if (env->explore_alu_limits) + return false; if (rcur->type == SCALAR_VALUE) { + if (!rold->precise && !rcur->precise) + return true; /* new val must satisfy old val knowledge */ return range_within(rold, rcur) && tnum_in(rold->var_off, rcur->var_off); @@ -5523,6 +7242,7 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur, case PTR_TO_SOCK_COMMON_OR_NULL: case PTR_TO_TCP_SOCK: case PTR_TO_TCP_SOCK_OR_NULL: + case PTR_TO_XDP_SOCK: /* Only valid matches are exact, which memcmp() above * would have accepted */ @@ -5536,18 +7256,11 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur, return false; } -static bool stacksafe(struct bpf_func_state *old, - struct bpf_func_state *cur, - struct idpair *idmap) +static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, + struct bpf_func_state *cur, struct bpf_id_pair *idmap) { int i, spi; - /* if explored stack has more populated slots than current stack - * such stacks are not equivalent - */ - if (old->allocated_stack > cur->allocated_stack) - return false; - /* walk slots of the explored stack and ignore any additional * slots in the current stack, since explored(safe) state * didn't use them @@ -5555,8 +7268,28 @@ static bool stacksafe(struct bpf_func_state *old, for (i = 0; i < old->allocated_stack; i++) { spi = i / BPF_REG_SIZE; + if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ)) { + i += BPF_REG_SIZE - 1; + /* explored state didn't use this */ + continue; + } + if (old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_INVALID) continue; + + /* explored stack has more populated slots than current stack + * and these slots were used + */ + if (i >= cur->allocated_stack) + return false; + + /* if old state was safe with misc data in the stack + * it will be safe with zero-initialized stack. + * The opposite is not true + */ + if (old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC && + cur->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_ZERO) + continue; if (old->stack[spi].slot_type[i % BPF_REG_SIZE] != cur->stack[spi].slot_type[i % BPF_REG_SIZE]) /* Ex: old explored (safe) state has STACK_SPILL in @@ -5569,9 +7302,8 @@ static bool stacksafe(struct bpf_func_state *old, continue; if (old->stack[spi].slot_type[0] != STACK_SPILL) continue; - if (!regsafe(&old->stack[spi].spilled_ptr, - &cur->stack[spi].spilled_ptr, - idmap)) + if (!regsafe(env, &old->stack[spi].spilled_ptr, + &cur->stack[spi].spilled_ptr, idmap)) /* when explored and current stack slot are both storing * spilled registers, check that stored pointers types * are the same as well. @@ -5621,34 +7353,24 @@ static bool refsafe(struct bpf_func_state *old, struct bpf_func_state *cur) * whereas register type in current state is meaningful, it means that * the current state will reach 'bpf_exit' instruction safely */ -static bool func_states_equal(struct bpf_func_state *old, +static bool func_states_equal(struct bpf_verifier_env *env, struct bpf_func_state *old, struct bpf_func_state *cur) { - struct idpair *idmap; - bool ret = false; int i; - idmap = kcalloc(ID_MAP_SIZE, sizeof(struct idpair), GFP_KERNEL); - /* If we failed to allocate the idmap, just say it's not safe */ - if (!idmap) + memset(env->idmap_scratch, 0, sizeof(env->idmap_scratch)); + for (i = 0; i < MAX_BPF_REG; i++) + if (!regsafe(env, &old->regs[i], &cur->regs[i], + env->idmap_scratch)) + return false; + + if (!stacksafe(env, old, cur, env->idmap_scratch)) return false; - for (i = 0; i < MAX_BPF_REG; i++) { - if (!regsafe(&old->regs[i], &cur->regs[i], idmap)) - goto out_free; - } - - if (!stacksafe(old, cur, idmap)) - goto out_free; - if (!refsafe(old, cur)) - goto out_free; + return false; - ret = true; - -out_free: - kfree(idmap); - return ret; + return true; } static bool states_equal(struct bpf_verifier_env *env, @@ -5660,6 +7382,12 @@ static bool states_equal(struct bpf_verifier_env *env, if (old->curframe != cur->curframe) return false; + /* Verification state from speculative execution simulation + * must never prune a non-speculative execution one. + */ + if (old->speculative && !cur->speculative) + return false; + if (old->active_spin_lock != cur->active_spin_lock) return false; @@ -5669,81 +7397,213 @@ static bool states_equal(struct bpf_verifier_env *env, for (i = 0; i <= old->curframe; i++) { if (old->frame[i]->callsite != cur->frame[i]->callsite) return false; - if (!func_states_equal(old->frame[i], cur->frame[i])) + if (!func_states_equal(env, old->frame[i], cur->frame[i])) return false; } - return true; } +/* Return 0 if no propagation happened. Return negative error code if error + * happened. Otherwise, return the propagated bit. + */ +static int propagate_liveness_reg(struct bpf_verifier_env *env, + struct bpf_reg_state *reg, + struct bpf_reg_state *parent_reg) +{ + u8 parent_flag = parent_reg->live & REG_LIVE_READ; + u8 flag = reg->live & REG_LIVE_READ; + int err; + + /* When comes here, read flags of PARENT_REG or REG could be any of + * REG_LIVE_READ64, REG_LIVE_READ32, REG_LIVE_NONE. There is no need + * of propagation if PARENT_REG has strongest REG_LIVE_READ64. + */ + if (parent_flag == REG_LIVE_READ64 || + /* Or if there is no read flag from REG. */ + !flag || + /* Or if the read flag from REG is the same as PARENT_REG. */ + parent_flag == flag) + return 0; + + err = mark_reg_read(env, reg, parent_reg, flag); + if (err) + return err; + + return flag; +} + /* A write screens off any subsequent reads; but write marks come from the * straight-line code between a state and its parent. When we arrive at an * equivalent state (jump target or such) we didn't arrive by the straight-line * code, so read marks in the state must propagate to the parent regardless * of the state's write marks. That's what 'parent == state->parent' comparison - * in mark_reg_read() and mark_stack_slot_read() is for. + * in mark_reg_read() is for. */ static int propagate_liveness(struct bpf_verifier_env *env, const struct bpf_verifier_state *vstate, struct bpf_verifier_state *vparent) { - int i, frame, err = 0; + struct bpf_reg_state *state_reg, *parent_reg; struct bpf_func_state *state, *parent; + int i, frame, err = 0; if (vparent->curframe != vstate->curframe) { WARN(1, "propagate_live: parent frame %d current frame %d\n", vparent->curframe, vstate->curframe); return -EFAULT; } - /* Propagate read liveness of registers... */ BUILD_BUG_ON(BPF_REG_FP + 1 != MAX_BPF_REG); - /* We don't need to worry about FP liveness because it's read-only */ - for (i = 0; i < BPF_REG_FP; i++) { - if (vparent->frame[vparent->curframe]->regs[i].live & REG_LIVE_READ) - continue; - if (vstate->frame[vstate->curframe]->regs[i].live & REG_LIVE_READ) { - err = mark_reg_read(env, vstate, vparent, i); - if (err) + for (frame = 0; frame <= vstate->curframe; frame++) { + parent = vparent->frame[frame]; + state = vstate->frame[frame]; + parent_reg = parent->regs; + state_reg = state->regs; + /* We don't need to worry about FP liveness, it's read-only */ + for (i = frame < vstate->curframe ? BPF_REG_6 : 0; i < BPF_REG_FP; i++) { + err = propagate_liveness_reg(env, &state_reg[i], + &parent_reg[i]); + if (err < 0) + return err; + if (err == REG_LIVE_READ64) + mark_insn_zext(env, &parent_reg[i]); + } + + /* Propagate stack slots. */ + for (i = 0; i < state->allocated_stack / BPF_REG_SIZE && + i < parent->allocated_stack / BPF_REG_SIZE; i++) { + parent_reg = &parent->stack[i].spilled_ptr; + state_reg = &state->stack[i].spilled_ptr; + err = propagate_liveness_reg(env, state_reg, + parent_reg); + if (err < 0) return err; } } - - /* ... and stack slots */ - for (frame = 0; frame <= vstate->curframe; frame++) { - state = vstate->frame[frame]; - parent = vparent->frame[frame]; - for (i = 0; i < state->allocated_stack / BPF_REG_SIZE && - i < parent->allocated_stack / BPF_REG_SIZE; i++) { - if (parent->stack[i].slot_type[0] != STACK_SPILL) - continue; - if (state->stack[i].slot_type[0] != STACK_SPILL) - continue; - if (parent->stack[i].spilled_ptr.live & REG_LIVE_READ) - continue; - if (state->stack[i].spilled_ptr.live & REG_LIVE_READ) - mark_stack_slot_read(env, vstate, vparent, i, frame); - } - } - return err; + return 0; } +/* find precise scalars in the previous equivalent state and + * propagate them into the current state + */ +static int propagate_precision(struct bpf_verifier_env *env, + const struct bpf_verifier_state *old) +{ + struct bpf_reg_state *state_reg; + struct bpf_func_state *state; + int i, err = 0; + + state = old->frame[old->curframe]; + state_reg = state->regs; + for (i = 0; i < BPF_REG_FP; i++, state_reg++) { + if (state_reg->type != SCALAR_VALUE || + !state_reg->precise) + continue; + if (env->log.level & BPF_LOG_LEVEL2) + verbose(env, "propagating r%d\n", i); + err = mark_chain_precision(env, i); + if (err < 0) + return err; + } + + for (i = 0; i < state->allocated_stack / BPF_REG_SIZE; i++) { + if (state->stack[i].slot_type[0] != STACK_SPILL) + continue; + state_reg = &state->stack[i].spilled_ptr; + if (state_reg->type != SCALAR_VALUE || + !state_reg->precise) + continue; + if (env->log.level & BPF_LOG_LEVEL2) + verbose(env, "propagating fp%d\n", + (-i - 1) * BPF_REG_SIZE); + err = mark_chain_precision_stack(env, i); + if (err < 0) + return err; + } + return 0; +} + +static bool states_maybe_looping(struct bpf_verifier_state *old, + struct bpf_verifier_state *cur) +{ + struct bpf_func_state *fold, *fcur; + int i, fr = cur->curframe; + + if (old->curframe != fr) + return false; + + fold = old->frame[fr]; + fcur = cur->frame[fr]; + for (i = 0; i < MAX_BPF_REG; i++) + if (memcmp(&fold->regs[i], &fcur->regs[i], + offsetof(struct bpf_reg_state, parent))) + return false; + return true; +} + + static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) { struct bpf_verifier_state_list *new_sl; - struct bpf_verifier_state_list *sl; - struct bpf_verifier_state *cur = env->cur_state; - int i, j, err; + struct bpf_verifier_state_list *sl, **pprev; + struct bpf_verifier_state *cur = env->cur_state, *new; + int i, j, err, states_cnt = 0; + bool add_new_state = env->test_state_freq ? true : false; - sl = env->explored_states[insn_idx]; - if (!sl) + cur->last_insn_idx = env->prev_insn_idx; + if (!env->insn_aux_data[insn_idx].prune_point) /* this 'insn_idx' instruction wasn't marked, so we will not * be doing state search here */ return 0; - while (sl != STATE_LIST_MARK) { + /* bpf progs typically have pruning point every 4 instructions + * http://vger.kernel.org/bpfconf2019.html#session-1 + * Do not add new state for future pruning if the verifier hasn't seen + * at least 2 jumps and at least 8 instructions. + * This heuristics helps decrease 'total_states' and 'peak_states' metric. + * In tests that amounts to up to 50% reduction into total verifier + * memory consumption and 20% verifier time speedup. + */ + if (env->jmps_processed - env->prev_jmps_processed >= 2 && + env->insn_processed - env->prev_insn_processed >= 8) + add_new_state = true; + + pprev = explored_state(env, insn_idx); + sl = *pprev; + + clean_live_states(env, insn_idx, cur); + + while (sl) { + states_cnt++; + if (sl->state.insn_idx != insn_idx) + goto next; + if (sl->state.branches) { + if (states_maybe_looping(&sl->state, cur) && + states_equal(env, &sl->state, cur)) { + verbose_linfo(env, insn_idx, "; "); + verbose(env, "infinite loop detected at insn %d\n", insn_idx); + return -EINVAL; + } + /* if the verifier is processing a loop, avoid adding new state + * too often, since different loop iterations have distinct + * states and may not help future pruning. + * This threshold shouldn't be too low to make sure that + * a loop with large bound will be rejected quickly. + * The most abusive loop will be: + * r1 += 1 + * if r1 < 1000000 goto pc-2 + * 1M insn_procssed limit / 100 == 10k peak states. + * This threshold shouldn't be too high either, since states + * at the end of the loop are likely to be useful in pruning. + */ + if (env->jmps_processed - env->prev_jmps_processed < 20 && + env->insn_processed - env->prev_insn_processed < 100) + add_new_state = false; + goto miss; + } if (states_equal(env, &sl->state, cur)) { + sl->hit_cnt++; /* reached equivalent register/stack state, * prune the search. * Registers read by the continuation are read by us. @@ -5755,49 +7615,135 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx) * this state and will pop a new one. */ err = propagate_liveness(env, &sl->state, cur); + + /* if previous state reached the exit with precision and + * current state is equivalent to it (except precsion marks) + * the precision needs to be propagated back in + * the current state. + */ + err = err ? : push_jmp_history(env, cur); + err = err ? : propagate_precision(env, &sl->state); if (err) return err; return 1; } - sl = sl->next; +miss: + /* when new state is not going to be added do not increase miss count. + * Otherwise several loop iterations will remove the state + * recorded earlier. The goal of these heuristics is to have + * states from some iterations of the loop (some in the beginning + * and some at the end) to help pruning. + */ + if (add_new_state) + sl->miss_cnt++; + /* heuristic to determine whether this state is beneficial + * to keep checking from state equivalence point of view. + * Higher numbers increase max_states_per_insn and verification time, + * but do not meaningfully decrease insn_processed. + */ + if (sl->miss_cnt > sl->hit_cnt * 3 + 3) { + /* the state is unlikely to be useful. Remove it to + * speed up verification + */ + *pprev = sl->next; + if (sl->state.frame[0]->regs[0].live & REG_LIVE_DONE) { + u32 br = sl->state.branches; + + WARN_ONCE(br, + "BUG live_done but branches_to_explore %d\n", + br); + free_verifier_state(&sl->state, false); + kfree(sl); + env->peak_states--; + } else { + /* cannot free this state, since parentage chain may + * walk it later. Add it for free_list instead to + * be freed at the end of verification + */ + sl->next = env->free_list; + env->free_list = sl; + } + sl = *pprev; + continue; + } +next: + pprev = &sl->next; + sl = *pprev; } - /* there were no equivalent states, remember current one. - * technically the current state is not proven to be safe yet, + if (env->max_states_per_insn < states_cnt) + env->max_states_per_insn = states_cnt; + + if (!env->allow_ptr_leaks && states_cnt > BPF_COMPLEXITY_LIMIT_STATES) + return push_jmp_history(env, cur); + + if (!add_new_state) + return push_jmp_history(env, cur); + + /* There were no equivalent states, remember the current one. + * Technically the current state is not proven to be safe yet, * but it will either reach outer most bpf_exit (which means it's safe) - * or it will be rejected. Since there are no loops, we won't be + * or it will be rejected. When there are no loops the verifier won't be * seeing this tuple (frame[0].callsite, frame[1].callsite, .. insn_idx) - * again on the way to bpf_exit + * again on the way to bpf_exit. + * When looping the sl->state.branches will be > 0 and this state + * will not be considered for equivalence until branches == 0. */ new_sl = kzalloc(sizeof(struct bpf_verifier_state_list), GFP_KERNEL); if (!new_sl) return -ENOMEM; + env->total_states++; + env->peak_states++; + env->prev_jmps_processed = env->jmps_processed; + env->prev_insn_processed = env->insn_processed; /* add new state to the head of linked list */ - err = copy_verifier_state(&new_sl->state, cur); + new = &new_sl->state; + err = copy_verifier_state(new, cur); if (err) { - free_verifier_state(&new_sl->state, false); + free_verifier_state(new, false); kfree(new_sl); return err; } - new_sl->next = env->explored_states[insn_idx]; - env->explored_states[insn_idx] = new_sl; - /* connect new state to parentage chain */ - cur->parent = &new_sl->state; + new->insn_idx = insn_idx; + WARN_ONCE(new->branches != 1, + "BUG is_state_visited:branches_to_explore=%d insn %d\n", new->branches, insn_idx); + + cur->parent = new; + cur->first_insn_idx = insn_idx; + clear_jmp_history(cur); + new_sl->next = *explored_state(env, insn_idx); + *explored_state(env, insn_idx) = new_sl; + /* connect new state to parentage chain. Current frame needs all + * registers connected. Only r6 - r9 of the callers are alive (pushed + * to the stack implicitly by JITs) so in callers' frames connect just + * r6 - r9 as an optimization. Callers will have r1 - r5 connected to + * the state of the call instruction (with WRITTEN set), and r0 comes + * from callee with its full parentage chain, anyway. + */ /* clear write marks in current state: the writes we did are not writes * our child did, so they don't screen off its reads from us. * (There are no read marks in current state, because reads always mark * their parent and current state never has children yet. Only * explored_states can get read marks.) */ - for (i = 0; i < BPF_REG_FP; i++) - cur->frame[cur->curframe]->regs[i].live = REG_LIVE_NONE; + for (j = 0; j <= cur->curframe; j++) { + for (i = j < cur->curframe ? BPF_REG_6 : 0; i < BPF_REG_FP; i++) + cur->frame[j]->regs[i].parent = &new->frame[j]->regs[i]; + for (i = 0; i < BPF_REG_FP; i++) + cur->frame[j]->regs[i].live = REG_LIVE_NONE; + } + /* all stack frames are accessible from callee, clear them all */ for (j = 0; j <= cur->curframe; j++) { struct bpf_func_state *frame = cur->frame[j]; - for (i = 0; i < frame->allocated_stack / BPF_REG_SIZE; i++) - if (frame->stack[i].slot_type[0] == STACK_SPILL) - frame->stack[i].spilled_ptr.live = REG_LIVE_NONE; + struct bpf_func_state *newframe = new->frame[j]; + + for (i = 0; i < frame->allocated_stack / BPF_REG_SIZE; i++) { + frame->stack[i].spilled_ptr.live = REG_LIVE_NONE; + frame->stack[i].spilled_ptr.parent = + &newframe->stack[i].spilled_ptr; + } } return 0; } @@ -5813,11 +7759,13 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type) case PTR_TO_SOCK_COMMON_OR_NULL: case PTR_TO_TCP_SOCK: case PTR_TO_TCP_SOCK_OR_NULL: + case PTR_TO_XDP_SOCK: return false; default: return true; } } + /* If an instruction was previously used with particular pointer types, then we * need to be careful to avoid cases such as the below, where it may be ok * for one branch accessing the pointer, but not ok for the other branch: @@ -5841,16 +7789,18 @@ static int do_check(struct bpf_verifier_env *env) struct bpf_verifier_state *state; struct bpf_insn *insns = env->prog->insnsi; struct bpf_reg_state *regs; - int insn_cnt = env->prog->len, i; - int insn_processed = 0; + int insn_cnt = env->prog->len; bool do_print_state = false; + int prev_insn_idx = -1; + + env->prev_linfo = NULL; state = kzalloc(sizeof(struct bpf_verifier_state), GFP_KERNEL); if (!state) return -ENOMEM; - state->curframe = 0; - state->parent = NULL; + state->speculative = false; + state->branches = 1; state->frame[0] = kzalloc(sizeof(struct bpf_func_state), GFP_KERNEL); if (!state->frame[0]) { kfree(state); @@ -5867,6 +7817,7 @@ static int do_check(struct bpf_verifier_env *env) u8 class; int err; + env->prev_insn_idx = prev_insn_idx; if (env->insn_idx >= insn_cnt) { verbose(env, "invalid insn idx %d insn_cnt %d\n", env->insn_idx, insn_cnt); @@ -5876,9 +7827,10 @@ static int do_check(struct bpf_verifier_env *env) insn = &insns[env->insn_idx]; class = BPF_CLASS(insn->code); - if (++insn_processed > BPF_COMPLEXITY_LIMIT_INSNS) { - verbose(env, "BPF program is too large. Processed %d insn\n", - insn_processed); + if (++env->insn_processed > BPF_COMPLEXITY_LIMIT_INSNS) { + verbose(env, + "BPF program is too large. Processed %d insn\n", + env->insn_processed); return -E2BIG; } @@ -5887,7 +7839,7 @@ static int do_check(struct bpf_verifier_env *env) return err; if (err == 1) { /* found equivalent state, can prune the search */ - if (env->log.level) { + if (env->log.level & BPF_LOG_LEVEL) { if (do_print_state) verbose(env, "\nfrom %d to %d%s: safe\n", env->prev_insn_idx, env->insn_idx, @@ -5899,12 +7851,15 @@ static int do_check(struct bpf_verifier_env *env) goto process_bpf_exit; } + if (signal_pending(current)) + return -EAGAIN; + if (need_resched()) cond_resched(); - if (env->log.level > 1 || - (env->log.level && do_print_state)) { - if (env->log.level > 1) + if (env->log.level & BPF_LOG_LEVEL2 || + (env->log.level & BPF_LOG_LEVEL && do_print_state)) { + if (env->log.level & BPF_LOG_LEVEL2) verbose(env, "%d:", env->insn_idx); else verbose(env, "\nfrom %d to %d%s:", @@ -5915,11 +7870,13 @@ static int do_check(struct bpf_verifier_env *env) do_print_state = false; } - if (env->log.level) { + if (env->log.level & BPF_LOG_LEVEL) { const struct bpf_insn_cbs cbs = { .cb_print = verbose, .private_data = env, }; + + verbose_linfo(env, env->insn_idx, "; "); verbose(env, "%d: ", env->insn_idx); print_bpf_insn(&cbs, insn, env->allow_ptr_leaks); } @@ -5932,7 +7889,9 @@ static int do_check(struct bpf_verifier_env *env) } regs = cur_regs(env); - env->insn_aux_data[env->insn_idx].seen = true; + sanitize_mark_insn_seen(env); + prev_insn_idx = env->insn_idx; + if (class == BPF_ALU || class == BPF_ALU64) { err = check_alu_op(env, insn); if (err) @@ -5971,6 +7930,7 @@ static int do_check(struct bpf_verifier_env *env) * save type to validate intersecting paths */ *prev_src_type = src_reg_type; + } else if (reg_type_mismatch(src_reg_type, *prev_src_type)) { /* ABuser program is trying to use the same insn * dst_reg = *(u32*) (src_reg + off) @@ -6033,8 +7993,9 @@ static int do_check(struct bpf_verifier_env *env) return err; if (is_ctx_reg(env, insn->dst_reg)) { - verbose(env, "BPF_ST stores into R%d context is not allowed\n", - insn->dst_reg); + verbose(env, "BPF_ST stores into R%d %s is not allowed\n", + insn->dst_reg, + reg_type_str[reg_state(env, insn->dst_reg)->type]); return -EACCES; } @@ -6045,15 +8006,17 @@ static int do_check(struct bpf_verifier_env *env) if (err) return err; - } else if (class == BPF_JMP) { + } else if (class == BPF_JMP || class == BPF_JMP32) { u8 opcode = BPF_OP(insn->code); + env->jmps_processed++; if (opcode == BPF_CALL) { if (BPF_SRC(insn->code) != BPF_K || insn->off != 0 || (insn->src_reg != BPF_REG_0 && insn->src_reg != BPF_PSEUDO_CALL) || - insn->dst_reg != BPF_REG_0) { + insn->dst_reg != BPF_REG_0 || + class == BPF_JMP32) { verbose(env, "BPF_CALL uses reserved fields\n"); return -EINVAL; } @@ -6075,7 +8038,8 @@ static int do_check(struct bpf_verifier_env *env) if (BPF_SRC(insn->code) != BPF_K || insn->imm != 0 || insn->src_reg != BPF_REG_0 || - insn->dst_reg != BPF_REG_0) { + insn->dst_reg != BPF_REG_0 || + class == BPF_JMP32) { verbose(env, "BPF_JA uses reserved fields\n"); return -EINVAL; } @@ -6087,7 +8051,8 @@ static int do_check(struct bpf_verifier_env *env) if (BPF_SRC(insn->code) != BPF_K || insn->imm != 0 || insn->src_reg != BPF_REG_0 || - insn->dst_reg != BPF_REG_0) { + insn->dst_reg != BPF_REG_0 || + class == BPF_JMP32) { verbose(env, "BPF_EXIT uses reserved fields\n"); return -EINVAL; } @@ -6129,7 +8094,9 @@ static int do_check(struct bpf_verifier_env *env) if (err) return err; process_bpf_exit: - err = pop_stack(env, &env->prev_insn_idx, &env->insn_idx); + update_branch_counts(env, env->cur_state); + err = pop_stack(env, &prev_insn_idx, + &env->insn_idx); if (err < 0) { if (err != -ENOENT) return err; @@ -6157,7 +8124,7 @@ process_bpf_exit: return err; env->insn_idx++; - env->insn_aux_data[env->insn_idx].seen = true; + sanitize_mark_insn_seen(env); } else { verbose(env, "invalid BPF_LD mode\n"); return -EINVAL; @@ -6170,15 +8137,6 @@ process_bpf_exit: env->insn_idx++; } - verbose(env, "processed %d insns, stack depth ", insn_processed); - for (i = 0; i < env->subprog_cnt; i++) { - u32 depth = env->subprog_info[i].stack_depth; - - verbose(env, "%d", depth); - if (i + 1 < env->subprog_cnt) - verbose(env, "+"); - } - verbose(env, "\n"); env->prog->aux->stack_depth = env->subprog_info[0].stack_depth; return 0; } @@ -6234,7 +8192,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, } if ((bpf_prog_is_dev_bound(prog->aux) || bpf_map_is_dev_bound(map)) && - !bpf_offload_dev_match(prog, map)) { + !bpf_offload_prog_map_match(prog, map)) { verbose(env, "offload device mismatch between prog and map\n"); return -EINVAL; } @@ -6301,13 +8259,14 @@ static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env) insn[1].imm != 0)) { verbose(env, "unrecognized bpf_ld_imm64 insn\n"); + return -EINVAL; } - f = fdget(insn->imm); + f = fdget(insn[0].imm); map = __bpf_map_get(f); if (IS_ERR(map)) { verbose(env, "fd %d is not pointing to valid bpf_map\n", - insn->imm); + insn[0].imm); return PTR_ERR(map); } @@ -6318,11 +8277,11 @@ static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env) } aux = &env->insn_aux_data[i]; - if (insn->src_reg == BPF_PSEUDO_MAP_FD) { addr = (unsigned long)map; } else { u32 off = insn[1].imm; + if (off >= BPF_MAX_VAR_OFF) { verbose(env, "direct value offset of %u is not allowed\n", off); fdput(f); @@ -6380,6 +8339,7 @@ static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env) if (bpf_map_is_cgroup_storage(map) && bpf_cgroup_storage_assign(env->prog, map)) { + verbose(env, "only one cgroup storage of each type is allowed\n"); fdput(f); return -EBUSY; } @@ -6388,6 +8348,13 @@ static int replace_map_fd_with_map_ptr(struct bpf_verifier_env *env) next_insn: insn++; i++; + continue; + } + + /* Basic sanity check before we invest more work here. */ + if (!bpf_opcode_in_insntable(insn->code)) { + verbose(env, "unknown opcode %02x\n", insn->code); + return -EINVAL; } } @@ -6431,34 +8398,47 @@ static void convert_pseudo_ld_imm64(struct bpf_verifier_env *env) * insni[off, off + cnt). Adjust corresponding insn_aux_data by copying * [0, off) and [off, end) to new locations, so the patched range stays zero */ -static int adjust_insn_aux_data(struct bpf_verifier_env *env, u32 prog_len, - u32 off, u32 cnt) +static void adjust_insn_aux_data(struct bpf_verifier_env *env, + struct bpf_insn_aux_data *new_data, + struct bpf_prog *new_prog, u32 off, u32 cnt) { - struct bpf_insn_aux_data *new_data, *old_data = env->insn_aux_data; + struct bpf_insn_aux_data *old_data = env->insn_aux_data; + struct bpf_insn *insn = new_prog->insnsi; + bool old_seen = old_data[off].seen; + u32 prog_len; int i; + /* aux info at OFF always needs adjustment, no matter fast path + * (cnt == 1) is taken or not. There is no guarantee INSN at OFF is the + * original insn at old prog. + */ + old_data[off].zext_dst = insn_has_def32(env, insn + off + cnt - 1); + if (cnt == 1) - return 0; - new_data = vzalloc(sizeof(struct bpf_insn_aux_data) * prog_len); - if (!new_data) - return -ENOMEM; + return; + prog_len = new_prog->len; + memcpy(new_data, old_data, sizeof(struct bpf_insn_aux_data) * off); memcpy(new_data + off + cnt - 1, old_data + off, sizeof(struct bpf_insn_aux_data) * (prog_len - off - cnt + 1)); - for (i = off; i < off + cnt - 1; i++) - new_data[i].seen = true; + for (i = off; i < off + cnt - 1; i++) { + /* Expand insni[off]'s seen count to the patched range. */ + new_data[i].seen = old_seen; + new_data[i].zext_dst = insn_has_def32(env, insn + i); + } env->insn_aux_data = new_data; vfree(old_data); - return 0; } static void adjust_subprog_starts(struct bpf_verifier_env *env, u32 off, u32 len) { int i; + if (len == 1) return; - for (i = 0; i < env->subprog_cnt; i++) { - if (env->subprog_info[i].start < off) + /* NOTE: fake 'exit' subprog should be updated as well. */ + for (i = 0; i <= env->subprog_cnt; i++) { + if (env->subprog_info[i].start <= off) continue; env->subprog_info[i].start += len - 1; } @@ -6468,24 +8448,191 @@ static struct bpf_prog *bpf_patch_insn_data(struct bpf_verifier_env *env, u32 of const struct bpf_insn *patch, u32 len) { struct bpf_prog *new_prog; + struct bpf_insn_aux_data *new_data = NULL; + + if (len > 1) { + new_data = vzalloc(array_size(env->prog->len + len - 1, + sizeof(struct bpf_insn_aux_data))); + if (!new_data) + return NULL; + } new_prog = bpf_patch_insn_single(env->prog, off, patch, len); - if (!new_prog) - return NULL; - if (adjust_insn_aux_data(env, new_prog->len, off, len)) + if (IS_ERR(new_prog)) { + if (PTR_ERR(new_prog) == -ERANGE) + verbose(env, + "insn %d cannot be patched due to 16-bit range\n", + env->insn_aux_data[off].orig_idx); + vfree(new_data); return NULL; + } + adjust_insn_aux_data(env, new_data, new_prog, off, len); adjust_subprog_starts(env, off, len); return new_prog; } -/* The verifier does more data flow analysis than llvm and will not explore - * branches that are dead at run time. Malicious programs can have dead code - * too. Therefore replace all dead at-run-time code with nops. +static int adjust_subprog_starts_after_remove(struct bpf_verifier_env *env, + u32 off, u32 cnt) +{ + int i, j; + + /* find first prog starting at or after off (first to remove) */ + for (i = 0; i < env->subprog_cnt; i++) + if (env->subprog_info[i].start >= off) + break; + /* find first prog starting at or after off + cnt (first to stay) */ + for (j = i; j < env->subprog_cnt; j++) + if (env->subprog_info[j].start >= off + cnt) + break; + /* if j doesn't start exactly at off + cnt, we are just removing + * the front of previous prog + */ + if (env->subprog_info[j].start != off + cnt) + j--; + + if (j > i) { + struct bpf_prog_aux *aux = env->prog->aux; + int move; + + /* move fake 'exit' subprog as well */ + move = env->subprog_cnt + 1 - j; + + memmove(env->subprog_info + i, + env->subprog_info + j, + sizeof(*env->subprog_info) * move); + env->subprog_cnt -= j - i; + + /* remove func_info */ + if (aux->func_info) { + move = aux->func_info_cnt - j; + + memmove(aux->func_info + i, + aux->func_info + j, + sizeof(*aux->func_info) * move); + aux->func_info_cnt -= j - i; + /* func_info->insn_off is set after all code rewrites, + * in adjust_btf_func() - no need to adjust + */ + } + } else { + /* convert i from "first prog to remove" to "first to adjust" */ + if (env->subprog_info[i].start == off) + i++; + } + + /* update fake 'exit' subprog as well */ + for (; i <= env->subprog_cnt; i++) + env->subprog_info[i].start -= cnt; + + return 0; +} + +static int bpf_adj_linfo_after_remove(struct bpf_verifier_env *env, u32 off, + u32 cnt) +{ + struct bpf_prog *prog = env->prog; + u32 i, l_off, l_cnt, nr_linfo; + struct bpf_line_info *linfo; + + nr_linfo = prog->aux->nr_linfo; + if (!nr_linfo) + return 0; + + linfo = prog->aux->linfo; + + /* find first line info to remove, count lines to be removed */ + for (i = 0; i < nr_linfo; i++) + if (linfo[i].insn_off >= off) + break; + + l_off = i; + l_cnt = 0; + for (; i < nr_linfo; i++) + if (linfo[i].insn_off < off + cnt) + l_cnt++; + else + break; + + /* First live insn doesn't match first live linfo, it needs to "inherit" + * last removed linfo. prog is already modified, so prog->len == off + * means no live instructions after (tail of the program was removed). + */ + if (prog->len != off && l_cnt && + (i == nr_linfo || linfo[i].insn_off != off + cnt)) { + l_cnt--; + linfo[--i].insn_off = off + cnt; + } + + /* remove the line info which refer to the removed instructions */ + if (l_cnt) { + memmove(linfo + l_off, linfo + i, + sizeof(*linfo) * (nr_linfo - i)); + + prog->aux->nr_linfo -= l_cnt; + nr_linfo = prog->aux->nr_linfo; + } + + /* pull all linfo[i].insn_off >= off + cnt in by cnt */ + for (i = l_off; i < nr_linfo; i++) + linfo[i].insn_off -= cnt; + + /* fix up all subprogs (incl. 'exit') which start >= off */ + for (i = 0; i <= env->subprog_cnt; i++) + if (env->subprog_info[i].linfo_idx > l_off) { + /* program may have started in the removed region but + * may not be fully removed + */ + if (env->subprog_info[i].linfo_idx >= l_off + l_cnt) + env->subprog_info[i].linfo_idx -= l_cnt; + else + env->subprog_info[i].linfo_idx = l_off; + } + + return 0; +} + +static int verifier_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt) +{ + struct bpf_insn_aux_data *aux_data = env->insn_aux_data; + unsigned int orig_prog_len = env->prog->len; + int err; + + if (bpf_prog_is_dev_bound(env->prog->aux)) + bpf_prog_offload_remove_insns(env, off, cnt); + + err = bpf_remove_insns(env->prog, off, cnt); + if (err) + return err; + + err = adjust_subprog_starts_after_remove(env, off, cnt); + if (err) + return err; + + err = bpf_adj_linfo_after_remove(env, off, cnt); + if (err) + return err; + + memmove(aux_data + off, aux_data + off + cnt, + sizeof(*aux_data) * (orig_prog_len - off - cnt)); + + return 0; +} + +/* The verifier does more data flow analysis than llvm and will not + * explore branches that are dead at run time. Malicious programs can + * have dead code too. Therefore replace all dead at-run-time code + * with 'ja -1'. + * + * Just nops are not optimal, e.g. if they would sit at the end of the + * program and through another bug we would manage to jump there, then + * we'd execute beyond program memory otherwise. Returning exception + * code also wouldn't work since we can have subprogs where the dead + * code could be located. */ static void sanitize_dead_code(struct bpf_verifier_env *env) { struct bpf_insn_aux_data *aux_data = env->insn_aux_data; - struct bpf_insn nop = BPF_MOV64_REG(BPF_REG_0, BPF_REG_0); + struct bpf_insn trap = BPF_JMP_IMM(BPF_JA, 0, 0, -1); struct bpf_insn *insn = env->prog->insnsi; const int insn_cnt = env->prog->len; int i; @@ -6493,12 +8640,178 @@ static void sanitize_dead_code(struct bpf_verifier_env *env) for (i = 0; i < insn_cnt; i++) { if (aux_data[i].seen) continue; - memcpy(insn + i, &nop, sizeof(nop)); + memcpy(insn + i, &trap, sizeof(trap)); + aux_data[i].zext_dst = false; } } -/* convert load instructions that access fields of 'struct __sk_buff' - * into sequence of instructions that access fields of 'struct sk_buff' +static bool insn_is_cond_jump(u8 code) +{ + u8 op; + + if (BPF_CLASS(code) == BPF_JMP32) + return true; + + if (BPF_CLASS(code) != BPF_JMP) + return false; + + op = BPF_OP(code); + return op != BPF_JA && op != BPF_EXIT && op != BPF_CALL; +} + +static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env) +{ + struct bpf_insn_aux_data *aux_data = env->insn_aux_data; + struct bpf_insn ja = BPF_JMP_IMM(BPF_JA, 0, 0, 0); + struct bpf_insn *insn = env->prog->insnsi; + const int insn_cnt = env->prog->len; + int i; + + for (i = 0; i < insn_cnt; i++, insn++) { + if (!insn_is_cond_jump(insn->code)) + continue; + + if (!aux_data[i + 1].seen) + ja.off = insn->off; + else if (!aux_data[i + 1 + insn->off].seen) + ja.off = 0; + else + continue; + + if (bpf_prog_is_dev_bound(env->prog->aux)) + bpf_prog_offload_replace_insn(env, i, &ja); + + memcpy(insn, &ja, sizeof(ja)); + } +} + +static int opt_remove_dead_code(struct bpf_verifier_env *env) +{ + struct bpf_insn_aux_data *aux_data = env->insn_aux_data; + int insn_cnt = env->prog->len; + int i, err; + + for (i = 0; i < insn_cnt; i++) { + int j; + + j = 0; + while (i + j < insn_cnt && !aux_data[i + j].seen) + j++; + if (!j) + continue; + + err = verifier_remove_insns(env, i, j); + if (err) + return err; + insn_cnt = env->prog->len; + } + + return 0; +} + +static int opt_remove_nops(struct bpf_verifier_env *env) +{ + const struct bpf_insn ja = BPF_JMP_IMM(BPF_JA, 0, 0, 0); + struct bpf_insn *insn = env->prog->insnsi; + int insn_cnt = env->prog->len; + int i, err; + + for (i = 0; i < insn_cnt; i++) { + if (memcmp(&insn[i], &ja, sizeof(ja))) + continue; + + err = verifier_remove_insns(env, i, 1); + if (err) + return err; + insn_cnt--; + i--; + } + + return 0; +} + +static int opt_subreg_zext_lo32_rnd_hi32(struct bpf_verifier_env *env, + const union bpf_attr *attr) +{ + struct bpf_insn *patch, zext_patch[2], rnd_hi32_patch[4]; + struct bpf_insn_aux_data *aux = env->insn_aux_data; + int i, patch_len, delta = 0, len = env->prog->len; + struct bpf_insn *insns = env->prog->insnsi; + struct bpf_prog *new_prog; + bool rnd_hi32; + + rnd_hi32 = attr->prog_flags & BPF_F_TEST_RND_HI32; + zext_patch[1] = BPF_ZEXT_REG(0); + rnd_hi32_patch[1] = BPF_ALU64_IMM(BPF_MOV, BPF_REG_AX, 0); + rnd_hi32_patch[2] = BPF_ALU64_IMM(BPF_LSH, BPF_REG_AX, 32); + rnd_hi32_patch[3] = BPF_ALU64_REG(BPF_OR, 0, BPF_REG_AX); + for (i = 0; i < len; i++) { + int adj_idx = i + delta; + struct bpf_insn insn; + + insn = insns[adj_idx]; + if (!aux[adj_idx].zext_dst) { + u8 code, class; + u32 imm_rnd; + + if (!rnd_hi32) + continue; + + code = insn.code; + class = BPF_CLASS(code); + if (insn_no_def(&insn)) + continue; + + /* NOTE: arg "reg" (the fourth one) is only used for + * BPF_STX which has been ruled out in above + * check, it is safe to pass NULL here. + */ + if (is_reg64(env, &insn, insn.dst_reg, NULL, DST_OP)) { + if (class == BPF_LD && + BPF_MODE(code) == BPF_IMM) + i++; + continue; + } + + /* ctx load could be transformed into wider load. */ + if (class == BPF_LDX && + aux[adj_idx].ptr_type == PTR_TO_CTX) + continue; + + imm_rnd = get_random_int(); + rnd_hi32_patch[0] = insn; + rnd_hi32_patch[1].imm = imm_rnd; + rnd_hi32_patch[3].dst_reg = insn.dst_reg; + patch = rnd_hi32_patch; + patch_len = 4; + goto apply_patch_buffer; + } + + if (!bpf_jit_needs_zext()) + continue; + + zext_patch[0] = insn; + zext_patch[1].dst_reg = insn.dst_reg; + zext_patch[1].src_reg = insn.dst_reg; + patch = zext_patch; + patch_len = 2; +apply_patch_buffer: + new_prog = bpf_patch_insn_data(env, adj_idx, patch, patch_len); + if (!new_prog) + return -ENOMEM; + env->prog = new_prog; + insns = new_prog->insnsi; + aux = env->insn_aux_data; + delta += patch_len - 1; + } + + return 0; +} + +/* convert load instructions that access fields of a context type into a + * sequence of instructions that access fields of the underlying structure: + * struct __sk_buff -> struct sk_buff + * struct bpf_sock_ops -> struct sock */ static int convert_ctx_accesses(struct bpf_verifier_env *env) { @@ -6506,12 +8819,16 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) int i, cnt, size, ctx_field_size, delta = 0; const int insn_cnt = env->prog->len; struct bpf_insn insn_buf[16], *insn; + u32 target_size, size_default, off; struct bpf_prog *new_prog; enum bpf_access_type type; bool is_narrower_load; - u32 target_size; - if (ops->gen_prologue) { + if (ops->gen_prologue || env->seen_direct_write) { + if (!ops->gen_prologue) { + verbose(env, "bpf verifier is misconfigured\n"); + return -EINVAL; + } cnt = ops->gen_prologue(insn_buf, env->seen_direct_write, env->prog); if (cnt >= ARRAY_SIZE(insn_buf)) { @@ -6534,34 +8851,33 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) for (i = 0; i < insn_cnt; i++, insn++) { bpf_convert_ctx_access_t convert_ctx_access; + bool ctx_access; + if (insn->code == (BPF_LDX | BPF_MEM | BPF_B) || insn->code == (BPF_LDX | BPF_MEM | BPF_H) || insn->code == (BPF_LDX | BPF_MEM | BPF_W) || - insn->code == (BPF_LDX | BPF_MEM | BPF_DW)) + insn->code == (BPF_LDX | BPF_MEM | BPF_DW)) { type = BPF_READ; - else if (insn->code == (BPF_STX | BPF_MEM | BPF_B) || - insn->code == (BPF_STX | BPF_MEM | BPF_H) || - insn->code == (BPF_STX | BPF_MEM | BPF_W) || - insn->code == (BPF_STX | BPF_MEM | BPF_DW)) + ctx_access = true; + } else if (insn->code == (BPF_STX | BPF_MEM | BPF_B) || + insn->code == (BPF_STX | BPF_MEM | BPF_H) || + insn->code == (BPF_STX | BPF_MEM | BPF_W) || + insn->code == (BPF_STX | BPF_MEM | BPF_DW) || + insn->code == (BPF_ST | BPF_MEM | BPF_B) || + insn->code == (BPF_ST | BPF_MEM | BPF_H) || + insn->code == (BPF_ST | BPF_MEM | BPF_W) || + insn->code == (BPF_ST | BPF_MEM | BPF_DW)) { type = BPF_WRITE; - else + ctx_access = BPF_CLASS(insn->code) == BPF_STX; + } else { continue; + } if (type == BPF_WRITE && - env->insn_aux_data[i + delta].sanitize_stack_off) { + env->insn_aux_data[i + delta].sanitize_stack_spill) { struct bpf_insn patch[] = { - /* Sanitize suspicious stack slot with zero. - * There are no memory dependencies for this store, - * since it's only using frame pointer and immediate - * constant of zero - */ - BPF_ST_MEM(BPF_DW, BPF_REG_FP, - env->insn_aux_data[i + delta].sanitize_stack_off, - 0), - /* the original STX instruction will immediately - * overwrite the same stack slot with appropriate value - */ *insn, + BPF_ST_NOSPEC(), }; cnt = ARRAY_SIZE(patch); @@ -6575,6 +8891,9 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) continue; } + if (!ctx_access) + continue; + switch (env->insn_aux_data[i + delta].ptr_type) { case PTR_TO_CTX: if (!ops->convert_ctx_access) @@ -6588,9 +8907,13 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) case PTR_TO_TCP_SOCK: convert_ctx_access = bpf_tcp_sock_convert_ctx_access; break; + case PTR_TO_XDP_SOCK: + convert_ctx_access = bpf_xdp_sock_convert_ctx_access; + break; default: continue; } + ctx_field_size = env->insn_aux_data[i + delta].ctx_field_size; size = BPF_LDST_BYTES(insn); @@ -6600,8 +8923,9 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) * we will apply proper mask to the result. */ is_narrower_load = size < ctx_field_size; + size_default = bpf_ctx_off_adjust_machine(ctx_field_size); + off = insn->off; if (is_narrower_load) { - u32 off = insn->off; u8 size_code; if (type == BPF_WRITE) { @@ -6615,7 +8939,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) else if (ctx_field_size == 8) size_code = BPF_DW; - insn->off = off & ~(ctx_field_size - 1); + insn->off = off & ~(size_default - 1); insn->code = BPF_LDX | BPF_MEM | size_code; } @@ -6629,12 +8953,27 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) } if (is_narrower_load && size < target_size) { - if (ctx_field_size <= 4) + u8 shift = bpf_ctx_narrow_access_offset( + off, size, size_default) * 8; + if (shift && cnt + 1 >= ARRAY_SIZE(insn_buf)) { + verbose(env, "bpf verifier narrow ctx load misconfigured\n"); + return -EINVAL; + } + if (ctx_field_size <= 4) { + if (shift) + insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH, + insn->dst_reg, + shift); insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg, (1 << size * 8) - 1); - else - insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg, - (1 << size * 8) - 1); + } else { + if (shift) + insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH, + insn->dst_reg, + shift); + insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg, + (1ULL << size * 8) - 1); + } } new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); @@ -6666,6 +9005,10 @@ static int jit_subprogs(struct bpf_verifier_env *env) if (insn->code != (BPF_JMP | BPF_CALL) || insn->src_reg != BPF_PSEUDO_CALL) continue; + /* Upon error here we cannot fall back to interpreter but + * need a hard reject of the program. Thus -EFAULT is + * propagated in any case. + */ subprog = find_subprog(env, i + insn->imm + 1); if (subprog < 0) { WARN_ONCE(1, "verifier bug. No program starts at insn %d\n", @@ -6687,28 +9030,31 @@ static int jit_subprogs(struct bpf_verifier_env *env) err = bpf_prog_alloc_jited_linfo(prog); if (err) goto out_undo_insn; - err = -ENOMEM; - func = kzalloc(sizeof(prog) * env->subprog_cnt, GFP_KERNEL); + err = -ENOMEM; + func = kcalloc(env->subprog_cnt, sizeof(prog), GFP_KERNEL); if (!func) goto out_undo_insn; for (i = 0; i < env->subprog_cnt; i++) { subprog_start = subprog_end; - if (env->subprog_cnt == i + 1) - subprog_end = prog->len; - else - subprog_end = env->subprog_info[i + 1].start; + subprog_end = env->subprog_info[i + 1].start; len = subprog_end - subprog_start; - func[i] = bpf_prog_alloc(bpf_prog_size(len), GFP_USER); - + /* BPF_PROG_RUN doesn't call subprogs directly, + * hence main prog stats include the runtime of subprogs. + * subprogs don't have IDs and not reachable via prog_get_next_id + * func[i]->aux->stats will never be accessed and stays NULL + */ + func[i] = bpf_prog_alloc_no_stats(bpf_prog_size(len), GFP_USER); if (!func[i]) goto out_free; - memcpy(func[i]->insnsi, &prog->insnsi[subprog_start], len * sizeof(struct bpf_insn)); + func[i]->type = prog->type; func[i]->len = len; + if (bpf_prog_calc_tag(func[i])) + goto out_free; func[i]->is_func = 1; func[i]->aux->func_idx = i; /* the btf and func_info will be freed only at prog->aux */ @@ -6743,11 +9089,23 @@ static int jit_subprogs(struct bpf_verifier_env *env) insn->src_reg != BPF_PSEUDO_CALL) continue; subprog = insn->off; - insn->off = 0; - insn->imm = (u64 (*)(u64, u64, u64, u64, u64)) - func[subprog]->bpf_func - - __bpf_call_base; + insn->imm = BPF_CAST_CALL(func[subprog]->bpf_func) - + __bpf_call_base; } + + /* we use the aux data to keep a list of the start addresses + * of the JITed images for each function in the program + * + * for some architectures, such as powerpc64, the imm field + * might not be large enough to hold the offset of the start + * address of the callee's JITed image from __bpf_call_base + * + * in such cases, we can lookup the start address of a callee + * by using its subprog id, available from the off field of + * the call instruction, as an index for this list + */ + func[i]->aux->func = func; + func[i]->aux->func_cnt = env->subprog_cnt; } for (i = 0; i < env->subprog_cnt; i++) { old_bpf_func = func[i]->bpf_func; @@ -6759,6 +9117,7 @@ static int jit_subprogs(struct bpf_verifier_env *env) } cond_resched(); } + /* finally lock prog and jit images for all functions and * populate kallsysm */ @@ -6776,10 +9135,6 @@ static int jit_subprogs(struct bpf_verifier_env *env) insn->src_reg != BPF_PSEUDO_CALL) continue; insn->off = env->insn_aux_data[i].call_imm; - /* Upon error here we cannot fall back to interpreter but - * need a hard reject of the program. Thus -EFAULT is - * propagated in any case. - */ subprog = find_subprog(env, i + insn->off + 1); insn->imm = subprog; } @@ -6852,6 +9207,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) struct bpf_insn *insn = prog->insnsi; const struct bpf_func_proto *fn; const int insn_cnt = prog->len; + const struct bpf_map_ops *ops; struct bpf_insn_aux_data *aux; struct bpf_insn insn_buf[16]; struct bpf_prog *new_prog; @@ -6864,31 +9220,30 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) insn->code == (BPF_ALU | BPF_MOD | BPF_X) || insn->code == (BPF_ALU | BPF_DIV | BPF_X)) { bool is64 = BPF_CLASS(insn->code) == BPF_ALU64; - struct bpf_insn mask_and_div[] = { - BPF_MOV_REG(BPF_CLASS(insn->code), BPF_REG_AX, insn->src_reg), + bool isdiv = BPF_OP(insn->code) == BPF_DIV; + struct bpf_insn *patchlet; + struct bpf_insn chk_and_div[] = { /* [R,W]x div 0 -> 0 */ - BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, 2), - BPF_RAW_REG(*insn, insn->dst_reg, BPF_REG_AX), + BPF_RAW_INSN((is64 ? BPF_JMP : BPF_JMP32) | + BPF_JNE | BPF_K, insn->src_reg, + 0, 2, 0), + BPF_ALU32_REG(BPF_XOR, insn->dst_reg, insn->dst_reg), BPF_JMP_IMM(BPF_JA, 0, 0, 1), - BPF_ALU_REG(BPF_CLASS(insn->code), BPF_XOR, insn->dst_reg, insn->dst_reg), + *insn, }; - struct bpf_insn mask_and_mod[] = { - BPF_MOV_REG(BPF_CLASS(insn->code), BPF_REG_AX, insn->src_reg), - BPF_JMP_IMM(BPF_JEQ, BPF_REG_AX, 0, 1 + (is64 ? 0 : 1)), - BPF_RAW_REG(*insn, insn->dst_reg, BPF_REG_AX), + struct bpf_insn chk_and_mod[] = { + /* [R,W]x mod 0 -> [R,W]x */ + BPF_RAW_INSN((is64 ? BPF_JMP : BPF_JMP32) | + BPF_JEQ | BPF_K, insn->src_reg, + 0, 1 + (is64 ? 0 : 1), 0), + *insn, BPF_JMP_IMM(BPF_JA, 0, 0, 1), BPF_MOV32_REG(insn->dst_reg, insn->dst_reg), }; - struct bpf_insn *patchlet; - if (insn->code == (BPF_ALU64 | BPF_DIV | BPF_X) || - insn->code == (BPF_ALU | BPF_DIV | BPF_X)) { - patchlet = mask_and_div; - cnt = ARRAY_SIZE(mask_and_div); - } else { - patchlet = mask_and_mod; - cnt = ARRAY_SIZE(mask_and_mod) - (is64 ? 2 : 0); - } + patchlet = isdiv ? chk_and_div : chk_and_mod; + cnt = isdiv ? ARRAY_SIZE(chk_and_div) : + ARRAY_SIZE(chk_and_mod) - (is64 ? 2 : 0); new_prog = bpf_patch_insn_data(env, i + delta, patchlet, cnt); if (!new_prog) @@ -6900,6 +9255,25 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) continue; } + if (BPF_CLASS(insn->code) == BPF_LD && + (BPF_MODE(insn->code) == BPF_ABS || + BPF_MODE(insn->code) == BPF_IND)) { + cnt = env->ops->gen_ld_abs(insn, insn_buf); + if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf)) { + verbose(env, "bpf verifier is misconfigured\n"); + return -EINVAL; + } + + new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); + if (!new_prog) + return -ENOMEM; + + delta += cnt - 1; + env->prog = prog = new_prog; + insn = new_prog->insnsi + i + delta; + continue; + } + if (insn->code == (BPF_ALU64 | BPF_ADD | BPF_X) || insn->code == (BPF_ALU64 | BPF_SUB | BPF_X)) { const u8 code_add = BPF_ALU64 | BPF_ADD | BPF_X; @@ -6955,7 +9329,6 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) if (insn->code != (BPF_JMP | BPF_CALL)) continue; - if (insn->src_reg == BPF_PSEUDO_CALL) continue; @@ -6963,6 +9336,8 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) prog->dst_needed = 1; if (insn->imm == BPF_FUNC_get_prandom_u32) bpf_user_rnd_init_once(); + if (insn->imm == BPF_FUNC_override_return) + prog->kprobe_override = 1; if (insn->imm == BPF_FUNC_tail_call) { /* If we tail call into other programs, we * cannot make any assumptions since they can @@ -6971,6 +9346,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) */ prog->cb_access = 1; env->prog->aux->stack_depth = MAX_BPF_STACK; + env->prog->aux->max_pkt_offset = MAX_PACKET_OFF; /* mark bpf_tail_call as different opcode to avoid * conditional branch in the interpeter for every normal @@ -6991,11 +9367,11 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) * to avoid out-of-bounds cpu speculation */ if (bpf_map_ptr_poisoned(aux)) { - verbose(env, "tail_call obusing map_ptr\n"); + verbose(env, "tail_call abusing map_ptr\n"); return -EINVAL; - - map_ptr = BPF_MAP_PTR(aux->map_state);} + } + map_ptr = BPF_MAP_PTR(aux->map_state); insn_buf[0] = BPF_JMP_IMM(BPF_JGE, BPF_REG_3, map_ptr->max_entries, 2); insn_buf[1] = BPF_ALU32_IMM(BPF_AND, BPF_REG_3, @@ -7015,64 +9391,94 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env) } /* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup - * handlers are currently limited to 64 bit only. + * and other inlining handlers are currently limited to 64 bit + * only. */ if (prog->jit_requested && BITS_PER_LONG == 64 && - insn->imm == BPF_FUNC_map_lookup_elem) { + (insn->imm == BPF_FUNC_map_lookup_elem || + insn->imm == BPF_FUNC_map_update_elem || + insn->imm == BPF_FUNC_map_delete_elem || + insn->imm == BPF_FUNC_map_push_elem || + insn->imm == BPF_FUNC_map_pop_elem || + insn->imm == BPF_FUNC_map_peek_elem)) { aux = &env->insn_aux_data[i + delta]; if (bpf_map_ptr_poisoned(aux)) goto patch_call_imm; - map_ptr = BPF_MAP_PTR(aux->map_state); - if (!map_ptr->ops->map_gen_lookup) - goto patch_call_imm; - cnt = map_ptr->ops->map_gen_lookup(map_ptr, insn_buf); - if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf)) { - verbose(env, "bpf verifier is misconfigured\n"); - return -EINVAL; + map_ptr = BPF_MAP_PTR(aux->map_state); + ops = map_ptr->ops; + if (insn->imm == BPF_FUNC_map_lookup_elem && + ops->map_gen_lookup) { + cnt = ops->map_gen_lookup(map_ptr, insn_buf); + if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf)) { + verbose(env, "bpf verifier is misconfigured\n"); + return -EINVAL; + } + + new_prog = bpf_patch_insn_data(env, i + delta, + insn_buf, cnt); + if (!new_prog) + return -ENOMEM; + + delta += cnt - 1; + env->prog = prog = new_prog; + insn = new_prog->insnsi + i + delta; + continue; } - new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, - cnt); - if (!new_prog) - return -ENOMEM; + BUILD_BUG_ON(!__same_type(ops->map_lookup_elem, + (void *(*)(struct bpf_map *map, void *key))NULL)); + BUILD_BUG_ON(!__same_type(ops->map_delete_elem, + (int (*)(struct bpf_map *map, void *key))NULL)); + BUILD_BUG_ON(!__same_type(ops->map_update_elem, + (int (*)(struct bpf_map *map, void *key, void *value, + u64 flags))NULL)); + BUILD_BUG_ON(!__same_type(ops->map_push_elem, + (int (*)(struct bpf_map *map, void *value, + u64 flags))NULL)); + BUILD_BUG_ON(!__same_type(ops->map_pop_elem, + (int (*)(struct bpf_map *map, void *value))NULL)); + BUILD_BUG_ON(!__same_type(ops->map_peek_elem, + (int (*)(struct bpf_map *map, void *value))NULL)); - delta += cnt - 1; + switch (insn->imm) { + case BPF_FUNC_map_lookup_elem: + insn->imm = BPF_CAST_CALL(ops->map_lookup_elem) - + __bpf_call_base; + continue; + case BPF_FUNC_map_update_elem: + insn->imm = BPF_CAST_CALL(ops->map_update_elem) - + __bpf_call_base; + continue; + case BPF_FUNC_map_delete_elem: + insn->imm = BPF_CAST_CALL(ops->map_delete_elem) - + __bpf_call_base; + continue; + case BPF_FUNC_map_push_elem: + insn->imm = BPF_CAST_CALL(ops->map_push_elem) - + __bpf_call_base; + continue; + case BPF_FUNC_map_pop_elem: + insn->imm = BPF_CAST_CALL(ops->map_pop_elem) - + __bpf_call_base; + continue; + case BPF_FUNC_map_peek_elem: + insn->imm = BPF_CAST_CALL(ops->map_peek_elem) - + __bpf_call_base; + continue; + } - /* keep walking new program and skip insns we just inserted */ - env->prog = prog = new_prog; - insn = new_prog->insnsi + i + delta; - continue; + goto patch_call_imm; } - if (insn->imm == BPF_FUNC_redirect_map) { - /* Note, we cannot use prog directly as imm as subsequent - * rewrites would still change the prog pointer. The only - * stable address we can use is aux, which also works with - * prog clones during blinding. - */ - u64 addr = (unsigned long)prog->aux; - struct bpf_insn r4_ld[] = { - BPF_LD_IMM64(BPF_REG_4, addr), - *insn, - }; - cnt = ARRAY_SIZE(r4_ld); - - new_prog = bpf_patch_insn_data(env, i + delta, r4_ld, cnt); - if (!new_prog) - return -ENOMEM; - - delta += cnt - 1; - env->prog = prog = new_prog; - insn = new_prog->insnsi + i + delta; - } patch_call_imm: fn = env->ops->get_func_proto(insn->imm, env->prog); /* all functions that have prototype and verifier allowed * programs to call them, must be real in-kernel functions */ if (!fn->func) { - verbose(env, "kernel subsystem misconfigured func %s#%d\n", + verbose(env, + "kernel subsystem misconfigured func %s#%d\n", func_id_name(insn->imm), insn->imm); return -EFAULT; } @@ -7087,49 +9493,91 @@ static void free_states(struct bpf_verifier_env *env) struct bpf_verifier_state_list *sl, *sln; int i; + sl = env->free_list; + while (sl) { + sln = sl->next; + free_verifier_state(&sl->state, false); + kfree(sl); + sl = sln; + } + if (!env->explored_states) return; - for (i = 0; i < env->prog->len; i++) { + for (i = 0; i < state_htab_size(env); i++) { sl = env->explored_states[i]; - if (sl) - while (sl != STATE_LIST_MARK) { - sln = sl->next; - free_verifier_state(&sl->state, false); - kfree(sl); - sl = sln; - } + while (sl) { + sln = sl->next; + free_verifier_state(&sl->state, false); + kfree(sl); + sl = sln; + } } - kfree(env->explored_states); + kvfree(env->explored_states); +} + +static void print_verification_stats(struct bpf_verifier_env *env) +{ + int i; + + if (env->log.level & BPF_LOG_STATS) { + verbose(env, "verification time %lld usec\n", + div_u64(env->verification_time, 1000)); + verbose(env, "stack depth "); + for (i = 0; i < env->subprog_cnt; i++) { + u32 depth = env->subprog_info[i].stack_depth; + + verbose(env, "%d", depth); + if (i + 1 < env->subprog_cnt) + verbose(env, "+"); + } + verbose(env, "\n"); + } + verbose(env, "processed %d insns (limit %d) max_states_per_insn %d " + "total_states %d peak_states %d mark_read %d\n", + env->insn_processed, BPF_COMPLEXITY_LIMIT_INSNS, + env->max_states_per_insn, env->total_states, + env->peak_states, env->longest_mark_read_walk); } int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, union bpf_attr __user *uattr) { - struct bpf_verifier_log *log; + u64 start_time = ktime_get_ns(); struct bpf_verifier_env *env; - int ret = -EINVAL; + struct bpf_verifier_log *log; + int i, len, ret = -EINVAL; + bool is_priv; + + /* no program is valid */ + if (ARRAY_SIZE(bpf_verifier_ops) == 0) + return -EINVAL; /* 'struct bpf_verifier_env' can be global, but since it's not small, * allocate/free it every time bpf_check() is called */ - env = kzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL); + env = kvzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL); if (!env) return -ENOMEM; log = &env->log; - env->insn_aux_data = vzalloc(sizeof(struct bpf_insn_aux_data) * - (*prog)->len); + len = (*prog)->len; + env->insn_aux_data = + vzalloc(array_size(sizeof(struct bpf_insn_aux_data), len)); ret = -ENOMEM; if (!env->insn_aux_data) goto err_free_env; + for (i = 0; i < len; i++) + env->insn_aux_data[i].orig_idx = i; env->prog = *prog; env->ops = bpf_verifier_ops[env->prog->type]; + is_priv = capable(CAP_SYS_ADMIN); /* grab the mutex to protect few globals used by verifier */ - mutex_lock(&bpf_verifier_lock); + if (!is_priv) + mutex_lock(&bpf_verifier_lock); if (attr->log_level || attr->log_buf || attr->log_size) { /* user requested verbose verifier output @@ -7141,37 +9589,40 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, ret = -EINVAL; /* log attributes have to be sane */ - if (log->len_total < 128 || log->len_total > UINT_MAX >> 8 || - !log->level || !log->ubuf) + if (log->len_total < 128 || log->len_total > UINT_MAX >> 2 || + !log->level || !log->ubuf || log->level & ~BPF_LOG_MASK) goto err_unlock; - - ret = -ENOMEM; } env->strict_alignment = !!(attr->prog_flags & BPF_F_STRICT_ALIGNMENT); if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) env->strict_alignment = true; + if (attr->prog_flags & BPF_F_ANY_ALIGNMENT) + env->strict_alignment = false; - if (bpf_prog_is_dev_bound(env->prog->aux)) { - ret = bpf_prog_offload_verifier_prep(env); - if (ret) - goto err_unlock; - } + env->allow_ptr_leaks = is_priv; + + if (is_priv) + env->test_state_freq = attr->prog_flags & BPF_F_TEST_STATE_FREQ; ret = replace_map_fd_with_map_ptr(env); if (ret < 0) goto skip_full_check; - env->explored_states = kcalloc(env->prog->len, + if (bpf_prog_is_dev_bound(env->prog->aux)) { + ret = bpf_prog_offload_verifier_prep(env->prog); + if (ret) + goto skip_full_check; + } + + env->explored_states = kvcalloc(state_htab_size(env), sizeof(struct bpf_verifier_state_list *), GFP_USER); ret = -ENOMEM; if (!env->explored_states) goto skip_full_check; - env->allow_ptr_leaks = capable(CAP_SYS_ADMIN); - - ret = check_cfg(env); + ret = check_subprogs(env); if (ret < 0) goto skip_full_check; @@ -7179,22 +9630,39 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, if (ret < 0) goto skip_full_check; + ret = check_cfg(env); + if (ret < 0) + goto skip_full_check; + ret = do_check(env); if (env->cur_state) { free_verifier_state(env->cur_state, true); env->cur_state = NULL; } + if (ret == 0 && bpf_prog_is_dev_bound(env->prog->aux)) + ret = bpf_prog_offload_finalize(env); + skip_full_check: while (!pop_stack(env, NULL, NULL)); free_states(env); - if (ret == 0) - sanitize_dead_code(env); - if (ret == 0) ret = check_max_stack_depth(env); + /* instruction rewrites happen after this point */ + if (is_priv) { + if (ret == 0) + opt_hard_wire_dead_code_branches(env); + if (ret == 0) + ret = opt_remove_dead_code(env); + if (ret == 0) + ret = opt_remove_nops(env); + } else { + if (ret == 0) + sanitize_dead_code(env); + } + if (ret == 0) /* program is valid, convert *(u32*)(ctx + off) accesses */ ret = convert_ctx_accesses(env); @@ -7202,13 +9670,23 @@ skip_full_check: if (ret == 0) ret = fixup_bpf_calls(env); + /* do 32-bit optimization after insn patching has done so those patched + * insns could be handled correctly. + */ + if (ret == 0 && !bpf_prog_is_dev_bound(env->prog->aux)) { + ret = opt_subreg_zext_lo32_rnd_hi32(env, attr); + env->prog->aux->verifier_zext = bpf_jit_needs_zext() ? !ret + : false; + } + if (ret == 0) ret = fixup_call_args(env); - if (log->level && bpf_verifier_log_full(log)) { - ret = -ENOSPC; - } + env->verification_time = ktime_get_ns() - start_time; + print_verification_stats(env); + if (log->level && bpf_verifier_log_full(log)) + ret = -ENOSPC; if (log->level && !log->ubuf) { ret = -EFAULT; goto err_release_maps; @@ -7246,67 +9724,10 @@ err_release_maps: release_maps(env); *prog = env->prog; err_unlock: - mutex_unlock(&bpf_verifier_lock); + if (!is_priv) + mutex_unlock(&bpf_verifier_lock); vfree(env->insn_aux_data); err_free_env: - kfree(env); + kvfree(env); return ret; } - -int bpf_analyzer(struct bpf_prog *prog, const struct bpf_ext_analyzer_ops *ops, - void *priv) -{ - struct bpf_verifier_env *env; - int ret; - - env = kzalloc(sizeof(struct bpf_verifier_env), GFP_KERNEL); - if (!env) - return -ENOMEM; - - env->insn_aux_data = vzalloc(sizeof(struct bpf_insn_aux_data) * - prog->len); - ret = -ENOMEM; - if (!env->insn_aux_data) - goto err_free_env; - env->prog = prog; - env->ops = bpf_verifier_ops[env->prog->type]; - env->analyzer_ops = ops; - env->analyzer_priv = priv; - - /* grab the mutex to protect few globals used by verifier */ - mutex_lock(&bpf_verifier_lock); - - env->strict_alignment = false; - if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) - env->strict_alignment = true; - - env->explored_states = kcalloc(env->prog->len, - sizeof(struct bpf_verifier_state_list *), - GFP_KERNEL); - ret = -ENOMEM; - if (!env->explored_states) - goto skip_full_check; - - ret = check_cfg(env); - if (ret < 0) - goto skip_full_check; - - env->allow_ptr_leaks = capable(CAP_SYS_ADMIN); - - ret = do_check(env); - if (env->cur_state) { - free_verifier_state(env->cur_state, true); - env->cur_state = NULL; - } - -skip_full_check: - while (!pop_stack(env, NULL, NULL)); - free_states(env); - - mutex_unlock(&bpf_verifier_lock); - vfree(env->insn_aux_data); -err_free_env: - kfree(env); - return ret; -} -EXPORT_SYMBOL_GPL(bpf_analyzer); diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c new file mode 100644 index 000000000000..82a1ffe15dfa --- /dev/null +++ b/kernel/bpf/xskmap.c @@ -0,0 +1,319 @@ +// SPDX-License-Identifier: GPL-2.0 +/* XSKMAP used for AF_XDP sockets + * Copyright(c) 2018 Intel Corporation. + */ + +#include +#include +#include +#include +#include + +struct xsk_map { + struct bpf_map map; + struct xdp_sock **xsk_map; + struct list_head __percpu *flush_list; + spinlock_t lock; /* Synchronize map updates */ +}; + +int xsk_map_inc(struct xsk_map *map) +{ + struct bpf_map *m = &map->map; + + m = bpf_map_inc(m, false); + return PTR_ERR_OR_ZERO(m); +} + +void xsk_map_put(struct xsk_map *map) +{ + bpf_map_put(&map->map); +} + +static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map, + struct xdp_sock **map_entry) +{ + struct xsk_map_node *node; + int err; + + node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN); + if (!node) + return ERR_PTR(-ENOMEM); + + err = xsk_map_inc(map); + if (err) { + kfree(node); + return ERR_PTR(err); + } + + node->map = map; + node->map_entry = map_entry; + return node; +} + +static void xsk_map_node_free(struct xsk_map_node *node) +{ + xsk_map_put(node->map); + kfree(node); +} + +static void xsk_map_sock_add(struct xdp_sock *xs, struct xsk_map_node *node) +{ + spin_lock_bh(&xs->map_list_lock); + list_add_tail(&node->node, &xs->map_list); + spin_unlock_bh(&xs->map_list_lock); +} + +static void xsk_map_sock_delete(struct xdp_sock *xs, + struct xdp_sock **map_entry) +{ + struct xsk_map_node *n, *tmp; + + spin_lock_bh(&xs->map_list_lock); + list_for_each_entry_safe(n, tmp, &xs->map_list, node) { + if (map_entry == n->map_entry) { + list_del(&n->node); + xsk_map_node_free(n); + } + } + spin_unlock_bh(&xs->map_list_lock); +} + +static struct bpf_map *xsk_map_alloc(union bpf_attr *attr) +{ + struct xsk_map *m; + int cpu, err; + u64 cost; + + if (!capable(CAP_NET_ADMIN)) + return ERR_PTR(-EPERM); + + if (attr->max_entries == 0 || attr->key_size != 4 || + attr->value_size != 4 || + attr->map_flags & ~(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)) + return ERR_PTR(-EINVAL); + + m = kzalloc(sizeof(*m), GFP_USER); + if (!m) + return ERR_PTR(-ENOMEM); + + bpf_map_init_from_attr(&m->map, attr); + spin_lock_init(&m->lock); + + cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *); + cost += sizeof(struct list_head) * num_possible_cpus(); + + /* Notice returns -EPERM on if map size is larger than memlock limit */ + err = bpf_map_charge_init(&m->map.memory, cost); + if (err) + goto free_m; + + err = -ENOMEM; + + m->flush_list = alloc_percpu(struct list_head); + if (!m->flush_list) + goto free_charge; + + for_each_possible_cpu(cpu) + INIT_LIST_HEAD(per_cpu_ptr(m->flush_list, cpu)); + + m->xsk_map = bpf_map_area_alloc(m->map.max_entries * + sizeof(struct xdp_sock *), + m->map.numa_node); + if (!m->xsk_map) + goto free_percpu; + return &m->map; + +free_percpu: + free_percpu(m->flush_list); +free_charge: + bpf_map_charge_finish(&m->map.memory); +free_m: + kfree(m); + return ERR_PTR(err); +} + +static void xsk_map_free(struct bpf_map *map) +{ + struct xsk_map *m = container_of(map, struct xsk_map, map); + + bpf_clear_redirect_map(map); + synchronize_net(); + free_percpu(m->flush_list); + bpf_map_area_free(m->xsk_map); + kfree(m); +} + +static int xsk_map_get_next_key(struct bpf_map *map, void *key, void *next_key) +{ + struct xsk_map *m = container_of(map, struct xsk_map, map); + u32 index = key ? *(u32 *)key : U32_MAX; + u32 *next = next_key; + + if (index >= m->map.max_entries) { + *next = 0; + return 0; + } + + if (index == m->map.max_entries - 1) + return -ENOENT; + *next = index + 1; + return 0; +} + +struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key) +{ + struct xsk_map *m = container_of(map, struct xsk_map, map); + struct xdp_sock *xs; + + if (key >= map->max_entries) + return NULL; + + xs = READ_ONCE(m->xsk_map[key]); + return xs; +} + +int __xsk_map_redirect(struct bpf_map *map, struct xdp_buff *xdp, + struct xdp_sock *xs) +{ + struct xsk_map *m = container_of(map, struct xsk_map, map); + struct list_head *flush_list = this_cpu_ptr(m->flush_list); + int err; + + err = xsk_rcv(xs, xdp); + if (err) + return err; + + if (!xs->flush_node.prev) + list_add(&xs->flush_node, flush_list); + + return 0; +} + +void __xsk_map_flush(struct bpf_map *map) +{ + struct xsk_map *m = container_of(map, struct xsk_map, map); + struct list_head *flush_list = this_cpu_ptr(m->flush_list); + struct xdp_sock *xs, *tmp; + + list_for_each_entry_safe(xs, tmp, flush_list, flush_node) { + xsk_flush(xs); + __list_del_clearprev(&xs->flush_node); + } +} + +static void *xsk_map_lookup_elem(struct bpf_map *map, void *key) +{ + WARN_ON_ONCE(!rcu_read_lock_held()); + return __xsk_map_lookup_elem(map, *(u32 *)key); +} + +static void *xsk_map_lookup_elem_sys_only(struct bpf_map *map, void *key) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value, + u64 map_flags) +{ + struct xsk_map *m = container_of(map, struct xsk_map, map); + struct xdp_sock *xs, *old_xs, **map_entry; + u32 i = *(u32 *)key, fd = *(u32 *)value; + struct xsk_map_node *node; + struct socket *sock; + int err; + + if (unlikely(map_flags > BPF_EXIST)) + return -EINVAL; + if (unlikely(i >= m->map.max_entries)) + return -E2BIG; + + sock = sockfd_lookup(fd, &err); + if (!sock) + return err; + + if (sock->sk->sk_family != PF_XDP) { + sockfd_put(sock); + return -EOPNOTSUPP; + } + + xs = (struct xdp_sock *)sock->sk; + + if (!xsk_is_setup_for_bpf_map(xs)) { + sockfd_put(sock); + return -EOPNOTSUPP; + } + + map_entry = &m->xsk_map[i]; + node = xsk_map_node_alloc(m, map_entry); + if (IS_ERR(node)) { + sockfd_put(sock); + return PTR_ERR(node); + } + + spin_lock_bh(&m->lock); + old_xs = READ_ONCE(*map_entry); + if (old_xs == xs) { + err = 0; + goto out; + } else if (old_xs && map_flags == BPF_NOEXIST) { + err = -EEXIST; + goto out; + } else if (!old_xs && map_flags == BPF_EXIST) { + err = -ENOENT; + goto out; + } + xsk_map_sock_add(xs, node); + WRITE_ONCE(*map_entry, xs); + if (old_xs) + xsk_map_sock_delete(old_xs, map_entry); + spin_unlock_bh(&m->lock); + sockfd_put(sock); + return 0; + +out: + spin_unlock_bh(&m->lock); + sockfd_put(sock); + xsk_map_node_free(node); + return err; +} + +static int xsk_map_delete_elem(struct bpf_map *map, void *key) +{ + struct xsk_map *m = container_of(map, struct xsk_map, map); + struct xdp_sock *old_xs, **map_entry; + int k = *(u32 *)key; + + if (k >= map->max_entries) + return -EINVAL; + + spin_lock_bh(&m->lock); + map_entry = &m->xsk_map[k]; + old_xs = xchg(map_entry, NULL); + if (old_xs) + xsk_map_sock_delete(old_xs, map_entry); + spin_unlock_bh(&m->lock); + + return 0; +} + +void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs, + struct xdp_sock **map_entry) +{ + spin_lock_bh(&map->lock); + if (READ_ONCE(*map_entry) == xs) { + WRITE_ONCE(*map_entry, NULL); + xsk_map_sock_delete(xs, map_entry); + } + spin_unlock_bh(&map->lock); +} + +const struct bpf_map_ops xsk_map_ops = { + .map_alloc = xsk_map_alloc, + .map_free = xsk_map_free, + .map_get_next_key = xsk_map_get_next_key, + .map_lookup_elem = xsk_map_lookup_elem, + .map_lookup_elem_sys_only = xsk_map_lookup_elem_sys_only, + .map_update_elem = xsk_map_update_elem, + .map_delete_elem = xsk_map_delete_elem, + .map_check_btf = map_check_no_btf, +}; diff --git a/kernel/capability.c b/kernel/capability.c index 1e1c0236f55b..7718d7dcadc7 100644 --- a/kernel/capability.c +++ b/kernel/capability.c @@ -299,7 +299,7 @@ bool has_ns_capability(struct task_struct *t, int ret; rcu_read_lock(); - ret = security_capable(__task_cred(t), ns, cap); + ret = security_capable(__task_cred(t), ns, cap, CAP_OPT_NONE); rcu_read_unlock(); return (ret == 0); @@ -340,7 +340,7 @@ bool has_ns_capability_noaudit(struct task_struct *t, int ret; rcu_read_lock(); - ret = security_capable_noaudit(__task_cred(t), ns, cap); + ret = security_capable(__task_cred(t), ns, cap, CAP_OPT_NOAUDIT); rcu_read_unlock(); return (ret == 0); @@ -363,7 +363,9 @@ bool has_capability_noaudit(struct task_struct *t, int cap) return has_ns_capability_noaudit(t, &init_user_ns, cap); } -static bool ns_capable_common(struct user_namespace *ns, int cap, bool audit) +static bool ns_capable_common(struct user_namespace *ns, + int cap, + unsigned int opts) { int capable; @@ -372,8 +374,7 @@ static bool ns_capable_common(struct user_namespace *ns, int cap, bool audit) BUG(); } - capable = audit ? security_capable(current_cred(), ns, cap) : - security_capable_noaudit(current_cred(), ns, cap); + capable = security_capable(current_cred(), ns, cap, opts); if (capable == 0) { current->flags |= PF_SUPERPRIV; return true; @@ -394,7 +395,7 @@ static bool ns_capable_common(struct user_namespace *ns, int cap, bool audit) */ bool ns_capable(struct user_namespace *ns, int cap) { - return ns_capable_common(ns, cap, true); + return ns_capable_common(ns, cap, CAP_OPT_NONE); } EXPORT_SYMBOL(ns_capable); @@ -412,7 +413,7 @@ EXPORT_SYMBOL(ns_capable); */ bool ns_capable_noaudit(struct user_namespace *ns, int cap) { - return ns_capable_common(ns, cap, false); + return ns_capable_common(ns, cap, CAP_OPT_NOAUDIT); } EXPORT_SYMBOL(ns_capable_noaudit); @@ -448,10 +449,11 @@ EXPORT_SYMBOL(capable); bool file_ns_capable(const struct file *file, struct user_namespace *ns, int cap) { + if (WARN_ON_ONCE(!cap_valid(cap))) return false; - if (security_capable(file->f_cred, ns, cap) == 0) + if (security_capable(file->f_cred, ns, cap, CAP_OPT_NONE) == 0) return true; return false; @@ -500,10 +502,12 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns) { int ret = 0; /* An absent tracer adds no restrictions */ const struct cred *cred; + rcu_read_lock(); cred = rcu_dereference(tsk->ptracer_cred); if (cred) - ret = security_capable_noaudit(cred, ns, CAP_SYS_PTRACE); + ret = security_capable(cred, ns, CAP_SYS_PTRACE, + CAP_OPT_NOAUDIT); rcu_read_unlock(); return (ret == 0); } diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 604773a53d8b..d9368aec5fa2 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -4859,8 +4859,6 @@ static void css_release_work_fn(struct work_struct *work) if (cgrp->kn) RCU_INIT_POINTER(*(void __rcu __force **)&cgrp->kn->priv, NULL); - - cgroup_bpf_put(cgrp); } mutex_unlock(&cgroup_mutex); @@ -5348,6 +5346,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) cgroup1_check_for_release(parent); + cgroup_bpf_offline(cgrp); + /* put the base reference */ percpu_ref_kill(&cgrp->self.refcnt); @@ -6071,6 +6071,7 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd) cset = task_css_set(current); if (likely(cgroup_tryget(cset->dfl_cgrp))) { skcd->val = (unsigned long)cset->dfl_cgrp; + cgroup_bpf_get(cset->dfl_cgrp); break; } cpu_relax(); @@ -6096,10 +6097,12 @@ void cgroup_sk_clone(struct sock_cgroup_data *skcd) void cgroup_sk_free(struct sock_cgroup_data *skcd) { + struct cgroup *cgrp = sock_cgroup_ptr(skcd); + if (skcd->no_refcnt) return; - - cgroup_put(sock_cgroup_ptr(skcd)); + cgroup_bpf_put(cgrp); + cgroup_put(cgrp); } #endif /* CONFIG_SOCK_CGROUP_DATA */ @@ -6121,7 +6124,7 @@ int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, int ret; mutex_lock(&cgroup_mutex); - ret = __cgroup_bpf_detach(cgrp, prog, type, flags); + ret = __cgroup_bpf_detach(cgrp, prog, type); mutex_unlock(&cgroup_mutex); return ret; } diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c index 940d2e8db776..138059eb730d 100644 --- a/kernel/cgroup/pids.c +++ b/kernel/cgroup/pids.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Process number limiting controller for cgroups. * @@ -25,10 +26,6 @@ * a superset of parent/child/pids.current. * * Copyright (C) 2015 Aleksa Sarai - * - * This file is subject to the terms and conditions of version 2 of the GNU - * General Public License. See the file COPYING in the main directory of the - * Linux distribution for more details. */ #include diff --git a/kernel/cgroup/rdma.c b/kernel/cgroup/rdma.c index defad3c5e7dc..8db8eb22c218 100644 --- a/kernel/cgroup/rdma.c +++ b/kernel/cgroup/rdma.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * RDMA resource limiting controller for cgroups. * @@ -5,10 +6,6 @@ * additional RDMA resources after a certain limit is reached. * * Copyright (C) 2016 Parav Pandit - * - * This file is subject to the terms and conditions of version 2 of the GNU - * General Public License. See the file COPYING in the main directory of the - * Linux distribution for more details. */ #include diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 2d90996dbe77..1001b581ae8f 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -457,6 +457,7 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_NUMBER(PG_hwpoison); #endif VMCOREINFO_NUMBER(PG_head_mask); +#define PAGE_BUDDY_MAPCOUNT_VALUE (~PG_buddy) VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE); #ifdef CONFIG_HUGETLB_PAGE VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR); diff --git a/kernel/events/core.c b/kernel/events/core.c index 55845a733f5c..d655eae6c912 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -419,6 +419,8 @@ static atomic_t nr_namespaces_events __read_mostly; static atomic_t nr_task_events __read_mostly; static atomic_t nr_freq_events __read_mostly; static atomic_t nr_switch_events __read_mostly; +static atomic_t nr_ksymbol_events __read_mostly; +static atomic_t nr_bpf_events __read_mostly; static LIST_HEAD(pmus); static DEFINE_MUTEX(pmus_lock); @@ -3841,12 +3843,15 @@ static inline u64 perf_event_count(struct perf_event *event) * will not be local and we cannot read them atomically * - must not have a pmu::count method */ -int perf_event_read_local(struct perf_event *event, u64 *value) +int perf_event_read_local(struct perf_event *event, u64 *value, + u64 *enabled, u64 *running) { unsigned long flags; int ret = 0; int local_cpu = smp_processor_id(); bool readable = cpumask_test_cpu(local_cpu, &event->readable_on_cpus); + u64 now; + /* * Disabling interrupts avoids all counter scheduling (context * switches, timer based rotation and IPIs). @@ -3883,13 +3888,21 @@ int perf_event_read_local(struct perf_event *event, u64 *value) goto out; } + now = event->shadow_ctx_time + perf_clock(); + if (enabled) + *enabled = now - event->tstamp_enabled; /* * If the event is currently on this CPU, its either a per-task event, * or local to this CPU. Furthermore it means its ACTIVE (otherwise * oncpu == -1). */ - if (event->oncpu == smp_processor_id() || readable) + if (event->oncpu == smp_processor_id() || readable) { event->pmu->read(event); + if (running) + *running = now - event->tstamp_running; + } else if (running) { + *running = event->total_time_running; + } *value = local64_read(&event->count); out: @@ -4172,7 +4185,7 @@ static bool is_sb_event(struct perf_event *event) if (attr->mmap || attr->mmap_data || attr->mmap2 || attr->comm || attr->comm_exec || - attr->task || + attr->task || attr->ksymbol || attr->context_switch) return true; return false; @@ -4242,6 +4255,10 @@ static void unaccount_event(struct perf_event *event) dec = true; if (has_branch_stack(event)) dec = true; + if (event->attr.ksymbol) + atomic_dec(&nr_ksymbol_events); + if (event->attr.bpf_event) + atomic_dec(&nr_bpf_events); if (dec) { if (!atomic_add_unless(&perf_sched_count, -1, 1)) @@ -5104,6 +5121,7 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon rcu_read_unlock(); return 0; } + case PERF_EVENT_IOC_QUERY_BPF: return perf_event_query_prog_array(event, (void __user *)arg); default: @@ -7107,7 +7125,7 @@ static void perf_fill_ns_link_info(struct perf_ns_link_info *ns_link_info, { struct path ns_path; struct inode *ns_inode; - int error; + void *error; error = ns_get_path(&ns_path, task, ns_ops); if (!error) { @@ -7716,6 +7734,207 @@ static void perf_log_throttle(struct perf_event *event, int enable) perf_output_end(&handle); } +/* + * ksymbol register/unregister tracking + */ + +struct perf_ksymbol_event { + const char *name; + int name_len; + struct { + struct perf_event_header header; + u64 addr; + u32 len; + u16 ksym_type; + u16 flags; + } event_id; +}; + +static int perf_event_ksymbol_match(struct perf_event *event) +{ + return event->attr.ksymbol; +} + +static void perf_event_ksymbol_output(struct perf_event *event, void *data) +{ + struct perf_ksymbol_event *ksymbol_event = data; + struct perf_output_handle handle; + struct perf_sample_data sample; + int ret; + + if (!perf_event_ksymbol_match(event)) + return; + + perf_event_header__init_id(&ksymbol_event->event_id.header, + &sample, event); + ret = perf_output_begin(&handle, event, + ksymbol_event->event_id.header.size); + if (ret) + return; + + perf_output_put(&handle, ksymbol_event->event_id); + __output_copy(&handle, ksymbol_event->name, ksymbol_event->name_len); + perf_event__output_id_sample(event, &handle, &sample); + + perf_output_end(&handle); +} + +void perf_event_ksymbol(u16 ksym_type, u64 addr, u32 len, bool unregister, + const char *sym) +{ + struct perf_ksymbol_event ksymbol_event; + char name[KSYM_NAME_LEN]; + u16 flags = 0; + int name_len; + + if (!atomic_read(&nr_ksymbol_events)) + return; + + if (ksym_type >= PERF_RECORD_KSYMBOL_TYPE_MAX || + ksym_type == PERF_RECORD_KSYMBOL_TYPE_UNKNOWN) + goto err; + + strlcpy(name, sym, KSYM_NAME_LEN); + name_len = strlen(name) + 1; + while (!IS_ALIGNED(name_len, sizeof(u64))) + name[name_len++] = '\0'; + BUILD_BUG_ON(KSYM_NAME_LEN % sizeof(u64)); + + if (unregister) + flags |= PERF_RECORD_KSYMBOL_FLAGS_UNREGISTER; + + ksymbol_event = (struct perf_ksymbol_event){ + .name = name, + .name_len = name_len, + .event_id = { + .header = { + .type = PERF_RECORD_KSYMBOL, + .size = sizeof(ksymbol_event.event_id) + + name_len, + }, + .addr = addr, + .len = len, + .ksym_type = ksym_type, + .flags = flags, + }, + }; + + perf_iterate_sb(perf_event_ksymbol_output, &ksymbol_event, NULL); + return; +err: + WARN_ONCE(1, "%s: Invalid KSYMBOL type 0x%x\n", __func__, ksym_type); +} + +/* + * bpf program load/unload tracking + */ + +struct perf_bpf_event { + struct bpf_prog *prog; + struct { + struct perf_event_header header; + u16 type; + u16 flags; + u32 id; + u8 tag[BPF_TAG_SIZE]; + } event_id; +}; + +static int perf_event_bpf_match(struct perf_event *event) +{ + return event->attr.bpf_event; +} + +static void perf_event_bpf_output(struct perf_event *event, void *data) +{ + struct perf_bpf_event *bpf_event = data; + struct perf_output_handle handle; + struct perf_sample_data sample; + int ret; + + if (!perf_event_bpf_match(event)) + return; + + perf_event_header__init_id(&bpf_event->event_id.header, + &sample, event); + ret = perf_output_begin(&handle, event, + bpf_event->event_id.header.size); + if (ret) + return; + + perf_output_put(&handle, bpf_event->event_id); + perf_event__output_id_sample(event, &handle, &sample); + + perf_output_end(&handle); +} + +static void perf_event_bpf_emit_ksymbols(struct bpf_prog *prog, + enum perf_bpf_event_type type) +{ + bool unregister = type == PERF_BPF_EVENT_PROG_UNLOAD; + char sym[KSYM_NAME_LEN]; + int i; + + if (prog->aux->func_cnt == 0) { + bpf_get_prog_name(prog, sym); + perf_event_ksymbol(PERF_RECORD_KSYMBOL_TYPE_BPF, + (u64)(unsigned long)prog->bpf_func, + prog->jited_len, unregister, sym); + } else { + for (i = 0; i < prog->aux->func_cnt; i++) { + struct bpf_prog *subprog = prog->aux->func[i]; + + bpf_get_prog_name(subprog, sym); + perf_event_ksymbol( + PERF_RECORD_KSYMBOL_TYPE_BPF, + (u64)(unsigned long)subprog->bpf_func, + subprog->jited_len, unregister, sym); + } + } +} + +void perf_event_bpf_event(struct bpf_prog *prog, + enum perf_bpf_event_type type, + u16 flags) +{ + struct perf_bpf_event bpf_event; + + if (type <= PERF_BPF_EVENT_UNKNOWN || + type >= PERF_BPF_EVENT_MAX) + return; + + switch (type) { + case PERF_BPF_EVENT_PROG_LOAD: + case PERF_BPF_EVENT_PROG_UNLOAD: + if (atomic_read(&nr_ksymbol_events)) + perf_event_bpf_emit_ksymbols(prog, type); + break; + default: + break; + } + + if (!atomic_read(&nr_bpf_events)) + return; + + bpf_event = (struct perf_bpf_event){ + .prog = prog, + .event_id = { + .header = { + .type = PERF_RECORD_BPF_EVENT, + .size = sizeof(bpf_event.event_id), + }, + .type = type, + .flags = flags, + .id = prog->aux->id, + }, + }; + + BUILD_BUG_ON(BPF_TAG_SIZE % sizeof(u64)); + + memcpy(bpf_event.event_id.tag, prog->tag, BPF_TAG_SIZE); + perf_iterate_sb(perf_event_bpf_output, &bpf_event, NULL); +} + void perf_event_itrace_started(struct perf_event *event) { event->attach_state |= PERF_ATTACH_ITRACE; @@ -8442,9 +8661,119 @@ static struct pmu perf_tracepoint = { .events_across_hotplug = 1, }; +#if defined(CONFIG_KPROBE_EVENTS) || defined(CONFIG_UPROBE_EVENTS) +/* + * Flags in config, used by dynamic PMU kprobe and uprobe + * The flags should match following PMU_FORMAT_ATTR(). + * + * PERF_PROBE_CONFIG_IS_RETPROBE if set, create kretprobe/uretprobe + * if not set, create kprobe/uprobe + */ +enum perf_probe_config { + PERF_PROBE_CONFIG_IS_RETPROBE = 1U << 0, /* [k,u]retprobe */ +}; + +PMU_FORMAT_ATTR(retprobe, "config:0"); + +static struct attribute *probe_attrs[] = { + &format_attr_retprobe.attr, + NULL, +}; + +static struct attribute_group probe_format_group = { + .name = "format", + .attrs = probe_attrs, +}; + +static const struct attribute_group *probe_attr_groups[] = { + &probe_format_group, + NULL, +}; +#endif + +#ifdef CONFIG_KPROBE_EVENTS +static int perf_kprobe_event_init(struct perf_event *event); +static struct pmu perf_kprobe = { + .task_ctx_nr = perf_sw_context, + .event_init = perf_kprobe_event_init, + .add = perf_trace_add, + .del = perf_trace_del, + .start = perf_swevent_start, + .stop = perf_swevent_stop, + .read = perf_swevent_read, + .attr_groups = probe_attr_groups, +}; + +static int perf_kprobe_event_init(struct perf_event *event) +{ + int err; + bool is_retprobe; + + if (event->attr.type != perf_kprobe.type) + return -ENOENT; + /* + * no branch sampling for probe events + */ + if (has_branch_stack(event)) + return -EOPNOTSUPP; + + is_retprobe = event->attr.config & PERF_PROBE_CONFIG_IS_RETPROBE; + err = perf_kprobe_init(event, is_retprobe); + if (err) + return err; + + event->destroy = perf_kprobe_destroy; + + return 0; +} +#endif /* CONFIG_KPROBE_EVENTS */ + +#ifdef CONFIG_UPROBE_EVENTS +static int perf_uprobe_event_init(struct perf_event *event); +static struct pmu perf_uprobe = { + .task_ctx_nr = perf_sw_context, + .event_init = perf_uprobe_event_init, + .add = perf_trace_add, + .del = perf_trace_del, + .start = perf_swevent_start, + .stop = perf_swevent_stop, + .read = perf_swevent_read, + .attr_groups = probe_attr_groups, +}; + +static int perf_uprobe_event_init(struct perf_event *event) +{ + int err; + bool is_retprobe; + + if (event->attr.type != perf_uprobe.type) + return -ENOENT; + /* + * no branch sampling for probe events + */ + if (has_branch_stack(event)) + return -EOPNOTSUPP; + + is_retprobe = event->attr.config & PERF_PROBE_CONFIG_IS_RETPROBE; + err = perf_uprobe_init(event, is_retprobe); + if (err) + return err; + + event->destroy = perf_uprobe_destroy; + + return 0; +} +#endif /* CONFIG_UPROBE_EVENTS */ + static inline void perf_tp_register(void) { perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT); +#ifdef CONFIG_KPROBE_EVENTS + perf_pmu_register(&perf_kprobe, "kprobe", -1); +#endif +#ifdef CONFIG_UPROBE_EVENTS + perf_pmu_register(&perf_uprobe, "uprobe", -1); +#endif } static void perf_event_free_filter(struct perf_event *event) @@ -8460,6 +8789,7 @@ static void bpf_overflow_handler(struct perf_event *event, struct bpf_perf_event_data_kern ctx = { .data = data, .regs = regs, + .event = event, }; int ret = 0; @@ -8520,13 +8850,32 @@ static void perf_event_free_bpf_handler(struct perf_event *event) } #endif +/* + * returns true if the event is a tracepoint, or a kprobe/upprobe created + * with perf_event_open() + */ +static inline bool perf_event_is_tracing(struct perf_event *event) +{ + if (event->pmu == &perf_tracepoint) + return true; +#ifdef CONFIG_KPROBE_EVENTS + if (event->pmu == &perf_kprobe) + return true; +#endif +#ifdef CONFIG_UPROBE_EVENTS + if (event->pmu == &perf_uprobe) + return true; +#endif + return false; +} + static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd) { bool is_kprobe, is_tracepoint, is_syscall_tp; struct bpf_prog *prog; int ret; - if (event->attr.type != PERF_TYPE_TRACEPOINT) + if (!perf_event_is_tracing(event)) return perf_event_set_bpf_handler(event, prog_fd); is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_UKPROBE; @@ -8548,6 +8897,13 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd) return -EINVAL; } + /* Kprobe override only works for kprobes, not uprobes. */ + if (prog->kprobe_override && + !(event->tp_event->flags & TRACE_EVENT_FL_KPROBE)) { + bpf_prog_put(prog); + return -EINVAL; + } + if (is_tracepoint || is_syscall_tp) { int off = trace_event_get_offsets(event->tp_event); @@ -8565,7 +8921,7 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd) static void perf_event_free_bpf_prog(struct perf_event *event) { - if (event->attr.type != PERF_TYPE_TRACEPOINT) { + if (!perf_event_is_tracing(event)) { perf_event_free_bpf_handler(event); return; } @@ -8985,23 +9341,34 @@ fail_clear_files: static int perf_event_set_filter(struct perf_event *event, void __user *arg) { - char *filter_str; int ret = -EINVAL; - - if ((event->attr.type != PERF_TYPE_TRACEPOINT || - !IS_ENABLED(CONFIG_EVENT_TRACING)) && - !has_addr_filter(event)) - return -EINVAL; + char *filter_str; filter_str = strndup_user(arg, PAGE_SIZE); if (IS_ERR(filter_str)) return PTR_ERR(filter_str); - if (IS_ENABLED(CONFIG_EVENT_TRACING) && - event->attr.type == PERF_TYPE_TRACEPOINT) - ret = ftrace_profile_set_filter(event, event->attr.config, - filter_str); - else if (has_addr_filter(event)) +#ifdef CONFIG_EVENT_TRACING + if (perf_event_is_tracing(event)) { + struct perf_event_context *ctx = event->ctx; + + /* + * Beware, here be dragons!! + * + * the tracepoint muck will deadlock against ctx->mutex, but + * the tracepoint stuff does not actually need it. So + * temporarily drop ctx->mutex. As per perf_event_ctx_lock() we + * already have a reference on ctx. + * + * This can result in event getting moved to a different ctx, + * but that does not affect the tracepoint state. + */ + mutex_unlock(&ctx->mutex); + ret = ftrace_profile_set_filter(event, event->attr.config, filter_str); + mutex_lock(&ctx->mutex); + } else +#endif + if (has_addr_filter(event)) ret = perf_event_set_addr_filter(event, filter_str); kfree(filter_str); @@ -9785,6 +10152,10 @@ static void account_event(struct perf_event *event) inc = true; if (is_cgroup_event(event)) inc = true; + if (event->attr.ksymbol) + atomic_inc(&nr_ksymbol_events); + if (event->attr.bpf_event) + atomic_inc(&nr_bpf_events); if (inc) { if (atomic_inc_not_zero(&perf_sched_count)) @@ -10397,11 +10768,11 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id) break; case CLOCK_BOOTTIME: - event->clock = &ktime_get_boot_ns; + event->clock = &ktime_get_boottime_ns; break; case CLOCK_TAI: - event->clock = &ktime_get_tai_ns; + event->clock = &ktime_get_clocktai_ns; break; default: @@ -11356,6 +11727,14 @@ struct file *perf_event_get(unsigned int fd) return file; } +const struct perf_event *perf_get_event(struct file *file) +{ + if (file->f_op != &perf_fops) + return ERR_PTR(-EINVAL); + + return file->private_data; +} + const struct perf_event_attr *perf_event_attrs(struct perf_event *event) { if (!event) diff --git a/kernel/exit.c b/kernel/exit.c index 46a0871fb2ff..c082bdca0189 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -939,6 +939,7 @@ void __noreturn do_exit(long code) exit_task_namespaces(tsk); exit_task_work(tsk); exit_thread(tsk); + exit_umh(tsk); /* * Flush inherited counters to the parent - before the parent diff --git a/kernel/fork.c b/kernel/fork.c index c086a57371be..6e3637bc035a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -838,8 +838,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, init_rwsem(&mm->mmap_sem); INIT_LIST_HEAD(&mm->mmlist); mm->core_state = NULL; - atomic_long_set(&mm->nr_ptes, 0); - mm_nr_pmds_init(mm); + mm_pgtables_bytes_init(mm); mm->map_count = 0; mm->locked_vm = 0; mm->pinned_vm = 0; @@ -901,12 +900,9 @@ static void check_mm(struct mm_struct *mm) "mm:%p idx:%d val:%ld\n", mm, i, x); } - if (atomic_long_read(&mm->nr_ptes)) - pr_alert("BUG: non-zero nr_ptes on freeing mm: %ld\n", - atomic_long_read(&mm->nr_ptes)); - if (mm_nr_pmds(mm)) - pr_alert("BUG: non-zero nr_pmds on freeing mm: %ld\n", - mm_nr_pmds(mm)); + if (mm_pgtables_bytes(mm)) + pr_alert("BUG: non-zero pgtables_bytes on freeing mm: %ld\n", + mm_pgtables_bytes(mm)); #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS VM_BUG_ON_MM(mm->pmd_huge_pte, mm); @@ -2149,7 +2145,7 @@ static __latent_entropy struct task_struct *copy_process( */ p->start_time = ktime_get_ns(); - p->real_start_time = ktime_get_boot_ns(); + p->real_start_time = ktime_get_boottime_ns(); /* * Make it visible to the rest of the system, but dont wake it up yet. diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index 971c3cc540ba..59d5dbf7f534 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -12,7 +12,6 @@ * compression (see scripts/kallsyms.c for a more complete description) */ #include -#include #include #include #include @@ -20,14 +19,11 @@ #include #include #include /* for cond_resched */ -#include #include #include #include #include -#include - #include #ifdef CONFIG_KALLSYMS_ALL @@ -89,37 +85,6 @@ void sec_debug_summary_set_kallsyms_info( } #endif -static inline int is_kernel_inittext(unsigned long addr) -{ - if (addr >= (unsigned long)_sinittext - && addr <= (unsigned long)_einittext) - return 1; - return 0; -} - -static inline int is_kernel_text(unsigned long addr) -{ - if ((addr >= (unsigned long)_stext && addr <= (unsigned long)_etext) || - arch_is_kernel_text(addr)) - return 1; - return in_gate_area_no_mm(addr); -} - -static inline int is_kernel(unsigned long addr) -{ - if (addr >= (unsigned long)_stext && addr <= (unsigned long)_end) - return 1; - return in_gate_area_no_mm(addr); -} - -static int is_ksym_addr(unsigned long addr) -{ - if (IS_ENABLED(CONFIG_KALLSYMS_ALL)) - return is_kernel(addr); - - return is_kernel_text(addr) || is_kernel_inittext(addr); -} - /* * Expand a compressed symbol data into the resulting uncompressed string, * if uncompressed string is too long (>= maxlen), it will be truncated, @@ -727,19 +692,20 @@ static inline int kallsyms_for_perf(void) * Otherwise, require CAP_SYSLOG (assuming kptr_restrict isn't set to * block even that). */ -int kallsyms_show_value(void) +bool kallsyms_show_value(const struct cred *cred) { switch (kptr_restrict) { case 0: if (kallsyms_for_perf()) - return 1; + return true; /* fallthrough */ case 1: - if (has_capability_noaudit(current, CAP_SYSLOG)) - return 1; + if (security_capable(cred, &init_user_ns, CAP_SYSLOG, + CAP_OPT_NOAUDIT) == 0) + return true; /* fallthrough */ default: - return 0; + return false; } } @@ -756,7 +722,11 @@ static int kallsyms_open(struct inode *inode, struct file *file) return -ENOMEM; reset_iter(iter, 0); - iter->show_value = kallsyms_show_value(); + /* + * Instead of checking this on every s_show() call, cache + * the result here at open time. + */ + iter->show_value = kallsyms_show_value(file->f_cred); return 0; } diff --git a/kernel/kprobes.c b/kernel/kprobes.c index e86bbcb849ac..ee058734fb43 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -2375,6 +2375,7 @@ static void report_probe(struct seq_file *pi, struct kprobe *p, const char *sym, int offset, char *modname, struct kprobe *pp) { char *kprobe_type; + void *addr = p->addr; if (p->pre_handler == pre_handler_kretprobe) kprobe_type = "r"; @@ -2383,13 +2384,16 @@ static void report_probe(struct seq_file *pi, struct kprobe *p, else kprobe_type = "k"; + if (!kallsyms_show_value(current_cred())) + addr = NULL; + if (sym) - seq_printf(pi, "%p %s %s+0x%x %s ", - p->addr, kprobe_type, sym, offset, + seq_printf(pi, "%px %s %s+0x%x %s ", + addr, kprobe_type, sym, offset, (modname ? modname : " ")); - else - seq_printf(pi, "%p %s %p ", - p->addr, kprobe_type, p->addr); + else /* try to use %pS */ + seq_printf(pi, "%px %s %pS ", + addr, kprobe_type, p->addr); if (!pp) pp = p; @@ -2477,8 +2481,16 @@ static int kprobe_blacklist_seq_show(struct seq_file *m, void *v) struct kprobe_blacklist_entry *ent = list_entry(v, struct kprobe_blacklist_entry, list); - seq_printf(m, "0x%px-0x%px\t%ps\n", (void *)ent->start_addr, - (void *)ent->end_addr, (void *)ent->start_addr); + /* + * If /proc/kallsyms is not showing kernel address, we won't + * show them here either. + */ + if (!kallsyms_show_value(current_cred())) + seq_printf(m, "0x%px-0x%px\t%ps\n", NULL, NULL, + (void *)ent->start_addr); + else + seq_printf(m, "0x%px-0x%px\t%ps\n", (void *)ent->start_addr, + (void *)ent->end_addr, (void *)ent->start_addr); return 0; } diff --git a/kernel/module.c b/kernel/module.c index 275dc6ccc6f3..9e80cb64751c 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3241,6 +3241,11 @@ static int find_module_sections(struct module *mod, struct load_info *info) sizeof(*mod->tracepoints_ptrs), &mod->num_tracepoints); #endif +#ifdef CONFIG_BPF_EVENTS + mod->bpf_raw_events = section_objs(info, "__bpf_raw_tp_map", + sizeof(*mod->bpf_raw_events), + &mod->num_bpf_raw_events); +#endif #ifdef HAVE_JUMP_LABEL mod->jump_entries = section_objs(info, "__jump_table", sizeof(*mod->jump_entries), @@ -3265,7 +3270,11 @@ static int find_module_sections(struct module *mod, struct load_info *info) sizeof(*mod->ftrace_callsites), &mod->num_ftrace_callsites); #endif - +#ifdef CONFIG_FUNCTION_ERROR_INJECTION + mod->ei_funcs = section_objs(info, "_error_injection_whitelist", + sizeof(*mod->ei_funcs), + &mod->num_ei_funcs); +#endif mod->extable = section_objs(info, "__ex_table", sizeof(*mod->extable), &mod->num_exentries); @@ -4423,7 +4432,7 @@ static int modules_open(struct inode *inode, struct file *file) if (!err) { struct seq_file *m = file->private_data; - m->private = kallsyms_show_value() ? NULL : (void *)8ul; + m->private = kallsyms_show_value(current_cred()) ? NULL : (void *)8ul; } return 0; diff --git a/kernel/ptrace.c b/kernel/ptrace.c index de0f4e93ce32..7269b0b1105b 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -275,17 +275,11 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) return ret; } -static bool ptrace_has_cap(const struct cred *cred, struct user_namespace *ns, - unsigned int mode) +static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode) { - int ret; - if (mode & PTRACE_MODE_NOAUDIT) - ret = security_capable(cred, ns, CAP_SYS_PTRACE); - else - ret = security_capable(cred, ns, CAP_SYS_PTRACE); - - return ret == 0; + return ns_capable_noaudit(ns, CAP_SYS_PTRACE); + return ns_capable(ns, CAP_SYS_PTRACE); } /* Returns 0 on success, -errno on denial. */ @@ -337,7 +331,7 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode) gid_eq(caller_gid, tcred->sgid) && gid_eq(caller_gid, tcred->gid)) goto ok; - if (ptrace_has_cap(cred, tcred->user_ns, mode)) + if (ptrace_has_cap(tcred->user_ns, mode)) goto ok; rcu_read_unlock(); return -EPERM; @@ -356,7 +350,7 @@ ok: mm = task->mm; if (mm && ((get_dumpable(mm) != SUID_DUMP_USER) && - !ptrace_has_cap(cred, mm->user_ns, mode))) + !ptrace_has_cap(mm->user_ns, mode))) return -EPERM; return security_ptrace_access_check(task, mode); diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index e1a49c6e58ea..542e52f6cd86 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -7,6 +7,17 @@ menu "RCU Debugging" config PROVE_RCU def_bool PROVE_LOCKING +config PROVE_RCU_LIST + bool "RCU list lockdep debugging" + depends on PROVE_RCU && RCU_EXPERT + default n + help + Enable RCU lockdep checking for list usages. By default it is + turned off since there are several list RCU users that still + need to be converted to pass a lockdep expression. To prevent + false-positive splats, we keep it default disabled but once all + users are converted, we can remove this config option. + config TORTURE_TEST tristate default n diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 692506c8fe1a..5a28ff21b3c5 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -254,6 +254,13 @@ void rcu_bh_qs(void) } } +void rcu_softirq_qs(void) +{ + rcu_sched_qs(); + rcu_preempt_qs(); + rcu_preempt_deferred_qs(current); +} + /* * Steal a bit from the bottom of ->dynticks for idle entry/exit * control. Initially this is for TLB flushing. @@ -443,6 +450,7 @@ static void rcu_momentary_dyntick_idle(void) { raw_cpu_write(rcu_dynticks.rcu_need_heavy_qs, false); rcu_dynticks_momentary_idle(); + rcu_preempt_deferred_qs(current); } /* @@ -791,6 +799,7 @@ static void rcu_eqs_enter_common(bool user) do_nocb_deferred_wakeup(rdp); } rcu_prepare_for_idle(); + rcu_preempt_deferred_qs(current); __this_cpu_inc(disable_rcu_irq_enter); rdtp->dynticks_nesting = 0; /* Breaks tracing momentarily. */ rcu_dynticks_eqs_enter(); /* After this, tracing works again. */ @@ -2909,6 +2918,12 @@ __rcu_process_callbacks(struct rcu_state *rsp) WARN_ON_ONCE(!rdp->beenonline); + /* Report any deferred quiescent states if preemption enabled. */ + if (!(preempt_count() & PREEMPT_MASK)) + rcu_preempt_deferred_qs(current); + else if (rcu_preempt_need_deferred_qs(current)) + resched_cpu(rdp->cpu); /* Provoke future context switch. */ + /* Update RCU state based on any recent quiescent states. */ rcu_check_quiescent_state(rsp, rdp); @@ -3876,6 +3891,7 @@ void rcu_report_dead(unsigned int cpu) rcu_report_exp_rdp(&rcu_sched_state, this_cpu_ptr(rcu_sched_state.rda), true); preempt_enable(); + rcu_preempt_deferred_qs(current); for_each_rcu_flavor(rsp) rcu_cleanup_dying_idle_cpu(cpu, rsp); } diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 8e1f285f0a70..f25b2e52ae08 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -204,6 +204,7 @@ struct rcu_data { bool core_needs_qs; /* Core waits for quiesc state. */ bool beenonline; /* CPU online at least once. */ bool gpwrap; /* Possible gpnum/completed wrap. */ + bool deferred_qs; /* This CPU awaiting a deferred QS? */ struct rcu_node *mynode; /* This CPU's leaf of hierarchy */ unsigned long grpmask; /* Mask to apply to leaf qsmask. */ unsigned long ticks_this_gp; /* The number of scheduling-clock */ @@ -447,6 +448,7 @@ DECLARE_PER_CPU(char, rcu_cpu_has_work); /* Forward declarations for rcutree_plugin.h */ static void rcu_bootup_announce(void); +static void rcu_preempt_qs(void); static void rcu_preempt_note_context_switch(bool preempt); static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp); #ifdef CONFIG_HOTPLUG_CPU @@ -474,6 +476,8 @@ static void rcu_cleanup_after_idle(void); static void rcu_prepare_for_idle(void); static void rcu_idle_count_callbacks_posted(void); static bool rcu_preempt_has_tasks(struct rcu_node *rnp); +static bool rcu_preempt_need_deferred_qs(struct task_struct *t); +static void rcu_preempt_deferred_qs(struct task_struct *t); static void print_cpu_stall_info_begin(void); static void print_cpu_stall_info(struct rcu_state *rsp, int cpu); static void print_cpu_stall_info_end(void); diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 81b4d4fd1277..161e31de2c61 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -235,6 +235,7 @@ static void rcu_report_exp_cpu_mult(struct rcu_state *rsp, struct rcu_node *rnp, static void rcu_report_exp_rdp(struct rcu_state *rsp, struct rcu_data *rdp, bool wake) { + WRITE_ONCE(rdp->deferred_qs, false); rcu_report_exp_cpu_mult(rsp, rdp->mynode, rdp->grpmask, wake); } @@ -666,32 +667,70 @@ EXPORT_SYMBOL_GPL(synchronize_sched_expedited); */ static void sync_rcu_exp_handler(void *info) { - struct rcu_data *rdp; + unsigned long flags; struct rcu_state *rsp = info; + struct rcu_data *rdp = this_cpu_ptr(rsp->rda); + struct rcu_node *rnp = rdp->mynode; struct task_struct *t = current; /* - * Within an RCU read-side critical section, request that the next - * rcu_read_unlock() report. Unless this RCU read-side critical - * section has already blocked, in which case it is already set - * up for the expedited grace period to wait on it. + * First, the common case of not being in an RCU read-side + * critical section. If also enabled or idle, immediately + * report the quiescent state, otherwise defer. */ - if (t->rcu_read_lock_nesting > 0 && - !t->rcu_read_unlock_special.b.blocked) { - t->rcu_read_unlock_special.b.exp_need_qs = true; + if (!t->rcu_read_lock_nesting) { + if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) || + rcu_dynticks_curr_cpu_in_eqs()) { + rcu_report_exp_rdp(rsp, rdp, true); + } else { + rdp->deferred_qs = true; + resched_cpu(rdp->cpu); + } return; } /* - * We are either exiting an RCU read-side critical section (negative - * values of t->rcu_read_lock_nesting) or are not in one at all - * (zero value of t->rcu_read_lock_nesting). Or we are in an RCU - * read-side critical section that blocked before this expedited - * grace period started. Either way, we can immediately report - * the quiescent state. + * Second, the less-common case of being in an RCU read-side + * critical section. In this case we can count on a future + * rcu_read_unlock(). However, this rcu_read_unlock() might + * execute on some other CPU, but in that case there will be + * a future context switch. Either way, if the expedited + * grace period is still waiting on this CPU, set ->deferred_qs + * so that the eventual quiescent state will be reported. + * Note that there is a large group of race conditions that + * can have caused this quiescent state to already have been + * reported, so we really do need to check ->expmask. */ - rdp = this_cpu_ptr(rsp->rda); - rcu_report_exp_rdp(rsp, rdp, true); + if (t->rcu_read_lock_nesting > 0) { + raw_spin_lock_irqsave_rcu_node(rnp, flags); + if (rnp->expmask & rdp->grpmask) + rdp->deferred_qs = true; + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); + } + + /* + * The final and least likely case is where the interrupted + * code was just about to or just finished exiting the RCU-preempt + * read-side critical section, and no, we can't tell which. + * So either way, set ->deferred_qs to flag later code that + * a quiescent state is required. + * + * If the CPU is fully enabled (or if some buggy RCU-preempt + * read-side critical section is being used from idle), just + * invoke rcu_preempt_defer_qs() to immediately report the + * quiescent state. We cannot use rcu_read_unlock_special() + * because we are in an interrupt handler, which will cause that + * function to take an early exit without doing anything. + * + * Otherwise, use resched_cpu() to force a context switch after + * the CPU enables everything. + */ + rdp->deferred_qs = true; + if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) || + WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) + rcu_preempt_deferred_qs(t); + else + resched_cpu(rdp->cpu); } /** diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 8b3102d22823..cc1c4fcfbd0a 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -358,6 +358,9 @@ static void rcu_preempt_note_context_switch(bool preempt) * behalf of preempted instance of __rcu_read_unlock(). */ rcu_read_unlock_special(t); + rcu_preempt_deferred_qs(t); + } else { + rcu_preempt_deferred_qs(t); } /* @@ -407,54 +410,51 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp) } /* - * Handle special cases during rcu_read_unlock(), such as needing to - * notify RCU core processing or task having blocked during the RCU - * read-side critical section. + * Report deferred quiescent states. The deferral time can + * be quite short, for example, in the case of the call from + * rcu_read_unlock_special(). */ -void rcu_read_unlock_special(struct task_struct *t) +static void +rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags) { bool empty_exp; bool empty_norm; bool empty_exp_now; - unsigned long flags; struct list_head *np; bool drop_boost_mutex = false; struct rcu_data *rdp; struct rcu_node *rnp; union rcu_special special; - /* NMI handlers cannot block and cannot safely manipulate state. */ - if (in_nmi()) - return; - - local_irq_save(flags); - /* * If RCU core is waiting for this CPU to exit its critical section, * report the fact that it has exited. Because irqs are disabled, * t->rcu_read_unlock_special cannot change. */ special = t->rcu_read_unlock_special; + rdp = this_cpu_ptr(rcu_state_p->rda); + if (!special.s && !rdp->deferred_qs) { + local_irq_restore(flags); + return; + } if (special.b.need_qs) { rcu_preempt_qs(); t->rcu_read_unlock_special.b.need_qs = false; - if (!t->rcu_read_unlock_special.s) { + if (!t->rcu_read_unlock_special.s && !rdp->deferred_qs) { local_irq_restore(flags); return; } } /* - * Respond to a request for an expedited grace period, but only if - * we were not preempted, meaning that we were running on the same - * CPU throughout. If we were preempted, the exp_need_qs flag - * would have been cleared at the time of the first preemption, - * and the quiescent state would be reported when we were dequeued. + * Respond to a request by an expedited grace period for a + * quiescent state from this CPU. Note that requests from + * tasks are handled when removing the task from the + * blocked-tasks list below. */ - if (special.b.exp_need_qs) { - WARN_ON_ONCE(special.b.blocked); + if (special.b.exp_need_qs || rdp->deferred_qs) { t->rcu_read_unlock_special.b.exp_need_qs = false; - rdp = this_cpu_ptr(rcu_state_p->rda); + rdp->deferred_qs = false; rcu_report_exp_rdp(rcu_state_p, rdp, true); if (!t->rcu_read_unlock_special.s) { local_irq_restore(flags); @@ -462,19 +462,6 @@ void rcu_read_unlock_special(struct task_struct *t) } } - /* Hardware IRQ handlers cannot block, complain if they get here. */ - if (in_irq() || in_serving_softirq()) { - lockdep_rcu_suspicious(__FILE__, __LINE__, - "rcu_read_unlock() from irq or softirq with blocking in critical section!!!\n"); - pr_alert("->rcu_read_unlock_special: %#x (b: %d, enq: %d nq: %d)\n", - t->rcu_read_unlock_special.s, - t->rcu_read_unlock_special.b.blocked, - t->rcu_read_unlock_special.b.exp_need_qs, - t->rcu_read_unlock_special.b.need_qs); - local_irq_restore(flags); - return; - } - /* Clean up if blocked during RCU read-side critical section. */ if (special.b.blocked) { t->rcu_read_unlock_special.b.blocked = false; @@ -543,6 +530,72 @@ void rcu_read_unlock_special(struct task_struct *t) } } +/* + * Is a deferred quiescent-state pending, and are we also not in + * an RCU read-side critical section? It is the caller's responsibility + * to ensure it is otherwise safe to report any deferred quiescent + * states. The reason for this is that it is safe to report a + * quiescent state during context switch even though preemption + * is disabled. This function cannot be expected to understand these + * nuances, so the caller must handle them. + */ +static bool rcu_preempt_need_deferred_qs(struct task_struct *t) +{ + return (this_cpu_ptr(&rcu_preempt_data)->deferred_qs || + READ_ONCE(t->rcu_read_unlock_special.s)) && + !t->rcu_read_lock_nesting; +} + +/* + * Report a deferred quiescent state if needed and safe to do so. + * As with rcu_preempt_need_deferred_qs(), "safe" involves only + * not being in an RCU read-side critical section. The caller must + * evaluate safety in terms of interrupt, softirq, and preemption + * disabling. + */ +static void rcu_preempt_deferred_qs(struct task_struct *t) +{ + unsigned long flags; + bool couldrecurse = t->rcu_read_lock_nesting >= 0; + + if (!rcu_preempt_need_deferred_qs(t)) + return; + if (couldrecurse) + t->rcu_read_lock_nesting -= INT_MIN; + local_irq_save(flags); + rcu_preempt_deferred_qs_irqrestore(t, flags); + if (couldrecurse) + t->rcu_read_lock_nesting += INT_MIN; +} + +/* + * Handle special cases during rcu_read_unlock(), such as needing to + * notify RCU core processing or task having blocked during the RCU + * read-side critical section. + */ +void rcu_read_unlock_special(struct task_struct *t) +{ + unsigned long flags; + bool preempt_bh_were_disabled = + !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)); + bool irqs_were_disabled; + + /* NMI handlers cannot block and cannot safely manipulate state. */ + if (in_nmi()) + return; + + local_irq_save(flags); + irqs_were_disabled = irqs_disabled_flags(flags); + if ((preempt_bh_were_disabled || irqs_were_disabled) && + t->rcu_read_unlock_special.b.blocked) { + /* Need to defer quiescent state until everything is enabled. */ + raise_softirq_irqoff(RCU_SOFTIRQ); + local_irq_restore(flags); + return; + } + rcu_preempt_deferred_qs_irqrestore(t, flags); +} + /* * Dump detailed information for all tasks blocking the current RCU * grace period on the specified rcu_node structure. @@ -674,10 +727,20 @@ static void rcu_preempt_check_callbacks(void) { struct task_struct *t = current; - if (t->rcu_read_lock_nesting == 0) { - rcu_preempt_qs(); + if (t->rcu_read_lock_nesting > 0 || + (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) { + /* No QS, force context switch if deferred. */ + if (rcu_preempt_need_deferred_qs(t)) + resched_cpu(smp_processor_id()); + } else if (rcu_preempt_need_deferred_qs(t)) { + rcu_preempt_deferred_qs(t); /* Report deferred QS. */ + return; + } else if (!t->rcu_read_lock_nesting) { + rcu_preempt_qs(); /* Report immediate QS. */ return; } + + /* If GP is oldish, ask for help from rcu_read_unlock_special(). */ if (t->rcu_read_lock_nesting > 0 && __this_cpu_read(rcu_data_p->core_needs_qs) && __this_cpu_read(rcu_data_p->cpu_no_qs.b.norm)) @@ -803,6 +866,7 @@ void exit_rcu(void) barrier(); t->rcu_read_unlock_special.b.blocked = true; __rcu_read_unlock(); + rcu_preempt_deferred_qs(current); } #else /* #ifdef CONFIG_PREEMPT_RCU */ @@ -818,6 +882,11 @@ static void __init rcu_bootup_announce(void) rcu_bootup_announce_oddness(); } +/* Because preemptible RCU does not exist, we can ignore its QSes. */ +static void rcu_preempt_qs(void) +{ +} + /* * Because preemptible RCU does not exist, we never have to check for * CPUs being in quiescent states. @@ -843,6 +912,16 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp) return false; } +/* + * Because there is no preemptible RCU, there can be no deferred quiescent + * states. + */ +static bool rcu_preempt_need_deferred_qs(struct task_struct *t) +{ + return false; +} +static void rcu_preempt_deferred_qs(struct task_struct *t) { } + /* * Because preemptible RCU does not exist, we never have to check for * tasks blocked within RCU read-side critical sections. diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 7a577bd989a4..497d45b3923c 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -72,9 +72,15 @@ module_param(rcu_normal_after_boot, int, 0); #ifdef CONFIG_DEBUG_LOCK_ALLOC /** - * rcu_read_lock_sched_held() - might we be in RCU-sched read-side critical section? + * rcu_read_lock_held_common() - might we be in RCU-sched read-side critical section? + * @ret: Best guess answer if lockdep cannot be relied on * - * If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an + * Returns true if lockdep must be ignored, in which case *ret contains + * the best guess described below. Otherwise returns false, in which + * case *ret tells the caller nothing and the caller should instead + * consult lockdep. + * + * If CONFIG_DEBUG_LOCK_ALLOC is selected, set *ret to nonzero iff in an * RCU-sched read-side critical section. In absence of * CONFIG_DEBUG_LOCK_ALLOC, this assumes we are in an RCU-sched read-side * critical section unless it can prove otherwise. Note that disabling @@ -86,35 +92,45 @@ module_param(rcu_normal_after_boot, int, 0); * Check debug_lockdep_rcu_enabled() to prevent false positives during boot * and while lockdep is disabled. * - * Note that if the CPU is in the idle loop from an RCU point of - * view (ie: that we are in the section between rcu_idle_enter() and - * rcu_idle_exit()) then rcu_read_lock_held() returns false even if the CPU - * did an rcu_read_lock(). The reason for this is that RCU ignores CPUs - * that are in such a section, considering these as in extended quiescent - * state, so such a CPU is effectively never in an RCU read-side critical - * section regardless of what RCU primitives it invokes. This state of - * affairs is required --- we need to keep an RCU-free window in idle - * where the CPU may possibly enter into low power mode. This way we can - * notice an extended quiescent state to other CPUs that started a grace - * period. Otherwise we would delay any grace period as long as we run in - * the idle task. + * Note that if the CPU is in the idle loop from an RCU point of view (ie: + * that we are in the section between rcu_idle_enter() and rcu_idle_exit()) + * then rcu_read_lock_held() sets *ret to false even if the CPU did an + * rcu_read_lock(). The reason for this is that RCU ignores CPUs that are + * in such a section, considering these as in extended quiescent state, + * so such a CPU is effectively never in an RCU read-side critical section + * regardless of what RCU primitives it invokes. This state of affairs is + * required --- we need to keep an RCU-free window in idle where the CPU may + * possibly enter into low power mode. This way we can notice an extended + * quiescent state to other CPUs that started a grace period. Otherwise + * we would delay any grace period as long as we run in the idle task. * - * Similarly, we avoid claiming an SRCU read lock held if the current + * Similarly, we avoid claiming an RCU read lock held if the current * CPU is offline. */ +static bool rcu_read_lock_held_common(bool *ret) +{ + if (!debug_lockdep_rcu_enabled()) { + *ret = 1; + return true; + } + if (!rcu_is_watching()) { + *ret = 0; + return true; + } + if (!rcu_lockdep_current_cpu_online()) { + *ret = 0; + return true; + } + return false; +} + int rcu_read_lock_sched_held(void) { - int lockdep_opinion = 0; + bool ret; - if (!debug_lockdep_rcu_enabled()) - return 1; - if (!rcu_is_watching()) - return 0; - if (!rcu_lockdep_current_cpu_online()) - return 0; - if (debug_locks) - lockdep_opinion = lock_is_held(&rcu_sched_lock_map); - return lockdep_opinion || !preemptible(); + if (rcu_read_lock_held_common(&ret)) + return ret; + return lock_is_held(&rcu_sched_lock_map) || !preemptible(); } EXPORT_SYMBOL(rcu_read_lock_sched_held); #endif @@ -323,12 +339,10 @@ EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled); */ int rcu_read_lock_held(void) { - if (!debug_lockdep_rcu_enabled()) - return 1; - if (!rcu_is_watching()) - return 0; - if (!rcu_lockdep_current_cpu_online()) - return 0; + bool ret; + + if (rcu_read_lock_held_common(&ret)) + return ret; return lock_is_held(&rcu_lock_map); } EXPORT_SYMBOL_GPL(rcu_read_lock_held); @@ -350,16 +364,28 @@ EXPORT_SYMBOL_GPL(rcu_read_lock_held); */ int rcu_read_lock_bh_held(void) { - if (!debug_lockdep_rcu_enabled()) - return 1; - if (!rcu_is_watching()) - return 0; - if (!rcu_lockdep_current_cpu_online()) - return 0; + bool ret; + + if (rcu_read_lock_held_common(&ret)) + return ret; return in_softirq() || irqs_disabled(); } EXPORT_SYMBOL_GPL(rcu_read_lock_bh_held); +int rcu_read_lock_any_held(void) +{ + bool ret; + + if (rcu_read_lock_held_common(&ret)) + return ret; + if (lock_is_held(&rcu_lock_map) || + lock_is_held(&rcu_bh_lock_map) || + lock_is_held(&rcu_sched_lock_map)) + return 1; + return !preemptible(); +} +EXPORT_SYMBOL_GPL(rcu_read_lock_any_held); + #endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ /** diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ae615d26c61c..c85b870325c2 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6658,6 +6658,34 @@ void ___might_sleep(const char *file, int line, int preempt_offset) add_taint(TAINT_WARN, LOCKDEP_STILL_OK); } EXPORT_SYMBOL(___might_sleep); + +void __cant_sleep(const char *file, int line, int preempt_offset) +{ + static unsigned long prev_jiffy; + + if (irqs_disabled()) + return; + + if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) + return; + + if (preempt_count() > preempt_offset) + return; + + if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy) + return; + prev_jiffy = jiffies; + + printk(KERN_ERR "BUG: assuming atomic context at %s:%d\n", file, line); + printk(KERN_ERR "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n", + in_atomic(), irqs_disabled(), + current->pid, current->comm); + + debug_show_held_locks(current); + dump_stack(); + add_taint(TAINT_WARN, LOCKDEP_STILL_OK); +} +EXPORT_SYMBOL_GPL(__cant_sleep); #endif #ifdef CONFIG_MAGIC_SYSRQ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 158ae456e8b1..642078429313 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -210,6 +210,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). */ + preempt_disable(); for (; f; f = f->prev) { u32 cur_ret = BPF_PROG_RUN(f->prog, sd); @@ -218,6 +219,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, *match = f; } } + preempt_enable(); return ret; } #endif /* CONFIG_SECCOMP_FILTER */ @@ -386,8 +388,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) * behavior of privileged children. */ if (!task_no_new_privs(current) && - security_capable_noaudit(current_cred(), current_user_ns(), - CAP_SYS_ADMIN) != 0) + security_capable(current_cred(), current_user_ns(), + CAP_SYS_ADMIN, CAP_OPT_NOAUDIT) != 0) return ERR_PTR(-EACCES); /* Allocate a new seccomp_filter */ diff --git a/kernel/softirq.c b/kernel/softirq.c index d48995a6da29..cf5ee5146221 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -310,6 +310,8 @@ restart: __this_cpu_write(active_softirqs, 0); rcu_bh_qs(); + if (__this_cpu_read(ksoftirqd) == current) + rcu_softirq_qs(); local_irq_disable(); pending = local_softirq_pending(); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 8af069d2fc53..2afaec84bbc0 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1428,6 +1428,13 @@ static struct ctl_table kern_table[] = { .extra2 = &two, }, #endif + { + .procname = "bpf_stats_enabled", + .data = &bpf_stats_enabled_key.key, + .maxlen = sizeof(bpf_stats_enabled_key), + .mode = 0644, + .proc_handler = proc_do_static_key, + }, #if defined(CONFIG_TREE_RCU) || defined(CONFIG_PREEMPT_RCU) { .procname = "panic_on_rcu_stall", @@ -3533,6 +3540,39 @@ int proc_douintvec_capacity(struct ctl_table *table, int write, #endif /* CONFIG_PROC_SYSCTL */ +#if defined(CONFIG_SYSCTL) +int proc_do_static_key(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + struct static_key *key = (struct static_key *)table->data; + static DEFINE_MUTEX(static_key_mutex); + int val, ret; + struct ctl_table tmp = { + .data = &val, + .maxlen = sizeof(val), + .mode = table->mode, + .extra1 = &zero, + .extra2 = &one, + }; + + if (write && !capable(CAP_SYS_ADMIN)) + return -EPERM; + + mutex_lock(&static_key_mutex); + val = static_key_enabled(key); + ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); + if (write && !ret) { + if (val) + static_key_enable(key); + else + static_key_disable(key); + } + mutex_unlock(&static_key_mutex); + return ret; +} +#endif + /* * No sense putting this after each symbol definition, twice, * exception granted :-) diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c index 64443ce1d28e..b5fd65049aa6 100644 --- a/kernel/time/ntp.c +++ b/kernel/time/ntp.c @@ -655,67 +655,6 @@ static inline void process_adjtimex_modes(struct timex *txc, } - -/** - * ntp_validate_timex - Ensures the timex is ok for use in do_adjtimex - */ -int ntp_validate_timex(struct timex *txc) -{ - if (txc->modes & ADJ_ADJTIME) { - /* singleshot must not be used with any other mode bits */ - if (!(txc->modes & ADJ_OFFSET_SINGLESHOT)) - return -EINVAL; - if (!(txc->modes & ADJ_OFFSET_READONLY) && - !capable(CAP_SYS_TIME)) - return -EPERM; - } else { - /* In order to modify anything, you gotta be super-user! */ - if (txc->modes && !capable(CAP_SYS_TIME)) - return -EPERM; - /* - * if the quartz is off by more than 10% then - * something is VERY wrong! - */ - if (txc->modes & ADJ_TICK && - (txc->tick < 900000/USER_HZ || - txc->tick > 1100000/USER_HZ)) - return -EINVAL; - } - - if (txc->modes & ADJ_SETOFFSET) { - /* In order to inject time, you gotta be super-user! */ - if (!capable(CAP_SYS_TIME)) - return -EPERM; - - if (txc->modes & ADJ_NANO) { - struct timespec ts; - - ts.tv_sec = txc->time.tv_sec; - ts.tv_nsec = txc->time.tv_usec; - if (!timespec_inject_offset_valid(&ts)) - return -EINVAL; - - } else { - if (!timeval_inject_offset_valid(&txc->time)) - return -EINVAL; - } - } - - /* - * Check for potential multiplication overflows that can - * only happen on 64-bit systems: - */ - if ((txc->modes & ADJ_FREQUENCY) && (BITS_PER_LONG == 64)) { - if (LLONG_MIN / PPM_SCALE > txc->freq) - return -EINVAL; - if (LLONG_MAX / PPM_SCALE < txc->freq) - return -EINVAL; - } - - return 0; -} - - /* * adjtimex mainly allows reading (and writing, if superuser) of * kernel time-keeping variables. used by xntpd. diff --git a/kernel/time/ntp_internal.h b/kernel/time/ntp_internal.h index 0a53e6ea47b1..909bd1f1bfb1 100644 --- a/kernel/time/ntp_internal.h +++ b/kernel/time/ntp_internal.h @@ -8,7 +8,6 @@ extern void ntp_clear(void); extern u64 ntp_tick_length(void); extern ktime_t ntp_get_next_leap(void); extern int second_overflow(time64_t secs); -extern int ntp_validate_timex(struct timex *); extern int __do_adjtimex(struct timex *, struct timespec64 *, s32 *); extern void __hardpps(const struct timespec64 *, const struct timespec64 *); #endif /* _LINUX_NTP_INTERNAL_H */ diff --git a/kernel/time/time.c b/kernel/time/time.c index 319935af02fb..a92241f410e0 100644 --- a/kernel/time/time.c +++ b/kernel/time/time.c @@ -158,40 +158,6 @@ SYSCALL_DEFINE2(gettimeofday, struct timeval __user *, tv, return 0; } -/* - * Indicates if there is an offset between the system clock and the hardware - * clock/persistent clock/rtc. - */ -int persistent_clock_is_local; - -/* - * Adjust the time obtained from the CMOS to be UTC time instead of - * local time. - * - * This is ugly, but preferable to the alternatives. Otherwise we - * would either need to write a program to do it in /etc/rc (and risk - * confusion if the program gets run more than once; it would also be - * hard to make the program warp the clock precisely n hours) or - * compile in the timezone information into the kernel. Bad, bad.... - * - * - TYT, 1992-01-01 - * - * The best thing to do is to keep the CMOS clock in universal time (UTC) - * as real UNIX machines always do it. This avoids all headaches about - * daylight saving times and warping kernel clocks. - */ -static inline void warp_clock(void) -{ - if (sys_tz.tz_minuteswest != 0) { - struct timespec adjust; - - persistent_clock_is_local = 1; - adjust.tv_sec = sys_tz.tz_minuteswest * 60; - adjust.tv_nsec = 0; - timekeeping_inject_offset(&adjust); - } -} - /* * In case for some reason the CMOS clock has not already been running * in UTC, but in some local time: The first time we set the timezone, @@ -225,7 +191,7 @@ int do_sys_settimeofday64(const struct timespec64 *tv, const struct timezone *tz if (firsttime) { firsttime = 0; if (!tv) - warp_clock(); + timekeeping_warp_clock(); } } if (tv) @@ -522,7 +488,18 @@ struct timeval ns_to_timeval(const s64 nsec) } EXPORT_SYMBOL(ns_to_timeval); -#if BITS_PER_LONG == 32 +struct __kernel_old_timeval ns_to_kernel_old_timeval(const s64 nsec) +{ + struct timespec64 ts = ns_to_timespec64(nsec); + struct __kernel_old_timeval tv; + + tv.tv_sec = ts.tv_sec; + tv.tv_usec = (suseconds_t)ts.tv_nsec / 1000; + + return tv; +} +EXPORT_SYMBOL(ns_to_kernel_old_timeval); + /** * set_normalized_timespec - set timespec sec and nsec parts and normalize * @@ -583,7 +560,7 @@ struct timespec64 ns_to_timespec64(const s64 nsec) return ts; } EXPORT_SYMBOL(ns_to_timespec64); -#endif + /** * msecs_to_jiffies: - convert milliseconds to jiffies * @m: time in milliseconds @@ -854,24 +831,6 @@ unsigned long nsecs_to_jiffies(u64 n) } EXPORT_SYMBOL_GPL(nsecs_to_jiffies); -/* - * Add two timespec values and do a safety check for overflow. - * It's assumed that both values are valid (>= 0) - */ -struct timespec timespec_add_safe(const struct timespec lhs, - const struct timespec rhs) -{ - struct timespec res; - - set_normalized_timespec(&res, lhs.tv_sec + rhs.tv_sec, - lhs.tv_nsec + rhs.tv_nsec); - - if (res.tv_sec < lhs.tv_sec || res.tv_sec < rhs.tv_sec) - res.tv_sec = TIME_T_MAX; - - return res; -} - /* * Add two timespec64 values and do a safety check for overflow. * It's assumed that both values are valid (>= 0). diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 6c14e0389ddb..6365d1131ed5 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -67,8 +67,27 @@ struct tk_fast { struct tk_read_base base[2]; }; -static struct tk_fast tk_fast_mono ____cacheline_aligned; -static struct tk_fast tk_fast_raw ____cacheline_aligned; +/* Suspend-time cycles value for halted fast timekeeper. */ +static u64 cycles_at_suspend; + +static u64 dummy_clock_read(struct clocksource *cs) +{ + return cycles_at_suspend; +} + +static struct clocksource dummy_clock = { + .read = dummy_clock_read, +}; + +static struct tk_fast tk_fast_mono ____cacheline_aligned = { + .base[0] = { .clock = &dummy_clock, }, + .base[1] = { .clock = &dummy_clock, }, +}; + +static struct tk_fast tk_fast_raw ____cacheline_aligned = { + .base[0] = { .clock = &dummy_clock, }, + .base[1] = { .clock = &dummy_clock, }, +}; /* flag for if timekeeping is suspended */ int __read_mostly timekeeping_suspended; @@ -484,17 +503,39 @@ u64 notrace ktime_get_boot_fast_ns(void) } EXPORT_SYMBOL_GPL(ktime_get_boot_fast_ns); -/* Suspend-time cycles value for halted fast timekeeper. */ -static u64 cycles_at_suspend; -static u64 dummy_clock_read(struct clocksource *cs) +/* + * See comment for __ktime_get_fast_ns() vs. timestamp ordering + */ +static __always_inline u64 __ktime_get_real_fast_ns(struct tk_fast *tkf) { - return cycles_at_suspend; + struct tk_read_base *tkr; + unsigned int seq; + u64 now; + + do { + seq = raw_read_seqcount_latch(&tkf->seq); + tkr = tkf->base + (seq & 0x01); + now = ktime_to_ns(tkr->base_real); + + now += timekeeping_delta_to_ns(tkr, + clocksource_delta( + tk_clock_read(tkr), + tkr->cycle_last, + tkr->mask)); + } while (read_seqcount_retry(&tkf->seq, seq)); + + return now; } -static struct clocksource dummy_clock = { - .read = dummy_clock_read, -}; +/** + * ktime_get_real_fast_ns: - NMI safe and fast access to clock realtime. + */ +u64 ktime_get_real_fast_ns(void) +{ + return __ktime_get_real_fast_ns(&tk_fast_mono); +} +EXPORT_SYMBOL_GPL(ktime_get_real_fast_ns); /** * halt_fast_timekeeper - Prevent fast timekeeper from accessing clocksource. @@ -514,6 +555,7 @@ static void halt_fast_timekeeper(struct timekeeper *tk) memcpy(&tkr_dummy, tkr, sizeof(tkr_dummy)); cycles_at_suspend = tk_clock_read(tkr); tkr_dummy.clock = &dummy_clock; + tkr_dummy.base_real = tkr->base + tk->offs_real; update_fast_timekeeper(&tkr_dummy, &tk_fast_mono); tkr = &tk->tkr_raw; @@ -661,6 +703,7 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) update_vsyscall(tk); update_pvclock_gtod(tk, action & TK_CLOCK_WAS_SET); + tk->tkr_mono.base_real = tk->tkr_mono.base + tk->offs_real; update_fast_timekeeper(&tk->tkr_mono, &tk_fast_mono); update_fast_timekeeper(&tk->tkr_raw, &tk_fast_raw); @@ -707,18 +750,19 @@ static void timekeeping_forward_now(struct timekeeper *tk) } /** - * __getnstimeofday64 - Returns the time of day in a timespec64. + * ktime_get_real_ts64 - Returns the time of day in a timespec64. * @ts: pointer to the timespec to be set * - * Updates the time of day in the timespec. - * Returns 0 on success, or -ve when suspended (timespec will be undefined). + * Returns the time of day in a timespec64 (WARN if suspended). */ -int __getnstimeofday64(struct timespec64 *ts) +void ktime_get_real_ts64(struct timespec64 *ts) { struct timekeeper *tk = &tk_core.timekeeper; unsigned long seq; u64 nsecs; + WARN_ON(timekeeping_suspended); + do { seq = read_seqcount_begin(&tk_core.seq); @@ -729,28 +773,8 @@ int __getnstimeofday64(struct timespec64 *ts) ts->tv_nsec = 0; timespec64_add_ns(ts, nsecs); - - /* - * Do not bail out early, in case there were callers still using - * the value, even in the face of the WARN_ON. - */ - if (unlikely(timekeeping_suspended)) - return -EAGAIN; - return 0; } -EXPORT_SYMBOL(__getnstimeofday64); - -/** - * getnstimeofday64 - Returns the time of day in a timespec64. - * @ts: pointer to the timespec64 to be set - * - * Returns the time of day in a timespec64 (WARN if suspended). - */ -void getnstimeofday64(struct timespec64 *ts) -{ - WARN_ON(__getnstimeofday64(ts)); -} -EXPORT_SYMBOL(getnstimeofday64); +EXPORT_SYMBOL(ktime_get_real_ts64); ktime_t ktime_get(void) { @@ -816,6 +840,25 @@ ktime_t ktime_get_with_offset(enum tk_offsets offs) } EXPORT_SYMBOL_GPL(ktime_get_with_offset); +ktime_t ktime_get_coarse_with_offset(enum tk_offsets offs) +{ + struct timekeeper *tk = &tk_core.timekeeper; + unsigned int seq; + ktime_t base, *offset = offsets[offs]; + + WARN_ON(timekeeping_suspended); + + do { + seq = read_seqcount_begin(&tk_core.seq); + base = ktime_add(tk->tkr_mono.base, *offset); + + } while (read_seqcount_retry(&tk_core.seq, seq)); + + return base; + +} +EXPORT_SYMBOL_GPL(ktime_get_coarse_with_offset); + /** * ktime_mono_to_any() - convert mononotic time to any other time * @tmono: time to convert. @@ -1269,33 +1312,31 @@ EXPORT_SYMBOL(do_settimeofday64); * * Adds or subtracts an offset value from the current time. */ -int timekeeping_inject_offset(struct timespec *ts) +static int timekeeping_inject_offset(struct timespec64 *ts) { struct timekeeper *tk = &tk_core.timekeeper; unsigned long flags; - struct timespec64 ts64, tmp; + struct timespec64 tmp; int ret = 0; - if (!timespec_inject_offset_valid(ts)) + if (ts->tv_nsec < 0 || ts->tv_nsec >= NSEC_PER_SEC) return -EINVAL; - ts64 = timespec_to_timespec64(*ts); - raw_spin_lock_irqsave(&timekeeper_lock, flags); write_seqcount_begin(&tk_core.seq); timekeeping_forward_now(tk); /* Make sure the proposed value is valid */ - tmp = timespec64_add(tk_xtime(tk), ts64); - if (timespec64_compare(&tk->wall_to_monotonic, &ts64) > 0 || + tmp = timespec64_add(tk_xtime(tk), *ts); + if (timespec64_compare(&tk->wall_to_monotonic, ts) > 0 || !timespec64_valid_strict(&tmp)) { ret = -EINVAL; goto error; } - tk_xtime_add(tk, &ts64); - tk_set_wall_to_mono(tk, timespec64_sub(tk->wall_to_monotonic, ts64)); + tk_xtime_add(tk, ts); + tk_set_wall_to_mono(tk, timespec64_sub(tk->wall_to_monotonic, *ts)); error: /* even if we error out, we forwarded the time, so call update */ timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET); @@ -1308,7 +1349,40 @@ error: /* even if we error out, we forwarded the time, so call update */ return ret; } -EXPORT_SYMBOL(timekeeping_inject_offset); + +/* + * Indicates if there is an offset between the system clock and the hardware + * clock/persistent clock/rtc. + */ +int persistent_clock_is_local; + +/* + * Adjust the time obtained from the CMOS to be UTC time instead of + * local time. + * + * This is ugly, but preferable to the alternatives. Otherwise we + * would either need to write a program to do it in /etc/rc (and risk + * confusion if the program gets run more than once; it would also be + * hard to make the program warp the clock precisely n hours) or + * compile in the timezone information into the kernel. Bad, bad.... + * + * - TYT, 1992-01-01 + * + * The best thing to do is to keep the CMOS clock in universal time (UTC) + * as real UNIX machines always do it. This avoids all headaches about + * daylight saving times and warping kernel clocks. + */ +void timekeeping_warp_clock(void) +{ + if (sys_tz.tz_minuteswest != 0) { + struct timespec64 adjust; + + persistent_clock_is_local = 1; + adjust.tv_sec = sys_tz.tz_minuteswest * 60; + adjust.tv_nsec = 0; + timekeeping_inject_offset(&adjust); + } +} /** * __timekeeping_set_tai_offset - Sets the TAI offset from UTC and monotonic @@ -1379,12 +1453,12 @@ int timekeeping_notify(struct clocksource *clock) } /** - * getrawmonotonic64 - Returns the raw monotonic time in a timespec + * ktime_get_raw_ts64 - Returns the raw monotonic time in a timespec * @ts: pointer to the timespec64 to be set * * Returns the raw monotonic time (completely un-modified by ntp) */ -void getrawmonotonic64(struct timespec64 *ts) +void ktime_get_raw_ts64(struct timespec64 *ts) { struct timekeeper *tk = &tk_core.timekeeper; unsigned long seq; @@ -1400,7 +1474,7 @@ void getrawmonotonic64(struct timespec64 *ts) ts->tv_nsec = 0; timespec64_add_ns(ts, nsecs); } -EXPORT_SYMBOL(getrawmonotonic64); +EXPORT_SYMBOL(ktime_get_raw_ts64); /** @@ -2168,30 +2242,20 @@ unsigned long get_seconds(void) } EXPORT_SYMBOL(get_seconds); -struct timespec __current_kernel_time(void) +void ktime_get_coarse_real_ts64(struct timespec64 *ts) { struct timekeeper *tk = &tk_core.timekeeper; - - return timespec64_to_timespec(tk_xtime(tk)); -} - -struct timespec64 current_kernel_time64(void) -{ - struct timekeeper *tk = &tk_core.timekeeper; - struct timespec64 now; unsigned long seq; do { seq = read_seqcount_begin(&tk_core.seq); - now = tk_xtime(tk); + *ts = tk_xtime(tk); } while (read_seqcount_retry(&tk_core.seq, seq)); - - return now; } -EXPORT_SYMBOL(current_kernel_time64); +EXPORT_SYMBOL(ktime_get_coarse_real_ts64); -struct timespec64 get_monotonic_coarse64(void) +void ktime_get_coarse_ts64(struct timespec64 *ts) { struct timekeeper *tk = &tk_core.timekeeper; struct timespec64 now, mono; @@ -2204,12 +2268,10 @@ struct timespec64 get_monotonic_coarse64(void) mono = tk->wall_to_monotonic; } while (read_seqcount_retry(&tk_core.seq, seq)); - set_normalized_timespec64(&now, now.tv_sec + mono.tv_sec, + set_normalized_timespec64(ts, now.tv_sec + mono.tv_sec, now.tv_nsec + mono.tv_nsec); - - return now; } -EXPORT_SYMBOL(get_monotonic_coarse64); +EXPORT_SYMBOL(ktime_get_coarse_ts64); /* * Must hold jiffies_lock @@ -2264,6 +2326,71 @@ ktime_t ktime_get_update_offsets_now(unsigned int *cwsseq, ktime_t *offs_real, return base; } +/** + * timekeeping_validate_timex - Ensures the timex is ok for use in do_adjtimex + */ +static int timekeeping_validate_timex(struct timex *txc) +{ + if (txc->modes & ADJ_ADJTIME) { + /* singleshot must not be used with any other mode bits */ + if (!(txc->modes & ADJ_OFFSET_SINGLESHOT)) + return -EINVAL; + if (!(txc->modes & ADJ_OFFSET_READONLY) && + !capable(CAP_SYS_TIME)) + return -EPERM; + } else { + /* In order to modify anything, you gotta be super-user! */ + if (txc->modes && !capable(CAP_SYS_TIME)) + return -EPERM; + /* + * if the quartz is off by more than 10% then + * something is VERY wrong! + */ + if (txc->modes & ADJ_TICK && + (txc->tick < 900000/USER_HZ || + txc->tick > 1100000/USER_HZ)) + return -EINVAL; + } + + if (txc->modes & ADJ_SETOFFSET) { + /* In order to inject time, you gotta be super-user! */ + if (!capable(CAP_SYS_TIME)) + return -EPERM; + + /* + * Validate if a timespec/timeval used to inject a time + * offset is valid. Offsets can be postive or negative, so + * we don't check tv_sec. The value of the timeval/timespec + * is the sum of its fields,but *NOTE*: + * The field tv_usec/tv_nsec must always be non-negative and + * we can't have more nanoseconds/microseconds than a second. + */ + if (txc->time.tv_usec < 0) + return -EINVAL; + + if (txc->modes & ADJ_NANO) { + if (txc->time.tv_usec >= NSEC_PER_SEC) + return -EINVAL; + } else { + if (txc->time.tv_usec >= USEC_PER_SEC) + return -EINVAL; + } + } + + /* + * Check for potential multiplication overflows that can + * only happen on 64-bit systems: + */ + if ((txc->modes & ADJ_FREQUENCY) && (BITS_PER_LONG == 64)) { + if (LLONG_MIN / PPM_SCALE > txc->freq) + return -EINVAL; + if (LLONG_MAX / PPM_SCALE < txc->freq) + return -EINVAL; + } + + return 0; +} + /** * random_get_entropy_fallback - Returns the raw clock source value, * used by random.c for platforms with no valid random_get_entropy(). @@ -2291,12 +2418,12 @@ int do_adjtimex(struct timex *txc) int ret; /* Validate the data before disabling interrupts */ - ret = ntp_validate_timex(txc); + ret = timekeeping_validate_timex(txc); if (ret) return ret; if (txc->modes & ADJ_SETOFFSET) { - struct timespec delta; + struct timespec64 delta; delta.tv_sec = txc->time.tv_sec; delta.tv_nsec = txc->time.tv_usec; if (!(txc->modes & ADJ_NANO)) diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h index c9f9af339914..7a9b4eb7a1d5 100644 --- a/kernel/time/timekeeping.h +++ b/kernel/time/timekeeping.h @@ -11,7 +11,7 @@ extern ktime_t ktime_get_update_offsets_now(unsigned int *cwsseq, extern int timekeeping_valid_for_hres(void); extern u64 timekeeping_max_deferment(void); -extern int timekeeping_inject_offset(struct timespec *ts); +extern void timekeeping_warp_clock(void); extern int timekeeping_suspend(void); extern void timekeeping_resume(void); diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index e79ef5aa6224..aaffbfba4a33 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -561,6 +561,17 @@ config FUNCTION_PROFILER If in doubt, say N. +config BPF_KPROBE_OVERRIDE + bool "Enable BPF programs to override a kprobed function" + depends on BPF_EVENTS + depends on KPROBES_ON_FTRACE + depends on FUNCTION_ERROR_INJECTION + depends on DYNAMIC_FTRACE_WITH_REGS + default n + help + Allows BPF to override the execution of a probed function and + set a different return value. This is used for error injection. + config FTRACE_MCOUNT_RECORD def_bool y depends on DYNAMIC_FTRACE diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index d723b6d03356..6122e7c48688 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -13,8 +13,53 @@ #include #include #include +#include +#include +#include + +#include "trace_probe.h" #include "trace.h" +#ifdef CONFIG_MODULES +struct bpf_trace_module { + struct module *module; + struct list_head list; +}; + +static LIST_HEAD(bpf_trace_modules); +static DEFINE_MUTEX(bpf_module_mutex); + +static struct bpf_raw_event_map *bpf_get_raw_tracepoint_module(const char *name) +{ + struct bpf_raw_event_map *btp, *ret = NULL; + struct bpf_trace_module *btm; + unsigned int i; + + mutex_lock(&bpf_module_mutex); + list_for_each_entry(btm, &bpf_trace_modules, list) { + for (i = 0; i < btm->module->num_bpf_raw_events; ++i) { + btp = &btm->module->bpf_raw_events[i]; + if (!strcmp(btp->tp->name, name)) { + if (try_module_get(btm->module)) + ret = btp; + goto out; + } + } + } +out: + mutex_unlock(&bpf_module_mutex); + return ret; +} +#else +static struct bpf_raw_event_map *bpf_get_raw_tracepoint_module(const char *name) +{ + return NULL; +} +#endif /* CONFIG_MODULES */ + +u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); +u64 bpf_get_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); + /** * trace_call_bpf - invoke BPF program * @call: tracepoint event @@ -74,6 +119,24 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) } EXPORT_SYMBOL_GPL(trace_call_bpf); +#ifdef CONFIG_BPF_KPROBE_OVERRIDE +BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, unsigned long, rc) +{ + __this_cpu_write(bpf_kprobe_override, 1); + regs_set_return_value(regs, rc); + override_function_with_return(regs); + return 0; +} + +static const struct bpf_func_proto bpf_override_return_proto = { + .func = bpf_override_return, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, +}; +#endif + BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr) { int ret; @@ -270,14 +333,14 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void) return &bpf_trace_printk_proto; } -BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags) +static __always_inline int +get_map_perf_counter(struct bpf_map *map, u64 flags, + u64 *value, u64 *enabled, u64 *running) { struct bpf_array *array = container_of(map, struct bpf_array, map); unsigned int cpu = smp_processor_id(); u64 index = flags & BPF_F_INDEX_MASK; struct bpf_event_entry *ee; - u64 value = 0; - int err; if (unlikely(flags & ~(BPF_F_INDEX_MASK))) return -EINVAL; @@ -290,7 +353,15 @@ BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags) if (!ee) return -ENOENT; - err = perf_event_read_local(ee->event, &value); + return perf_event_read_local(ee->event, value, enabled, running); +} + +BPF_CALL_2(bpf_perf_event_read, struct bpf_map *, map, u64, flags) +{ + u64 value = 0; + int err; + + err = get_map_perf_counter(map, flags, &value, NULL, NULL); /* * this api is ugly since we miss [-22..-2] range of valid * counter values, but that's uapi @@ -308,6 +379,33 @@ static const struct bpf_func_proto bpf_perf_event_read_proto = { .arg2_type = ARG_ANYTHING, }; +BPF_CALL_4(bpf_perf_event_read_value, struct bpf_map *, map, u64, flags, + struct bpf_perf_event_value *, buf, u32, size) +{ + int err = -EINVAL; + + if (unlikely(size != sizeof(struct bpf_perf_event_value))) + goto clear; + err = get_map_perf_counter(map, flags, &buf->counter, &buf->enabled, + &buf->running); + if (unlikely(err)) + goto clear; + return 0; +clear: + memset(buf, 0, size); + return err; +} + +static const struct bpf_func_proto bpf_perf_event_read_value_proto = { + .func = bpf_perf_event_read_value, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_UNINIT_MEM, + .arg4_type = ARG_CONST_SIZE, +}; + static DEFINE_PER_CPU(struct perf_sample_data, bpf_trace_sd); static __always_inline u64 @@ -478,6 +576,12 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_map_update_elem_proto; case BPF_FUNC_map_delete_elem: return &bpf_map_delete_elem_proto; + case BPF_FUNC_map_push_elem: + return &bpf_map_push_elem_proto; + case BPF_FUNC_map_pop_elem: + return &bpf_map_pop_elem_proto; + case BPF_FUNC_map_peek_elem: + return &bpf_map_peek_elem_proto; case BPF_FUNC_probe_read: return &bpf_probe_read_proto; case BPF_FUNC_ktime_get_ns: @@ -525,6 +629,14 @@ kprobe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_perf_event_output_proto; case BPF_FUNC_get_stackid: return &bpf_get_stackid_proto; + case BPF_FUNC_get_stack: + return &bpf_get_stack_proto; + case BPF_FUNC_perf_event_read_value: + return &bpf_perf_event_read_value_proto; +#ifdef CONFIG_BPF_KPROBE_OVERRIDE + case BPF_FUNC_override_return: + return &bpf_override_return_proto; +#endif default: return tracing_func_proto(func_id, prog); } @@ -606,6 +718,25 @@ static const struct bpf_func_proto bpf_get_stackid_proto_tp = { .arg3_type = ARG_ANYTHING, }; +BPF_CALL_4(bpf_get_stack_tp, void *, tp_buff, void *, buf, u32, size, + u64, flags) +{ + struct pt_regs *regs = *(struct pt_regs **)tp_buff; + + return bpf_get_stack((unsigned long) regs, (unsigned long) buf, + (unsigned long) size, flags, 0); +} + +static const struct bpf_func_proto bpf_get_stack_proto_tp = { + .func = bpf_get_stack_tp, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_UNINIT_MEM, + .arg3_type = ARG_CONST_SIZE_OR_ZERO, + .arg4_type = ARG_ANYTHING, +}; + static const struct bpf_func_proto * tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { @@ -614,6 +745,8 @@ tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_perf_event_output_proto_tp; case BPF_FUNC_get_stackid: return &bpf_get_stackid_proto_tp; + case BPF_FUNC_get_stack: + return &bpf_get_stack_proto_tp; default: return tracing_func_proto(func_id, prog); } @@ -642,14 +775,57 @@ const struct bpf_verifier_ops tracepoint_verifier_ops = { const struct bpf_prog_ops tracepoint_prog_ops = { }; +BPF_CALL_3(bpf_perf_prog_read_value, struct bpf_perf_event_data_kern *, ctx, + struct bpf_perf_event_value *, buf, u32, size) +{ + int err = -EINVAL; + + if (unlikely(size != sizeof(struct bpf_perf_event_value))) + goto clear; + err = perf_event_read_local(ctx->event, &buf->counter, &buf->enabled, + &buf->running); + if (unlikely(err)) + goto clear; + return 0; +clear: + memset(buf, 0, size); + return err; +} + +static const struct bpf_func_proto bpf_perf_prog_read_value_proto = { + .func = bpf_perf_prog_read_value, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_UNINIT_MEM, + .arg3_type = ARG_CONST_SIZE, +}; + +static const struct bpf_func_proto * +pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_perf_event_output: + return &bpf_perf_event_output_proto_tp; + case BPF_FUNC_get_stackid: + return &bpf_get_stackid_proto_tp; + case BPF_FUNC_get_stack: + return &bpf_get_stack_proto_tp; + case BPF_FUNC_perf_prog_read_value: + return &bpf_perf_prog_read_value_proto; + default: + return tracing_func_proto(func_id, prog); + } +} + /* * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp * to avoid potential recursive reuse issue when/if tracepoints are added - * inside bpf_*_event_output and/or bpf_get_stack_id + * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack */ static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs); BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args, - struct bpf_map *, map, u64, flags, void *, data, u64, size) + struct bpf_map *, map, u64, flags, void *, data, u64, size) { struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); @@ -658,43 +834,66 @@ BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args, } static const struct bpf_func_proto bpf_perf_event_output_proto_raw_tp = { - .func = bpf_perf_event_output_raw_tp, - .gpl_only = true, - .ret_type = RET_INTEGER, - .arg1_type = ARG_PTR_TO_CTX, - .arg2_type = ARG_CONST_MAP_PTR, - .arg3_type = ARG_ANYTHING, - .arg4_type = ARG_PTR_TO_MEM, - .arg5_type = ARG_CONST_SIZE_OR_ZERO, + .func = bpf_perf_event_output_raw_tp, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_MEM, + .arg5_type = ARG_CONST_SIZE_OR_ZERO, }; BPF_CALL_3(bpf_get_stackid_raw_tp, struct bpf_raw_tracepoint_args *, args, - struct bpf_map *, map, u64, flags) + struct bpf_map *, map, u64, flags) { struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); perf_fetch_caller_regs(regs); /* similar to bpf_perf_event_output_tp, but pt_regs fetched differently */ return bpf_get_stackid((unsigned long) regs, (unsigned long) map, - flags, 0, 0); + flags, 0, 0); } static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = { - .func = bpf_get_stackid_raw_tp, - .gpl_only = true, - .ret_type = RET_INTEGER, - .arg1_type = ARG_PTR_TO_CTX, - .arg2_type = ARG_CONST_MAP_PTR, - .arg3_type = ARG_ANYTHING, + .func = bpf_get_stackid_raw_tp, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_ANYTHING, }; -static const struct bpf_func_proto *raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +BPF_CALL_4(bpf_get_stack_raw_tp, struct bpf_raw_tracepoint_args *, args, + void *, buf, u32, size, u64, flags) +{ + struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); + + perf_fetch_caller_regs(regs); + return bpf_get_stack((unsigned long) regs, (unsigned long) buf, + (unsigned long) size, flags, 0); +} + +static const struct bpf_func_proto bpf_get_stack_proto_raw_tp = { + .func = bpf_get_stack_raw_tp, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE_OR_ZERO, + .arg4_type = ARG_ANYTHING, +}; + +static const struct bpf_func_proto * +raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { switch (func_id) { case BPF_FUNC_perf_event_output: return &bpf_perf_event_output_proto_raw_tp; case BPF_FUNC_get_stackid: return &bpf_get_stackid_proto_raw_tp; + case BPF_FUNC_get_stack: + return &bpf_get_stack_proto_raw_tp; default: return tracing_func_proto(func_id, prog); } @@ -755,8 +954,14 @@ static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type type return false; if (type != BPF_READ) return false; - if (off % size != 0) - return false; + if (off % size != 0) { + if (sizeof(unsigned long) != 4) + return false; + if (size != 8) + return false; + if (off % size != 4) + return false; + } switch (off) { case bpf_ctx_range(struct bpf_perf_event_data, sample_period): @@ -801,7 +1006,7 @@ static u32 pe_prog_convert_ctx_access(enum bpf_access_type type, } const struct bpf_verifier_ops perf_event_verifier_ops = { - .get_func_proto = tp_prog_func_proto, + .get_func_proto = pe_prog_func_proto, .is_valid_access = pe_prog_is_valid_access, .convert_ctx_access = pe_prog_convert_ctx_access, }; @@ -820,17 +1025,28 @@ int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog_array *new_array; int ret = -EEXIST; + /* + * Kprobe override only works for ftrace based kprobes, and only if they + * are on the opt-in list. + */ + if (prog->kprobe_override && + (!trace_kprobe_ftrace(event->tp_event) || + !trace_kprobe_error_injectable(event->tp_event))) + return -EINVAL; + mutex_lock(&bpf_event_mutex); if (event->prog) goto unlock; old_array = event->tp_event->prog_array; + if (old_array && bpf_prog_array_length(old_array) >= BPF_TRACE_MAX_PROGS) { ret = -E2BIG; goto unlock; } + ret = bpf_prog_array_copy(old_array, NULL, prog, &new_array); if (ret < 0) goto unlock; @@ -858,6 +1074,8 @@ void perf_event_detach_bpf_prog(struct perf_event *event) old_array = event->tp_event->prog_array; ret = bpf_prog_array_copy(old_array, event->prog, NULL, &new_array); + if (ret == -ENOENT) + goto unlock; if (ret < 0) { bpf_prog_array_delete_safe(old_array, event->prog); } else { @@ -885,6 +1103,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info) return -EINVAL; if (copy_from_user(&query, uquery, sizeof(query))) return -EFAULT; + ids_len = query.ids_len; if (ids_len > BPF_TRACE_MAX_PROGS) return -E2BIG; @@ -897,6 +1116,7 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info) * There is no need to check for it since the case is handled * gracefully in bpf_prog_array_copy_info. */ + mutex_lock(&bpf_event_mutex); ret = bpf_prog_array_copy_info(event->tp_event->prog_array, ids, @@ -907,15 +1127,15 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info) if (copy_to_user(&uquery->prog_cnt, &prog_cnt, sizeof(prog_cnt)) || copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) ret = -EFAULT; - kfree(ids); + kfree(ids); return ret; } extern struct bpf_raw_event_map __start__bpf_raw_tp[]; extern struct bpf_raw_event_map __stop__bpf_raw_tp[]; -struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name) +struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name) { struct bpf_raw_event_map *btp = __start__bpf_raw_tp; @@ -923,7 +1143,16 @@ struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name) if (!strcmp(btp->tp->name, name)) return btp; } - return NULL; + + return bpf_get_raw_tracepoint_module(name); +} + +void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp) +{ + struct module *mod = __module_address((unsigned long)btp); + + if (mod) + module_put(mod); } static __always_inline @@ -936,37 +1165,37 @@ void __bpf_trace_run(struct bpf_prog *prog, u64 *args) rcu_read_unlock(); } -#define UNPACK(...) __VA_ARGS__ -#define REPEAT_1(FN, DL, X, ...) FN(X) -#define REPEAT_2(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_1(FN, DL, __VA_ARGS__) -#define REPEAT_3(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_2(FN, DL, __VA_ARGS__) -#define REPEAT_4(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_3(FN, DL, __VA_ARGS__) -#define REPEAT_5(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_4(FN, DL, __VA_ARGS__) -#define REPEAT_6(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_5(FN, DL, __VA_ARGS__) -#define REPEAT_7(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_6(FN, DL, __VA_ARGS__) -#define REPEAT_8(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_7(FN, DL, __VA_ARGS__) -#define REPEAT_9(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_8(FN, DL, __VA_ARGS__) -#define REPEAT_10(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_9(FN, DL, __VA_ARGS__) -#define REPEAT_11(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_10(FN, DL, __VA_ARGS__) -#define REPEAT_12(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_11(FN, DL, __VA_ARGS__) -#define REPEAT(X, FN, DL, ...) REPEAT_##X(FN, DL, __VA_ARGS__) +#define UNPACK(...) __VA_ARGS__ +#define REPEAT_1(FN, DL, X, ...) FN(X) +#define REPEAT_2(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_1(FN, DL, __VA_ARGS__) +#define REPEAT_3(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_2(FN, DL, __VA_ARGS__) +#define REPEAT_4(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_3(FN, DL, __VA_ARGS__) +#define REPEAT_5(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_4(FN, DL, __VA_ARGS__) +#define REPEAT_6(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_5(FN, DL, __VA_ARGS__) +#define REPEAT_7(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_6(FN, DL, __VA_ARGS__) +#define REPEAT_8(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_7(FN, DL, __VA_ARGS__) +#define REPEAT_9(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_8(FN, DL, __VA_ARGS__) +#define REPEAT_10(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_9(FN, DL, __VA_ARGS__) +#define REPEAT_11(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_10(FN, DL, __VA_ARGS__) +#define REPEAT_12(FN, DL, X, ...) FN(X) UNPACK DL REPEAT_11(FN, DL, __VA_ARGS__) +#define REPEAT(X, FN, DL, ...) REPEAT_##X(FN, DL, __VA_ARGS__) -#define SARG(X) u64 arg##X -#define COPY(X) args[X] = arg##X +#define SARG(X) u64 arg##X +#define COPY(X) args[X] = arg##X -#define __DL_COM (,) -#define __DL_SEM (;) +#define __DL_COM (,) +#define __DL_SEM (;) -#define __SEQ_0_11 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 +#define __SEQ_0_11 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 -#define BPF_TRACE_DEFN_x(x) \ - void bpf_trace_run##x(struct bpf_prog *prog, \ - REPEAT(x, SARG, __DL_COM, __SEQ_0_11)) \ - { \ - u64 args[x]; \ - REPEAT(x, COPY, __DL_SEM, __SEQ_0_11); \ - __bpf_trace_run(prog, args); \ - } \ +#define BPF_TRACE_DEFN_x(x) \ + void bpf_trace_run##x(struct bpf_prog *prog, \ + REPEAT(x, SARG, __DL_COM, __SEQ_0_11)) \ + { \ + u64 args[x]; \ + REPEAT(x, COPY, __DL_SEM, __SEQ_0_11); \ + __bpf_trace_run(prog, args); \ + } \ EXPORT_SYMBOL_GPL(bpf_trace_run##x) BPF_TRACE_DEFN_x(1); BPF_TRACE_DEFN_x(2); @@ -983,19 +1212,21 @@ BPF_TRACE_DEFN_x(12); static int __bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog) { -struct tracepoint *tp = btp->tp; + struct tracepoint *tp = btp->tp; /* * check that program doesn't access arguments beyond what's * available in this tracepoint - */ + */ if (prog->aux->max_ctx_offset > btp->num_args * sizeof(u64)) return -EINVAL; if (prog->aux->max_tp_access > btp->writable_size) return -EINVAL; - return tracepoint_probe_register(tp, (void *)btp->bpf_func, prog); + return tracepoint_probe_register_may_exist(tp, (void *)btp->bpf_func, + prog); + } int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog) @@ -1017,3 +1248,99 @@ int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog) mutex_unlock(&bpf_event_mutex); return err; } + +int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, + u32 *fd_type, const char **buf, + u64 *probe_offset, u64 *probe_addr) +{ + bool is_tracepoint, is_syscall_tp; + struct bpf_prog *prog; + int flags, err = 0; + + prog = event->prog; + if (!prog) + return -ENOENT; + + /* not supporting BPF_PROG_TYPE_PERF_EVENT yet */ + if (prog->type == BPF_PROG_TYPE_PERF_EVENT) + return -EOPNOTSUPP; + + *prog_id = prog->aux->id; + flags = event->tp_event->flags; + is_tracepoint = flags & TRACE_EVENT_FL_TRACEPOINT; + is_syscall_tp = is_syscall_trace_event(event->tp_event); + + if (is_tracepoint || is_syscall_tp) { + *buf = is_tracepoint ? event->tp_event->tp->name + : event->tp_event->name; + *fd_type = BPF_FD_TYPE_TRACEPOINT; + *probe_offset = 0x0; + *probe_addr = 0x0; + } else { + /* kprobe/uprobe */ + err = -EOPNOTSUPP; +#ifdef CONFIG_KPROBE_EVENTS + if (flags & TRACE_EVENT_FL_KPROBE) + err = bpf_get_kprobe_info(event, fd_type, buf, + probe_offset, probe_addr, + event->attr.type == PERF_TYPE_TRACEPOINT); +#endif +#ifdef CONFIG_UPROBE_EVENTS + if (flags & TRACE_EVENT_FL_UPROBE) + err = bpf_get_uprobe_info(event, fd_type, buf, + probe_offset, + event->attr.type == PERF_TYPE_TRACEPOINT); +#endif + } + + return err; +} + +#ifdef CONFIG_MODULES +int bpf_event_notify(struct notifier_block *nb, unsigned long op, void *module) +{ + struct bpf_trace_module *btm, *tmp; + struct module *mod = module; + + if (mod->num_bpf_raw_events == 0 || + (op != MODULE_STATE_COMING && op != MODULE_STATE_GOING)) + return 0; + + mutex_lock(&bpf_module_mutex); + + switch (op) { + case MODULE_STATE_COMING: + btm = kzalloc(sizeof(*btm), GFP_KERNEL); + if (btm) { + btm->module = module; + list_add(&btm->list, &bpf_trace_modules); + } + break; + case MODULE_STATE_GOING: + list_for_each_entry_safe(btm, tmp, &bpf_trace_modules, list) { + if (btm->module == module) { + list_del(&btm->list); + kfree(btm); + break; + } + } + break; + } + + mutex_unlock(&bpf_module_mutex); + + return 0; +} + +static struct notifier_block bpf_module_nb = { + .notifier_call = bpf_event_notify, +}; + +int __init bpf_event_init(void) +{ + register_module_notifier(&bpf_module_nb); + return 0; +} + +fs_initcall(bpf_event_init); +#endif /* CONFIG_MODULES */ diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index e4dc2fae33d6..f08ef5a309af 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -9,6 +9,7 @@ #include #include #include "trace.h" +#include "trace_probe.h" static char __percpu *perf_trace_buf[PERF_NR_CONTEXTS]; @@ -242,6 +243,107 @@ void perf_trace_destroy(struct perf_event *p_event) mutex_unlock(&event_mutex); } +#ifdef CONFIG_KPROBE_EVENTS +int perf_kprobe_init(struct perf_event *p_event, bool is_retprobe) +{ + int ret; + char *func = NULL; + struct trace_event_call *tp_event; + + if (p_event->attr.kprobe_func) { + func = kzalloc(KSYM_NAME_LEN, GFP_KERNEL); + if (!func) + return -ENOMEM; + ret = strncpy_from_user( + func, u64_to_user_ptr(p_event->attr.kprobe_func), + KSYM_NAME_LEN); + if (ret < 0) + goto out; + + if (func[0] == '\0') { + kfree(func); + func = NULL; + } + } + + tp_event = create_local_trace_kprobe( + func, (void *)(unsigned long)(p_event->attr.kprobe_addr), + p_event->attr.probe_offset, is_retprobe); + if (IS_ERR(tp_event)) { + ret = PTR_ERR(tp_event); + goto out; + } + + ret = perf_trace_event_init(tp_event, p_event); + if (ret) + destroy_local_trace_kprobe(tp_event); +out: + kfree(func); + return ret; +} + +void perf_kprobe_destroy(struct perf_event *p_event) +{ + perf_trace_event_close(p_event); + perf_trace_event_unreg(p_event); + + destroy_local_trace_kprobe(p_event->tp_event); +} +#endif /* CONFIG_KPROBE_EVENTS */ + +#ifdef CONFIG_UPROBE_EVENTS +int perf_uprobe_init(struct perf_event *p_event, bool is_retprobe) +{ + int ret; + char *path = NULL; + struct trace_event_call *tp_event; + + if (!p_event->attr.uprobe_path) + return -EINVAL; + path = kzalloc(PATH_MAX, GFP_KERNEL); + if (!path) + return -ENOMEM; + ret = strncpy_from_user( + path, u64_to_user_ptr(p_event->attr.uprobe_path), PATH_MAX); + if (ret < 0) + goto out; + if (path[0] == '\0') { + ret = -EINVAL; + goto out; + } + + tp_event = create_local_trace_uprobe( + path, p_event->attr.probe_offset, is_retprobe); + if (IS_ERR(tp_event)) { + ret = PTR_ERR(tp_event); + goto out; + } + + /* + * local trace_uprobe need to hold event_mutex to call + * uprobe_buffer_enable() and uprobe_buffer_disable(). + * event_mutex is not required for local trace_kprobes. + */ + mutex_lock(&event_mutex); + ret = perf_trace_event_init(tp_event, p_event); + if (ret) + destroy_local_trace_uprobe(tp_event); + mutex_unlock(&event_mutex); +out: + kfree(path); + return ret; +} + +void perf_uprobe_destroy(struct perf_event *p_event) +{ + mutex_lock(&event_mutex); + perf_trace_event_close(p_event); + perf_trace_event_unreg(p_event); + mutex_unlock(&event_mutex); + destroy_local_trace_uprobe(p_event->tp_event); +} +#endif /* CONFIG_UPROBE_EVENTS */ + int perf_trace_add(struct perf_event *p_event, int flags) { struct trace_event_call *tp_event = p_event->tp_event; diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index b0db2e4cefa3..ff938f5b4b59 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -21,6 +21,7 @@ #include #include #include +#include #include "trace_probe.h" @@ -42,6 +43,7 @@ struct trace_kprobe { (offsetof(struct trace_kprobe, tp.args) + \ (sizeof(struct probe_arg) * (n))) +DEFINE_PER_CPU(int, bpf_kprobe_override); static nokprobe_inline bool trace_kprobe_is_return(struct trace_kprobe *tk) { @@ -87,6 +89,27 @@ static nokprobe_inline unsigned long trace_kprobe_nhit(struct trace_kprobe *tk) return nhit; } +int trace_kprobe_ftrace(struct trace_event_call *call) +{ + struct trace_kprobe *tk = (struct trace_kprobe *)call->data; + return kprobe_ftrace(&tk->rp.kp); +} + +int trace_kprobe_error_injectable(struct trace_event_call *call) +{ + struct trace_kprobe *tk = (struct trace_kprobe *)call->data; + unsigned long addr; + + if (tk->symbol) { + addr = (unsigned long) + kallsyms_lookup_name(trace_kprobe_symbol(tk)); + addr += tk->rp.kp.offset; + } else { + addr = (unsigned long)tk->rp.kp.addr; + } + return within_error_injection_list(addr); +} + static int register_kprobe_event(struct trace_kprobe *tk); static int unregister_kprobe_event(struct trace_kprobe *tk); @@ -449,6 +472,14 @@ disable_trace_kprobe(struct trace_kprobe *tk, struct trace_event_file *file) disable_kprobe(&tk->rp.kp); wait = 1; } + + /* + * if tk is not added to any list, it must be a local trace_kprobe + * created with perf_event_open. We don't need to wait for these + * trace_kprobes + */ + if (list_empty(&tk->list)) + wait = 0; out: if (wait) { /* @@ -1183,7 +1214,7 @@ static int kretprobe_event_define_fields(struct trace_event_call *event_call) #ifdef CONFIG_PERF_EVENTS /* Kprobe profile handler */ -static void +static int kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs) { struct trace_event_call *call = &tk->tp.call; @@ -1192,12 +1223,29 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs) int size, __size, dsize; int rctx; - if (bpf_prog_array_valid(call) && !trace_call_bpf(call, regs)) - return; + if (bpf_prog_array_valid(call)) { + int ret; + + ret = trace_call_bpf(call, regs); + + /* + * We need to check and see if we modified the pc of the + * pt_regs, and if so clear the kprobe and return 1 so that we + * don't do the instruction skipping. Also reset our state so + * we are clean the next pass through. + */ + if (__this_cpu_read(bpf_kprobe_override)) { + __this_cpu_write(bpf_kprobe_override, 0); + reset_current_kprobe(); + return 1; + } + if (!ret) + return 0; + } head = this_cpu_ptr(call->perf_events); if (hlist_empty(head)) - return; + return 0; dsize = __get_data_size(&tk->tp, regs); __size = sizeof(*entry) + tk->tp.size + dsize; @@ -1206,13 +1254,14 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs) entry = perf_trace_buf_alloc(size, NULL, &rctx); if (!entry) - return; + return 0; entry->ip = (unsigned long)tk->rp.kp.addr; memset(&entry[1], 0, dsize); store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize); perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs, head, NULL, NULL); + return 0; } NOKPROBE_SYMBOL(kprobe_perf_func); @@ -1250,6 +1299,35 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri, head, NULL, NULL); } NOKPROBE_SYMBOL(kretprobe_perf_func); + +int bpf_get_kprobe_info(const struct perf_event *event, u32 *fd_type, + const char **symbol, u64 *probe_offset, + u64 *probe_addr, bool perf_type_tracepoint) +{ + const char *pevent = trace_event_name(event->tp_event); + const char *group = event->tp_event->class->system; + struct trace_kprobe *tk; + + if (perf_type_tracepoint) + tk = find_trace_kprobe(pevent, group); + else + tk = event->tp_event->data; + if (!tk) + return -EINVAL; + + *fd_type = trace_kprobe_is_return(tk) ? BPF_FD_TYPE_KRETPROBE + : BPF_FD_TYPE_KPROBE; + if (tk->symbol) { + *symbol = tk->symbol; + *probe_offset = tk->rp.kp.offset; + *probe_addr = 0; + } else { + *symbol = NULL; + *probe_offset = 0; + *probe_addr = (unsigned long)tk->rp.kp.addr; + } + return 0; +} #endif /* CONFIG_PERF_EVENTS */ /* @@ -1288,6 +1366,7 @@ static int kprobe_register(struct trace_event_call *event, static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs) { struct trace_kprobe *tk = container_of(kp, struct trace_kprobe, rp.kp); + int ret = 0; raw_cpu_inc(*tk->nhit); @@ -1295,9 +1374,9 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs) kprobe_trace_func(tk, regs); #ifdef CONFIG_PERF_EVENTS if (tk->tp.flags & TP_FLAG_PROFILE) - kprobe_perf_func(tk, regs); + ret = kprobe_perf_func(tk, regs); #endif - return 0; /* We don't tweek kernel, so just return 0 */ + return ret; } NOKPROBE_SYMBOL(kprobe_dispatcher); @@ -1326,12 +1405,9 @@ static struct trace_event_functions kprobe_funcs = { .trace = print_kprobe_event }; -static int register_kprobe_event(struct trace_kprobe *tk) +static inline void init_trace_event_call(struct trace_kprobe *tk, + struct trace_event_call *call) { - struct trace_event_call *call = &tk->tp.call; - int ret; - - /* Initialize trace_event_call */ INIT_LIST_HEAD(&call->class->fields); if (trace_kprobe_is_return(tk)) { call->event.funcs = &kretprobe_funcs; @@ -1340,6 +1416,19 @@ static int register_kprobe_event(struct trace_kprobe *tk) call->event.funcs = &kprobe_funcs; call->class->define_fields = kprobe_event_define_fields; } + + call->flags = TRACE_EVENT_FL_KPROBE; + call->class->reg = kprobe_register; + call->data = tk; +} + +static int register_kprobe_event(struct trace_kprobe *tk) +{ + struct trace_event_call *call = &tk->tp.call; + int ret = 0; + + init_trace_event_call(tk, call); + if (set_print_fmt(&tk->tp, trace_kprobe_is_return(tk)) < 0) return -ENOMEM; ret = register_trace_event(&call->event); @@ -1347,9 +1436,6 @@ static int register_kprobe_event(struct trace_kprobe *tk) kfree(call->print_fmt); return -ENODEV; } - call->flags = TRACE_EVENT_FL_KPROBE; - call->class->reg = kprobe_register; - call->data = tk; ret = trace_add_event_call(call); if (ret) { pr_info("Failed to register kprobe event: %s\n", @@ -1371,6 +1457,66 @@ static int unregister_kprobe_event(struct trace_kprobe *tk) return ret; } +#ifdef CONFIG_PERF_EVENTS +/* create a trace_kprobe, but don't add it to global lists */ +struct trace_event_call * +create_local_trace_kprobe(char *func, void *addr, unsigned long offs, + bool is_return) +{ + struct trace_kprobe *tk; + int ret; + char *event; + + /* + * local trace_kprobes are not added to probe_list, so they are never + * searched in find_trace_kprobe(). Therefore, there is no concern of + * duplicated name here. + */ + event = func ? func : "DUMMY_EVENT"; + + tk = alloc_trace_kprobe(KPROBE_EVENT_SYSTEM, event, (void *)addr, func, + offs, 0 /* maxactive */, 0 /* nargs */, + is_return); + + if (IS_ERR(tk)) { + pr_info("Failed to allocate trace_probe.(%d)\n", + (int)PTR_ERR(tk)); + return ERR_CAST(tk); + } + + init_trace_event_call(tk, &tk->tp.call); + + if (set_print_fmt(&tk->tp, trace_kprobe_is_return(tk)) < 0) { + ret = -ENOMEM; + goto error; + } + + ret = __register_trace_kprobe(tk); + if (ret < 0) + goto error; + + return &tk->tp.call; +error: + free_trace_kprobe(tk); + return ERR_PTR(ret); +} + +void destroy_local_trace_kprobe(struct trace_event_call *event_call) +{ + struct trace_kprobe *tk; + + tk = container_of(event_call, struct trace_kprobe, tp.call); + + if (trace_probe_is_enabled(&tk->tp)) { + WARN_ON(1); + return; + } + + __unregister_trace_kprobe(tk); + free_trace_kprobe(tk); +} +#endif /* CONFIG_PERF_EVENTS */ + /* Make a tracefs interface for controlling probe points */ static __init int init_kprobe_trace(void) { diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h index dc39472ca9e4..e990405a1c9d 100644 --- a/kernel/trace/trace_probe.h +++ b/kernel/trace/trace_probe.h @@ -253,6 +253,8 @@ struct symbol_cache; unsigned long update_symbol_cache(struct symbol_cache *sc); void free_symbol_cache(struct symbol_cache *sc); struct symbol_cache *alloc_symbol_cache(const char *sym, long offset); +int trace_kprobe_ftrace(struct trace_event_call *call); +int trace_kprobe_error_injectable(struct trace_event_call *call); #else /* uprobes do not support symbol fetch methods */ #define fetch_symbol_u8 NULL @@ -278,6 +280,16 @@ alloc_symbol_cache(const char *sym, long offset) { return NULL; } + +static inline int trace_kprobe_ftrace(struct trace_event_call *call) +{ + return 0; +} + +static inline int trace_kprobe_error_injectable(struct trace_event_call *call) +{ + return 0; +} #endif /* CONFIG_KPROBE_EVENTS */ struct probe_arg { @@ -411,3 +423,14 @@ store_trace_args(int ent_size, struct trace_probe *tp, struct pt_regs *regs, } extern int set_print_fmt(struct trace_probe *tp, bool is_return); + +#ifdef CONFIG_PERF_EVENTS +extern struct trace_event_call * +create_local_trace_kprobe(char *func, void *addr, unsigned long offs, + bool is_return); +extern void destroy_local_trace_kprobe(struct trace_event_call *event_call); + +extern struct trace_event_call * +create_local_trace_uprobe(char *name, unsigned long offs, bool is_return); +extern void destroy_local_trace_uprobe(struct trace_event_call *event_call); +#endif diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index b71d116db937..03e51bd9a07a 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1183,6 +1183,28 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func, { __uprobe_perf_func(tu, func, regs, ucb, dsize); } + +int bpf_get_uprobe_info(const struct perf_event *event, u32 *fd_type, + const char **filename, u64 *probe_offset, + bool perf_type_tracepoint) +{ + const char *pevent = trace_event_name(event->tp_event); + const char *group = event->tp_event->class->system; + struct trace_uprobe *tu; + + if (perf_type_tracepoint) + tu = find_probe_event(pevent, group); + else + tu = event->tp_event->data; + if (!tu) + return -EINVAL; + + *fd_type = is_ret_probe(tu) ? BPF_FD_TYPE_URETPROBE + : BPF_FD_TYPE_UPROBE; + *filename = tu->filename; + *probe_offset = tu->offset; + return 0; +} #endif /* CONFIG_PERF_EVENTS */ static int @@ -1297,16 +1319,25 @@ static struct trace_event_functions uprobe_funcs = { .trace = print_uprobe_event }; -static int register_uprobe_event(struct trace_uprobe *tu) +static inline void init_trace_event_call(struct trace_uprobe *tu, + struct trace_event_call *call) { - struct trace_event_call *call = &tu->tp.call; - int ret; - - /* Initialize trace_event_call */ INIT_LIST_HEAD(&call->class->fields); call->event.funcs = &uprobe_funcs; call->class->define_fields = uprobe_event_define_fields; + call->flags = TRACE_EVENT_FL_UPROBE; + call->class->reg = trace_uprobe_register; + call->data = tu; +} + +static int register_uprobe_event(struct trace_uprobe *tu) +{ + struct trace_event_call *call = &tu->tp.call; + int ret = 0; + + init_trace_event_call(tu, call); + if (set_print_fmt(&tu->tp, is_ret_probe(tu)) < 0) return -ENOMEM; @@ -1316,9 +1347,6 @@ static int register_uprobe_event(struct trace_uprobe *tu) return -ENODEV; } - call->flags = TRACE_EVENT_FL_UPROBE; - call->class->reg = trace_uprobe_register; - call->data = tu; ret = trace_add_event_call(call); if (ret) { @@ -1344,6 +1372,70 @@ static int unregister_uprobe_event(struct trace_uprobe *tu) return 0; } +#ifdef CONFIG_PERF_EVENTS +struct trace_event_call * +create_local_trace_uprobe(char *name, unsigned long offs, bool is_return) +{ + struct trace_uprobe *tu; + struct inode *inode; + struct path path; + int ret; + + ret = kern_path(name, LOOKUP_FOLLOW, &path); + if (ret) + return ERR_PTR(ret); + + inode = igrab(d_inode(path.dentry)); + path_put(&path); + + if (!inode || !S_ISREG(inode->i_mode)) { + iput(inode); + return ERR_PTR(-EINVAL); + } + + /* + * local trace_kprobes are not added to probe_list, so they are never + * searched in find_trace_kprobe(). Therefore, there is no concern of + * duplicated name "DUMMY_EVENT" here. + */ + tu = alloc_trace_uprobe(UPROBE_EVENT_SYSTEM, "DUMMY_EVENT", 0, + is_return); + + if (IS_ERR(tu)) { + pr_info("Failed to allocate trace_uprobe.(%d)\n", + (int)PTR_ERR(tu)); + return ERR_CAST(tu); + } + + tu->offset = offs; + tu->inode = inode; + tu->filename = kstrdup(name, GFP_KERNEL); + init_trace_event_call(tu, &tu->tp.call); + + if (set_print_fmt(&tu->tp, is_ret_probe(tu)) < 0) { + ret = -ENOMEM; + goto error; + } + + return &tu->tp.call; +error: + free_trace_uprobe(tu); + return ERR_PTR(ret); +} + +void destroy_local_trace_uprobe(struct trace_event_call *event_call) +{ + struct trace_uprobe *tu; + + tu = container_of(event_call, struct trace_uprobe, tp.call); + + kfree(tu->tp.call.print_fmt); + tu->tp.call.print_fmt = NULL; + + free_trace_uprobe(tu); +} +#endif /* CONFIG_PERF_EVENTS */ + /* Make a trace interface for controling probe points */ static __init int init_uprobe_trace(void) { diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c index 6354c5f24c7e..f822aaaa32a6 100644 --- a/kernel/trace/tracing_map.c +++ b/kernel/trace/tracing_map.c @@ -355,7 +355,7 @@ static struct tracing_map_elt *get_free_elt(struct tracing_map *map) struct tracing_map_elt *elt = NULL; int idx; - idx = __atomic_add_unless(&map->next_elt, 1, map->max_elts); + idx = atomic_fetch_add_unless(&map->next_elt, 1, map->max_elts); if (idx < map->max_elts) { elt = *(TRACING_MAP_ELT(map->elts, idx)); if (map->ops && map->ops->elt_init) diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c index b65b2e7fd850..8a8f0ea3bc6f 100644 --- a/kernel/tracepoint.c +++ b/kernel/tracepoint.c @@ -238,7 +238,8 @@ static void *func_remove(struct tracepoint_func **funcs, * Add the probe function to a tracepoint. */ static int tracepoint_add_func(struct tracepoint *tp, - struct tracepoint_func *func, int prio) + struct tracepoint_func *func, int prio, + bool warn) { struct tracepoint_func *old, *tp_funcs; int ret; @@ -253,7 +254,7 @@ static int tracepoint_add_func(struct tracepoint *tp, lockdep_is_held(&tracepoints_mutex)); old = func_add(&tp_funcs, func, prio); if (IS_ERR(old)) { - WARN_ON_ONCE(PTR_ERR(old) != -ENOMEM); + WARN_ON_ONCE(warn && PTR_ERR(old) != -ENOMEM); return PTR_ERR(old); } @@ -305,6 +306,32 @@ static int tracepoint_remove_func(struct tracepoint *tp, return 0; } +/** + * tracepoint_probe_register_prio_may_exist - Connect a probe to a tracepoint with priority + * @tp: tracepoint + * @probe: probe handler + * @data: tracepoint data + * @prio: priority of this function over other registered functions + * + * Same as tracepoint_probe_register_prio() except that it will not warn + * if the tracepoint is already registered. + */ +int tracepoint_probe_register_prio_may_exist(struct tracepoint *tp, void *probe, + void *data, int prio) +{ + struct tracepoint_func tp_func; + int ret; + + mutex_lock(&tracepoints_mutex); + tp_func.func = probe; + tp_func.data = data; + tp_func.prio = prio; + ret = tracepoint_add_func(tp, &tp_func, prio, false); + mutex_unlock(&tracepoints_mutex); + return ret; +} +EXPORT_SYMBOL_GPL(tracepoint_probe_register_prio_may_exist); + /** * tracepoint_probe_register - Connect a probe to a tracepoint * @tp: tracepoint @@ -328,7 +355,7 @@ int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, tp_func.func = probe; tp_func.data = data; tp_func.prio = prio; - ret = tracepoint_add_func(tp, &tp_func, prio); + ret = tracepoint_add_func(tp, &tp_func, prio, true); mutex_unlock(&tracepoints_mutex); return ret; } diff --git a/kernel/umh.c b/kernel/umh.c index 9c41f05f969b..0e0f2738d920 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -26,6 +26,8 @@ #include #include #include +#include +#include #include @@ -36,6 +38,8 @@ static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; static DEFINE_SPINLOCK(umh_sysctl_lock); static DECLARE_RWSEM(umhelper_sem); +static LIST_HEAD(umh_list); +static DEFINE_MUTEX(umh_list_lock); static void call_usermodehelper_freeinfo(struct subprocess_info *info) { @@ -106,9 +110,16 @@ static int call_usermodehelper_exec_async(void *data) commit_creds(new); - retval = do_execve(getname_kernel(sub_info->path), - (const char __user *const __user *)sub_info->argv, - (const char __user *const __user *)sub_info->envp); + sub_info->pid = task_pid_nr(current); + if (sub_info->file) { + retval = do_execve_file(sub_info->file, + sub_info->argv, sub_info->envp); + if (!retval) + current->flags |= PF_UMH; + } else + retval = do_execve(getname_kernel(sub_info->path), + (const char __user *const __user *)sub_info->argv, + (const char __user *const __user *)sub_info->envp); out: sub_info->retval = retval; /* @@ -402,6 +413,134 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv, } EXPORT_SYMBOL(call_usermodehelper_setup); +struct subprocess_info *call_usermodehelper_setup_file(struct file *file, + int (*init)(struct subprocess_info *info, struct cred *new), + void (*cleanup)(struct subprocess_info *info), void *data) +{ + struct subprocess_info *sub_info; + struct umh_info *info = data; + const char *cmdline = (info->cmdline) ? info->cmdline : "usermodehelper"; + + sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL); + if (!sub_info) + return NULL; + + sub_info->argv = argv_split(GFP_KERNEL, cmdline, NULL); + if (!sub_info->argv) { + kfree(sub_info); + return NULL; + } + + INIT_WORK(&sub_info->work, call_usermodehelper_exec_work); + sub_info->path = "none"; + sub_info->file = file; + sub_info->init = init; + sub_info->cleanup = cleanup; + sub_info->data = data; + return sub_info; +} + +static int umh_pipe_setup(struct subprocess_info *info, struct cred *new) +{ + struct umh_info *umh_info = info->data; + struct file *from_umh[2]; + struct file *to_umh[2]; + int err; + + /* create pipe to send data to umh */ + err = create_pipe_files(to_umh, 0); + if (err) + return err; + err = replace_fd(0, to_umh[0], 0); + fput(to_umh[0]); + if (err < 0) { + fput(to_umh[1]); + return err; + } + + /* create pipe to receive data from umh */ + err = create_pipe_files(from_umh, 0); + if (err) { + fput(to_umh[1]); + replace_fd(0, NULL, 0); + return err; + } + err = replace_fd(1, from_umh[1], 0); + fput(from_umh[1]); + if (err < 0) { + fput(to_umh[1]); + replace_fd(0, NULL, 0); + fput(from_umh[0]); + return err; + } + + umh_info->pipe_to_umh = to_umh[1]; + umh_info->pipe_from_umh = from_umh[0]; + return 0; +} + +static void umh_clean_and_save_pid(struct subprocess_info *info) +{ + struct umh_info *umh_info = info->data; + + argv_free(info->argv); + umh_info->pid = info->pid; +} + +/** + * fork_usermode_blob - fork a blob of bytes as a usermode process + * @data: a blob of bytes that can be do_execv-ed as a file + * @len: length of the blob + * @info: information about usermode process (shouldn't be NULL) + * + * If info->cmdline is set it will be used as command line for the + * user process, else "usermodehelper" is used. + * + * Returns either negative error or zero which indicates success + * in executing a blob of bytes as a usermode process. In such + * case 'struct umh_info *info' is populated with two pipes + * and a pid of the process. The caller is responsible for health + * check of the user process, killing it via pid, and closing the + * pipes when user process is no longer needed. + */ +int fork_usermode_blob(void *data, size_t len, struct umh_info *info) +{ + struct subprocess_info *sub_info; + struct file *file; + ssize_t written; + loff_t pos = 0; + int err; + + file = shmem_kernel_file_setup("", len, 0); + if (IS_ERR(file)) + return PTR_ERR(file); + + written = kernel_write(file, data, len, &pos); + if (written != len) { + err = written; + if (err >= 0) + err = -ENOMEM; + goto out; + } + + err = -ENOMEM; + sub_info = call_usermodehelper_setup_file(file, umh_pipe_setup, + umh_clean_and_save_pid, info); + if (!sub_info) + goto out; + + err = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC); + if (!err) { + mutex_lock(&umh_list_lock); + list_add(&info->list, &umh_list); + mutex_unlock(&umh_list_lock); + } +out: + fput(file); + return err; +} +EXPORT_SYMBOL_GPL(fork_usermode_blob); + /** * call_usermodehelper_exec - start a usermode application * @sub_info: information about the subprocessa @@ -563,6 +702,26 @@ static int proc_cap_handler(struct ctl_table *table, int write, return 0; } +void __exit_umh(struct task_struct *tsk) +{ + struct umh_info *info; + pid_t pid = tsk->pid; + + mutex_lock(&umh_list_lock); + list_for_each_entry(info, &umh_list, list) { + if (info->pid == pid) { + list_del(&info->list); + mutex_unlock(&umh_list_lock); + goto out; + } + } + mutex_unlock(&umh_list_lock); + return; +out: + if (info->cleanup) + info->cleanup(info); +} + struct ctl_table usermodehelper_table[] = { { .procname = "bset", diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 50c471e7282c..4109023e2953 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -208,6 +208,14 @@ config DEBUG_INFO_DWARF4 But it significantly improves the success of resolving variables in gdb on optimized code. +config DEBUG_INFO_BTF + bool "Generate BTF typeinfo" + depends on DEBUG_INFO + help + Generate deduplicated BTF type information from DWARF debug info. + Turning this on expects presence of pahole tool, which will convert + DWARF type info into equivalent deduplicated BTF type info. + config GDB_SCRIPTS bool "Provide GDB scripts for kernel debugging" depends on DEBUG_INFO @@ -1614,6 +1622,10 @@ config FAULT_INJECTION Provide fault-injection framework. For more details, see Documentation/fault-injection/. +config FUNCTION_ERROR_INJECTION + def_bool y + depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES + config FAILSLAB bool "Fault-injection capability for kmalloc" depends on FAULT_INJECTION diff --git a/lib/Makefile b/lib/Makefile index d78b53461f86..fdafdf6663d0 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -165,6 +165,7 @@ obj-$(CONFIG_NETDEV_NOTIFIER_ERROR_INJECT) += netdev-notifier-error-inject.o obj-$(CONFIG_MEMORY_NOTIFIER_ERROR_INJECT) += memory-notifier-error-inject.o obj-$(CONFIG_OF_RECONFIG_NOTIFIER_ERROR_INJECT) += \ of-reconfig-notifier-error-inject.o +obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o lib-$(CONFIG_GENERIC_BUG) += bug.o diff --git a/lib/error-inject.c b/lib/error-inject.c new file mode 100644 index 000000000000..bccadcf3c981 --- /dev/null +++ b/lib/error-inject.c @@ -0,0 +1,213 @@ +// SPDX-License-Identifier: GPL-2.0 +// error-inject.c: Function-level error injection table +#include +#include +#include +#include +#include +#include +#include +#include + +/* Whitelist of symbols that can be overridden for error injection. */ +static LIST_HEAD(error_injection_list); +static DEFINE_MUTEX(ei_mutex); +struct ei_entry { + struct list_head list; + unsigned long start_addr; + unsigned long end_addr; + void *priv; +}; + +bool within_error_injection_list(unsigned long addr) +{ + struct ei_entry *ent; + bool ret = false; + + mutex_lock(&ei_mutex); + list_for_each_entry(ent, &error_injection_list, list) { + if (addr >= ent->start_addr && addr < ent->end_addr) { + ret = true; + break; + } + } + mutex_unlock(&ei_mutex); + return ret; +} + +/* + * Lookup and populate the error_injection_list. + * + * For safety reasons we only allow certain functions to be overridden with + * bpf_error_injection, so we need to populate the list of the symbols that have + * been marked as safe for overriding. + */ +static void populate_error_injection_list(unsigned long *start, + unsigned long *end, void *priv) +{ + unsigned long *iter; + struct ei_entry *ent; + unsigned long entry, offset = 0, size = 0; + + mutex_lock(&ei_mutex); + for (iter = start; iter < end; iter++) { + entry = arch_deref_entry_point((void *)*iter); + + if (!kernel_text_address(entry) || + !kallsyms_lookup_size_offset(entry, &size, &offset)) { + pr_err("Failed to find error inject entry at %p\n", + (void *)entry); + continue; + } + + ent = kmalloc(sizeof(*ent), GFP_KERNEL); + if (!ent) + break; + ent->start_addr = entry; + ent->end_addr = entry + size; + ent->priv = priv; + INIT_LIST_HEAD(&ent->list); + list_add_tail(&ent->list, &error_injection_list); + } + mutex_unlock(&ei_mutex); +} + +/* Markers of the _error_inject_whitelist section */ +extern unsigned long __start_error_injection_whitelist[]; +extern unsigned long __stop_error_injection_whitelist[]; + +static void __init populate_kernel_ei_list(void) +{ + populate_error_injection_list(__start_error_injection_whitelist, + __stop_error_injection_whitelist, + NULL); +} + +#ifdef CONFIG_MODULES +static void module_load_ei_list(struct module *mod) +{ + if (!mod->num_ei_funcs) + return; + + populate_error_injection_list(mod->ei_funcs, + mod->ei_funcs + mod->num_ei_funcs, mod); +} + +static void module_unload_ei_list(struct module *mod) +{ + struct ei_entry *ent, *n; + + if (!mod->num_ei_funcs) + return; + + mutex_lock(&ei_mutex); + list_for_each_entry_safe(ent, n, &error_injection_list, list) { + if (ent->priv == mod) { + list_del_init(&ent->list); + kfree(ent); + } + } + mutex_unlock(&ei_mutex); +} + +/* Module notifier call back, checking error injection table on the module */ +static int ei_module_callback(struct notifier_block *nb, + unsigned long val, void *data) +{ + struct module *mod = data; + + if (val == MODULE_STATE_COMING) + module_load_ei_list(mod); + else if (val == MODULE_STATE_GOING) + module_unload_ei_list(mod); + + return NOTIFY_DONE; +} + +static struct notifier_block ei_module_nb = { + .notifier_call = ei_module_callback, + .priority = 0 +}; + +static __init int module_ei_init(void) +{ + return register_module_notifier(&ei_module_nb); +} +#else /* !CONFIG_MODULES */ +#define module_ei_init() (0) +#endif + +/* + * error_injection/whitelist -- shows which functions can be overridden for + * error injection. + */ +static void *ei_seq_start(struct seq_file *m, loff_t *pos) +{ + mutex_lock(&ei_mutex); + return seq_list_start(&error_injection_list, *pos); +} + +static void ei_seq_stop(struct seq_file *m, void *v) +{ + mutex_unlock(&ei_mutex); +} + +static void *ei_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + return seq_list_next(v, &error_injection_list, pos); +} + +static int ei_seq_show(struct seq_file *m, void *v) +{ + struct ei_entry *ent = list_entry(v, struct ei_entry, list); + + seq_printf(m, "%pf\n", (void *)ent->start_addr); + return 0; +} + +static const struct seq_operations ei_seq_ops = { + .start = ei_seq_start, + .next = ei_seq_next, + .stop = ei_seq_stop, + .show = ei_seq_show, +}; + +static int ei_open(struct inode *inode, struct file *filp) +{ + return seq_open(filp, &ei_seq_ops); +} + +static const struct file_operations debugfs_ei_ops = { + .open = ei_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +static int __init ei_debugfs_init(void) +{ + struct dentry *dir, *file; + + dir = debugfs_create_dir("error_injection", NULL); + if (!dir) + return -ENOMEM; + + file = debugfs_create_file("list", 0444, dir, NULL, &debugfs_ei_ops); + if (!file) { + debugfs_remove(dir); + return -ENOMEM; + } + + return 0; +} + +static int __init init_error_injection(void) +{ + populate_kernel_ei_list(); + + if (!module_ei_init()) + ei_debugfs_init(); + + return 0; +} +late_initcall(init_error_injection); diff --git a/lib/test_bpf.c b/lib/test_bpf.c index 724674c421ca..a811848b5db4 100644 --- a/lib/test_bpf.c +++ b/lib/test_bpf.c @@ -1,16 +1,8 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Testsuite for BPF interpreter and BPF JIT compiler * * Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 60a95211efa6..7c6a8d6b777a 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -42,7 +42,6 @@ #include "../mm/internal.h" /* For the trace_print_flags arrays */ #include /* for PAGE_SIZE */ -#include /* for dereference_function_descriptor() */ #include /* cpu_to_le16 */ #include @@ -1891,10 +1890,10 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr, switch (*fmt) { case 'F': case 'f': - ptr = dereference_function_descriptor(ptr); - /* Fallthrough */ case 'S': case 's': + ptr = dereference_symbol_descriptor(ptr); + /* Fallthrough */ case 'B': return symbol_string(buf, end, ptr, spec, fmt); case 'R': diff --git a/mm/debug.c b/mm/debug.c index 97609290dd51..81442951281c 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -105,7 +105,7 @@ void dump_mm(const struct mm_struct *mm) "get_unmapped_area %px\n" #endif "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n" - "pgd %px mm_users %d mm_count %d nr_ptes %lu nr_pmds %lu map_count %d\n" + "pgd %px mm_users %d mm_count %d pgtables_bytes %lu map_count %d\n" "hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n" "pinned_vm %lx data_vm %lx exec_vm %lx stack_vm %lx\n" "start_code %lx end_code %lx start_data %lx end_data %lx\n" @@ -135,8 +135,7 @@ void dump_mm(const struct mm_struct *mm) mm->mmap_base, mm->mmap_legacy_base, mm->highest_vm_end, mm->pgd, atomic_read(&mm->mm_users), atomic_read(&mm->mm_count), - atomic_long_read((atomic_long_t *)&mm->nr_ptes), - mm_nr_pmds((struct mm_struct *)mm), + mm_pgtables_bytes(mm), mm->map_count, mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm, mm->pinned_vm, mm->data_vm, mm->exec_vm, mm->stack_vm, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5932e9160606..9737b3975b05 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -610,7 +610,7 @@ static int __do_huge_pmd_anonymous_page(struct vm_fault *vmf, struct page *page, pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); - atomic_long_inc(&vma->vm_mm->nr_ptes); + mm_inc_nr_ptes(vma->vm_mm); spin_unlock(vmf->ptl); count_vm_event(THP_FAULT_ALLOC); } @@ -666,7 +666,7 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm, if (pgtable) pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, haddr, pmd, entry); - atomic_long_inc(&mm->nr_ptes); + mm_inc_nr_ptes(mm); return true; } @@ -750,7 +750,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, if (pgtable) { pgtable_trans_huge_deposit(mm, pmd, pgtable); - atomic_long_inc(&mm->nr_ptes); + mm_inc_nr_ptes(mm); } set_pmd_at(mm, addr, pmd, entry); @@ -938,7 +938,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, set_pmd_at(src_mm, addr, src_pmd, pmd); } add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); - atomic_long_inc(&dst_mm->nr_ptes); + mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); set_pmd_at(dst_mm, addr, dst_pmd, pmd); ret = 0; @@ -974,7 +974,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, get_page(src_page); page_dup_rmap(src_page, true); add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); - atomic_long_inc(&dst_mm->nr_ptes); + mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); pmdp_set_wrprotect(src_mm, addr, src_pmd); @@ -1666,7 +1666,7 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) pgtable = pgtable_trans_huge_withdraw(mm, pmd); pte_free(mm, pgtable); - atomic_long_dec(&mm->nr_ptes); + mm_dec_nr_ptes(mm); } int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 7869c4a63756..e3775096a338 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3209,7 +3209,7 @@ static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr) * hugegpage VMA. do_page_fault() is supposed to trap this, so BUG is we get * this far. */ -static int hugetlb_vm_op_fault(struct vm_fault *vmf) +static vm_fault_t hugetlb_vm_op_fault(struct vm_fault *vmf) { BUG(); return 0; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 644f0a9c8a55..f5205e7fb629 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1313,6 +1313,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) /* assume page table is clear */ _pmd = pmdp_collapse_flush(vma, addr, pmd); spin_unlock(ptl); + mm_dec_nr_ptes(vma->vm_mm); atomic_long_dec(&mm->nr_ptes); tlb_remove_table_sync_one(); pte_free(mm, pmd_pgtable(_pmd)); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 9c595fd53766..b4c92d212df0 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5974,6 +5974,10 @@ void mem_cgroup_sk_alloc(struct sock *sk) if (in_interrupt()) return; + /* Do not associate the sock with unrelated interrupted task's memcg. */ + if (in_interrupt()) + return; + rcu_read_lock(); memcg = mem_cgroup_from_task(current); if (memcg == root_mem_cgroup) diff --git a/mm/memory.c b/mm/memory.c index 14847ad0cbb0..d68a43b3d6d2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -474,7 +474,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, pgtable_t token = pmd_pgtable(*pmd); pmd_clear(pmd); pte_free_tlb(tlb, token, addr); - atomic_long_dec(&tlb->mm->nr_ptes); + mm_dec_nr_ptes(tlb->mm); } static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, @@ -542,6 +542,7 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d, pud = pud_offset(p4d, start); p4d_clear(p4d); pud_free_tlb(tlb, pud, start); + mm_dec_nr_puds(tlb->mm); } static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd, @@ -701,7 +702,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address) ptl = pmd_lock(mm, pmd); if (likely(pmd_none(*pmd))) { /* Has another populated it ? */ - atomic_long_inc(&mm->nr_ptes); + mm_inc_nr_ptes(mm); pmd_populate(mm, pmd, new); new = NULL; } @@ -3407,7 +3408,7 @@ static int pte_alloc_one_map(struct vm_fault *vmf) goto map_pte; } - atomic_long_inc(&vma->vm_mm->nr_ptes); + mm_inc_nr_ptes(vma->vm_mm); pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte); spin_unlock(vmf->ptl); vmf->prealloc_pte = NULL; @@ -3466,7 +3467,7 @@ static void deposit_prealloc_pte(struct vm_fault *vmf) * We are going to consume the prealloc table, * count that as nr_ptes. */ - atomic_long_inc(&vma->vm_mm->nr_ptes); + mm_inc_nr_ptes(vma->vm_mm); vmf->prealloc_pte = NULL; } @@ -4362,15 +4363,17 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address) spin_lock(&mm->page_table_lock); #ifndef __ARCH_HAS_5LEVEL_HACK - if (p4d_present(*p4d)) /* Another has populated it */ - pud_free(mm, new); - else + if (!p4d_present(*p4d)) { + mm_inc_nr_puds(mm); p4d_populate(mm, p4d, new); -#else - if (pgd_present(*p4d)) /* Another has populated it */ + } else /* Another has populated it */ pud_free(mm, new); - else +#else + if (!pgd_present(*p4d)) { + mm_inc_nr_puds(mm); pgd_populate(mm, p4d, new); + } else /* Another has populated it */ + pud_free(mm, new); #endif /* __ARCH_HAS_5LEVEL_HACK */ spin_unlock(&mm->page_table_lock); return 0; diff --git a/mm/mmap.c b/mm/mmap.c index 40074c94c6b6..795a22380a2c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3305,7 +3305,7 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages) mm->data_vm += npages; } -static int special_mapping_fault(struct vm_fault *vmf); +static vm_fault_t special_mapping_fault(struct vm_fault *vmf); /* * Having a close hook prevents vma merging regardless of flags. @@ -3344,7 +3344,7 @@ static const struct vm_operations_struct legacy_special_mapping_vmops = { .fault = special_mapping_fault, }; -static int special_mapping_fault(struct vm_fault *vmf) +static vm_fault_t special_mapping_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; pgoff_t pgoff; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index b67a1183700c..a478c5b8218f 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -206,7 +206,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, * task's rss, pagetable and swap space use. */ points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) + - atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm); + mm_pgtables_bytes(p->mm) / PAGE_SIZE; task_unlock(p); /* Normalize to oom_score_adj units */ @@ -367,8 +367,8 @@ static void select_bad_process(struct oom_control *oc) * Dumps the current memory state of all eligible tasks. Tasks not in the same * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes * are not shown. - * State information includes task's pid, uid, tgid, vm size, rss, nr_ptes, - * swapents, oom_score_adj value, and name. + * State information includes task's pid, uid, tgid, vm size, rss, + * pgtables_bytes, swapents, oom_score_adj value, and name. */ void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) { @@ -388,7 +388,7 @@ void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) swap_comp_nrpages = get_swap_comp_pool_nrpages(); pr_info("[ pid ] uid tgid total_vm total_rss ( rss swap ) nr_ptes nr_pmds swapents oom_score_adj name\n"); #else - pr_info("[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name\n"); + pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n"); #endif rcu_read_lock(); for_each_process(p) { @@ -408,9 +408,9 @@ void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) #if defined(CONFIG_SWAP) task_swap = get_mm_counter(task->mm, MM_SWAPENTS) * swap_comp_nrpages / swap_orig_nrpages; - pr_info("[%5d] %5d %5d %8lu %8lu (%8lu %8lu) %7ld %7ld %8lu %5hd %s\n", + pr_info("[%5d] %5d %5d %8lu %8lu (%8lu %8lu) %8ld %5hd %s\n", #else - pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu %5hd %s\n", + pr_info("[%5d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n", #endif task->pid, from_kuid(&init_user_ns, task_uid(task)), task->tgid, task->mm->total_vm, @@ -420,8 +420,7 @@ void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) #else get_mm_rss(task->mm), #endif - atomic_long_read(&task->mm->nr_ptes), - mm_nr_pmds(task->mm), + mm_pgtables_bytes(task->mm), get_mm_counter(task->mm, MM_SWAPENTS), task->signal->oom_score_adj, task->comm); cur_rss_sum = get_mm_rss(task->mm) + diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 13de09a6c709..93dddfd211dc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -803,16 +803,14 @@ static inline void rmv_page_order(struct page *page) /* * This function checks whether a page is free && is the buddy - * we can do coalesce a page and its buddy if + * we can coalesce a page and its buddy if * (a) the buddy is not in a hole (check before calling!) && * (b) the buddy is in the buddy system && * (c) a page and its buddy have the same order && * (d) a page and its buddy are in the same zone. * - * For recording whether a page is in the buddy system, we set ->_mapcount - * PAGE_BUDDY_MAPCOUNT_VALUE. - * Setting, clearing, and testing _mapcount PAGE_BUDDY_MAPCOUNT_VALUE is - * serialized by zone->lock. + * For recording whether a page is in the buddy system, we set PageBuddy. + * Setting, clearing, and testing PageBuddy is serialized by zone->lock. * * For recording page's order, we use page_private(page). */ @@ -857,9 +855,8 @@ static inline int page_is_buddy(struct page *page, struct page *buddy, * as necessary, plus some accounting needed to play nicely with other * parts of the VM system. * At each level, we keep a list of pages, which are heads of continuous - * free pages of length of (1 << order) and marked with _mapcount - * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page) - * field. + * free pages of length of (1 << order) and marked with PageBuddy. + * Page's order is recorded in page_private(page) field. * So when we are allocating or freeing one, we can derive the state of the * other. That is, if we allocate a small block, and both were * free, the remainder of the region must be split into blocks. @@ -1044,7 +1041,7 @@ static int free_tail_pages_check(struct page *head_page, struct page *page) } switch (page - head_page) { case 1: - /* the first tail page: ->mapping is compound_mapcount() */ + /* the first tail page: ->mapping may be compound_mapcount() */ if (unlikely(compound_mapcount(page))) { bad_page(page, "nonzero compound_mapcount", 0); goto out; diff --git a/mm/sec_mm/show_mem.c b/mm/sec_mm/show_mem.c index 3d48c2b7a876..a016a150978a 100644 --- a/mm/sec_mm/show_mem.c +++ b/mm/sec_mm/show_mem.c @@ -112,11 +112,10 @@ void mm_debug_dump_tasks(void) continue; } - pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu %5hd %s\n", + pr_info("[%5d] %5d %5d %8lu %8lu %7ld %8lu %5hd %s\n", task->pid, from_kuid(&init_user_ns, task_uid(task)), task->tgid, task->mm->total_vm, get_mm_rss(task->mm), - atomic_long_read(&task->mm->nr_ptes), - mm_nr_pmds(task->mm), + mm_pgtables_bytes(task->mm), get_mm_counter(task->mm, MM_SWAPENTS), task->signal->oom_score_adj, task->comm); cur_rss_sum = get_mm_rss(task->mm) + diff --git a/mm/slub.c b/mm/slub.c index b71c85f98566..1684c7fa0fa5 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -68,11 +68,11 @@ do { \ * and to synchronize major metadata changes to slab cache structures. * * The slab_lock is only used for debugging and on arches that do not - * have the ability to do a cmpxchg_double. It only protects the second - * double word in the page struct. Meaning + * have the ability to do a cmpxchg_double. It only protects: * A. page->freelist -> List of object free in a page - * B. page->counters -> Counters of objects - * C. page->frozen -> frozen state + * B. page->inuse -> Number of objects in use + * C. page->objects -> Number of objects in page + * D. page->frozen -> frozen state * * If a slab is frozen then it is exempt from list management. It is not * on any list. The processor that froze the slab is the one who can @@ -383,21 +383,6 @@ static __always_inline void slab_unlock(struct page *page) __bit_spin_unlock(PG_locked, &page->flags); } -static inline void set_page_slub_counters(struct page *page, unsigned long counters_new) -{ - struct page tmp; - tmp.counters = counters_new; - /* - * page->counters can cover frozen/inuse/objects as well - * as page->_refcount. If we assign to ->counters directly - * we run the risk of losing updates to page->_refcount, so - * be careful and only assign to the fields we need. - */ - page->frozen = tmp.frozen; - page->inuse = tmp.inuse; - page->objects = tmp.objects; -} - /* Interrupts must be disabled (for the fallback code to work right) */ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page, void *freelist_old, unsigned long counters_old, @@ -419,7 +404,7 @@ static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page if (page->freelist == freelist_old && page->counters == counters_old) { page->freelist = freelist_new; - set_page_slub_counters(page, counters_new); + page->counters = counters_new; slab_unlock(page); return true; } @@ -458,7 +443,7 @@ static inline bool cmpxchg_double_slab(struct kmem_cache *s, struct page *page, if (page->freelist == freelist_old && page->counters == counters_old) { page->freelist = freelist_new; - set_page_slub_counters(page, counters_new); + page->counters = counters_new; slab_unlock(page); local_irq_restore(flags); return true; @@ -1981,7 +1966,7 @@ static void __free_slab(struct kmem_cache *s, struct page *page) __ClearPageSlabPfmemalloc(page); __ClearPageSlab(page); - page_mapcount_reset(page); + page->mapping = NULL; if (current->reclaim_state) current->reclaim_state->reclaimed_slab += pages; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index b013740c8743..19fb83b1f6b4 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -1130,24 +1131,9 @@ static void vb_free(const void *addr, unsigned long size) spin_unlock(&vb->lock); } -/** - * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer - * - * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily - * to amortize TLB flushing overheads. What this means is that any page you - * have now, may, in a former life, have been mapped into kernel virtual - * address by the vmap layer and so there might be some CPUs with TLB entries - * still referencing that page (additional to the regular 1:1 kernel mapping). - * - * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can - * be sure that none of the pages we have control over will have any aliases - * from the vmap layer. - */ -void vm_unmap_aliases(void) +static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) { - unsigned long start = ULONG_MAX, end = 0; int cpu; - int flush = 0; if (unlikely(!vmap_initialized)) return; @@ -1184,6 +1170,27 @@ void vm_unmap_aliases(void) flush_tlb_kernel_range(start, end); mutex_unlock(&vmap_purge_lock); } + +/** + * vm_unmap_aliases - unmap outstanding lazy aliases in the vmap layer + * + * The vmap/vmalloc layer lazily flushes kernel virtual mappings primarily + * to amortize TLB flushing overheads. What this means is that any page you + * have now, may, in a former life, have been mapped into kernel virtual + * address by the vmap layer and so there might be some CPUs with TLB entries + * still referencing that page (additional to the regular 1:1 kernel mapping). + * + * vm_unmap_aliases flushes all such lazy mappings. After it returns, we can + * be sure that none of the pages we have control over will have any aliases + * from the vmap layer. + */ +void vm_unmap_aliases(void) +{ + unsigned long start = ULONG_MAX, end = 0; + int flush = 0; + + _vm_unmap_aliases(start, end, flush); +} EXPORT_SYMBOL_GPL(vm_unmap_aliases); /** @@ -1608,6 +1615,72 @@ struct vm_struct *remove_vm_area(const void *addr) return NULL; } +static inline void set_area_direct_map(const struct vm_struct *area, + int (*set_direct_map)(struct page *page)) +{ + int i; + + for (i = 0; i < area->nr_pages; i++) + if (page_address(area->pages[i])) + set_direct_map(area->pages[i]); +} + +/* Handle removing and resetting vm mappings related to the vm_struct. */ +static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages) +{ + unsigned long addr = (unsigned long)area->addr; + unsigned long start = ULONG_MAX, end = 0; + int flush_reset = area->flags & VM_FLUSH_RESET_PERMS; + int i; + + /* + * The below block can be removed when all architectures that have + * direct map permissions also have set_direct_map_() implementations. + * This is concerned with resetting the direct map any an vm alias with + * execute permissions, without leaving a RW+X window. + */ + if (flush_reset && !IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) { + set_memory_nx(addr, area->nr_pages); + set_memory_rw(addr, area->nr_pages); + } + + remove_vm_area(area->addr); + + /* If this is not VM_FLUSH_RESET_PERMS memory, no need for the below. */ + if (!flush_reset) + return; + + /* + * If not deallocating pages, just do the flush of the VM area and + * return. + */ + if (!deallocate_pages) { + vm_unmap_aliases(); + return; + } + + /* + * If execution gets here, flush the vm mapping and reset the direct + * map. Find the start and end range of the direct mappings to make sure + * the vm_unmap_aliases() flush includes the direct map. + */ + for (i = 0; i < area->nr_pages; i++) { + if (page_address(area->pages[i])) { + start = min(addr, start); + end = max(addr, end); + } + } + + /* + * Set direct map to something invalid so that it won't be cached if + * there are any accesses after the TLB flush, then flush the TLB and + * reset the direct map permissions to the default. + */ + set_area_direct_map(area, set_direct_map_invalid_noflush); + _vm_unmap_aliases(start, end, 1); + set_area_direct_map(area, set_direct_map_default_noflush); +} + static void __vunmap(const void *addr, int deallocate_pages) { struct vm_struct *area; @@ -1629,7 +1702,8 @@ static void __vunmap(const void *addr, int deallocate_pages) debug_check_no_locks_freed(addr, get_vm_area_size(area)); debug_check_no_obj_freed(addr, get_vm_area_size(area)); - remove_vm_area(addr); + vm_remove_mappings(area, deallocate_pages); + if (deallocate_pages) { int i; @@ -2050,8 +2124,9 @@ EXPORT_SYMBOL(vzalloc_node); void *vmalloc_exec(unsigned long size) { - return __vmalloc_node(size, 1, GFP_KERNEL, PAGE_KERNEL_EXEC, - NUMA_NO_NODE, __builtin_return_address(0)); + return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, + GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, + NUMA_NO_NODE, __builtin_return_address(0)); } #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32) diff --git a/net/Kconfig b/net/Kconfig index d3b44dba74ec..cb26f0ddc435 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -51,9 +51,6 @@ config NET_INGRESS config NET_EGRESS bool -config SKB_EXTENSIONS - bool - menu "Networking options" source "net/packet/Kconfig" @@ -63,6 +60,7 @@ source "net/xfrm/Kconfig" source "net/iucv/Kconfig" source "net/ncm/Kconfig" source "net/smc/Kconfig" +source "net/xdp/Kconfig" config INET bool "TCP/IP networking" @@ -203,6 +201,8 @@ source "net/bridge/netfilter/Kconfig" endif +source "net/bpfilter/Kconfig" + source "net/dccp/Kconfig" source "net/sctp/Kconfig" source "net/rds/Kconfig" @@ -464,6 +464,9 @@ config MAY_USE_DEVLINK on MAY_USE_DEVLINK to ensure they do not cause link errors when devlink is a loadable module and the driver using it is built-in. +config PAGE_POOL + bool + endif # if NET # Used by archs to tell that they support BPF JIT compiler plus which flavour. diff --git a/net/Makefile b/net/Makefile index 42d3dcff921b..fa37df4ebe66 100644 --- a/net/Makefile +++ b/net/Makefile @@ -20,6 +20,7 @@ obj-$(CONFIG_TLS) += tls/ obj-$(CONFIG_XFRM) += xfrm/ obj-$(CONFIG_UNIX_SCM) += unix/ obj-$(CONFIG_NET) += ipv6/ +obj-$(CONFIG_BPFILTER) += bpfilter/ obj-$(CONFIG_PACKET) += packet/ obj-$(CONFIG_NET_KEY) += key/ obj-$(CONFIG_BRIDGE) += bridge/ @@ -85,6 +86,7 @@ obj-y += l3mdev/ endif obj-$(CONFIG_QRTR) += qrtr/ obj-$(CONFIG_NET_NCSI) += ncsi/ +obj-$(CONFIG_XDP_SOCKETS) += xdp/ obj-$(CONFIG_RMNET_DATA) += rmnet_data/ obj-$(CONFIG_RMNET_USB) += rmnet_usb/ obj-$(CONFIG_KNOX_NCM) += ncm/ diff --git a/net/bpf/Makefile b/net/bpf/Makefile index 27b2992a0692..b0ca361742e4 100644 --- a/net/bpf/Makefile +++ b/net/bpf/Makefile @@ -1 +1 @@ -obj-y := test_run.o +obj-$(CONFIG_BPF_SYSCALL) := test_run.o diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 9f942cccffad..68680f7e89f3 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -11,42 +11,68 @@ #include #include #include +#include +#include -static __always_inline u32 bpf_test_run_one(struct bpf_prog *prog, void *ctx) -{ - u32 ret; - - preempt_disable(); - rcu_read_lock(); - ret = BPF_PROG_RUN(prog, ctx); - rcu_read_unlock(); - preempt_enable(); - - return ret; -} - -static u32 bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *time) +#define CREATE_TRACE_POINTS +#include + +static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, + u32 *retval, u32 *time) { + struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE] = { NULL }; + enum bpf_cgroup_storage_type stype; u64 time_start, time_spent = 0; - u32 ret = 0, i; + int ret = 0; + u32 i; + + for_each_cgroup_storage_type(stype) { + storage[stype] = bpf_cgroup_storage_alloc(prog, stype); + if (IS_ERR(storage[stype])) { + storage[stype] = NULL; + for_each_cgroup_storage_type(stype) + bpf_cgroup_storage_free(storage[stype]); + return -ENOMEM; + } + } if (!repeat) repeat = 1; + + rcu_read_lock(); + preempt_disable(); time_start = ktime_get_ns(); for (i = 0; i < repeat; i++) { - ret = bpf_test_run_one(prog, ctx); + bpf_cgroup_storage_set(storage); + *retval = BPF_PROG_RUN(prog, ctx); + + if (signal_pending(current)) { + ret = -EINTR; + break; + } + if (need_resched()) { - if (signal_pending(current)) - break; time_spent += ktime_get_ns() - time_start; + preempt_enable(); + rcu_read_unlock(); + cond_resched(); + + rcu_read_lock(); + preempt_disable(); time_start = ktime_get_ns(); } } time_spent += ktime_get_ns() - time_start; + preempt_enable(); + rcu_read_unlock(); + do_div(time_spent, repeat); *time = time_spent > U32_MAX ? U32_MAX : (u32)time_spent; + for_each_cgroup_storage_type(stype) + bpf_cgroup_storage_free(storage[stype]); + return ret; } @@ -56,8 +82,18 @@ static int bpf_test_finish(const union bpf_attr *kattr, { void __user *data_out = u64_to_user_ptr(kattr->test.data_out); int err = -EFAULT; + u32 copy_size = size; - if (data_out && copy_to_user(data_out, data, size)) + /* Clamp copy if the user has provided a size hint, but copy the full + * buffer if not to retain old behaviour. + */ + if (kattr->test.data_size_out && + copy_size > kattr->test.data_size_out) { + copy_size = kattr->test.data_size_out; + err = -ENOSPC; + } + + if (data_out && copy_to_user(data_out, data, copy_size)) goto out; if (copy_to_user(&uattr->test.data_size_out, &size, sizeof(size))) goto out; @@ -65,8 +101,10 @@ static int bpf_test_finish(const union bpf_attr *kattr, goto out; if (copy_to_user(&uattr->test.duration, &duration, sizeof(duration))) goto out; - err = 0; + if (err != -ENOSPC) + err = 0; out: + trace_bpf_test_finish(&err); return err; } @@ -91,15 +129,130 @@ static void *bpf_test_init(const union bpf_attr *kattr, u32 size, return data; } +static void *bpf_ctx_init(const union bpf_attr *kattr, u32 max_size) +{ + void __user *data_in = u64_to_user_ptr(kattr->test.ctx_in); + void __user *data_out = u64_to_user_ptr(kattr->test.ctx_out); + u32 size = kattr->test.ctx_size_in; + void *data; + int err; + + if (!data_in && !data_out) + return NULL; + + data = kzalloc(max_size, GFP_USER); + if (!data) + return ERR_PTR(-ENOMEM); + + if (data_in) { + err = bpf_check_uarg_tail_zero(data_in, max_size, size); + if (err) { + kfree(data); + return ERR_PTR(err); + } + + size = min_t(u32, max_size, size); + if (copy_from_user(data, data_in, size)) { + kfree(data); + return ERR_PTR(-EFAULT); + } + } + return data; +} + +static int bpf_ctx_finish(const union bpf_attr *kattr, + union bpf_attr __user *uattr, const void *data, + u32 size) +{ + void __user *data_out = u64_to_user_ptr(kattr->test.ctx_out); + int err = -EFAULT; + u32 copy_size = size; + + if (!data || !data_out) + return 0; + + if (copy_size > kattr->test.ctx_size_out) { + copy_size = kattr->test.ctx_size_out; + err = -ENOSPC; + } + + if (copy_to_user(data_out, data, copy_size)) + goto out; + if (copy_to_user(&uattr->test.ctx_size_out, &size, sizeof(size))) + goto out; + if (err != -ENOSPC) + err = 0; +out: + return err; +} + +/** + * range_is_zero - test whether buffer is initialized + * @buf: buffer to check + * @from: check from this position + * @to: check up until (excluding) this position + * + * This function returns true if the there is a non-zero byte + * in the buf in the range [from,to). + */ +static inline bool range_is_zero(void *buf, size_t from, size_t to) +{ + return !memchr_inv((u8 *)buf + from, 0, to - from); +} + +static int convert___skb_to_skb(struct sk_buff *skb, struct __sk_buff *__skb) +{ + struct qdisc_skb_cb *cb = (struct qdisc_skb_cb *)skb->cb; + + if (!__skb) + return 0; + + /* make sure the fields we don't use are zeroed */ + if (!range_is_zero(__skb, 0, offsetof(struct __sk_buff, priority))) + return -EINVAL; + + /* priority is allowed */ + + if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) + + FIELD_SIZEOF(struct __sk_buff, priority), + offsetof(struct __sk_buff, cb))) + return -EINVAL; + + /* cb is allowed */ + + if (!range_is_zero(__skb, offsetof(struct __sk_buff, cb) + + FIELD_SIZEOF(struct __sk_buff, cb), + sizeof(struct __sk_buff))) + return -EINVAL; + + skb->priority = __skb->priority; + memcpy(&cb->data, __skb->cb, QDISC_CB_PRIV_LEN); + + return 0; +} + +static void convert_skb_to___skb(struct sk_buff *skb, struct __sk_buff *__skb) +{ + struct qdisc_skb_cb *cb = (struct qdisc_skb_cb *)skb->cb; + + if (!__skb) + return; + + __skb->priority = skb->priority; + memcpy(__skb->cb, &cb->data, QDISC_CB_PRIV_LEN); +} + int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr) { bool is_l2 = false, is_direct_pkt_access = false; u32 size = kattr->test.data_size_in; u32 repeat = kattr->test.repeat; + struct __sk_buff *ctx = NULL; u32 retval, duration; int hh_len = ETH_HLEN; struct sk_buff *skb; + struct sock *sk; void *data; int ret; @@ -108,6 +261,12 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, if (IS_ERR(data)) return PTR_ERR(data); + ctx = bpf_ctx_init(kattr, sizeof(struct __sk_buff)); + if (IS_ERR(ctx)) { + kfree(data); + return PTR_ERR(ctx); + } + switch (prog->type) { case BPF_PROG_TYPE_SCHED_CLS: case BPF_PROG_TYPE_SCHED_ACT: @@ -122,11 +281,23 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, break; } + sk = kzalloc(sizeof(struct sock), GFP_USER); + if (!sk) { + kfree(data); + kfree(ctx); + return -ENOMEM; + } + sock_net_set(sk, current->nsproxy->net_ns); + sock_init_data(NULL, sk); + skb = build_skb(data, 0); if (!skb) { kfree(data); + kfree(ctx); + kfree(sk); return -ENOMEM; } + skb->sk = sk; skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN); __skb_put(skb, size); @@ -137,25 +308,38 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, __skb_push(skb, hh_len); if (is_direct_pkt_access) bpf_compute_data_pointers(skb); - retval = bpf_test_run(prog, skb, repeat, &duration); + ret = convert___skb_to_skb(skb, ctx); + if (ret) + goto out; + ret = bpf_test_run(prog, skb, repeat, &retval, &duration); + if (ret) + goto out; if (!is_l2) { if (skb_headroom(skb) < hh_len) { int nhead = HH_DATA_ALIGN(hh_len - skb_headroom(skb)); if (pskb_expand_head(skb, nhead, 0, GFP_USER)) { - kfree_skb(skb); - return -ENOMEM; + ret = -ENOMEM; + goto out; } } memset(__skb_push(skb, hh_len), 0, hh_len); } + convert_skb_to___skb(skb, ctx); size = skb->len; /* bpf program can never convert linear skb to non-linear */ if (WARN_ON_ONCE(skb_is_nonlinear(skb))) size = skb_headlen(skb); ret = bpf_test_finish(kattr, uattr, skb->data, size, retval, duration); + if (!ret) + ret = bpf_ctx_finish(kattr, uattr, ctx, + sizeof(struct __sk_buff)); +out: kfree_skb(skb); + bpf_sk_storage_free(sk); + kfree(sk); + kfree(ctx); return ret; } @@ -164,11 +348,15 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr, { u32 size = kattr->test.data_size_in; u32 repeat = kattr->test.repeat; + struct netdev_rx_queue *rxqueue; struct xdp_buff xdp = {}; u32 retval, duration; void *data; int ret; + if (kattr->test.ctx_in || kattr->test.ctx_out) + return -EINVAL; + data = bpf_test_init(kattr, size, XDP_PACKET_HEADROOM + NET_IP_ALIGN, 0); if (IS_ERR(data)) return PTR_ERR(data); @@ -178,10 +366,127 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr, xdp.data_meta = xdp.data; xdp.data_end = xdp.data + size; - retval = bpf_test_run(prog, &xdp, repeat, &duration); - if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN) + rxqueue = __netif_get_rx_queue(current->nsproxy->net_ns->loopback_dev, 0); + xdp.rxq = &rxqueue->xdp_rxq; + + ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration); + if (ret) + goto out; + if (xdp.data != data + XDP_PACKET_HEADROOM + NET_IP_ALIGN || + xdp.data_end != xdp.data + size) size = xdp.data_end - xdp.data; ret = bpf_test_finish(kattr, uattr, xdp.data, size, retval, duration); +out: + kfree(data); + return ret; +} + +static int verify_user_bpf_flow_keys(struct bpf_flow_keys *ctx) +{ + /* make sure the fields we don't use are zeroed */ + if (!range_is_zero(ctx, 0, offsetof(struct bpf_flow_keys, flags))) + return -EINVAL; + + /* flags is allowed */ + + if (!range_is_zero(ctx, offsetof(struct bpf_flow_keys, flags) + + FIELD_SIZEOF(struct bpf_flow_keys, flags), + sizeof(struct bpf_flow_keys))) + return -EINVAL; + + return 0; +} + +int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, + const union bpf_attr *kattr, + union bpf_attr __user *uattr) +{ + u32 size = kattr->test.data_size_in; + struct bpf_flow_dissector ctx = {}; + u32 repeat = kattr->test.repeat; + struct bpf_flow_keys *user_ctx; + struct bpf_flow_keys flow_keys; + u64 time_start, time_spent = 0; + const struct ethhdr *eth; + unsigned int flags = 0; + u32 retval, duration; + void *data; + int ret; + u32 i; + + if (prog->type != BPF_PROG_TYPE_FLOW_DISSECTOR) + return -EINVAL; + + if (size < ETH_HLEN) + return -EINVAL; + + data = bpf_test_init(kattr, size, 0, 0); + if (IS_ERR(data)) + return PTR_ERR(data); + + eth = (struct ethhdr *)data; + + if (!repeat) + repeat = 1; + + user_ctx = bpf_ctx_init(kattr, sizeof(struct bpf_flow_keys)); + if (IS_ERR(user_ctx)) { + kfree(data); + return PTR_ERR(user_ctx); + } + if (user_ctx) { + ret = verify_user_bpf_flow_keys(user_ctx); + if (ret) + goto out; + flags = user_ctx->flags; + } + + ctx.flow_keys = &flow_keys; + ctx.data = data; + ctx.data_end = (__u8 *)data + size; + + rcu_read_lock(); + preempt_disable(); + time_start = ktime_get_ns(); + for (i = 0; i < repeat; i++) { + retval = bpf_flow_dissect(prog, &ctx, eth->h_proto, ETH_HLEN, + size, flags); + + if (signal_pending(current)) { + preempt_enable(); + rcu_read_unlock(); + + ret = -EINTR; + goto out; + } + + if (need_resched()) { + time_spent += ktime_get_ns() - time_start; + preempt_enable(); + rcu_read_unlock(); + + cond_resched(); + + rcu_read_lock(); + preempt_disable(); + time_start = ktime_get_ns(); + } + } + time_spent += ktime_get_ns() - time_start; + preempt_enable(); + rcu_read_unlock(); + + do_div(time_spent, repeat); + duration = time_spent > U32_MAX ? U32_MAX : (u32)time_spent; + + ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys), + retval, duration); + if (!ret) + ret = bpf_ctx_finish(kattr, uattr, user_ctx, + sizeof(struct bpf_flow_keys)); + +out: + kfree(user_ctx); kfree(data); return ret; } diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig new file mode 100644 index 000000000000..60725c5f79db --- /dev/null +++ b/net/bpfilter/Kconfig @@ -0,0 +1,16 @@ +menuconfig BPFILTER + bool "BPF based packet filtering framework (BPFILTER)" + default n + depends on NET && BPF + help + This builds experimental bpfilter framework that is aiming to + provide netfilter compatible functionality via BPF + +if BPFILTER +config BPFILTER_UMH + tristate "bpfilter kernel module with user mode helper" + default m + help + This builds bpfilter kernel module with embedded user mode helper +endif + diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile new file mode 100644 index 000000000000..bb2146d33bf3 --- /dev/null +++ b/net/bpfilter/Makefile @@ -0,0 +1,19 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Makefile for the Linux BPFILTER layer. +# + +hostprogs-y := bpfilter_umh +bpfilter_umh-objs := main.o +HOSTCFLAGS += -I. -Itools/include/ +ifeq ($(CONFIG_BPFILTER_UMH), y) +# builtin bpfilter_umh should be compiled with -static +# since rootfs isn't mounted at the time of __init +# function is called and do_execv won't find elf interpreter +HOSTLDFLAGS += -static +endif + +$(obj)/bpfilter_umh_blob.o: $(obj)/bpfilter_umh + +obj-$(CONFIG_BPFILTER_UMH) += bpfilter.o +bpfilter-objs += bpfilter_kern.o bpfilter_umh_blob.o diff --git a/net/bpfilter/bpfilter_kern.c b/net/bpfilter/bpfilter_kern.c new file mode 100644 index 000000000000..c0f0990f30b6 --- /dev/null +++ b/net/bpfilter/bpfilter_kern.c @@ -0,0 +1,128 @@ +// SPDX-License-Identifier: GPL-2.0 +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include +#include +#include +#include +#include +#include +#include +#include +#include "msgfmt.h" + +extern char bpfilter_umh_start; +extern char bpfilter_umh_end; + +static void shutdown_umh(void) +{ + struct task_struct *tsk; + + if (bpfilter_ops.stop) + return; + + tsk = get_pid_task(find_vpid(bpfilter_ops.info.pid), PIDTYPE_PID); + if (tsk) { + send_sig(SIGKILL, tsk, 1); + put_task_struct(tsk); + } +} + +static void __stop_umh(void) +{ + if (IS_ENABLED(CONFIG_INET)) + shutdown_umh(); +} + +static int __bpfilter_process_sockopt(struct sock *sk, int optname, + char __user *optval, + unsigned int optlen, bool is_set) +{ + struct mbox_request req; + struct mbox_reply reply; + loff_t pos; + ssize_t n; + int ret = -EFAULT; + + req.is_set = is_set; + req.pid = current->pid; + req.cmd = optname; + req.addr = (long __force __user)optval; + req.len = optlen; + if (!bpfilter_ops.info.pid) + goto out; + n = __kernel_write(bpfilter_ops.info.pipe_to_umh, &req, sizeof(req), + &pos); + if (n != sizeof(req)) { + pr_err("write fail %zd\n", n); + __stop_umh(); + ret = -EFAULT; + goto out; + } + pos = 0; + n = kernel_read(bpfilter_ops.info.pipe_from_umh, &reply, sizeof(reply), + &pos); + if (n != sizeof(reply)) { + pr_err("read fail %zd\n", n); + __stop_umh(); + ret = -EFAULT; + goto out; + } + ret = reply.status; +out: + return ret; +} + +static int start_umh(void) +{ + int err; + + /* fork usermode process */ + err = fork_usermode_blob(&bpfilter_umh_start, + &bpfilter_umh_end - &bpfilter_umh_start, + &bpfilter_ops.info); + if (err) + return err; + bpfilter_ops.stop = false; + pr_info("Loaded bpfilter_umh pid %d\n", bpfilter_ops.info.pid); + + /* health check that usermode process started correctly */ + if (__bpfilter_process_sockopt(NULL, 0, NULL, 0, 0) != 0) { + shutdown_umh(); + return -EFAULT; + } + + return 0; +} + +static int __init load_umh(void) +{ + int err; + + mutex_lock(&bpfilter_ops.lock); + if (!bpfilter_ops.stop) { + err = -EFAULT; + goto out; + } + err = start_umh(); + if (!err && IS_ENABLED(CONFIG_INET)) { + bpfilter_ops.sockopt = &__bpfilter_process_sockopt; + bpfilter_ops.start = &start_umh; + } +out: + mutex_unlock(&bpfilter_ops.lock); + return err; +} + +static void __exit fini_umh(void) +{ + mutex_lock(&bpfilter_ops.lock); + if (IS_ENABLED(CONFIG_INET)) { + shutdown_umh(); + bpfilter_ops.start = NULL; + bpfilter_ops.sockopt = NULL; + } + mutex_unlock(&bpfilter_ops.lock); +} +module_init(load_umh); +module_exit(fini_umh); +MODULE_LICENSE("GPL"); diff --git a/net/bpfilter/bpfilter_umh_blob.S b/net/bpfilter/bpfilter_umh_blob.S new file mode 100644 index 000000000000..7f1c521dcc2f --- /dev/null +++ b/net/bpfilter/bpfilter_umh_blob.S @@ -0,0 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + .section .bpfilter_umh, "a" + .global bpfilter_umh_start +bpfilter_umh_start: + .incbin "net/bpfilter/bpfilter_umh" + .global bpfilter_umh_end +bpfilter_umh_end: diff --git a/net/bpfilter/main.c b/net/bpfilter/main.c new file mode 100644 index 000000000000..81bbc1684896 --- /dev/null +++ b/net/bpfilter/main.c @@ -0,0 +1,63 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include "include/uapi/linux/bpf.h" +#include +#include "msgfmt.h" + +int debug_fd; + +static int handle_get_cmd(struct mbox_request *cmd) +{ + switch (cmd->cmd) { + case 0: + return 0; + default: + break; + } + return -ENOPROTOOPT; +} + +static int handle_set_cmd(struct mbox_request *cmd) +{ + return -ENOPROTOOPT; +} + +static void loop(void) +{ + while (1) { + struct mbox_request req; + struct mbox_reply reply; + int n; + + n = read(0, &req, sizeof(req)); + if (n != sizeof(req)) { + dprintf(debug_fd, "invalid request %d\n", n); + return; + } + + reply.status = req.is_set ? + handle_set_cmd(&req) : + handle_get_cmd(&req); + + n = write(1, &reply, sizeof(reply)); + if (n != sizeof(reply)) { + dprintf(debug_fd, "reply failed %d\n", n); + return; + } + } +} + +int main(void) +{ + debug_fd = open("/dev/console", 00000002 | 00000100); + dprintf(debug_fd, "Started bpfilter\n"); + loop(); + close(debug_fd); + return 0; +} diff --git a/net/bpfilter/msgfmt.h b/net/bpfilter/msgfmt.h new file mode 100644 index 000000000000..98d121c62945 --- /dev/null +++ b/net/bpfilter/msgfmt.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _NET_BPFILTER_MSGFMT_H +#define _NET_BPFILTER_MSGFMT_H + +struct mbox_request { + __u64 addr; + __u32 len; + __u32 is_set; + __u32 cmd; + __u32 pid; +}; + +struct mbox_reply { + __u32 status; +}; + +#endif diff --git a/net/bridge/br_nf_core.c b/net/bridge/br_nf_core.c index c217276bd76a..d88e724d5755 100644 --- a/net/bridge/br_nf_core.c +++ b/net/bridge/br_nf_core.c @@ -79,7 +79,6 @@ void br_netfilter_rtable_init(struct net_bridge *br) atomic_set(&rt->dst.__refcnt, 1); rt->dst.dev = br->dev; - rt->dst.path = &rt->dst; dst_init_metrics(&rt->dst, br_dst_default_metrics, true); rt->dst.flags = DST_NOXFRM | DST_FAKE_RTABLE; rt->dst.ops = &fake_dst_ops; diff --git a/net/core/Makefile b/net/core/Makefile index 2ce40e69aae0..2931e5c23359 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -14,6 +14,7 @@ obj-y += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \ fib_notifier.o net_ipc_log.o dev_monitor.o xdp.o obj-y += net-sysfs.o +obj-$(CONFIG_PAGE_POOL) += page_pool.o obj-$(CONFIG_PROC_FS) += net-procfs.o obj-$(CONFIG_NET_SOCK_MSG) += skmsg.o obj-$(CONFIG_NET_PKTGEN) += pktgen.o diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c index 3462809d3dbe..21cb21a6f899 100644 --- a/net/core/bpf_sk_storage.c +++ b/net/core/bpf_sk_storage.c @@ -10,16 +10,11 @@ #include #include -#include -#include -static inline void *__compat_kvcalloc(size_t n, size_t size, gfp_t flags) -{ - return kvmalloc_array(n, size, flags | __GFP_ZERO); -} -#define kvcalloc __compat_kvcalloc - static atomic_t cache_idx; +#define SK_STORAGE_CREATE_FLAG_MASK \ + (BPF_F_NO_PREALLOC | BPF_F_CLONE) + struct bucket { struct hlist_head list; raw_spinlock_t lock; @@ -217,7 +212,6 @@ static void selem_unlink_sk(struct bpf_sk_storage_elem *selem) kfree_rcu(sk_storage, rcu); } -/* sk_storage->lock must be held and sk_storage->list cannot be empty */ static void __selem_link_sk(struct bpf_sk_storage *sk_storage, struct bpf_sk_storage_elem *selem) { @@ -517,7 +511,7 @@ static int sk_storage_delete(struct sock *sk, struct bpf_map *map) return 0; } -/* Called by __sk_destruct() */ +/* Called by __sk_destruct() & bpf_sk_storage_clone() */ void bpf_sk_storage_free(struct sock *sk) { struct bpf_sk_storage_elem *selem; @@ -565,6 +559,11 @@ static void bpf_sk_storage_map_free(struct bpf_map *map) smap = (struct bpf_sk_storage_map *)map; + /* Note that this map might be concurrently cloned from + * bpf_sk_storage_clone. Wait for any existing bpf_sk_storage_clone + * RCU read section to finish before proceeding. New RCU + * read sections should be prevented via bpf_map_inc_not_zero. + */ synchronize_rcu(); /* bpf prog and the userspace can no longer access this map @@ -609,7 +608,9 @@ static void bpf_sk_storage_map_free(struct bpf_map *map) static int bpf_sk_storage_map_alloc_check(union bpf_attr *attr) { - if (attr->map_flags != BPF_F_NO_PREALLOC || attr->max_entries || + if (attr->map_flags & ~SK_STORAGE_CREATE_FLAG_MASK || + !(attr->map_flags & BPF_F_NO_PREALLOC) || + attr->max_entries || attr->key_size != sizeof(int) || !attr->value_size || /* Enforce BTF for userspace sk dumping */ !attr->btf_key_type_id || !attr->btf_value_type_id) @@ -635,6 +636,7 @@ static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr) unsigned int i; u32 nbuckets; u64 cost; + int ret; smap = kzalloc(sizeof(*smap), GFP_USER | __GFP_NOWARN); if (!smap) @@ -643,13 +645,21 @@ static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr) smap->bucket_log = ilog2(roundup_pow_of_two(num_possible_cpus())); nbuckets = 1U << smap->bucket_log; + cost = sizeof(*smap->buckets) * nbuckets + sizeof(*smap); + + ret = bpf_map_charge_init(&smap->map.memory, cost); + if (ret < 0) { + kfree(smap); + return ERR_PTR(ret); + } + smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets, GFP_USER | __GFP_NOWARN); if (!smap->buckets) { + bpf_map_charge_finish(&smap->map.memory); kfree(smap); return ERR_PTR(-ENOMEM); } - cost = sizeof(*smap->buckets) * nbuckets + sizeof(*smap); for (i = 0; i < nbuckets; i++) { INIT_HLIST_HEAD(&smap->buckets[i].list); @@ -659,7 +669,6 @@ static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr) smap->elem_size = sizeof(struct bpf_sk_storage_elem) + attr->value_size; smap->cache_idx = (unsigned int)atomic_inc_return(&cache_idx) % BPF_SK_STORAGE_CACHE_SIZE; - smap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; return &smap->map; } @@ -738,6 +747,95 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key) return err; } +static struct bpf_sk_storage_elem * +bpf_sk_storage_clone_elem(struct sock *newsk, + struct bpf_sk_storage_map *smap, + struct bpf_sk_storage_elem *selem) +{ + struct bpf_sk_storage_elem *copy_selem; + + copy_selem = selem_alloc(smap, newsk, NULL, true); + if (!copy_selem) + return NULL; + + if (map_value_has_spin_lock(&smap->map)) + copy_map_value_locked(&smap->map, SDATA(copy_selem)->data, + SDATA(selem)->data, true); + else + copy_map_value(&smap->map, SDATA(copy_selem)->data, + SDATA(selem)->data); + + return copy_selem; +} + +int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk) +{ + struct bpf_sk_storage *new_sk_storage = NULL; + struct bpf_sk_storage *sk_storage; + struct bpf_sk_storage_elem *selem; + int ret = 0; + + RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL); + + rcu_read_lock(); + sk_storage = rcu_dereference(sk->sk_bpf_storage); + + if (!sk_storage || hlist_empty(&sk_storage->list)) + goto out; + + hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) { + struct bpf_sk_storage_elem *copy_selem; + struct bpf_sk_storage_map *smap; + struct bpf_map *map; + + smap = rcu_dereference(SDATA(selem)->smap); + if (!(smap->map.map_flags & BPF_F_CLONE)) + continue; + + /* Note that for lockless listeners adding new element + * here can race with cleanup in bpf_sk_storage_map_free. + * Try to grab map refcnt to make sure that it's still + * alive and prevent concurrent removal. + */ + map = bpf_map_inc_not_zero(&smap->map, false); + if (IS_ERR(map)) + continue; + + copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem); + if (!copy_selem) { + ret = -ENOMEM; + bpf_map_put(map); + goto out; + } + + if (new_sk_storage) { + selem_link_map(smap, copy_selem); + __selem_link_sk(new_sk_storage, copy_selem); + } else { + ret = sk_storage_alloc(newsk, smap, copy_selem); + if (ret) { + kfree(copy_selem); + atomic_sub(smap->elem_size, + &newsk->sk_omem_alloc); + bpf_map_put(map); + goto out; + } + + new_sk_storage = rcu_dereference(copy_selem->sk_storage); + } + bpf_map_put(map); + } + +out: + rcu_read_unlock(); + + /* In case of an error, don't free anything explicitly here, the + * caller is responsible to call bpf_sk_storage_free. + */ + + return ret; +} + BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk, void *, value, u64, flags) { diff --git a/net/core/dev.c b/net/core/dev.c index 9dc4a585c8e8..9967f87af68c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3085,7 +3085,7 @@ sw_checksum: } EXPORT_SYMBOL(skb_csum_hwoffload_help); -static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device *dev, bool *again) +static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device *dev) { netdev_features_t features; @@ -3113,6 +3113,9 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device __skb_linearize(skb)) goto out_kfree_skb; + if (validate_xmit_xfrm(skb, features)) + goto out_kfree_skb; + /* If packet is not checksummed and device does not * support checksumming for this protocol, complete * checksumming here. @@ -3129,8 +3132,6 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device } } - skb = validate_xmit_xfrm(skb, features, again); - return skb; out_kfree_skb: @@ -3140,7 +3141,7 @@ out_null: return NULL; } -struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev, bool *again) +struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev) { struct sk_buff *next, *head = NULL, *tail; @@ -3151,7 +3152,7 @@ struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *d /* in case skb wont be segmented, point to itself */ skb->prev = skb; - skb = validate_xmit_skb(skb, dev, again); + skb = validate_xmit_skb(skb, dev); if (!skb) continue; @@ -3296,9 +3297,6 @@ static void skb_update_prio(struct sk_buff *skb) #define skb_update_prio(skb) #endif -DEFINE_PER_CPU(int, xmit_recursion); -EXPORT_SYMBOL(xmit_recursion); - /** * dev_loopback_xmit - loop back @skb * @net: network namespace this loopback is happening in @@ -3481,9 +3479,9 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv) struct netdev_queue *txq; struct Qdisc *q; int rc = -ENOMEM; - bool again = false; skb_reset_mac_header(skb); + skb_assert_len(skb); if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP)) __skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED); @@ -3539,20 +3537,19 @@ static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv) int cpu = smp_processor_id(); /* ok because BHs are off */ if (txq->xmit_lock_owner != cpu) { - if (unlikely(__this_cpu_read(xmit_recursion) > - XMIT_RECURSION_LIMIT)) + if (dev_xmit_recursion()) goto recursion_alert; - skb = validate_xmit_skb(skb, dev, &again); + skb = validate_xmit_skb(skb, dev); if (!skb) goto out; HARD_TX_LOCK(dev, txq, cpu); if (!netif_xmit_stopped(txq)) { - __this_cpu_inc(xmit_recursion); + dev_xmit_recursion_inc(); skb = dev_hard_start_xmit(skb, dev, txq, &rc); - __this_cpu_dec(xmit_recursion); + dev_xmit_recursion_dec(); if (dev_xmit_complete(rc)) { HARD_TX_UNLOCK(dev, txq); goto out; @@ -3951,27 +3948,27 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb) if (skb_rx_queue_recorded(skb)) { u16 index = skb_get_rx_queue(skb); + if (unlikely(index >= dev->real_num_rx_queues)) { WARN_ONCE(dev->real_num_rx_queues > 1, "%s received packet on queue %u, but number " "of RX queues is %u\n", dev->name, index, dev->real_num_rx_queues); + return rxqueue; /* Return first rxqueue */ } - rxqueue += index; } - return rxqueue; } static u32 netif_receive_generic_xdp(struct sk_buff *skb, + struct xdp_buff *xdp, struct bpf_prog *xdp_prog) { struct netdev_rx_queue *rxqueue; + void *orig_data, *orig_data_end; u32 metalen, act = XDP_DROP; - struct xdp_buff xdp; - void *orig_data; int hlen, off; u32 mac_len; @@ -3989,6 +3986,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, skb_headroom(skb) < XDP_PACKET_HEADROOM) { int hroom = XDP_PACKET_HEADROOM - skb_headroom(skb); int troom = skb->tail + skb->data_len - skb->end; + /* In case we have to go down the path and also linearize, * then lets do the pskb_expand_head() work just once here. */ @@ -4005,35 +4003,45 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, */ mac_len = skb->data - skb_mac_header(skb); hlen = skb_headlen(skb) + mac_len; - xdp.data = skb->data - mac_len; - xdp.data_end = xdp.data + hlen; - xdp.data_meta = xdp.data; - xdp.data_hard_start = skb->data - skb_headroom(skb); - orig_data = xdp.data; + xdp->data = skb->data - mac_len; + xdp->data_meta = xdp->data; + xdp->data_end = xdp->data + hlen; + xdp->data_hard_start = skb->data - skb_headroom(skb); + orig_data_end = xdp->data_end; + orig_data = xdp->data; rxqueue = netif_get_rxqueue(skb); - xdp.rxq = &rxqueue->xdp_rxq; + xdp->rxq = &rxqueue->xdp_rxq; - act = bpf_prog_run_xdp(xdp_prog, &xdp); + act = bpf_prog_run_xdp(xdp_prog, xdp); - off = xdp.data - orig_data; + off = xdp->data - orig_data; if (off > 0) __skb_pull(skb, off); else if (off < 0) __skb_push(skb, -off); skb->mac_header += off; + /* check if bpf_xdp_adjust_tail was used. it can only "shrink" + * pckt. + */ + off = orig_data_end - xdp->data_end; + if (off != 0) { + skb_set_tail_pointer(skb, xdp->data_end - xdp->data); + skb->len -= off; + + } + switch (act) { case XDP_REDIRECT: case XDP_TX: __skb_push(skb, mac_len); break; case XDP_PASS: - metalen = xdp.data - xdp.data_meta; + metalen = xdp->data - xdp->data_meta; if (metalen) skb_metadata_set(skb, metalen); break; - default: bpf_warn_invalid_xdp_action(act); /* fall through */ @@ -4075,22 +4083,24 @@ void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog) } EXPORT_SYMBOL_GPL(generic_xdp_tx); -static struct static_key generic_xdp_needed __read_mostly; +static DEFINE_STATIC_KEY_FALSE(generic_xdp_needed_key); int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb) { if (xdp_prog) { - u32 act = netif_receive_generic_xdp(skb, xdp_prog); + struct xdp_buff xdp; + u32 act; int err; + act = netif_receive_generic_xdp(skb, &xdp, xdp_prog); if (act != XDP_PASS) { switch (act) { case XDP_REDIRECT: err = xdp_do_generic_redirect(skb->dev, skb, - xdp_prog); + &xdp, xdp_prog); if (err) goto out_redir; - /* fallthru to submit skb */ + break; case XDP_TX: generic_xdp_tx(skb, xdp_prog); break; @@ -4113,7 +4123,7 @@ static int netif_rx_internal(struct sk_buff *skb) trace_netif_rx(skb); - if (static_key_false(&generic_xdp_needed)) { + if (static_branch_unlikely(&generic_xdp_needed_key)) { int ret; preempt_disable(); @@ -4254,7 +4264,6 @@ static __latent_entropy void net_tx_action(struct softirq_action *h) spin_unlock(root_lock); } } - xfrm_dev_backlog(sd); } #if IS_ENABLED(CONFIG_BRIDGE) && IS_ENABLED(CONFIG_ATM_LANE) @@ -4606,6 +4615,33 @@ out: return ret; } +/** + * netif_receive_skb_core - special purpose version of netif_receive_skb + * @skb: buffer to process + * + * More direct receive version of netif_receive_skb(). It should + * only be used by callers that have a need to skip RPS and Generic XDP. + * Caller must also take care of handling if (page_is_)pfmemalloc. + * + * This function may only be called from softirq context and interrupts + * should be enabled. + * + * Return values (usually ignored): + * NET_RX_SUCCESS: no congestion + * NET_RX_DROP: packet was dropped + */ +int netif_receive_skb_core(struct sk_buff *skb) +{ + int ret; + + rcu_read_lock(); + ret = __netif_receive_skb_core(skb, false); + rcu_read_unlock(); + + return ret; +} +EXPORT_SYMBOL(netif_receive_skb_core); + static int __netif_receive_skb(struct sk_buff *skb) { int ret; @@ -4644,15 +4680,14 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp) bpf_prog_put(old); if (old && !new) { - static_key_slow_dec(&generic_xdp_needed); + static_branch_dec(&generic_xdp_needed_key); } else if (new && !old) { - static_key_slow_inc(&generic_xdp_needed); + static_branch_inc(&generic_xdp_needed_key); dev_disable_lro(dev); } break; case XDP_QUERY_PROG: - xdp->prog_attached = !!old; xdp->prog_id = old ? old->aux->id : 0; break; @@ -4673,7 +4708,7 @@ static int netif_receive_skb_internal(struct sk_buff *skb) if (skb_defer_rx_timestamp(skb)) return NET_RX_SUCCESS; - if (static_key_false(&generic_xdp_needed)) { + if (static_branch_unlikely(&generic_xdp_needed_key)) { int ret; preempt_disable(); @@ -7191,19 +7226,21 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down) } EXPORT_SYMBOL(dev_change_proto_down); -u8 __dev_xdp_attached(struct net_device *dev, bpf_op_t bpf_op, u32 *prog_id) +u32 __dev_xdp_query(struct net_device *dev, bpf_op_t bpf_op, + enum bpf_netdev_command cmd) { struct netdev_bpf xdp; + if (!bpf_op) + return 0; + memset(&xdp, 0, sizeof(xdp)); - xdp.command = XDP_QUERY_PROG; + xdp.command = cmd; /* Query must always succeed. */ - WARN_ON(bpf_op(dev, &xdp) < 0); - if (prog_id) - *prog_id = xdp.prog_id; + WARN_ON(bpf_op(dev, &xdp) < 0 && cmd == XDP_QUERY_PROG); - return xdp.prog_attached; + return xdp.prog_id; } static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op, @@ -7224,6 +7261,34 @@ static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op, return bpf_op(dev, &xdp); } +static void dev_xdp_uninstall(struct net_device *dev) +{ + struct netdev_bpf xdp; + bpf_op_t ndo_bpf; + + /* Remove generic XDP */ + WARN_ON(dev_xdp_install(dev, generic_xdp_install, NULL, 0, NULL)); + + /* Remove from the driver */ + ndo_bpf = dev->netdev_ops->ndo_bpf; + if (!ndo_bpf) + return; + + memset(&xdp, 0, sizeof(xdp)); + xdp.command = XDP_QUERY_PROG; + WARN_ON(ndo_bpf(dev, &xdp)); + if (xdp.prog_id) + WARN_ON(dev_xdp_install(dev, ndo_bpf, NULL, xdp.prog_flags, + NULL)); + + /* Remove HW offload */ + memset(&xdp, 0, sizeof(xdp)); + xdp.command = XDP_QUERY_PROG_HW; + if (!ndo_bpf(dev, &xdp) && xdp.prog_id) + WARN_ON(dev_xdp_install(dev, ndo_bpf, NULL, xdp.prog_flags, + NULL)); +} + /** * dev_change_xdp_fd - set or clear a bpf program for a device rx path * @dev: device @@ -7237,12 +7302,15 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, int fd, u32 flags) { const struct net_device_ops *ops = dev->netdev_ops; + enum bpf_netdev_command query; struct bpf_prog *prog = NULL; bpf_op_t bpf_op, bpf_chk; int err; ASSERT_RTNL(); + query = flags & XDP_FLAGS_HW_MODE ? XDP_QUERY_PROG_HW : XDP_QUERY_PROG; + bpf_op = bpf_chk = ops->ndo_bpf; if (!bpf_op && (flags & (XDP_FLAGS_DRV_MODE | XDP_FLAGS_HW_MODE))) return -EOPNOTSUPP; @@ -7252,13 +7320,15 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack, bpf_chk = generic_xdp_install; if (fd >= 0) { - if (bpf_chk && __dev_xdp_attached(dev, bpf_chk, NULL)) + if (__dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG) || + __dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG_HW)) return -EEXIST; if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) && - __dev_xdp_attached(dev, bpf_op, NULL)) + __dev_xdp_query(dev, bpf_op, query)) return -EBUSY; - prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP); + prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP, + bpf_op == ops->ndo_bpf); if (IS_ERR(prog)) return PTR_ERR(prog); } @@ -7346,6 +7416,7 @@ static void rollback_registered_many(struct list_head *head) /* Shutdown queueing discipline. */ dev_shutdown(dev); + dev_xdp_uninstall(dev); /* Notify protocols, that we are about to destroy * this device. They should clean all the things. @@ -7674,14 +7745,15 @@ static void netif_free_rx_queues(struct net_device *dev) { unsigned int i, count = dev->num_rx_queues; struct netdev_rx_queue *rx; + /* netif_alloc_rx_queues alloc failed, resources have been unreg'ed */ if (!dev->_rx) return; rx = dev->_rx; + for (i = 0; i < count; i++) xdp_rxq_info_unreg(&rx[i].xdp_rxq); - } static void netdev_init_one_queue(struct net_device *dev, @@ -7693,7 +7765,9 @@ static void netdev_init_one_queue(struct net_device *dev, queue->xmit_lock_owner = -1; netdev_queue_numa_node_write(queue, NUMA_NO_NODE); queue->dev = dev; +#ifdef CONFIG_BQL dql_init(&queue->dql, HZ); +#endif } static void netif_free_tx_queues(struct net_device *dev) @@ -8264,12 +8338,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name, return NULL; } -#ifdef CONFIG_SYSFS if (rxqs < 1) { pr_err("alloc_netdev: Unable to allocate device with zero RX queues\n"); return NULL; } -#endif alloc_size = sizeof(struct net_device); if (sizeof_priv) { @@ -8365,7 +8437,6 @@ EXPORT_SYMBOL(alloc_netdev_mqs); void free_netdev(struct net_device *dev) { struct napi_struct *p, *n; - struct bpf_prog *prog; might_sleep(); netif_free_tx_queues(dev); @@ -8382,12 +8453,6 @@ void free_netdev(struct net_device *dev) free_percpu(dev->pcpu_refcnt); dev->pcpu_refcnt = NULL; - prog = rcu_dereference_protected(dev->xdp_prog, 1); - if (prog) { - bpf_prog_put(prog); - static_key_slow_dec(&generic_xdp_needed); - } - /* Compatibility with error handling in drivers */ if (dev->reg_state == NETREG_UNINITIALIZED) { netdev_freemem(dev); @@ -8971,9 +9036,6 @@ static int __init net_dev_init(void) skb_queue_head_init(&sd->input_pkt_queue); skb_queue_head_init(&sd->process_queue); -#ifdef CONFIG_XFRM_OFFLOAD - skb_queue_head_init(&sd->xfrm_backlog); -#endif INIT_LIST_HEAD(&sd->poll_list); sd->output_queue_tailp = &sd->output_queue; #ifdef CONFIG_RPS diff --git a/net/core/dst.c b/net/core/dst.c index 2d121958d5b0..88f7ddc05013 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -57,20 +58,18 @@ const struct dst_metrics dst_default_metrics = { */ .refcnt = REFCOUNT_INIT(1), }; +EXPORT_SYMBOL(dst_default_metrics); void dst_init(struct dst_entry *dst, struct dst_ops *ops, struct net_device *dev, int initial_ref, int initial_obsolete, unsigned short flags) { - dst->child = NULL; dst->dev = dev; if (dev) dev_hold(dev); dst->ops = ops; dst_init_metrics(dst, dst_default_metrics.metrics, true); dst->expires = 0UL; - dst->path = dst; - dst->from = NULL; #ifdef CONFIG_XFRM dst->xfrm = NULL; #endif @@ -116,12 +115,17 @@ EXPORT_SYMBOL(dst_alloc); struct dst_entry *dst_destroy(struct dst_entry * dst) { - struct dst_entry *child; + struct dst_entry *child = NULL; smp_rmb(); - child = dst->child; +#ifdef CONFIG_XFRM + if (dst->xfrm) { + struct xfrm_dst *xdst = (struct xfrm_dst *) dst; + child = xdst->child; + } +#endif if (!(dst->flags & DST_NOCOUNT)) dst_entries_add(dst->ops, -1); @@ -322,3 +326,19 @@ metadata_dst_alloc_percpu(u8 optslen, enum metadata_type type, gfp_t flags) return md_dst; } EXPORT_SYMBOL_GPL(metadata_dst_alloc_percpu); + +void metadata_dst_free_percpu(struct metadata_dst __percpu *md_dst) +{ + int cpu; + +#ifdef CONFIG_DST_CACHE + for_each_possible_cpu(cpu) { + struct metadata_dst *one_md_dst = per_cpu_ptr(md_dst, cpu); + + if (one_md_dst->type == METADATA_IP_TUNNEL) + dst_cache_destroy(&one_md_dst->u.tun_info.dst_cache); + } +#endif + free_percpu(md_dst); +} +EXPORT_SYMBOL_GPL(metadata_dst_free_percpu); diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 76c3f602ee15..aac6f41e3691 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -33,6 +33,10 @@ bool fib_rule_matchall(const struct fib_rule *rule) if (!uid_eq(rule->uid_range.start, fib_kuid_range_unset.start) || !uid_eq(rule->uid_range.end, fib_kuid_range_unset.end)) return false; + if (fib_rule_port_range_set(&rule->sport_range)) + return false; + if (fib_rule_port_range_set(&rule->dport_range)) + return false; return true; } EXPORT_SYMBOL_GPL(fib_rule_matchall); @@ -51,6 +55,7 @@ int fib_default_rule_add(struct fib_rules_ops *ops, r->pref = pref; r->table = table; r->flags = flags; + r->proto = RTPROT_KERNEL; r->fr_net = ops->fro_net; r->uid_range = fib_kuid_range_unset; @@ -220,6 +225,26 @@ static int nla_put_uid_range(struct sk_buff *skb, struct fib_kuid_range *range) return nla_put(skb, FRA_UID_RANGE, sizeof(out), &out); } +static int nla_get_port_range(struct nlattr *pattr, + struct fib_rule_port_range *port_range) +{ + const struct fib_rule_port_range *pr = nla_data(pattr); + + if (!fib_rule_port_range_valid(pr)) + return -EINVAL; + + port_range->start = pr->start; + port_range->end = pr->end; + + return 0; +} + +static int nla_put_port_range(struct sk_buff *skb, int attrtype, + struct fib_rule_port_range *range) +{ + return nla_put(skb, attrtype, sizeof(*range), range); +} + static int fib_rule_match(struct fib_rule *rule, struct fib_rules_ops *ops, struct flowi *fl, int flags, struct fib_lookup_arg *arg) @@ -314,10 +339,12 @@ static int call_fib_rule_notifier(struct notifier_block *nb, struct net *net, static int call_fib_rule_notifiers(struct net *net, enum fib_event_type event_type, struct fib_rule *rule, - struct fib_rules_ops *ops) + struct fib_rules_ops *ops, + struct netlink_ext_ack *extack) { struct fib_rule_notifier_info info = { .info.family = ops->family, + .info.extack = extack, .rule = rule, }; @@ -422,6 +449,17 @@ static int rule_exists(struct fib_rules_ops *ops, struct fib_rule_hdr *frh, !uid_eq(r->uid_range.end, rule->uid_range.end)) continue; + if (r->ip_proto != rule->ip_proto) + continue; + + if (!fib_rule_port_range_compare(&r->sport_range, + &rule->sport_range)) + continue; + + if (!fib_rule_port_range_compare(&r->dport_range, + &rule->dport_range)) + continue; + if (!ops->compare(r, frh, tb)) continue; return 1; @@ -467,6 +505,9 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh, rule->pref = tb[FRA_PRIORITY] ? nla_get_u32(tb[FRA_PRIORITY]) : fib_default_rule_pref(ops); + rule->proto = tb[FRA_PROTOCOL] ? + nla_get_u8(tb[FRA_PROTOCOL]) : RTPROT_UNSPEC; + if (tb[FRA_IIFNAME]) { struct net_device *dev; @@ -563,6 +604,23 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh, rule->uid_range = fib_kuid_range_unset; } + if (tb[FRA_IP_PROTO]) + rule->ip_proto = nla_get_u8(tb[FRA_IP_PROTO]); + + if (tb[FRA_SPORT_RANGE]) { + err = nla_get_port_range(tb[FRA_SPORT_RANGE], + &rule->sport_range); + if (err) + goto errout_free; + } + + if (tb[FRA_DPORT_RANGE]) { + err = nla_get_port_range(tb[FRA_DPORT_RANGE], + &rule->dport_range); + if (err) + goto errout_free; + } + if ((nlh->nlmsg_flags & NLM_F_EXCL) && rule_exists(ops, frh, tb, rule)) { err = -EEXIST; @@ -609,7 +667,7 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr *nlh, if (rule->tun_id) ip_tunnel_need_metadata(); - call_fib_rule_notifiers(net, FIB_EVENT_RULE_ADD, rule, ops); + call_fib_rule_notifiers(net, FIB_EVENT_RULE_ADD, rule, ops, extack); notify_rule_change(RTM_NEWRULE, rule, ops, nlh, NETLINK_CB(skb).portid); flush_route_cache(ops); rules_ops_put(ops); @@ -628,6 +686,8 @@ int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr *nlh, { struct net *net = sock_net(skb->sk); struct fib_rule_hdr *frh = nlmsg_data(nlh); + struct fib_rule_port_range sprange = {0, 0}; + struct fib_rule_port_range dprange = {0, 0}; struct fib_rules_ops *ops = NULL; struct fib_rule *rule, *r; struct nlattr *tb[FRA_MAX+1]; @@ -661,7 +721,25 @@ int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr *nlh, range = fib_kuid_range_unset; } + if (tb[FRA_SPORT_RANGE]) { + err = nla_get_port_range(tb[FRA_SPORT_RANGE], + &sprange); + if (err) + goto errout; + } + + if (tb[FRA_DPORT_RANGE]) { + err = nla_get_port_range(tb[FRA_DPORT_RANGE], + &dprange); + if (err) + goto errout; + } + list_for_each_entry(rule, &ops->rules_list, list) { + if (tb[FRA_PROTOCOL] && + (rule->proto != nla_get_u8(tb[FRA_PROTOCOL]))) + continue; + if (frh->action && (frh->action != rule->action)) continue; @@ -702,6 +780,18 @@ int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr *nlh, !uid_eq(rule->uid_range.end, range.end))) continue; + if (tb[FRA_IP_PROTO] && + (rule->ip_proto != nla_get_u8(tb[FRA_IP_PROTO]))) + continue; + + if (fib_rule_port_range_set(&sprange) && + !fib_rule_port_range_compare(&rule->sport_range, &sprange)) + continue; + + if (fib_rule_port_range_set(&dprange) && + !fib_rule_port_range_compare(&rule->dport_range, &dprange)) + continue; + if (!ops->compare(rule, frh, tb)) continue; @@ -749,7 +839,8 @@ int fib_nl_delrule(struct sk_buff *skb, struct nlmsghdr *nlh, } } - call_fib_rule_notifiers(net, FIB_EVENT_RULE_DEL, rule, ops); + call_fib_rule_notifiers(net, FIB_EVENT_RULE_DEL, rule, ops, + NULL); notify_rule_change(RTM_DELRULE, rule, ops, nlh, NETLINK_CB(skb).portid); fib_rule_put(rule); @@ -778,7 +869,11 @@ static inline size_t fib_rule_nlmsg_size(struct fib_rules_ops *ops, + nla_total_size(4) /* FRA_FWMARK */ + nla_total_size(4) /* FRA_FWMASK */ + nla_total_size_64bit(8) /* FRA_TUN_ID */ - + nla_total_size(sizeof(struct fib_kuid_range)); + + nla_total_size(sizeof(struct fib_kuid_range)) + + nla_total_size(1) /* FRA_PROTOCOL */ + + nla_total_size(1) /* FRA_IP_PROTO */ + + nla_total_size(sizeof(struct fib_rule_port_range)) /* FRA_SPORT_RANGE */ + + nla_total_size(sizeof(struct fib_rule_port_range)); /* FRA_DPORT_RANGE */ if (ops->nlmsg_payload) payload += ops->nlmsg_payload(rule); @@ -809,6 +904,9 @@ static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule, frh->action = rule->action; frh->flags = rule->flags; + if (nla_put_u8(skb, FRA_PROTOCOL, rule->proto)) + goto nla_put_failure; + if (rule->action == FR_ACT_GOTO && rcu_access_pointer(rule->ctarget) == NULL) frh->flags |= FIB_RULE_UNRESOLVED; @@ -840,7 +938,12 @@ static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule, (rule->l3mdev && nla_put_u8(skb, FRA_L3MDEV, rule->l3mdev)) || (uid_range_set(&rule->uid_range) && - nla_put_uid_range(skb, &rule->uid_range))) + nla_put_uid_range(skb, &rule->uid_range)) || + (fib_rule_port_range_set(&rule->sport_range) && + nla_put_port_range(skb, FRA_SPORT_RANGE, &rule->sport_range)) || + (fib_rule_port_range_set(&rule->dport_range) && + nla_put_port_range(skb, FRA_DPORT_RANGE, &rule->dport_range)) || + (rule->ip_proto && nla_put_u8(skb, FRA_IP_PROTO, rule->ip_proto))) goto nla_put_failure; if (rule->suppress_ifgroup != -1) { diff --git a/net/core/filter.c b/net/core/filter.c index 3f47792532be..5856f07f4d99 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -45,6 +45,7 @@ #include #include #include +#include #include #include #include @@ -57,11 +58,22 @@ #include #include #include -#include -#include +#include #include +#include +#include +#include #include #include +#include +#include +#include +#include +#include +#include +#include +#include +#include #include /** @@ -116,12 +128,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap) } EXPORT_SYMBOL(sk_filter_trim_cap); -BPF_CALL_1(__skb_get_pay_offset, struct sk_buff *, skb) +BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb) { return skb_get_poff(skb); } -BPF_CALL_3(__skb_get_nlattr, struct sk_buff *, skb, u32, a, u32, x) +BPF_CALL_3(bpf_skb_get_nlattr, struct sk_buff *, skb, u32, a, u32, x) { struct nlattr *nla; @@ -141,7 +153,7 @@ BPF_CALL_3(__skb_get_nlattr, struct sk_buff *, skb, u32, a, u32, x) return 0; } -BPF_CALL_3(__skb_get_nlattr_nest, struct sk_buff *, skb, u32, a, u32, x) +BPF_CALL_3(bpf_skb_get_nlattr_nest, struct sk_buff *, skb, u32, a, u32, x) { struct nlattr *nla; @@ -165,13 +177,102 @@ BPF_CALL_3(__skb_get_nlattr_nest, struct sk_buff *, skb, u32, a, u32, x) return 0; } -BPF_CALL_0(__get_raw_cpu_id) +static int bpf_skb_load_helper_convert_offset(const struct sk_buff *skb, int offset) +{ + if (likely(offset >= 0)) + return offset; + + if (offset >= SKF_NET_OFF) + return offset - SKF_NET_OFF + skb_network_offset(skb); + + if (offset >= SKF_LL_OFF && skb_mac_header_was_set(skb)) + return offset - SKF_LL_OFF + skb_mac_offset(skb); + + return INT_MIN; +} + +BPF_CALL_4(bpf_skb_load_helper_8, const struct sk_buff *, skb, const void *, + data, int, headlen, int, offset) +{ + u8 tmp; + const int len = sizeof(tmp); + + offset = bpf_skb_load_helper_convert_offset(skb, offset); + if (offset == INT_MIN) + return -EFAULT; + + if (headlen - offset >= len) + return *(u8 *)(data + offset); + if (!skb_copy_bits(skb, offset, &tmp, sizeof(tmp))) + return tmp; + else + return -EFAULT; +} + +BPF_CALL_2(bpf_skb_load_helper_8_no_cache, const struct sk_buff *, skb, + int, offset) +{ + return ____bpf_skb_load_helper_8(skb, skb->data, skb->len - skb->data_len, + offset); +} + +BPF_CALL_4(bpf_skb_load_helper_16, const struct sk_buff *, skb, const void *, + data, int, headlen, int, offset) +{ + __be16 tmp; + const int len = sizeof(tmp); + + offset = bpf_skb_load_helper_convert_offset(skb, offset); + if (offset == INT_MIN) + return -EFAULT; + + if (headlen - offset >= len) + return get_unaligned_be16(data + offset); + if (!skb_copy_bits(skb, offset, &tmp, sizeof(tmp))) + return be16_to_cpu(tmp); + else + return -EFAULT; +} + +BPF_CALL_2(bpf_skb_load_helper_16_no_cache, const struct sk_buff *, skb, + int, offset) +{ + return ____bpf_skb_load_helper_16(skb, skb->data, skb->len - skb->data_len, + offset); +} + +BPF_CALL_4(bpf_skb_load_helper_32, const struct sk_buff *, skb, const void *, + data, int, headlen, int, offset) +{ + __be32 tmp; + const int len = sizeof(tmp); + + offset = bpf_skb_load_helper_convert_offset(skb, offset); + if (offset == INT_MIN) + return -EFAULT; + + if (headlen - offset >= len) + return get_unaligned_be32(data + offset); + if (!skb_copy_bits(skb, offset, &tmp, sizeof(tmp))) + return be32_to_cpu(tmp); + else + return -EFAULT; +} + +BPF_CALL_2(bpf_skb_load_helper_32_no_cache, const struct sk_buff *, skb, + int, offset) +{ + return ____bpf_skb_load_helper_32(skb, skb->data, skb->len - skb->data_len, + offset); +} + +BPF_CALL_0(bpf_get_raw_cpu_id) { return raw_smp_processor_id(); } static const struct bpf_func_proto bpf_get_raw_smp_processor_id_proto = { - .func = __get_raw_cpu_id, + .func = bpf_get_raw_cpu_id, .gpl_only = false, .ret_type = RET_INTEGER, }; @@ -321,16 +422,16 @@ static bool convert_bpf_extensions(struct sock_filter *fp, /* Emit call(arg1=CTX, arg2=A, arg3=X) */ switch (fp->k) { case SKF_AD_OFF + SKF_AD_PAY_OFFSET: - *insn = BPF_EMIT_CALL(__skb_get_pay_offset); + *insn = BPF_EMIT_CALL(bpf_skb_get_pay_offset); break; case SKF_AD_OFF + SKF_AD_NLATTR: - *insn = BPF_EMIT_CALL(__skb_get_nlattr); + *insn = BPF_EMIT_CALL(bpf_skb_get_nlattr); break; case SKF_AD_OFF + SKF_AD_NLATTR_NEST: - *insn = BPF_EMIT_CALL(__skb_get_nlattr_nest); + *insn = BPF_EMIT_CALL(bpf_skb_get_nlattr_nest); break; case SKF_AD_OFF + SKF_AD_CPU: - *insn = BPF_EMIT_CALL(__get_raw_cpu_id); + *insn = BPF_EMIT_CALL(bpf_get_raw_cpu_id); break; case SKF_AD_OFF + SKF_AD_RANDOM: *insn = BPF_EMIT_CALL(bpf_user_rnd_u32); @@ -357,26 +458,98 @@ static bool convert_bpf_extensions(struct sock_filter *fp, return true; } +static bool convert_bpf_ld_abs(struct sock_filter *fp, struct bpf_insn **insnp) +{ + const bool unaligned_ok = IS_BUILTIN(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS); + int size = bpf_size_to_bytes(BPF_SIZE(fp->code)); + bool endian = BPF_SIZE(fp->code) == BPF_H || + BPF_SIZE(fp->code) == BPF_W; + bool indirect = BPF_MODE(fp->code) == BPF_IND; + const int ip_align = NET_IP_ALIGN; + struct bpf_insn *insn = *insnp; + int offset = fp->k; + + if (!indirect && + ((unaligned_ok && offset >= 0) || + (!unaligned_ok && offset >= 0 && + offset + ip_align >= 0 && + offset + ip_align % size == 0))) { + bool ldx_off_ok = offset <= S16_MAX; + + *insn++ = BPF_MOV64_REG(BPF_REG_TMP, BPF_REG_H); + if (offset) + *insn++ = BPF_ALU64_IMM(BPF_SUB, BPF_REG_TMP, offset); + *insn++ = BPF_JMP_IMM(BPF_JSLT, BPF_REG_TMP, + size, 2 + endian + (!ldx_off_ok * 2)); + if (ldx_off_ok) { + *insn++ = BPF_LDX_MEM(BPF_SIZE(fp->code), BPF_REG_A, + BPF_REG_D, offset); + } else { + *insn++ = BPF_MOV64_REG(BPF_REG_TMP, BPF_REG_D); + *insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_TMP, offset); + *insn++ = BPF_LDX_MEM(BPF_SIZE(fp->code), BPF_REG_A, + BPF_REG_TMP, 0); + } + if (endian) + *insn++ = BPF_ENDIAN(BPF_FROM_BE, BPF_REG_A, size * 8); + *insn++ = BPF_JMP_A(8); + } + + *insn++ = BPF_MOV64_REG(BPF_REG_ARG1, BPF_REG_CTX); + *insn++ = BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_D); + *insn++ = BPF_MOV64_REG(BPF_REG_ARG3, BPF_REG_H); + if (!indirect) { + *insn++ = BPF_MOV64_IMM(BPF_REG_ARG4, offset); + } else { + *insn++ = BPF_MOV64_REG(BPF_REG_ARG4, BPF_REG_X); + if (fp->k) + *insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG4, offset); + } + + switch (BPF_SIZE(fp->code)) { + case BPF_B: + *insn++ = BPF_EMIT_CALL(bpf_skb_load_helper_8); + break; + case BPF_H: + *insn++ = BPF_EMIT_CALL(bpf_skb_load_helper_16); + break; + case BPF_W: + *insn++ = BPF_EMIT_CALL(bpf_skb_load_helper_32); + break; + default: + return false; + } + + *insn++ = BPF_JMP_IMM(BPF_JSGE, BPF_REG_A, 0, 2); + *insn++ = BPF_ALU32_REG(BPF_XOR, BPF_REG_A, BPF_REG_A); + *insn = BPF_EXIT_INSN(); + + *insnp = insn; + return true; +} + /** * bpf_convert_filter - convert filter program * @prog: the user passed filter program * @len: the length of the user passed filter program * @new_prog: allocated 'struct bpf_prog' or NULL * @new_len: pointer to store length of converted program + * @seen_ld_abs: bool whether we've seen ld_abs/ind * * Remap 'sock_filter' style classic BPF (cBPF) instruction set to 'bpf_insn' * style extended BPF (eBPF). * Conversion workflow: * * 1) First pass for calculating the new program length: - * bpf_convert_filter(old_prog, old_len, NULL, &new_len) + * bpf_convert_filter(old_prog, old_len, NULL, &new_len, &seen_ld_abs) * * 2) 2nd pass to remap in two passes: 1st pass finds new * jump offsets, 2nd pass remapping: - * bpf_convert_filter(old_prog, old_len, new_prog, &new_len); + * bpf_convert_filter(old_prog, old_len, new_prog, &new_len, &seen_ld_abs) */ static int bpf_convert_filter(struct sock_filter *prog, int len, - struct bpf_prog *new_prog, int *new_len) + struct bpf_prog *new_prog, int *new_len, + bool *seen_ld_abs) { int new_flen = 0, pass = 0, target, i, stack_off; struct bpf_insn *new_insn, *first_insn = NULL; @@ -407,20 +580,35 @@ do_pass: /* Classic BPF expects A and X to be reset first. These need * to be guaranteed to be the first two instructions. */ - *new_insn++ = BPF_ALU64_REG(BPF_XOR, BPF_REG_A, BPF_REG_A); - *new_insn++ = BPF_ALU64_REG(BPF_XOR, BPF_REG_X, BPF_REG_X); + *new_insn++ = BPF_ALU32_REG(BPF_XOR, BPF_REG_A, BPF_REG_A); + *new_insn++ = BPF_ALU32_REG(BPF_XOR, BPF_REG_X, BPF_REG_X); /* All programs must keep CTX in callee saved BPF_REG_CTX. * In eBPF case it's done by the compiler, here we need to * do this ourself. Initial CTX is present in BPF_REG_ARG1. */ *new_insn++ = BPF_MOV64_REG(BPF_REG_CTX, BPF_REG_ARG1); + if (*seen_ld_abs) { + /* For packet access in classic BPF, cache skb->data + * in callee-saved BPF R8 and skb->len - skb->data_len + * (headlen) in BPF R9. Since classic BPF is read-only + * on CTX, we only need to cache it once. + */ + *new_insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, data), + BPF_REG_D, BPF_REG_CTX, + offsetof(struct sk_buff, data)); + *new_insn++ = BPF_LDX_MEM(BPF_W, BPF_REG_H, BPF_REG_CTX, + offsetof(struct sk_buff, len)); + *new_insn++ = BPF_LDX_MEM(BPF_W, BPF_REG_TMP, BPF_REG_CTX, + offsetof(struct sk_buff, data_len)); + *new_insn++ = BPF_ALU32_REG(BPF_SUB, BPF_REG_H, BPF_REG_TMP); + } } else { new_insn += 3; } for (i = 0; i < len; fp++, i++) { - struct bpf_insn tmp_insns[6] = { }; + struct bpf_insn tmp_insns[32] = { }; struct bpf_insn *insn = tmp_insns; if (addrs) @@ -463,6 +651,11 @@ do_pass: BPF_MODE(fp->code) == BPF_ABS && convert_bpf_extensions(fp, &insn)) break; + if (BPF_CLASS(fp->code) == BPF_LD && + convert_bpf_ld_abs(fp, &insn)) { + *seen_ld_abs = true; + break; + } if (fp->code == (BPF_ALU | BPF_DIV | BPF_X) || fp->code == (BPF_ALU | BPF_MOD | BPF_X)) { @@ -572,21 +765,31 @@ jmp_rest: break; /* ldxb 4 * ([14] & 0xf) is remaped into 6 insns. */ - case BPF_LDX | BPF_MSH | BPF_B: - /* tmp = A */ - *insn++ = BPF_MOV64_REG(BPF_REG_TMP, BPF_REG_A); + case BPF_LDX | BPF_MSH | BPF_B: { + struct sock_filter tmp = { + .code = BPF_LD | BPF_ABS | BPF_B, + .k = fp->k, + }; + + *seen_ld_abs = true; + + /* X = A */ + *insn++ = BPF_MOV64_REG(BPF_REG_X, BPF_REG_A); /* A = BPF_R0 = *(u8 *) (skb->data + K) */ - *insn++ = BPF_LD_ABS(BPF_B, fp->k); + convert_bpf_ld_abs(&tmp, &insn); + insn++; /* A &= 0xf */ *insn++ = BPF_ALU32_IMM(BPF_AND, BPF_REG_A, 0xf); /* A <<= 2 */ *insn++ = BPF_ALU32_IMM(BPF_LSH, BPF_REG_A, 2); + /* tmp = X */ + *insn++ = BPF_MOV64_REG(BPF_REG_TMP, BPF_REG_X); /* X = A */ *insn++ = BPF_MOV64_REG(BPF_REG_X, BPF_REG_A); /* A = tmp */ *insn = BPF_MOV64_REG(BPF_REG_A, BPF_REG_TMP); break; - + } /* RET_K is remaped into 2 insns. RET_A case doesn't need an * extra mov as BPF_REG_0 is already mapped into BPF_REG_A. */ @@ -668,6 +871,8 @@ jmp_rest: if (!new_prog) { /* Only calculating new length. */ *new_len = new_insn - first_insn; + if (*seen_ld_abs) + *new_len += 4; /* Prologue bits. */ return 0; } @@ -1029,6 +1234,7 @@ static struct bpf_prog *bpf_migrate_filter(struct bpf_prog *fp) struct sock_filter *old_prog; struct bpf_prog *old_fp; int err, new_len, old_len = fp->len; + bool seen_ld_abs = false; /* We are free to overwrite insns et al right here as it * won't be used at this point in time anymore internally @@ -1050,7 +1256,8 @@ static struct bpf_prog *bpf_migrate_filter(struct bpf_prog *fp) } /* 1st pass: calculate the new program length. */ - err = bpf_convert_filter(old_prog, old_len, NULL, &new_len); + err = bpf_convert_filter(old_prog, old_len, NULL, &new_len, + &seen_ld_abs); if (err) goto out_err_free; @@ -1069,7 +1276,8 @@ static struct bpf_prog *bpf_migrate_filter(struct bpf_prog *fp) fp->len = new_len; /* 2nd pass: remap sock_filter insns into bpf_insn insns. */ - err = bpf_convert_filter(old_prog, old_len, fp, &new_len); + err = bpf_convert_filter(old_prog, old_len, fp, &new_len, + &seen_ld_abs); if (err) /* 2nd bpf_convert_filter() can fail only if it fails * to allocate memory, remapping must succeed. Note, @@ -1261,30 +1469,6 @@ static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk) return 0; } -static int __reuseport_attach_prog(struct bpf_prog *prog, struct sock *sk) -{ - struct bpf_prog *old_prog; - int err; - - if (bpf_prog_size(prog->len) > sysctl_optmem_max) - return -ENOMEM; - - if (sk_unhashed(sk) && sk->sk_reuseport) { - err = reuseport_alloc(sk); - if (err) - return err; - } else if (!rcu_access_pointer(sk->sk_reuseport_cb)) { - /* The socket wasn't bound with SO_REUSEPORT */ - return -EINVAL; - } - - old_prog = reuseport_attach_prog(sk, prog); - if (old_prog) - bpf_prog_destroy(old_prog); - - return 0; -} - static struct bpf_prog *__get_filter(struct sock_fprog *fprog, struct sock *sk) { @@ -1358,13 +1542,15 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk) if (IS_ERR(prog)) return PTR_ERR(prog); - err = __reuseport_attach_prog(prog, sk); - if (err < 0) { - __bpf_prog_release(prog); - return err; - } + if (bpf_prog_size(prog->len) > sysctl_optmem_max) + err = -ENOMEM; + else + err = reuseport_attach_prog(sk, prog); - return 0; + if (err) + __bpf_prog_release(prog); + + return err; } static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk) @@ -1394,19 +1580,58 @@ int sk_attach_bpf(u32 ufd, struct sock *sk) int sk_reuseport_attach_bpf(u32 ufd, struct sock *sk) { - struct bpf_prog *prog = __get_bpf(ufd, sk); + struct bpf_prog *prog; int err; + if (sock_flag(sk, SOCK_FILTER_LOCKED)) + return -EPERM; + + prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER); + if (IS_ERR(prog) && PTR_ERR(prog) == -EINVAL) + prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SK_REUSEPORT); if (IS_ERR(prog)) return PTR_ERR(prog); - err = __reuseport_attach_prog(prog, sk); - if (err < 0) { - bpf_prog_put(prog); - return err; + if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) { + /* Like other non BPF_PROG_TYPE_SOCKET_FILTER + * bpf prog (e.g. sockmap). It depends on the + * limitation imposed by bpf_prog_load(). + * Hence, sysctl_optmem_max is not checked. + */ + if ((sk->sk_type != SOCK_STREAM && + sk->sk_type != SOCK_DGRAM) || + (sk->sk_protocol != IPPROTO_UDP && + sk->sk_protocol != IPPROTO_TCP) || + (sk->sk_family != AF_INET && + sk->sk_family != AF_INET6)) { + err = -ENOTSUPP; + goto err_prog_put; + } + } else { + /* BPF_PROG_TYPE_SOCKET_FILTER */ + if (bpf_prog_size(prog->len) > sysctl_optmem_max) { + err = -ENOMEM; + goto err_prog_put; + } } - return 0; + err = reuseport_attach_prog(sk, prog); +err_prog_put: + if (err) + bpf_prog_put(prog); + + return err; +} + +void sk_reuseport_prog_free(struct bpf_prog *prog) +{ + if (!prog) + return; + + if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) + bpf_prog_put(prog); + else + bpf_prog_destroy(prog); } struct bpf_scratchpad { @@ -1517,29 +1742,65 @@ static const struct bpf_func_proto bpf_skb_load_bytes_proto = { .arg4_type = ARG_CONST_SIZE, }; +BPF_CALL_4(bpf_flow_dissector_load_bytes, + const struct bpf_flow_dissector *, ctx, u32, offset, + void *, to, u32, len) +{ + void *ptr; + + if (unlikely(offset > 0xffff)) + goto err_clear; + + if (unlikely(!ctx->skb)) + goto err_clear; + + ptr = skb_header_pointer(ctx->skb, offset, len, to); + if (unlikely(!ptr)) + goto err_clear; + if (ptr != to) + memcpy(to, ptr, len); + + return 0; +err_clear: + memset(to, 0, len); + return -EFAULT; +} + +static const struct bpf_func_proto bpf_flow_dissector_load_bytes_proto = { + .func = bpf_flow_dissector_load_bytes, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_UNINIT_MEM, + .arg4_type = ARG_CONST_SIZE, +}; + BPF_CALL_5(bpf_skb_load_bytes_relative, const struct sk_buff *, skb, u32, offset, void *, to, u32, len, u32, start_header) { u8 *end = skb_tail_pointer(skb); - u8 *net = skb_network_header(skb); - u8 *mac = skb_mac_header(skb); - u8 *ptr; + u8 *start, *ptr; - if (unlikely(offset > 0xffff || len > (end - mac))) + if (unlikely(offset > 0xffff)) goto err_clear; switch (start_header) { case BPF_HDR_START_MAC: - ptr = mac + offset; + if (unlikely(!skb_mac_header_was_set(skb))) + goto err_clear; + start = skb_mac_header(skb); break; case BPF_HDR_START_NET: - ptr = net + offset; + start = skb_network_header(skb); break; default: goto err_clear; } - if (likely(ptr >= mac && ptr + len <= end)) { + ptr = start + offset; + + if (likely(ptr + len <= end)) { memcpy(to, ptr, len); return 0; } @@ -1582,10 +1843,23 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = { .arg2_type = ARG_ANYTHING, }; +BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk) +{ + return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL; +} + +static const struct bpf_func_proto bpf_sk_fullsock_proto = { + .func = bpf_sk_fullsock, + .gpl_only = false, + .ret_type = RET_PTR_TO_SOCKET_OR_NULL, + .arg1_type = ARG_PTR_TO_SOCK_COMMON, +}; + static inline int sk_skb_try_make_writable(struct sk_buff *skb, unsigned int write_len) { int err = __bpf_try_make_writable(skb, write_len); + bpf_compute_data_end_sk_skb(skb); return err; } @@ -1612,19 +1886,6 @@ static const struct bpf_func_proto sk_skb_pull_data_proto = { .arg2_type = ARG_ANYTHING, }; -BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk) -{ - sk = sk_to_full_sk(sk); - return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL; -} - -static const struct bpf_func_proto bpf_sk_fullsock_proto = { - .func = bpf_sk_fullsock, - .gpl_only = false, - .ret_type = RET_PTR_TO_SOCKET_OR_NULL, - .arg1_type = ARG_PTR_TO_SOCK_COMMON, -}; - BPF_CALL_5(bpf_l3_csum_replace, struct sk_buff *, skb, u32, offset, u64, from, u64, to, u64, flags) { @@ -1754,9 +2015,9 @@ static const struct bpf_func_proto bpf_csum_diff_proto = { .gpl_only = false, .pkt_access = true, .ret_type = RET_INTEGER, - .arg1_type = ARG_PTR_TO_MEM, + .arg1_type = ARG_PTR_TO_MEM_OR_NULL, .arg2_type = ARG_CONST_SIZE_OR_ZERO, - .arg3_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_PTR_TO_MEM_OR_NULL, .arg4_type = ARG_CONST_SIZE_OR_ZERO, .arg5_type = ARG_ANYTHING, }; @@ -1803,17 +2064,18 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb) { int ret; - if (unlikely(__this_cpu_read(xmit_recursion) > XMIT_RECURSION_LIMIT)) { + if (dev_xmit_recursion()) { net_crit_ratelimited("bpf: recursion limit reached on datapath, buggy bpf program?\n"); kfree_skb(skb); return -ENETDOWN; } skb->dev = dev; + skb->tstamp = 0; - __this_cpu_inc(xmit_recursion); + dev_xmit_recursion_inc(); ret = dev_queue_xmit(skb); - __this_cpu_dec(xmit_recursion); + dev_xmit_recursion_dec(); return ret; } @@ -1823,6 +2085,11 @@ static int __bpf_redirect_no_mac(struct sk_buff *skb, struct net_device *dev, { unsigned int mlen = skb_network_offset(skb); + if (unlikely(skb->len <= mlen)) { + kfree_skb(skb); + return -ERANGE; + } + if (mlen) { __skb_pull(skb, mlen); if (unlikely(!skb->len)) { @@ -1848,7 +2115,7 @@ static int __bpf_redirect_common(struct sk_buff *skb, struct net_device *dev, u32 flags) { /* Verify that a link layer header is carried */ - if (unlikely(skb->mac_header >= skb->network_header)) { + if (unlikely(skb->mac_header >= skb->network_header || skb->len == 0)) { kfree_skb(skb); return -ERANGE; } @@ -1907,36 +2174,29 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = { .arg3_type = ARG_ANYTHING, }; -struct redirect_info { - u32 ifindex; - u32 flags; - struct bpf_map *map; - struct bpf_map *map_to_flush; - unsigned long map_owner; -}; - -static DEFINE_PER_CPU(struct redirect_info, redirect_info); +DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info); +EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info); BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags) { - struct redirect_info *ri = this_cpu_ptr(&redirect_info); + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); if (unlikely(flags & ~(BPF_F_INGRESS))) return TC_ACT_SHOT; - ri->ifindex = ifindex; ri->flags = flags; + ri->tgt_index = ifindex; return TC_ACT_REDIRECT; } int skb_do_redirect(struct sk_buff *skb) { - struct redirect_info *ri = this_cpu_ptr(&redirect_info); + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); struct net_device *dev; - dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex); - ri->ifindex = 0; + dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->tgt_index); + ri->tgt_index = 0; if (unlikely(!dev)) { kfree_skb(skb); return -EINVAL; @@ -1953,6 +2213,493 @@ static const struct bpf_func_proto bpf_redirect_proto = { .arg2_type = ARG_ANYTHING, }; +BPF_CALL_2(bpf_msg_apply_bytes, struct sk_msg *, msg, u32, bytes) +{ + msg->apply_bytes = bytes; + return 0; +} + +static const struct bpf_func_proto bpf_msg_apply_bytes_proto = { + .func = bpf_msg_apply_bytes, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, +}; + +BPF_CALL_2(bpf_msg_cork_bytes, struct sk_msg *, msg, u32, bytes) +{ + msg->cork_bytes = bytes; + return 0; +} + +static void sk_msg_reset_curr(struct sk_msg *msg) +{ + if (!msg->sg.size) { + msg->sg.curr = msg->sg.start; + msg->sg.copybreak = 0; + } else { + u32 i = msg->sg.end; + + sk_msg_iter_var_prev(i); + msg->sg.curr = i; + msg->sg.copybreak = msg->sg.data[i].length; + } +} + +static const struct bpf_func_proto bpf_msg_cork_bytes_proto = { + .func = bpf_msg_cork_bytes, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, +}; + +BPF_CALL_4(bpf_msg_pull_data, struct sk_msg *, msg, u32, start, + u32, end, u64, flags) +{ + u32 len = 0, offset = 0, copy = 0, poffset = 0, bytes = end - start; + u32 first_sge, last_sge, i, shift, bytes_sg_total; + struct scatterlist *sge; + u8 *raw, *to, *from; + struct page *page; + + if (unlikely(flags || end <= start)) + return -EINVAL; + + /* First find the starting scatterlist element */ + i = msg->sg.start; + do { + offset += len; + len = sk_msg_elem(msg, i)->length; + if (start < offset + len) + break; + sk_msg_iter_var_next(i); + } while (i != msg->sg.end); + + if (unlikely(start >= offset + len)) + return -EINVAL; + + first_sge = i; + /* The start may point into the sg element so we need to also + * account for the headroom. + */ + bytes_sg_total = start - offset + bytes; + if (!msg->sg.copy[i] && bytes_sg_total <= len) + goto out; + + /* At this point we need to linearize multiple scatterlist + * elements or a single shared page. Either way we need to + * copy into a linear buffer exclusively owned by BPF. Then + * place the buffer in the scatterlist and fixup the original + * entries by removing the entries now in the linear buffer + * and shifting the remaining entries. For now we do not try + * to copy partial entries to avoid complexity of running out + * of sg_entry slots. The downside is reading a single byte + * will copy the entire sg entry. + */ + do { + copy += sk_msg_elem(msg, i)->length; + sk_msg_iter_var_next(i); + if (bytes_sg_total <= copy) + break; + } while (i != msg->sg.end); + last_sge = i; + + if (unlikely(bytes_sg_total > copy)) + return -EINVAL; + + page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP, + get_order(copy)); + if (unlikely(!page)) + return -ENOMEM; + + raw = page_address(page); + i = first_sge; + do { + sge = sk_msg_elem(msg, i); + from = sg_virt(sge); + len = sge->length; + to = raw + poffset; + + memcpy(to, from, len); + poffset += len; + sge->length = 0; + put_page(sg_page(sge)); + + sk_msg_iter_var_next(i); + } while (i != last_sge); + + sg_set_page(&msg->sg.data[first_sge], page, copy, 0); + + /* To repair sg ring we need to shift entries. If we only + * had a single entry though we can just replace it and + * be done. Otherwise walk the ring and shift the entries. + */ + WARN_ON_ONCE(last_sge == first_sge); + shift = last_sge > first_sge ? + last_sge - first_sge - 1 : + NR_MSG_FRAG_IDS - first_sge + last_sge - 1; + if (!shift) + goto out; + + i = first_sge; + sk_msg_iter_var_next(i); + do { + u32 move_from; + + if (i + shift >= NR_MSG_FRAG_IDS) + move_from = i + shift - NR_MSG_FRAG_IDS; + else + move_from = i + shift; + if (move_from == msg->sg.end) + break; + + msg->sg.data[i] = msg->sg.data[move_from]; + msg->sg.data[move_from].length = 0; + msg->sg.data[move_from].page_link = 0; + msg->sg.data[move_from].offset = 0; + sk_msg_iter_var_next(i); + } while (1); + + msg->sg.end = msg->sg.end - shift > msg->sg.end ? + msg->sg.end - shift + NR_MSG_FRAG_IDS : + msg->sg.end - shift; +out: + sk_msg_reset_curr(msg); + msg->data = sg_virt(&msg->sg.data[first_sge]) + start - offset; + msg->data_end = msg->data + bytes; + return 0; +} + +static const struct bpf_func_proto bpf_msg_pull_data_proto = { + .func = bpf_msg_pull_data, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_ANYTHING, +}; + +BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start, + u32, len, u64, flags) +{ + struct scatterlist sge, nsge, nnsge, rsge = {0}, *psge; + u32 new, i = 0, l = 0, space, copy = 0, offset = 0; + u8 *raw, *to, *from; + struct page *page; + + if (unlikely(flags)) + return -EINVAL; + + /* First find the starting scatterlist element */ + i = msg->sg.start; + do { + offset += l; + l = sk_msg_elem(msg, i)->length; + + if (start < offset + l) + break; + sk_msg_iter_var_next(i); + } while (i != msg->sg.end); + + if (start > offset + l) + return -EINVAL; + + space = MAX_MSG_FRAGS - sk_msg_elem_used(msg); + + /* If no space available will fallback to copy, we need at + * least one scatterlist elem available to push data into + * when start aligns to the beginning of an element or two + * when it falls inside an element. We handle the start equals + * offset case because its the common case for inserting a + * header. + */ + if (!space || (space == 1 && start != offset)) + copy = msg->sg.data[i].length; + + page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP, + get_order(copy + len)); + if (unlikely(!page)) + return -ENOMEM; + + if (copy) { + int front, back; + + raw = page_address(page); + + if (i == msg->sg.end) + sk_msg_iter_var_prev(i); + psge = sk_msg_elem(msg, i); + front = start - offset; + back = psge->length - front; + from = sg_virt(psge); + + if (front) + memcpy(raw, from, front); + + if (back) { + from += front; + to = raw + front + len; + + memcpy(to, from, back); + } + + put_page(sg_page(psge)); + new = i; + goto place_new; + } + + if (start - offset) { + if (i == msg->sg.end) + sk_msg_iter_var_prev(i); + psge = sk_msg_elem(msg, i); + rsge = sk_msg_elem_cpy(msg, i); + + psge->length = start - offset; + rsge.length -= psge->length; + rsge.offset += start; + + sk_msg_iter_var_next(i); + sg_unmark_end(psge); + sg_unmark_end(&rsge); + } + + /* Slot(s) to place newly allocated data */ + sk_msg_iter_next(msg, end); + new = i; + sk_msg_iter_var_next(i); + + if (i == msg->sg.end) { + if (!rsge.length) + goto place_new; + sk_msg_iter_next(msg, end); + goto place_new; + } + + /* Shift one or two slots as needed */ + sge = sk_msg_elem_cpy(msg, new); + sg_unmark_end(&sge); + + nsge = sk_msg_elem_cpy(msg, i); + if (rsge.length) { + sk_msg_iter_var_next(i); + nnsge = sk_msg_elem_cpy(msg, i); + sk_msg_iter_next(msg, end); + } + + while (i != msg->sg.end) { + msg->sg.data[i] = sge; + sge = nsge; + sk_msg_iter_var_next(i); + if (rsge.length) { + nsge = nnsge; + nnsge = sk_msg_elem_cpy(msg, i); + } else { + nsge = sk_msg_elem_cpy(msg, i); + } + } + +place_new: + /* Place newly allocated data buffer */ + sk_mem_charge(msg->sk, len); + msg->sg.size += len; + msg->sg.copy[new] = false; + sg_set_page(&msg->sg.data[new], page, len + copy, 0); + if (rsge.length) { + get_page(sg_page(&rsge)); + sk_msg_iter_var_next(new); + msg->sg.data[new] = rsge; + } + + sk_msg_reset_curr(msg); + sk_msg_compute_data_pointers(msg); + return 0; +} + +static const struct bpf_func_proto bpf_msg_push_data_proto = { + .func = bpf_msg_push_data, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_ANYTHING, +}; + +static void sk_msg_shift_left(struct sk_msg *msg, int i) +{ + struct scatterlist *sge = sk_msg_elem(msg, i); + int prev; + + put_page(sg_page(sge)); + do { + prev = i; + sk_msg_iter_var_next(i); + msg->sg.data[prev] = msg->sg.data[i]; + } while (i != msg->sg.end); + + sk_msg_iter_prev(msg, end); +} + +static void sk_msg_shift_right(struct sk_msg *msg, int i) +{ + struct scatterlist tmp, sge; + + sk_msg_iter_next(msg, end); + sge = sk_msg_elem_cpy(msg, i); + sk_msg_iter_var_next(i); + tmp = sk_msg_elem_cpy(msg, i); + + while (i != msg->sg.end) { + msg->sg.data[i] = sge; + sk_msg_iter_var_next(i); + sge = tmp; + tmp = sk_msg_elem_cpy(msg, i); + } +} + +BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start, + u32, len, u64, flags) +{ + u32 i = 0, l = 0, space, offset = 0; + u64 last = start + len; + int pop; + + if (unlikely(flags)) + return -EINVAL; + + if (unlikely(len == 0)) + return 0; + + /* First find the starting scatterlist element */ + i = msg->sg.start; + do { + offset += l; + l = sk_msg_elem(msg, i)->length; + + if (start < offset + l) + break; + sk_msg_iter_var_next(i); + } while (i != msg->sg.end); + + /* Bounds checks: start and pop must be inside message */ + if (start >= offset + l || last > msg->sg.size) + return -EINVAL; + + space = MAX_MSG_FRAGS - sk_msg_elem_used(msg); + + pop = len; + /* --------------| offset + * -| start |-------- len -------| + * + * |----- a ----|-------- pop -------|----- b ----| + * |______________________________________________| length + * + * + * a: region at front of scatter element to save + * b: region at back of scatter element to save when length > A + pop + * pop: region to pop from element, same as input 'pop' here will be + * decremented below per iteration. + * + * Two top-level cases to handle when start != offset, first B is non + * zero and second B is zero corresponding to when a pop includes more + * than one element. + * + * Then if B is non-zero AND there is no space allocate space and + * compact A, B regions into page. If there is space shift ring to + * the rigth free'ing the next element in ring to place B, leaving + * A untouched except to reduce length. + */ + if (start != offset) { + struct scatterlist *nsge, *sge = sk_msg_elem(msg, i); + int a = start - offset; + int b = sge->length - pop - a; + + sk_msg_iter_var_next(i); + + if (b > 0) { + if (space) { + sge->length = a; + sk_msg_shift_right(msg, i); + nsge = sk_msg_elem(msg, i); + get_page(sg_page(sge)); + sg_set_page(nsge, + sg_page(sge), + b, sge->offset + pop + a); + } else { + struct page *page, *orig; + u8 *to, *from; + + page = alloc_pages(__GFP_NOWARN | + __GFP_COMP | GFP_ATOMIC, + get_order(a + b)); + if (unlikely(!page)) + return -ENOMEM; + + orig = sg_page(sge); + from = sg_virt(sge); + to = page_address(page); + memcpy(to, from, a); + memcpy(to + a, from + a + pop, b); + sg_set_page(sge, page, a + b, 0); + put_page(orig); + } + pop = 0; + } else { + pop -= (sge->length - a); + sge->length = a; + } + } + + /* From above the current layout _must_ be as follows, + * + * -| offset + * -| start + * + * |---- pop ---|---------------- b ------------| + * |____________________________________________| length + * + * Offset and start of the current msg elem are equal because in the + * previous case we handled offset != start and either consumed the + * entire element and advanced to the next element OR pop == 0. + * + * Two cases to handle here are first pop is less than the length + * leaving some remainder b above. Simply adjust the element's layout + * in this case. Or pop >= length of the element so that b = 0. In this + * case advance to next element decrementing pop. + */ + while (pop) { + struct scatterlist *sge = sk_msg_elem(msg, i); + + if (pop < sge->length) { + sge->length -= pop; + sge->offset += pop; + pop = 0; + } else { + pop -= sge->length; + sk_msg_shift_left(msg, i); + } + } + + sk_mem_uncharge(msg->sk, len - pop); + msg->sg.size -= (len - pop); + sk_msg_reset_curr(msg); + sk_msg_compute_data_pointers(msg); + return 0; +} + +static const struct bpf_func_proto bpf_msg_pop_data_proto = { + .func = bpf_msg_pop_data, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_ANYTHING, +}; + BPF_CALL_1(bpf_get_cgroup_classid, const struct sk_buff *, skb) { return task_get_classid(skb); @@ -2045,7 +2792,7 @@ BPF_CALL_3(bpf_skb_vlan_push, struct sk_buff *, skb, __be16, vlan_proto, return ret; } -const struct bpf_func_proto bpf_skb_vlan_push_proto = { +static const struct bpf_func_proto bpf_skb_vlan_push_proto = { .func = bpf_skb_vlan_push, .gpl_only = false, .ret_type = RET_INTEGER, @@ -2053,7 +2800,6 @@ const struct bpf_func_proto bpf_skb_vlan_push_proto = { .arg2_type = ARG_ANYTHING, .arg3_type = ARG_ANYTHING, }; -EXPORT_SYMBOL_GPL(bpf_skb_vlan_push_proto); BPF_CALL_1(bpf_skb_vlan_pop, struct sk_buff *, skb) { @@ -2067,13 +2813,12 @@ BPF_CALL_1(bpf_skb_vlan_pop, struct sk_buff *, skb) return ret; } -const struct bpf_func_proto bpf_skb_vlan_pop_proto = { +static const struct bpf_func_proto bpf_skb_vlan_pop_proto = { .func = bpf_skb_vlan_pop, .gpl_only = false, .ret_type = RET_INTEGER, .arg1_type = ARG_PTR_TO_CTX, }; -EXPORT_SYMBOL_GPL(bpf_skb_vlan_pop_proto); static int bpf_skb_generic_push(struct sk_buff *skb, u32 off, u32 len) { @@ -2153,8 +2898,7 @@ static int bpf_skb_proto_4_to_6(struct sk_buff *skb) u32 off = skb_mac_header_len(skb); int ret; - /* SCTP uses GSO_BY_FRAGS, thus cannot adjust it. */ - if (skb_is_gso(skb) && unlikely(skb_is_gso_sctp(skb))) + if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) return -ENOTSUPP; ret = skb_cow(skb, len_diff); @@ -2193,8 +2937,7 @@ static int bpf_skb_proto_6_to_4(struct sk_buff *skb) u32 off = skb_mac_header_len(skb); int ret; - /* SCTP uses GSO_BY_FRAGS, thus cannot adjust it. */ - if (skb_is_gso(skb) && unlikely(skb_is_gso_sctp(skb))) + if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) return -ENOTSUPP; ret = skb_unclone(skb, GFP_ATOMIC); @@ -2229,7 +2972,7 @@ static int bpf_skb_proto_6_to_4(struct sk_buff *skb) static int bpf_skb_proto_xlat(struct sk_buff *skb, __be16 to_proto) { - __be16 from_proto = skb->protocol; + __be16 from_proto = skb_protocol(skb, true); if (from_proto == htons(ETH_P_IP) && to_proto == htons(ETH_P_IPV6)) @@ -2302,7 +3045,7 @@ static const struct bpf_func_proto bpf_skb_change_type_proto = { static u32 bpf_skb_net_base_len(const struct sk_buff *skb) { - switch (skb->protocol) { + switch (skb_protocol(skb, true)) { case htons(ETH_P_IP): return sizeof(struct iphdr); case htons(ETH_P_IPV6): @@ -2312,44 +3055,135 @@ static u32 bpf_skb_net_base_len(const struct sk_buff *skb) } } -static int bpf_skb_net_grow(struct sk_buff *skb, u32 len_diff) +#define BPF_F_ADJ_ROOM_ENCAP_L3_MASK (BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 | \ + BPF_F_ADJ_ROOM_ENCAP_L3_IPV6) + +#define BPF_F_ADJ_ROOM_MASK (BPF_F_ADJ_ROOM_FIXED_GSO | \ + BPF_F_ADJ_ROOM_ENCAP_L3_MASK | \ + BPF_F_ADJ_ROOM_ENCAP_L4_GRE | \ + BPF_F_ADJ_ROOM_ENCAP_L4_UDP | \ + BPF_F_ADJ_ROOM_ENCAP_L2( \ + BPF_ADJ_ROOM_ENCAP_L2_MASK)) + +static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff, + u64 flags) { - u32 off = skb_mac_header_len(skb) + bpf_skb_net_base_len(skb); + u8 inner_mac_len = flags >> BPF_ADJ_ROOM_ENCAP_L2_SHIFT; + bool encap = flags & BPF_F_ADJ_ROOM_ENCAP_L3_MASK; + u16 mac_len = 0, inner_net = 0, inner_trans = 0; + unsigned int gso_type = SKB_GSO_DODGY; int ret; - /* SCTP uses GSO_BY_FRAGS, thus cannot adjust it. */ - if (skb_is_gso(skb) && unlikely(skb_is_gso_sctp(skb))) - return -ENOTSUPP; + if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) { + /* udp gso_size delineates datagrams, only allow if fixed */ + if (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) || + !(flags & BPF_F_ADJ_ROOM_FIXED_GSO)) + return -ENOTSUPP; + } - ret = skb_cow(skb, len_diff); + ret = skb_cow_head(skb, len_diff); if (unlikely(ret < 0)) return ret; + if (encap) { + if (skb->protocol != htons(ETH_P_IP) && + skb->protocol != htons(ETH_P_IPV6)) + return -ENOTSUPP; + + if (flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 && + flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV6) + return -EINVAL; + + if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_GRE && + flags & BPF_F_ADJ_ROOM_ENCAP_L4_UDP) + return -EINVAL; + + if (skb->encapsulation) + return -EALREADY; + + mac_len = skb->network_header - skb->mac_header; + inner_net = skb->network_header; + if (inner_mac_len > len_diff) + return -EINVAL; + inner_trans = skb->transport_header; + } + ret = bpf_skb_net_hdr_push(skb, off, len_diff); if (unlikely(ret < 0)) return ret; + if (encap) { + skb->inner_mac_header = inner_net - inner_mac_len; + skb->inner_network_header = inner_net; + skb->inner_transport_header = inner_trans; + skb_set_inner_protocol(skb, skb->protocol); + + skb->encapsulation = 1; + skb_set_network_header(skb, mac_len); + + if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_UDP) + gso_type |= SKB_GSO_UDP_TUNNEL; + else if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_GRE) + gso_type |= SKB_GSO_GRE; + else if (flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV6) + gso_type |= SKB_GSO_IPXIP6; + else if (flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV4) + gso_type |= SKB_GSO_IPXIP4; + + if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_GRE || + flags & BPF_F_ADJ_ROOM_ENCAP_L4_UDP) { + int nh_len = flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV6 ? + sizeof(struct ipv6hdr) : + sizeof(struct iphdr); + + skb_set_transport_header(skb, mac_len + nh_len); + } + + /* Match skb->protocol to new outer l3 protocol */ + if (skb->protocol == htons(ETH_P_IP) && + flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV6) + skb->protocol = htons(ETH_P_IPV6); + else if (skb->protocol == htons(ETH_P_IPV6) && + flags & BPF_F_ADJ_ROOM_ENCAP_L3_IPV4) + skb->protocol = htons(ETH_P_IP); + } + if (skb_is_gso(skb)) { struct skb_shared_info *shinfo = skb_shinfo(skb); - /* Due to header grow, MSS needs to be downgraded. */ - skb_decrease_gso_size(shinfo, len_diff); /* Header must be checked, and gso_segs recomputed. */ - shinfo->gso_type |= SKB_GSO_DODGY; + shinfo->gso_type |= gso_type; shinfo->gso_segs = 0; + + /* Due to header growth, MSS needs to be downgraded. + * There is a BUG_ON() when segmenting the frag_list with + * head_frag true, so linearize the skb after downgrading + * the MSS. + */ + if (!(flags & BPF_F_ADJ_ROOM_FIXED_GSO)) { + skb_decrease_gso_size(shinfo, len_diff); + if (shinfo->frag_list) + return skb_linearize(skb); + } } return 0; } -static int bpf_skb_net_shrink(struct sk_buff *skb, u32 len_diff) +static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff, + u64 flags) { - u32 off = skb_mac_header_len(skb) + bpf_skb_net_base_len(skb); int ret; - /* SCTP uses GSO_BY_FRAGS, thus cannot adjust it. */ - if (skb_is_gso(skb) && unlikely(skb_is_gso_sctp(skb))) - return -ENOTSUPP; + if (flags & ~BPF_F_ADJ_ROOM_FIXED_GSO) + return -EINVAL; + + if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) { + /* udp gso_size delineates datagrams, only allow if fixed */ + if (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) || + !(flags & BPF_F_ADJ_ROOM_FIXED_GSO)) + return -ENOTSUPP; + } ret = skb_unclone(skb, GFP_ATOMIC); if (unlikely(ret < 0)) @@ -2363,7 +3197,9 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 len_diff) struct skb_shared_info *shinfo = skb_shinfo(skb); /* Due to header shrink, MSS can be upgraded. */ - skb_increase_gso_size(shinfo, len_diff); + if (!(flags & BPF_F_ADJ_ROOM_FIXED_GSO)) + skb_increase_gso_size(shinfo, len_diff); + /* Header must be checked, and gso_segs recomputed. */ shinfo->gso_type |= SKB_GSO_DODGY; shinfo->gso_segs = 0; @@ -2374,49 +3210,50 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 len_diff) #define BPF_SKB_MAX_LEN SKB_MAX_ALLOC -static int bpf_skb_adjust_net(struct sk_buff *skb, s32 len_diff) +BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, + u32, mode, u64, flags) { - bool trans_same = skb->transport_header == skb->network_header; u32 len_cur, len_diff_abs = abs(len_diff); u32 len_min = bpf_skb_net_base_len(skb); u32 len_max = BPF_SKB_MAX_LEN; - __be16 proto = skb->protocol; + __be16 proto = skb_protocol(skb, true); bool shrink = len_diff < 0; + u32 off; int ret; + if (unlikely(flags & ~BPF_F_ADJ_ROOM_MASK)) + return -EINVAL; if (unlikely(len_diff_abs > 0xfffU)) return -EFAULT; if (unlikely(proto != htons(ETH_P_IP) && proto != htons(ETH_P_IPV6))) return -ENOTSUPP; + off = skb_mac_header_len(skb); + switch (mode) { + case BPF_ADJ_ROOM_NET: + off += bpf_skb_net_base_len(skb); + break; + case BPF_ADJ_ROOM_MAC: + break; + default: + return -ENOTSUPP; + } + len_cur = skb->len - skb_network_offset(skb); - if (skb_transport_header_was_set(skb) && !trans_same) - len_cur = skb_network_header_len(skb); if ((shrink && (len_diff_abs >= len_cur || len_cur - len_diff_abs < len_min)) || (!shrink && (skb->len + len_diff_abs > len_max && !skb_is_gso(skb)))) return -ENOTSUPP; - ret = shrink ? bpf_skb_net_shrink(skb, len_diff_abs) : - bpf_skb_net_grow(skb, len_diff_abs); + ret = shrink ? bpf_skb_net_shrink(skb, off, len_diff_abs, flags) : + bpf_skb_net_grow(skb, off, len_diff_abs, flags); bpf_compute_data_pointers(skb); return ret; } -BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff, - u32, mode, u64, flags) -{ - if (unlikely(flags)) - return -EINVAL; - if (likely(mode == BPF_ADJ_ROOM_NET)) - return bpf_skb_adjust_net(skb, len_diff); - - return -ENOTSUPP; -} - static const struct bpf_func_proto bpf_skb_adjust_room_proto = { .func = bpf_skb_adjust_room, .gpl_only = false, @@ -2429,13 +3266,22 @@ static const struct bpf_func_proto bpf_skb_adjust_room_proto = { static u32 __bpf_skb_min_len(const struct sk_buff *skb) { - u32 min_len = skb_network_offset(skb); + int offset = skb_network_offset(skb); + u32 min_len = 0; - if (skb_transport_header_was_set(skb)) - min_len = skb_transport_offset(skb); - if (skb->ip_summed == CHECKSUM_PARTIAL) - min_len = skb_checksum_start_offset(skb) + - skb->csum_offset + sizeof(__sum16); + if (offset > 0) + min_len = offset; + if (skb_transport_header_was_set(skb)) { + offset = skb_transport_offset(skb); + if (offset > 0) + min_len = offset; + } + if (skb->ip_summed == CHECKSUM_PARTIAL) { + offset = skb_checksum_start_offset(skb) + + skb->csum_offset + sizeof(__sum16); + if (offset > 0) + min_len = offset; + } return min_len; } @@ -2499,6 +3345,7 @@ BPF_CALL_3(bpf_skb_change_tail, struct sk_buff *, skb, u32, new_len, u64, flags) { int ret = __bpf_skb_change_tail(skb, new_len, flags); + bpf_compute_data_pointers(skb); return ret; } @@ -2516,6 +3363,7 @@ BPF_CALL_3(sk_skb_change_tail, struct sk_buff *, skb, u32, new_len, u64, flags) { int ret = __bpf_skb_change_tail(skb, new_len, flags); + bpf_compute_data_end_sk_skb(skb); return ret; } @@ -2559,6 +3407,7 @@ static inline int __bpf_skb_change_head(struct sk_buff *skb, u32 head_room, return ret; } + BPF_CALL_3(bpf_skb_change_head, struct sk_buff *, skb, u32, head_room, u64, flags) { @@ -2581,6 +3430,7 @@ BPF_CALL_3(sk_skb_change_head, struct sk_buff *, skb, u32, head_room, u64, flags) { int ret = __bpf_skb_change_head(skb, head_room, flags); + bpf_compute_data_end_sk_skb(skb); return ret; } @@ -2593,7 +3443,6 @@ static const struct bpf_func_proto sk_skb_change_head_proto = { .arg2_type = ARG_ANYTHING, .arg3_type = ARG_ANYTHING, }; - static unsigned long xdp_get_metalen(const struct xdp_buff *xdp) { return xdp_data_meta_unsupported(xdp) ? 0 : @@ -2602,8 +3451,9 @@ static unsigned long xdp_get_metalen(const struct xdp_buff *xdp) BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset) { + void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame); unsigned long metalen = xdp_get_metalen(xdp); - void *data_start = xdp->data_hard_start + metalen; + void *data_start = xdp_frame_end + metalen; void *data = xdp->data + offset; if (unlikely(data < data_start || @@ -2614,6 +3464,7 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset) memmove(xdp->data_meta + offset, xdp->data_meta, metalen); xdp->data_meta += offset; + xdp->data = data; return 0; } @@ -2626,21 +3477,50 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = { .arg2_type = ARG_ANYTHING, }; +BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset) +{ + void *data_end = xdp->data_end + offset; + + /* only shrinking is allowed for now. */ + if (unlikely(offset >= 0)) + return -EINVAL; + + if (unlikely(data_end < xdp->data + ETH_HLEN)) + return -EINVAL; + + xdp->data_end = data_end; + + return 0; +} + +static const struct bpf_func_proto bpf_xdp_adjust_tail_proto = { + .func = bpf_xdp_adjust_tail, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, +}; + BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset) { + void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame); void *meta = xdp->data_meta + offset; unsigned long metalen = xdp->data - meta; + if (xdp_data_meta_unsupported(xdp)) return -ENOTSUPP; - if (unlikely(meta < xdp->data_hard_start || + if (unlikely(meta < xdp_frame_end || meta > xdp->data)) return -EINVAL; if (unlikely((metalen & (sizeof(__u32) - 1)) || (metalen > 32))) return -EACCES; + xdp->data_meta = meta; + return 0; } + static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = { .func = bpf_xdp_adjust_meta, .gpl_only = false, @@ -2654,95 +3534,37 @@ static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp, u32 index) { - int err; + struct xdp_frame *xdpf; + int err, sent; if (!dev->netdev_ops->ndo_xdp_xmit) { return -EOPNOTSUPP; } - err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp); - if (err) - return err; - if (map) - __dev_map_insert_ctx(map, index); - else - dev->netdev_ops->ndo_xdp_flush(dev); - return 0; -} - -void xdp_do_flush_map(void) -{ - struct redirect_info *ri = this_cpu_ptr(&redirect_info); - struct bpf_map *map = ri->map_to_flush; - - ri->map_to_flush = NULL; - if (map) - __dev_map_flush(map); -} -EXPORT_SYMBOL_GPL(xdp_do_flush_map); - -static inline bool xdp_map_invalid(const struct bpf_prog *xdp_prog, - unsigned long aux) -{ - return (unsigned long)xdp_prog->aux != aux; -} - -static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp, - struct bpf_prog *xdp_prog) -{ - struct redirect_info *ri = this_cpu_ptr(&redirect_info); - unsigned long map_owner = ri->map_owner; - struct bpf_map *map = ri->map; - struct net_device *fwd = NULL; - u32 index = ri->ifindex; - int err; - - ri->ifindex = 0; - ri->map = NULL; - ri->map_owner = 0; - - if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) { - err = -EFAULT; - map = NULL; - goto err; - } - - if (map->map_type == BPF_MAP_TYPE_DEVMAP) - fwd = __dev_map_lookup_elem(map, index); - else if (map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) - fwd = __dev_map_hash_lookup_elem(map, index); - if (!fwd) { - err = -EINVAL; - goto err; - } - if (ri->map_to_flush && ri->map_to_flush != map) - xdp_do_flush_map(); - - err = __bpf_tx_xdp(fwd, map, xdp, index); + err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data); if (unlikely(err)) - goto err; + return err; - ri->map_to_flush = map; - _trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index); + xdpf = convert_to_xdp_frame(xdp); + if (unlikely(!xdpf)) + return -EOVERFLOW; + + sent = dev->netdev_ops->ndo_xdp_xmit(dev, 1, &xdpf, XDP_XMIT_FLUSH); + if (sent <= 0) + return sent; return 0; -err: - _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err); - return err; } -int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, - struct bpf_prog *xdp_prog) +static noinline int +xdp_do_redirect_slow(struct net_device *dev, struct xdp_buff *xdp, + struct bpf_prog *xdp_prog, struct bpf_redirect_info *ri) { - struct redirect_info *ri = this_cpu_ptr(&redirect_info); struct net_device *fwd; - u32 index = ri->ifindex; + u32 index = ri->tgt_index; int err; - if (ri->map) - return xdp_do_redirect_map(dev, xdp, xdp_prog); - fwd = dev_get_by_index_rcu(dev_net(dev), index); - ri->ifindex = 0; + ri->tgt_index = 0; if (unlikely(!fwd)) { err = -EINVAL; goto err; @@ -2758,74 +3580,228 @@ err: _trace_xdp_redirect_err(dev, xdp_prog, index, err); return err; } + +static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd, + struct bpf_map *map, + struct xdp_buff *xdp, + u32 index) +{ + int err; + + switch (map->map_type) { + case BPF_MAP_TYPE_DEVMAP: + case BPF_MAP_TYPE_DEVMAP_HASH: { + struct bpf_dtab_netdev *dst = fwd; + + err = dev_map_enqueue(dst, xdp, dev_rx); + if (unlikely(err)) + return err; + break; + } + case BPF_MAP_TYPE_CPUMAP: { + struct bpf_cpu_map_entry *rcpu = fwd; + + err = cpu_map_enqueue(rcpu, xdp, dev_rx); + if (unlikely(err)) + return err; + break; + } + case BPF_MAP_TYPE_XSKMAP: { + struct xdp_sock *xs = fwd; + + err = __xsk_map_redirect(map, xdp, xs); + return err; + } + default: + return -EBADRQC; + } + return 0; +} + +void xdp_do_flush_map(void) +{ + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_map *map = ri->map_to_flush; + + ri->map_to_flush = NULL; + if (map) { + switch (map->map_type) { + case BPF_MAP_TYPE_DEVMAP: + case BPF_MAP_TYPE_DEVMAP_HASH: + __dev_map_flush(map); + break; + case BPF_MAP_TYPE_CPUMAP: + __cpu_map_flush(map); + break; + case BPF_MAP_TYPE_XSKMAP: + __xsk_map_flush(map); + break; + default: + break; + } + } +} +EXPORT_SYMBOL_GPL(xdp_do_flush_map); + +static inline void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index) +{ + switch (map->map_type) { + case BPF_MAP_TYPE_DEVMAP: + return __dev_map_lookup_elem(map, index); + case BPF_MAP_TYPE_DEVMAP_HASH: + return __dev_map_hash_lookup_elem(map, index); + case BPF_MAP_TYPE_CPUMAP: + return __cpu_map_lookup_elem(map, index); + case BPF_MAP_TYPE_XSKMAP: + return __xsk_map_lookup_elem(map, index); + default: + return NULL; + } +} + +void bpf_clear_redirect_map(struct bpf_map *map) +{ + struct bpf_redirect_info *ri; + int cpu; + + for_each_possible_cpu(cpu) { + ri = per_cpu_ptr(&bpf_redirect_info, cpu); + /* Avoid polluting remote cacheline due to writes if + * not needed. Once we pass this test, we need the + * cmpxchg() to make sure it hasn't been changed in + * the meantime by remote CPU. + */ + if (unlikely(READ_ONCE(ri->map) == map)) + cmpxchg(&ri->map, map, NULL); + } +} + +static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp, + struct bpf_prog *xdp_prog, struct bpf_map *map, + struct bpf_redirect_info *ri) +{ + u32 index = ri->tgt_index; + void *fwd = ri->tgt_value; + int err; + + ri->tgt_index = 0; + ri->tgt_value = NULL; + WRITE_ONCE(ri->map, NULL); + + if (ri->map_to_flush && unlikely(ri->map_to_flush != map)) + xdp_do_flush_map(); + + err = __bpf_tx_xdp_map(dev, fwd, map, xdp, index); + if (unlikely(err)) + goto err; + + ri->map_to_flush = map; + _trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index); + return 0; +err: + _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err); + return err; +} + +int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, + struct bpf_prog *xdp_prog) +{ + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_map *map = READ_ONCE(ri->map); + + if (likely(map)) + return xdp_do_redirect_map(dev, xdp, xdp_prog, map, ri); + + return xdp_do_redirect_slow(dev, xdp, xdp_prog, ri); +} EXPORT_SYMBOL_GPL(xdp_do_redirect); -int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb, - struct bpf_prog *xdp_prog) +static int xdp_do_generic_redirect_map(struct net_device *dev, + struct sk_buff *skb, + struct xdp_buff *xdp, + struct bpf_prog *xdp_prog, + struct bpf_map *map) { - struct redirect_info *ri = this_cpu_ptr(&redirect_info); - unsigned long map_owner = ri->map_owner; - struct bpf_map *map = ri->map; - struct net_device *fwd = NULL; - u32 index = ri->ifindex; - unsigned int len; + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + u32 index = ri->tgt_index; + void *fwd = ri->tgt_value; int err = 0; - ri->ifindex = 0; - ri->map = NULL; - ri->map_owner = 0; + ri->tgt_index = 0; + ri->tgt_value = NULL; + WRITE_ONCE(ri->map, NULL); - if (map) { - if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) { - err = -EFAULT; - map = NULL; + if (map->map_type == BPF_MAP_TYPE_DEVMAP || + map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) { + struct bpf_dtab_netdev *dst = fwd; + + err = dev_map_generic_redirect(dst, skb, xdp_prog); + if (unlikely(err)) goto err; - } - if (map->map_type == BPF_MAP_TYPE_DEVMAP) - fwd = __dev_map_lookup_elem(map, index); - else if (map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) - fwd = __dev_map_hash_lookup_elem(map, index); + } else if (map->map_type == BPF_MAP_TYPE_XSKMAP) { + struct xdp_sock *xs = fwd; + + err = xsk_generic_rcv(xs, xdp); + if (err) + goto err; + consume_skb(skb); } else { - fwd = dev_get_by_index_rcu(dev_net(dev), index); + /* TODO: Handle BPF_MAP_TYPE_CPUMAP */ + err = -EBADRQC; + goto err; } + + _trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index); + return 0; +err: + _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err); + return err; +} + +int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb, + struct xdp_buff *xdp, struct bpf_prog *xdp_prog) +{ + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); + struct bpf_map *map = READ_ONCE(ri->map); + u32 index = ri->tgt_index; + struct net_device *fwd; + int err = 0; + + if (map) + return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog, + map); + ri->tgt_index = 0; + fwd = dev_get_by_index_rcu(dev_net(dev), index); if (unlikely(!fwd)) { err = -EINVAL; goto err; } - if (unlikely(!(fwd->flags & IFF_UP))) { - err = -ENETDOWN; + err = xdp_ok_fwd_dev(fwd, skb->len); + if (unlikely(err)) goto err; - } - - len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN; - if (skb->len > len) { - err = -EMSGSIZE; - goto err; - } skb->dev = fwd; - map ? _trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index) - : _trace_xdp_redirect(dev, xdp_prog, index); + _trace_xdp_redirect(dev, xdp_prog, index); + generic_xdp_tx(skb, xdp_prog); return 0; err: - map ? _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err) - : _trace_xdp_redirect_err(dev, xdp_prog, index, err); + _trace_xdp_redirect_err(dev, xdp_prog, index, err); return err; } EXPORT_SYMBOL_GPL(xdp_do_generic_redirect); BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags) { - struct redirect_info *ri = this_cpu_ptr(&redirect_info); + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); if (unlikely(flags)) return XDP_ABORTED; - ri->ifindex = ifindex; ri->flags = flags; - ri->map = NULL; - ri->map_owner = 0; + ri->tgt_index = ifindex; + ri->tgt_value = NULL; + WRITE_ONCE(ri->map, NULL); return XDP_REDIRECT; } @@ -2838,25 +3814,33 @@ static const struct bpf_func_proto bpf_xdp_redirect_proto = { .arg2_type = ARG_ANYTHING, }; -BPF_CALL_4(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex, u64, flags, - unsigned long, map_owner) +BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex, + u64, flags) { - struct redirect_info *ri = this_cpu_ptr(&redirect_info); + struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info); - if (unlikely(flags)) + /* Lower bits of the flags are used as return code on lookup failure */ + if (unlikely(flags > XDP_TX)) return XDP_ABORTED; - ri->ifindex = ifindex; + ri->tgt_value = __xdp_map_lookup_elem(map, ifindex); + if (unlikely(!ri->tgt_value)) { + /* If the lookup fails we want to clear out the state in the + * redirect_info struct completely, so that if an eBPF program + * performs multiple lookups, the last one always takes + * precedence. + */ + WRITE_ONCE(ri->map, NULL); + return flags; + } + ri->flags = flags; - ri->map = map; - ri->map_owner = map_owner; + ri->tgt_index = ifindex; + WRITE_ONCE(ri->map, map); return XDP_REDIRECT; } -/* Note, arg4 is hidden from users and populated by the verifier - * with the right pointer. - */ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = { .func = bpf_xdp_redirect_map, .gpl_only = false, @@ -2866,372 +3850,6 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = { .arg3_type = ARG_ANYTHING, }; -#ifdef CONFIG_INET -struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, - int dif, int sdif, u8 family, u8 proto) -{ - bool refcounted = false; - struct sock *sk = NULL; - if (family == AF_INET) { - __be32 src4 = tuple->ipv4.saddr; - __be32 dst4 = tuple->ipv4.daddr; - - if (proto == IPPROTO_TCP) - sk = __inet_lookup(net, &tcp_hashinfo, NULL, 0, - src4, tuple->ipv4.sport, - dst4, tuple->ipv4.dport, - dif, sdif, &refcounted); - else - sk = __udp4_lib_lookup(net, src4, tuple->ipv4.sport, - dst4, tuple->ipv4.dport, - dif, sdif, &udp_table, NULL); -#if IS_ENABLED(CONFIG_IPV6) - } else { - struct in6_addr *src6 = (struct in6_addr *)&tuple->ipv6.saddr; - struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr; - - if (proto == IPPROTO_TCP) - sk = __inet6_lookup(net, &tcp_hashinfo, NULL, 0, - src6, tuple->ipv6.sport, - dst6, tuple->ipv6.dport, - dif, sdif, &refcounted); - else - sk = __udp6_lib_lookup(net, src6, tuple->ipv6.sport, - dst6, tuple->ipv6.dport, - dif, sdif, &udp_table, NULL); -#endif - } - if (unlikely(sk && !refcounted && !sock_flag(sk, SOCK_RCU_FREE))) { - WARN_ONCE(1, "Found non-RCU, unreferenced socket!"); - sk = NULL; - } - return sk; -} -/* bpf_sk_lookup performs the core lookup for different types of sockets, - * taking a reference on the socket if it doesn't have the flag SOCK_RCU_FREE. - * Returns the socket as an 'unsigned long' to simplify the casting in the - * callers to satisfy BPF_CALL declarations. - */ -static unsigned long -__bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len, - struct net *caller_net, u32 ifindex, u8 proto, u64 netns_id, - u64 flags) -{ - struct sock *sk = NULL; - u8 family = AF_UNSPEC; - struct net *net; - int sdif; - - family = len == sizeof(tuple->ipv4) ? AF_INET : AF_INET6; - - if (unlikely(family == AF_UNSPEC || netns_id > U32_MAX || flags)) - goto out; - - if (family == AF_INET) - sdif = inet_sdif(skb); - else - sdif = inet6_sdif(skb); - - if (netns_id) { - net = get_net_ns_by_id(caller_net, netns_id); - if (unlikely(!net)) - goto out; - sk = sk_lookup(net, tuple, ifindex, sdif, family, proto); - put_net(net); - } else { - net = caller_net; - sk = sk_lookup(net, tuple, ifindex, sdif, family, proto); - } - if (sk) - sk = sk_to_full_sk(sk); -out: - return (unsigned long) sk; -} - -static unsigned long -bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len, - u8 proto, u64 netns_id, u64 flags) -{ - struct net *caller_net; - int ifindex; - - if (skb->dev) { - caller_net = dev_net(skb->dev); - ifindex = skb->dev->ifindex; - } else { - caller_net = sock_net(skb->sk); - ifindex = 0; - } - - return __bpf_sk_lookup(skb, tuple, len, caller_net, ifindex, - proto, netns_id, flags); -} - -BPF_CALL_5(bpf_sk_lookup_tcp, struct sk_buff *, skb, - struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags) -{ - return bpf_sk_lookup(skb, tuple, len, IPPROTO_TCP, netns_id, flags); -} -static const struct bpf_func_proto bpf_sk_lookup_tcp_proto = { - .func = bpf_sk_lookup_tcp, - .gpl_only = false, - .pkt_access = true, - .ret_type = RET_PTR_TO_SOCKET_OR_NULL, - .arg1_type = ARG_PTR_TO_CTX, - .arg2_type = ARG_PTR_TO_MEM, - .arg3_type = ARG_CONST_SIZE, - .arg4_type = ARG_ANYTHING, - .arg5_type = ARG_ANYTHING, -}; -BPF_CALL_5(bpf_sk_lookup_udp, struct sk_buff *, skb, - struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags) -{ - return bpf_sk_lookup(skb, tuple, len, IPPROTO_UDP, netns_id, flags); -} -static const struct bpf_func_proto bpf_sk_lookup_udp_proto = { - .func = bpf_sk_lookup_udp, - .gpl_only = false, - .pkt_access = true, - .ret_type = RET_PTR_TO_SOCKET_OR_NULL, - .arg1_type = ARG_PTR_TO_CTX, - .arg2_type = ARG_PTR_TO_MEM, - .arg3_type = ARG_CONST_SIZE, - .arg4_type = ARG_ANYTHING, - .arg5_type = ARG_ANYTHING, -}; -BPF_CALL_1(bpf_sk_release, struct sock *, sk) -{ - if (!sock_flag(sk, SOCK_RCU_FREE)) - sock_gen_put(sk); - return 0; -} -static const struct bpf_func_proto bpf_sk_release_proto = { - .func = bpf_sk_release, - .gpl_only = false, - .ret_type = RET_INTEGER, - .arg1_type = ARG_PTR_TO_SOCKET, -}; - -BPF_CALL_5(bpf_xdp_sk_lookup_udp, struct xdp_buff *, ctx, - struct bpf_sock_tuple *, tuple, u32, len, u32, netns_id, u64, flags) -{ - struct net *caller_net = dev_net(ctx->rxq->dev); - int ifindex = ctx->rxq->dev->ifindex; - - return __bpf_sk_lookup(NULL, tuple, len, caller_net, ifindex, - IPPROTO_UDP, netns_id, flags); -} - -static const struct bpf_func_proto bpf_xdp_sk_lookup_udp_proto = { - .func = bpf_xdp_sk_lookup_udp, - .gpl_only = false, - .pkt_access = true, - .ret_type = RET_PTR_TO_SOCKET_OR_NULL, - .arg1_type = ARG_PTR_TO_CTX, - .arg2_type = ARG_PTR_TO_MEM, - .arg3_type = ARG_CONST_SIZE, - .arg4_type = ARG_ANYTHING, - .arg5_type = ARG_ANYTHING, -}; - -BPF_CALL_5(bpf_xdp_sk_lookup_tcp, struct xdp_buff *, ctx, - struct bpf_sock_tuple *, tuple, u32, len, u32, netns_id, u64, flags) -{ - struct net *caller_net = dev_net(ctx->rxq->dev); - int ifindex = ctx->rxq->dev->ifindex; - - return __bpf_sk_lookup(NULL, tuple, len, caller_net, ifindex, - IPPROTO_TCP, netns_id, flags); -} - -static const struct bpf_func_proto bpf_xdp_sk_lookup_tcp_proto = { - .func = bpf_xdp_sk_lookup_tcp, - .gpl_only = false, - .pkt_access = true, - .ret_type = RET_PTR_TO_SOCKET_OR_NULL, - .arg1_type = ARG_PTR_TO_CTX, - .arg2_type = ARG_PTR_TO_MEM, - .arg3_type = ARG_CONST_SIZE, - .arg4_type = ARG_ANYTHING, - .arg5_type = ARG_ANYTHING, -}; - -BPF_CALL_5(bpf_sock_addr_sk_lookup_tcp, struct bpf_sock_addr_kern *, ctx, - struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags) -{ - return __bpf_sk_lookup(NULL, tuple, len, sock_net(ctx->sk), 0, - IPPROTO_TCP, netns_id, flags); -} - -static const struct bpf_func_proto bpf_sock_addr_sk_lookup_tcp_proto = { - .func = bpf_sock_addr_sk_lookup_tcp, - .gpl_only = false, - .ret_type = RET_PTR_TO_SOCKET_OR_NULL, - .arg1_type = ARG_PTR_TO_CTX, - .arg2_type = ARG_PTR_TO_MEM, - .arg3_type = ARG_CONST_SIZE, - .arg4_type = ARG_ANYTHING, - .arg5_type = ARG_ANYTHING, -}; - -BPF_CALL_5(bpf_sock_addr_sk_lookup_udp, struct bpf_sock_addr_kern *, ctx, - struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags) -{ - return __bpf_sk_lookup(NULL, tuple, len, sock_net(ctx->sk), 0, - IPPROTO_UDP, netns_id, flags); -} - -static const struct bpf_func_proto bpf_sock_addr_sk_lookup_udp_proto = { - .func = bpf_sock_addr_sk_lookup_udp, - .gpl_only = false, - .ret_type = RET_PTR_TO_SOCKET_OR_NULL, - .arg1_type = ARG_PTR_TO_CTX, - .arg2_type = ARG_PTR_TO_MEM, - .arg3_type = ARG_CONST_SIZE, - .arg4_type = ARG_ANYTHING, - .arg5_type = ARG_ANYTHING, -}; - -bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type, - struct bpf_insn_access_aux *info) -{ - if (off < 0 || off >= offsetofend(struct bpf_tcp_sock, bytes_acked)) - return false; - if (off % size != 0) - return false; - switch (off) { - case offsetof(struct bpf_tcp_sock, bytes_received): - case offsetof(struct bpf_tcp_sock, bytes_acked): - return size == sizeof(__u64); - default: - return size == sizeof(__u32); - } -} - -#define CONVERT_COMMON_TCP_SOCK_FIELDS(md_type, CONVERT) \ -do { \ - switch (si->off) { \ - case offsetof(md_type, snd_cwnd): \ - CONVERT(snd_cwnd); break; \ - case offsetof(md_type, srtt_us): \ - CONVERT(srtt_us); break; \ - case offsetof(md_type, snd_ssthresh): \ - CONVERT(snd_ssthresh); break; \ - case offsetof(md_type, rcv_nxt): \ - CONVERT(rcv_nxt); break; \ - case offsetof(md_type, snd_nxt): \ - CONVERT(snd_nxt); break; \ - case offsetof(md_type, snd_una): \ - CONVERT(snd_una); break; \ - case offsetof(md_type, mss_cache): \ - CONVERT(mss_cache); break; \ - case offsetof(md_type, ecn_flags): \ - CONVERT(ecn_flags); break; \ - case offsetof(md_type, rate_delivered): \ - CONVERT(rate_delivered); break; \ - case offsetof(md_type, rate_interval_us): \ - CONVERT(rate_interval_us); break; \ - case offsetof(md_type, packets_out): \ - CONVERT(packets_out); break; \ - case offsetof(md_type, retrans_out): \ - CONVERT(retrans_out); break; \ - case offsetof(md_type, total_retrans): \ - CONVERT(total_retrans); break; \ - case offsetof(md_type, segs_in): \ - CONVERT(segs_in); break; \ - case offsetof(md_type, data_segs_in): \ - CONVERT(data_segs_in); break; \ - case offsetof(md_type, segs_out): \ - CONVERT(segs_out); break; \ - case offsetof(md_type, data_segs_out): \ - CONVERT(data_segs_out); break; \ - case offsetof(md_type, lost_out): \ - CONVERT(lost_out); break; \ - case offsetof(md_type, sacked_out): \ - CONVERT(sacked_out); break; \ - case offsetof(md_type, bytes_received): \ - CONVERT(bytes_received); break; \ - case offsetof(md_type, bytes_acked): \ - CONVERT(bytes_acked); break; \ - } \ -} while (0) - -u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type, - const struct bpf_insn *si, - struct bpf_insn *insn_buf, - struct bpf_prog *prog, u32 *target_size) -{ - struct bpf_insn *insn = insn_buf; - -#define BPF_TCP_SOCK_GET_COMMON(FIELD) \ - do { \ - BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD) > \ - FIELD_SIZEOF(struct bpf_tcp_sock, FIELD)); \ - *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct tcp_sock, FIELD),\ - si->dst_reg, si->src_reg, \ - offsetof(struct tcp_sock, FIELD)); \ - } while (0) - CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_tcp_sock, - BPF_TCP_SOCK_GET_COMMON); - if (insn > insn_buf) - return insn - insn_buf; - - switch (si->off) { - case offsetof(struct bpf_tcp_sock, rtt_min): - BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) != - sizeof(struct minmax)); - BUILD_BUG_ON(sizeof(struct minmax) < - sizeof(struct minmax_sample)); - *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, - offsetof(struct tcp_sock, rtt_min) + - offsetof(struct minmax_sample, v)); - break; - } - return insn - insn_buf; -} - -BPF_CALL_1(bpf_tcp_sock, struct sock *, sk) -{ - sk = sk_to_full_sk(sk); - - if (sk_fullsock(sk) && sk->sk_protocol == IPPROTO_TCP) - return (unsigned long)sk; - - return (unsigned long)NULL; -} - -const struct bpf_func_proto bpf_tcp_sock_proto = { - .func = bpf_tcp_sock, - .gpl_only = false, - .ret_type = RET_PTR_TO_TCP_SOCK_OR_NULL, - .arg1_type = ARG_PTR_TO_SOCK_COMMON, -}; - -#endif /* CONFIG_INET */ - -bool bpf_helper_changes_pkt_data(void *func) -{ - if (func == bpf_skb_vlan_push || - func == bpf_skb_vlan_pop || - func == bpf_skb_store_bytes || - func == bpf_skb_change_proto || - func == bpf_skb_change_head || - func == sk_skb_change_head || - func == bpf_skb_change_tail || - func == sk_skb_change_tail || - func == bpf_skb_adjust_room || - func == bpf_skb_pull_data || - func == sk_skb_pull_data || - func == bpf_clone_redirect || - func == bpf_l3_csum_replace || - func == bpf_l4_csum_replace || - func == bpf_xdp_adjust_head || - func == bpf_xdp_adjust_meta) - return true; - - return false; -} - static unsigned long bpf_skb_copy(void *dst_buff, const void *skb, unsigned long off, unsigned long len) { @@ -3267,7 +3885,7 @@ static const struct bpf_func_proto bpf_skb_event_output_proto = { .arg2_type = ARG_CONST_MAP_PTR, .arg3_type = ARG_ANYTHING, .arg4_type = ARG_PTR_TO_MEM, - .arg5_type = ARG_CONST_SIZE, + .arg5_type = ARG_CONST_SIZE_OR_ZERO, }; static unsigned short bpf_tunnel_key_af(u64 flags) @@ -3314,6 +3932,7 @@ set_compat: to->tunnel_id = be64_to_cpu(info->key.tun_id); to->tunnel_tos = info->key.tos; to->tunnel_ttl = info->key.ttl; + to->tunnel_ext = 0; if (flags & BPF_F_TUNINFO_IPV6) { memcpy(to->remote_ipv6, &info->key.u.ipv6.src, @@ -3321,6 +3940,8 @@ set_compat: to->tunnel_label = be32_to_cpu(info->key.label); } else { to->remote_ipv4 = be32_to_cpu(info->key.u.ipv4.src); + memset(&to->remote_ipv6[1], 0, sizeof(__u32) * 3); + to->tunnel_label = 0; } if (unlikely(size != sizeof(struct bpf_tunnel_key))) @@ -3386,7 +4007,7 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb, struct ip_tunnel_info *info; if (unlikely(flags & ~(BPF_F_TUNINFO_IPV6 | BPF_F_ZERO_CSUM_TX | - BPF_F_DONT_FRAGMENT))) + BPF_F_DONT_FRAGMENT | BPF_F_SEQ_NUMBER))) return -EINVAL; if (unlikely(size != sizeof(struct bpf_tunnel_key))) { switch (size) { @@ -3413,11 +4034,16 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb, skb_dst_set(skb, (struct dst_entry *) md); info = &md->u.tun_info; + memset(info, 0, sizeof(*info)); info->mode = IP_TUNNEL_INFO_TX; info->key.tun_flags = TUNNEL_KEY | TUNNEL_CSUM | TUNNEL_NOCACHE; if (flags & BPF_F_DONT_FRAGMENT) info->key.tun_flags |= TUNNEL_DONT_FRAGMENT; + if (flags & BPF_F_ZERO_CSUM_TX) + info->key.tun_flags &= ~TUNNEL_CSUM; + if (flags & BPF_F_SEQ_NUMBER) + info->key.tun_flags |= TUNNEL_SEQ; info->key.tun_id = cpu_to_be64(from->tunnel_id); info->key.tos = from->tunnel_tos; @@ -3431,8 +4057,6 @@ BPF_CALL_4(bpf_skb_set_tunnel_key, struct sk_buff *, skb, IPV6_FLOWLABEL_MASK; } else { info->key.u.ipv4.dst = cpu_to_be32(from->remote_ipv4); - if (flags & BPF_F_ZERO_CSUM_TX) - info->key.tun_flags &= ~TUNNEL_CSUM; } return 0; @@ -3459,7 +4083,7 @@ BPF_CALL_3(bpf_skb_set_tunnel_opt, struct sk_buff *, skb, if (unlikely(size > IP_TUNNEL_OPTS_MAX)) return -ENOMEM; - ip_tunnel_info_opts_set(info, from, size); + ip_tunnel_info_opts_set(info, from, size, TUNNEL_OPTIONS_PRESENT); return 0; } @@ -3477,14 +4101,15 @@ static const struct bpf_func_proto * bpf_get_skb_set_tunnel_proto(enum bpf_func_id which) { if (!md_dst) { - /* Race is not possible, since it's called from verifier - * that is holding verifier mutex. - */ - md_dst = metadata_dst_alloc_percpu(IP_TUNNEL_OPTS_MAX, - METADATA_IP_TUNNEL, - GFP_KERNEL); - if (!md_dst) + struct metadata_dst __percpu *tmp; + + tmp = metadata_dst_alloc_percpu(IP_TUNNEL_OPTS_MAX, + METADATA_IP_TUNNEL, + GFP_KERNEL); + if (!tmp) return NULL; + if (cmpxchg(&md_dst, NULL, tmp)) + metadata_dst_free_percpu(tmp); } switch (which) { @@ -3526,6 +4151,53 @@ static const struct bpf_func_proto bpf_skb_under_cgroup_proto = { .arg3_type = ARG_ANYTHING, }; +#ifdef CONFIG_SOCK_CGROUP_DATA +BPF_CALL_1(bpf_skb_cgroup_id, const struct sk_buff *, skb) +{ + struct sock *sk = skb_to_full_sk(skb); + struct cgroup *cgrp; + + if (!sk || !sk_fullsock(sk)) + return 0; + + cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); + return cgrp->kn->id.id; +} + +static const struct bpf_func_proto bpf_skb_cgroup_id_proto = { + .func = bpf_skb_cgroup_id, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + +BPF_CALL_2(bpf_skb_ancestor_cgroup_id, const struct sk_buff *, skb, int, + ancestor_level) +{ + struct sock *sk = skb_to_full_sk(skb); + struct cgroup *ancestor; + struct cgroup *cgrp; + + if (!sk || !sk_fullsock(sk)) + return 0; + + cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); + ancestor = cgroup_ancestor(cgrp, ancestor_level); + if (!ancestor) + return 0; + + return ancestor->kn->id.id; +} + +static const struct bpf_func_proto bpf_skb_ancestor_cgroup_id_proto = { + .func = bpf_skb_ancestor_cgroup_id, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, +}; +#endif + static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff, unsigned long off, unsigned long len) { @@ -3555,7 +4227,7 @@ static const struct bpf_func_proto bpf_xdp_event_output_proto = { .arg2_type = ARG_CONST_MAP_PTR, .arg3_type = ARG_ANYTHING, .arg4_type = ARG_PTR_TO_MEM, - .arg5_type = ARG_CONST_SIZE, + .arg5_type = ARG_CONST_SIZE_OR_ZERO, }; BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb) @@ -3570,6 +4242,30 @@ static const struct bpf_func_proto bpf_get_socket_cookie_proto = { .arg1_type = ARG_PTR_TO_CTX, }; +BPF_CALL_1(bpf_get_socket_cookie_sock_addr, struct bpf_sock_addr_kern *, ctx) +{ + return sock_gen_cookie(ctx->sk); +} + +static const struct bpf_func_proto bpf_get_socket_cookie_sock_addr_proto = { + .func = bpf_get_socket_cookie_sock_addr, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + +BPF_CALL_1(bpf_get_socket_cookie_sock_ops, struct bpf_sock_ops_kern *, ctx) +{ + return sock_gen_cookie(ctx->sk); +} + +static const struct bpf_func_proto bpf_get_socket_cookie_sock_ops_proto = { + .func = bpf_get_socket_cookie_sock_ops, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb) { struct sock *sk = sk_to_full_sk(skb->sk); @@ -3588,6 +4284,26 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = { .arg1_type = ARG_PTR_TO_CTX, }; +BPF_CALL_5(bpf_sockopt_event_output, struct bpf_sock_ops_kern *, bpf_sock, + struct bpf_map *, map, u64, flags, void *, data, u64, size) +{ + if (unlikely(flags & ~(BPF_F_INDEX_MASK))) + return -EINVAL; + + return bpf_event_output(map, flags, data, size, NULL, 0, NULL); +} + +static const struct bpf_func_proto bpf_sockopt_event_output_proto = { + .func = bpf_sockopt_event_output, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_MEM, + .arg5_type = ARG_CONST_SIZE_OR_ZERO, +}; + BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, int, level, int, optname, char *, optval, int, optlen) { @@ -3607,15 +4323,21 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, switch (optname) { case SO_RCVBUF: val = min_t(u32, val, sysctl_rmem_max); + val = min_t(int, val, INT_MAX / 2); sk->sk_userlocks |= SOCK_RCVBUF_LOCK; sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF); break; case SO_SNDBUF: val = min_t(u32, val, sysctl_wmem_max); + val = min_t(int, val, INT_MAX / 2); sk->sk_userlocks |= SOCK_SNDBUF_LOCK; sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF); break; case SO_MAX_PACING_RATE: + if (val != ~0U) + cmpxchg(&sk->sk_pacing_status, + SK_PACING_NONE, + SK_PACING_NEEDED); sk->sk_max_pacing_rate = val; sk->sk_pacing_rate = min(sk->sk_pacing_rate, sk->sk_max_pacing_rate); @@ -3638,6 +4360,50 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, ret = -EINVAL; } #ifdef CONFIG_INET + } else if (level == SOL_IP) { + if (optlen != sizeof(int) || sk->sk_family != AF_INET) + return -EINVAL; + + val = *((int *)optval); + /* Only some options are supported */ + switch (optname) { + case IP_TOS: + if (val < -1 || val > 0xff) { + ret = -EINVAL; + } else { + struct inet_sock *inet = inet_sk(sk); + + if (val == -1) + val = 0; + inet->tos = val; + } + break; + default: + ret = -EINVAL; + } +#if IS_ENABLED(CONFIG_IPV6) + } else if (level == SOL_IPV6) { + if (optlen != sizeof(int) || sk->sk_family != AF_INET6) + return -EINVAL; + + val = *((int *)optval); + /* Only some options are supported */ + switch (optname) { + case IPV6_TCLASS: + if (val < -1 || val > 0xff) { + ret = -EINVAL; + } else { + struct ipv6_pinfo *np = inet6_sk(sk); + + if (val == -1) + val = 0; + np->tclass = val; + } + break; + default: + ret = -EINVAL; + } +#endif } else if (level == SOL_TCP && sk->sk_prot->setsockopt == tcp_setsockopt) { if (optname == TCP_CONGESTION) { @@ -3672,6 +4438,12 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, tp->snd_ssthresh = val; } break; + case TCP_SAVE_SYN: + if (val < 0 || val > 1) + ret = -EINVAL; + else + tp->save_syn = val; + break; default: ret = -EINVAL; } @@ -3685,7 +4457,7 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, static const struct bpf_func_proto bpf_setsockopt_proto = { .func = bpf_setsockopt, - .gpl_only = true, + .gpl_only = false, .ret_type = RET_INTEGER, .arg1_type = ARG_PTR_TO_CTX, .arg2_type = ARG_ANYTHING, @@ -3694,6 +4466,111 @@ static const struct bpf_func_proto bpf_setsockopt_proto = { .arg5_type = ARG_CONST_SIZE, }; +BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, + int, level, int, optname, char *, optval, int, optlen) +{ + struct sock *sk = bpf_sock->sk; + + if (!sk_fullsock(sk)) + goto err_clear; +#ifdef CONFIG_INET + if (level == SOL_TCP && sk->sk_prot->getsockopt == tcp_getsockopt) { + struct inet_connection_sock *icsk; + struct tcp_sock *tp; + + switch (optname) { + case TCP_CONGESTION: + icsk = inet_csk(sk); + + if (!icsk->icsk_ca_ops || optlen <= 1) + goto err_clear; + strncpy(optval, icsk->icsk_ca_ops->name, optlen); + optval[optlen - 1] = 0; + break; + case TCP_SAVED_SYN: + tp = tcp_sk(sk); + + if (optlen <= 0 || !tp->saved_syn || + optlen > tp->saved_syn[0]) + goto err_clear; + memcpy(optval, tp->saved_syn + 1, optlen); + break; + default: + goto err_clear; + } + } else if (level == SOL_IP) { + struct inet_sock *inet = inet_sk(sk); + + if (optlen != sizeof(int) || sk->sk_family != AF_INET) + goto err_clear; + + /* Only some options are supported */ + switch (optname) { + case IP_TOS: + *((int *)optval) = (int)inet->tos; + break; + default: + goto err_clear; + } +#if IS_ENABLED(CONFIG_IPV6) + } else if (level == SOL_IPV6) { + struct ipv6_pinfo *np = inet6_sk(sk); + + if (optlen != sizeof(int) || sk->sk_family != AF_INET6) + goto err_clear; + + /* Only some options are supported */ + switch (optname) { + case IPV6_TCLASS: + *((int *)optval) = (int)np->tclass; + break; + default: + goto err_clear; + } +#endif + } else { + goto err_clear; + } + return 0; +#endif +err_clear: + memset(optval, 0, optlen); + return -EINVAL; +} + +static const struct bpf_func_proto bpf_getsockopt_proto = { + .func = bpf_getsockopt, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_UNINIT_MEM, + .arg5_type = ARG_CONST_SIZE, +}; + +BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock, + int, argval) +{ + struct sock *sk = bpf_sock->sk; + int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS; + + if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk)) + return -EINVAL; + + tcp_sk(sk)->bpf_sock_ops_cb_flags = val; + + return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS); +} + +static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = { + .func = bpf_sock_ops_cb_flags_set, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, +}; + const struct ipv6_bpf_stub *ipv6_bpf_stub __read_mostly; EXPORT_SYMBOL_GPL(ipv6_bpf_stub); @@ -3708,6 +4585,8 @@ BPF_CALL_3(bpf_bind, struct bpf_sock_addr_kern *, ctx, struct sockaddr *, addr, * Only binding to IP is supported. */ err = -EINVAL; + if (addr_len < offsetofend(struct sockaddr, sa_family)) + return err; if (addr->sa_family == AF_INET) { if (addr_len < sizeof(struct sockaddr_in)) return err; @@ -3740,6 +4619,1451 @@ static const struct bpf_func_proto bpf_bind_proto = { .arg3_type = ARG_CONST_SIZE, }; +#ifdef CONFIG_XFRM +BPF_CALL_5(bpf_skb_get_xfrm_state, struct sk_buff *, skb, u32, index, + struct bpf_xfrm_state *, to, u32, size, u64, flags) +{ + const struct sec_path *sp = skb_sec_path(skb); + const struct xfrm_state *x; + + if (!sp || unlikely(index >= sp->len || flags)) + goto err_clear; + + x = sp->xvec[index]; + + if (unlikely(size != sizeof(struct bpf_xfrm_state))) + goto err_clear; + + to->reqid = x->props.reqid; + to->spi = x->id.spi; + to->family = x->props.family; + to->ext = 0; + + if (to->family == AF_INET6) { + memcpy(to->remote_ipv6, x->props.saddr.a6, + sizeof(to->remote_ipv6)); + } else { + to->remote_ipv4 = x->props.saddr.a4; + memset(&to->remote_ipv6[1], 0, sizeof(__u32) * 3); + } + + return 0; +err_clear: + memset(to, 0, size); + return -EINVAL; +} + +static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = { + .func = bpf_skb_get_xfrm_state, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_UNINIT_MEM, + .arg4_type = ARG_CONST_SIZE, + .arg5_type = ARG_ANYTHING, +}; +#endif + +#if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6) +static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params, + const struct neighbour *neigh, + const struct net_device *dev) +{ + memcpy(params->dmac, neigh->ha, ETH_ALEN); + memcpy(params->smac, dev->dev_addr, ETH_ALEN); + params->h_vlan_TCI = 0; + params->h_vlan_proto = 0; + + return 0; +} +#endif + +#if IS_ENABLED(CONFIG_INET) +static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params, + u32 flags, bool check_mtu) +{ + struct in_device *in_dev; + struct neighbour *neigh; + struct net_device *dev; + struct fib_result res; + struct fib_nh *nh; + struct flowi4 fl4; + int err; + u32 mtu; + + dev = dev_get_by_index_rcu(net, params->ifindex); + if (unlikely(!dev)) + return -ENODEV; + + /* verify forwarding is enabled on this interface */ + in_dev = __in_dev_get_rcu(dev); + if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev))) + return BPF_FIB_LKUP_RET_FWD_DISABLED; + + if (flags & BPF_FIB_LOOKUP_OUTPUT) { + fl4.flowi4_iif = 1; + fl4.flowi4_oif = params->ifindex; + } else { + fl4.flowi4_iif = params->ifindex; + fl4.flowi4_oif = 0; + } + fl4.flowi4_tos = params->tos & IPTOS_RT_MASK; + fl4.flowi4_scope = RT_SCOPE_UNIVERSE; + fl4.flowi4_flags = 0; + + fl4.flowi4_proto = params->l4_protocol; + fl4.daddr = params->ipv4_dst; + fl4.saddr = params->ipv4_src; + fl4.fl4_sport = params->sport; + fl4.fl4_dport = params->dport; + + if (flags & BPF_FIB_LOOKUP_DIRECT) { + u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN; + struct fib_table *tb; + + tb = fib_get_table(net, tbid); + if (unlikely(!tb)) + return BPF_FIB_LKUP_RET_NOT_FWDED; + + err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF); + } else { + fl4.flowi4_mark = 0; + fl4.flowi4_secid = 0; + fl4.flowi4_tun_key.tun_id = 0; + fl4.flowi4_uid = sock_net_uid(net, NULL); + + err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF); + } + + if (err) { + /* map fib lookup errors to RTN_ type */ + if (err == -EINVAL) + return BPF_FIB_LKUP_RET_BLACKHOLE; + if (err == -EHOSTUNREACH) + return BPF_FIB_LKUP_RET_UNREACHABLE; + if (err == -EACCES) + return BPF_FIB_LKUP_RET_PROHIBIT; + + return BPF_FIB_LKUP_RET_NOT_FWDED; + } + + if (res.type != RTN_UNICAST) + return BPF_FIB_LKUP_RET_NOT_FWDED; + + if (res.fi->fib_nhs > 1) + fib_select_path(net, &res, &fl4, NULL); + + if (check_mtu) { + mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst); + if (params->tot_len > mtu) + return BPF_FIB_LKUP_RET_FRAG_NEEDED; + } + + nh = &res.fi->fib_nh[res.nh_sel]; + + /* do not handle lwt encaps right now */ + if (nh->nh_lwtstate) + return BPF_FIB_LKUP_RET_UNSUPP_LWT; + + dev = nh->nh_dev; + if (nh->nh_gw) + params->ipv4_dst = nh->nh_gw; + + params->rt_metric = res.fi->fib_priority; + params->ifindex = dev->ifindex; + + /* xdp and cls_bpf programs are run in RCU-bh so + * rcu_read_lock_bh is not needed here + */ + neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)params->ipv4_dst); + if (!neigh || !(neigh->nud_state & NUD_VALID)) + return BPF_FIB_LKUP_RET_NO_NEIGH; + + return bpf_fib_set_fwd_params(params, neigh, dev); +} +#endif + +#if IS_ENABLED(CONFIG_IPV6) +static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params, + u32 flags, bool check_mtu) +{ + struct in6_addr *src = (struct in6_addr *) params->ipv6_src; + struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst; + struct neighbour *neigh; + struct net_device *dev; + struct inet6_dev *idev; + struct fib6_info *f6i; + struct flowi6 fl6; + int strict = 0; + int oif; + u32 mtu; + + /* link local addresses are never forwarded */ + if (rt6_need_strict(dst) || rt6_need_strict(src)) + return BPF_FIB_LKUP_RET_NOT_FWDED; + + dev = dev_get_by_index_rcu(net, params->ifindex); + if (unlikely(!dev)) + return -ENODEV; + + idev = __in6_dev_get_safely(dev); + if (unlikely(!idev || !idev->cnf.forwarding)) + return BPF_FIB_LKUP_RET_FWD_DISABLED; + + if (flags & BPF_FIB_LOOKUP_OUTPUT) { + fl6.flowi6_iif = 1; + oif = fl6.flowi6_oif = params->ifindex; + } else { + oif = fl6.flowi6_iif = params->ifindex; + fl6.flowi6_oif = 0; + strict = RT6_LOOKUP_F_HAS_SADDR; + } + fl6.flowlabel = params->flowinfo; + fl6.flowi6_scope = 0; + fl6.flowi6_flags = 0; + fl6.mp_hash = 0; + + fl6.flowi6_proto = params->l4_protocol; + fl6.daddr = *dst; + fl6.saddr = *src; + fl6.fl6_sport = params->sport; + fl6.fl6_dport = params->dport; + + if (flags & BPF_FIB_LOOKUP_DIRECT) { + u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN; + struct fib6_table *tb; + + tb = ipv6_stub->fib6_get_table(net, tbid); + if (unlikely(!tb)) + return BPF_FIB_LKUP_RET_NOT_FWDED; + + f6i = ipv6_stub->fib6_table_lookup(net, tb, oif, &fl6, strict); + } else { + fl6.flowi6_mark = 0; + fl6.flowi6_secid = 0; + fl6.flowi6_tun_key.tun_id = 0; + fl6.flowi6_uid = sock_net_uid(net, NULL); + + f6i = ipv6_stub->fib6_lookup(net, oif, &fl6, strict); + } + + if (unlikely(IS_ERR_OR_NULL(f6i) || f6i == net->ipv6.fib6_null_entry)) + return BPF_FIB_LKUP_RET_NOT_FWDED; + + if (unlikely(f6i->fib6_flags & RTF_REJECT)) { + switch (f6i->fib6_type) { + case RTN_BLACKHOLE: + return BPF_FIB_LKUP_RET_BLACKHOLE; + case RTN_UNREACHABLE: + return BPF_FIB_LKUP_RET_UNREACHABLE; + case RTN_PROHIBIT: + return BPF_FIB_LKUP_RET_PROHIBIT; + default: + return BPF_FIB_LKUP_RET_NOT_FWDED; + } + } + + if (f6i->fib6_type != RTN_UNICAST) + return BPF_FIB_LKUP_RET_NOT_FWDED; + + if (f6i->fib6_nsiblings && fl6.flowi6_oif == 0) + f6i = ipv6_stub->fib6_multipath_select(net, f6i, &fl6, + fl6.flowi6_oif, NULL, + strict); + + if (check_mtu) { + mtu = ipv6_stub->ip6_mtu_from_fib6(f6i, dst, src); + if (params->tot_len > mtu) + return BPF_FIB_LKUP_RET_FRAG_NEEDED; + } + + if (f6i->fib6_nh.nh_lwtstate) + return BPF_FIB_LKUP_RET_UNSUPP_LWT; + + if (f6i->fib6_flags & RTF_GATEWAY) + *dst = f6i->fib6_nh.nh_gw; + + dev = f6i->fib6_nh.nh_dev; + params->rt_metric = f6i->fib6_metric; + params->ifindex = dev->ifindex; + + /* xdp and cls_bpf programs are run in RCU-bh so rcu_read_lock_bh is + * not needed here. Can not use __ipv6_neigh_lookup_noref here + * because we need to get nd_tbl via the stub + */ + neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128, + ndisc_hashfn, dst, dev); + if (!neigh || !(neigh->nud_state & NUD_VALID)) + return BPF_FIB_LKUP_RET_NO_NEIGH; + + return bpf_fib_set_fwd_params(params, neigh, dev); +} +#endif + +BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx, + struct bpf_fib_lookup *, params, int, plen, u32, flags) +{ + if (plen < sizeof(*params)) + return -EINVAL; + + if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT)) + return -EINVAL; + + switch (params->family) { +#if IS_ENABLED(CONFIG_INET) + case AF_INET: + return bpf_ipv4_fib_lookup(dev_net(ctx->rxq->dev), params, + flags, true); +#endif +#if IS_ENABLED(CONFIG_IPV6) + case AF_INET6: + return bpf_ipv6_fib_lookup(dev_net(ctx->rxq->dev), params, + flags, true); +#endif + } + return -EAFNOSUPPORT; +} + +static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = { + .func = bpf_xdp_fib_lookup, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, +}; + +BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb, + struct bpf_fib_lookup *, params, int, plen, u32, flags) +{ + struct net *net = dev_net(skb->dev); + int rc = -EAFNOSUPPORT; + bool check_mtu = false; + + if (plen < sizeof(*params)) + return -EINVAL; + + if (flags & ~(BPF_FIB_LOOKUP_DIRECT | BPF_FIB_LOOKUP_OUTPUT)) + return -EINVAL; + + if (params->tot_len) + check_mtu = true; + + switch (params->family) { +#if IS_ENABLED(CONFIG_INET) + case AF_INET: + rc = bpf_ipv4_fib_lookup(net, params, flags, check_mtu); + break; +#endif +#if IS_ENABLED(CONFIG_IPV6) + case AF_INET6: + rc = bpf_ipv6_fib_lookup(net, params, flags, check_mtu); + break; +#endif + } + + if (rc == BPF_FIB_LKUP_RET_SUCCESS && !check_mtu) { + struct net_device *dev; + + /* When tot_len isn't provided by user, check skb + * against MTU of FIB lookup resulting net_device + */ + dev = dev_get_by_index_rcu(net, params->ifindex); + if (!is_skb_forwardable(dev, skb)) + rc = BPF_FIB_LKUP_RET_FRAG_NEEDED; + } + + return rc; +} + +static const struct bpf_func_proto bpf_skb_fib_lookup_proto = { + .func = bpf_skb_fib_lookup, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, +}; + +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF) +static int bpf_push_seg6_encap(struct sk_buff *skb, u32 type, void *hdr, u32 len) +{ + int err; + struct ipv6_sr_hdr *srh = (struct ipv6_sr_hdr *)hdr; + + if (!seg6_validate_srh(srh, len)) + return -EINVAL; + + switch (type) { + case BPF_LWT_ENCAP_SEG6_INLINE: + if (skb_protocol(skb, true) != htons(ETH_P_IPV6)) + return -EBADMSG; + + err = seg6_do_srh_inline(skb, srh); + break; + case BPF_LWT_ENCAP_SEG6: + skb_reset_inner_headers(skb); + skb->encapsulation = 1; + err = seg6_do_srh_encap(skb, srh, IPPROTO_IPV6); + break; + default: + return -EINVAL; + } + + bpf_compute_data_pointers(skb); + if (err) + return err; + + skb_set_transport_header(skb, sizeof(struct ipv6hdr)); + + return seg6_lookup_nexthop(skb, NULL, 0); +} +#endif /* CONFIG_IPV6_SEG6_BPF */ + +#if IS_ENABLED(CONFIG_LWTUNNEL_BPF) +static int bpf_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, + bool ingress) +{ + return bpf_lwt_push_ip_encap(skb, hdr, len, ingress); +} +#endif + +BPF_CALL_4(bpf_lwt_in_push_encap, struct sk_buff *, skb, u32, type, void *, hdr, + u32, len) +{ + switch (type) { +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF) + case BPF_LWT_ENCAP_SEG6: + case BPF_LWT_ENCAP_SEG6_INLINE: + return bpf_push_seg6_encap(skb, type, hdr, len); +#endif +#if IS_ENABLED(CONFIG_LWTUNNEL_BPF) + case BPF_LWT_ENCAP_IP: + return bpf_push_ip_encap(skb, hdr, len, true /* ingress */); +#endif + default: + return -EINVAL; + } +} + +BPF_CALL_4(bpf_lwt_xmit_push_encap, struct sk_buff *, skb, u32, type, + void *, hdr, u32, len) +{ + switch (type) { +#if IS_ENABLED(CONFIG_LWTUNNEL_BPF) + case BPF_LWT_ENCAP_IP: + return bpf_push_ip_encap(skb, hdr, len, false /* egress */); +#endif + default: + return -EINVAL; + } +} + +static const struct bpf_func_proto bpf_lwt_in_push_encap_proto = { + .func = bpf_lwt_in_push_encap, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_MEM, + .arg4_type = ARG_CONST_SIZE +}; + +static const struct bpf_func_proto bpf_lwt_xmit_push_encap_proto = { + .func = bpf_lwt_xmit_push_encap, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_MEM, + .arg4_type = ARG_CONST_SIZE +}; + +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF) +BPF_CALL_4(bpf_lwt_seg6_store_bytes, struct sk_buff *, skb, u32, offset, + const void *, from, u32, len) +{ + struct seg6_bpf_srh_state *srh_state = + this_cpu_ptr(&seg6_bpf_srh_states); + struct ipv6_sr_hdr *srh = srh_state->srh; + void *srh_tlvs, *srh_end, *ptr; + int srhoff = 0; + + if (srh == NULL) + return -EINVAL; + + srh_tlvs = (void *)((char *)srh + ((srh->first_segment + 1) << 4)); + srh_end = (void *)((char *)srh + sizeof(*srh) + srh_state->hdrlen); + + ptr = skb->data + offset; + if (ptr >= srh_tlvs && ptr + len <= srh_end) + srh_state->valid = false; + else if (ptr < (void *)&srh->flags || + ptr + len > (void *)&srh->segments) + return -EFAULT; + + if (unlikely(bpf_try_make_writable(skb, offset + len))) + return -EFAULT; + if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0) + return -EINVAL; + srh_state->srh = (struct ipv6_sr_hdr *)(skb->data + srhoff); + + memcpy(skb->data + offset, from, len); + return 0; +} + +static const struct bpf_func_proto bpf_lwt_seg6_store_bytes_proto = { + .func = bpf_lwt_seg6_store_bytes, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_MEM, + .arg4_type = ARG_CONST_SIZE +}; + +static void bpf_update_srh_state(struct sk_buff *skb) +{ + struct seg6_bpf_srh_state *srh_state = + this_cpu_ptr(&seg6_bpf_srh_states); + int srhoff = 0; + + if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0) { + srh_state->srh = NULL; + } else { + srh_state->srh = (struct ipv6_sr_hdr *)(skb->data + srhoff); + srh_state->hdrlen = srh_state->srh->hdrlen << 3; + srh_state->valid = true; + } +} + +BPF_CALL_4(bpf_lwt_seg6_action, struct sk_buff *, skb, + u32, action, void *, param, u32, param_len) +{ + struct seg6_bpf_srh_state *srh_state = + this_cpu_ptr(&seg6_bpf_srh_states); + int hdroff = 0; + int err; + + switch (action) { + case SEG6_LOCAL_ACTION_END_X: + if (!seg6_bpf_has_valid_srh(skb)) + return -EBADMSG; + if (param_len != sizeof(struct in6_addr)) + return -EINVAL; + return seg6_lookup_nexthop(skb, (struct in6_addr *)param, 0); + case SEG6_LOCAL_ACTION_END_T: + if (!seg6_bpf_has_valid_srh(skb)) + return -EBADMSG; + if (param_len != sizeof(int)) + return -EINVAL; + return seg6_lookup_nexthop(skb, NULL, *(int *)param); + case SEG6_LOCAL_ACTION_END_DT6: + if (!seg6_bpf_has_valid_srh(skb)) + return -EBADMSG; + if (param_len != sizeof(int)) + return -EINVAL; + + if (ipv6_find_hdr(skb, &hdroff, IPPROTO_IPV6, NULL, NULL) < 0) + return -EBADMSG; + if (!pskb_pull(skb, hdroff)) + return -EBADMSG; + + skb_postpull_rcsum(skb, skb_network_header(skb), hdroff); + skb_reset_network_header(skb); + skb_reset_transport_header(skb); + skb->encapsulation = 0; + + bpf_compute_data_pointers(skb); + bpf_update_srh_state(skb); + return seg6_lookup_nexthop(skb, NULL, *(int *)param); + case SEG6_LOCAL_ACTION_END_B6: + if (srh_state->srh && !seg6_bpf_has_valid_srh(skb)) + return -EBADMSG; + err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6_INLINE, + param, param_len); + if (!err) + bpf_update_srh_state(skb); + + return err; + case SEG6_LOCAL_ACTION_END_B6_ENCAP: + if (srh_state->srh && !seg6_bpf_has_valid_srh(skb)) + return -EBADMSG; + err = bpf_push_seg6_encap(skb, BPF_LWT_ENCAP_SEG6, + param, param_len); + if (!err) + bpf_update_srh_state(skb); + + return err; + default: + return -EINVAL; + } +} + +static const struct bpf_func_proto bpf_lwt_seg6_action_proto = { + .func = bpf_lwt_seg6_action, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_MEM, + .arg4_type = ARG_CONST_SIZE +}; + +BPF_CALL_3(bpf_lwt_seg6_adjust_srh, struct sk_buff *, skb, u32, offset, + s32, len) +{ + struct seg6_bpf_srh_state *srh_state = + this_cpu_ptr(&seg6_bpf_srh_states); + struct ipv6_sr_hdr *srh = srh_state->srh; + void *srh_end, *srh_tlvs, *ptr; + struct ipv6hdr *hdr; + int srhoff = 0; + int ret; + + if (unlikely(srh == NULL)) + return -EINVAL; + + srh_tlvs = (void *)((unsigned char *)srh + sizeof(*srh) + + ((srh->first_segment + 1) << 4)); + srh_end = (void *)((unsigned char *)srh + sizeof(*srh) + + srh_state->hdrlen); + ptr = skb->data + offset; + + if (unlikely(ptr < srh_tlvs || ptr > srh_end)) + return -EFAULT; + if (unlikely(len < 0 && (void *)((char *)ptr - len) > srh_end)) + return -EFAULT; + + if (len > 0) { + ret = skb_cow_head(skb, len); + if (unlikely(ret < 0)) + return ret; + + ret = bpf_skb_net_hdr_push(skb, offset, len); + } else { + ret = bpf_skb_net_hdr_pop(skb, offset, -1 * len); + } + + bpf_compute_data_pointers(skb); + if (unlikely(ret < 0)) + return ret; + + hdr = (struct ipv6hdr *)skb->data; + hdr->payload_len = htons(skb->len - sizeof(struct ipv6hdr)); + + if (ipv6_find_hdr(skb, &srhoff, IPPROTO_ROUTING, NULL, NULL) < 0) + return -EINVAL; + srh_state->srh = (struct ipv6_sr_hdr *)(skb->data + srhoff); + srh_state->hdrlen += len; + srh_state->valid = false; + return 0; +} + +static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = { + .func = bpf_lwt_seg6_adjust_srh, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, +}; +#endif /* CONFIG_IPV6_SEG6_BPF */ + +#ifdef CONFIG_INET +static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, + int dif, int sdif, u8 family, u8 proto) +{ + bool refcounted = false; + struct sock *sk = NULL; + + if (family == AF_INET) { + __be32 src4 = tuple->ipv4.saddr; + __be32 dst4 = tuple->ipv4.daddr; + + if (proto == IPPROTO_TCP) + sk = __inet_lookup(net, &tcp_hashinfo, NULL, 0, + src4, tuple->ipv4.sport, + dst4, tuple->ipv4.dport, + dif, sdif, &refcounted); + else + sk = __udp4_lib_lookup(net, src4, tuple->ipv4.sport, + dst4, tuple->ipv4.dport, + dif, sdif, &udp_table, NULL); +#if IS_ENABLED(CONFIG_IPV6) + } else { + struct in6_addr *src6 = (struct in6_addr *)&tuple->ipv6.saddr; + struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr; + + if (proto == IPPROTO_TCP) + sk = __inet6_lookup(net, &tcp_hashinfo, NULL, 0, + src6, tuple->ipv6.sport, + dst6, ntohs(tuple->ipv6.dport), + dif, sdif, &refcounted); + else if (likely(ipv6_bpf_stub)) + sk = ipv6_bpf_stub->udp6_lib_lookup(net, + src6, tuple->ipv6.sport, + dst6, tuple->ipv6.dport, + dif, sdif, + &udp_table, NULL); +#endif + } + + if (unlikely(sk && !refcounted && !sock_flag(sk, SOCK_RCU_FREE))) { + WARN_ONCE(1, "Found non-RCU, unreferenced socket!"); + sk = NULL; + } + return sk; +} + +/* bpf_skc_lookup performs the core lookup for different types of sockets, + * taking a reference on the socket if it doesn't have the flag SOCK_RCU_FREE. + * Returns the socket as an 'unsigned long' to simplify the casting in the + * callers to satisfy BPF_CALL declarations. + */ +static struct sock * +__bpf_skc_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len, + struct net *caller_net, u32 ifindex, u8 proto, u64 netns_id, + u64 flags) +{ + struct sock *sk = NULL; + u8 family = AF_UNSPEC; + struct net *net; + int sdif; + + if (len == sizeof(tuple->ipv4)) + family = AF_INET; + else if (len == sizeof(tuple->ipv6)) + family = AF_INET6; + else + return NULL; + + if (unlikely(family == AF_UNSPEC || flags || + !((s32)netns_id < 0 || netns_id <= S32_MAX))) + goto out; + + if (family == AF_INET) + sdif = inet_sdif(skb); + else + sdif = inet6_sdif(skb); + + if ((s32)netns_id < 0) { + net = caller_net; + sk = sk_lookup(net, tuple, ifindex, sdif, family, proto); + } else { + net = get_net_ns_by_id(caller_net, netns_id); + if (unlikely(!net)) + goto out; + sk = sk_lookup(net, tuple, ifindex, sdif, family, proto); + put_net(net); + } + +out: + return sk; +} + +static struct sock * +__bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len, + struct net *caller_net, u32 ifindex, u8 proto, u64 netns_id, + u64 flags) +{ + struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net, + ifindex, proto, netns_id, flags); + + if (sk) { + struct sock *sk2 = sk_to_full_sk(sk); + + /* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk + * sock refcnt is decremented to prevent a request_sock leak. + */ + if (!sk_fullsock(sk2)) + sk2 = NULL; + if (sk2 != sk) { + sock_gen_put(sk); + /* Ensure there is no need to bump sk2 refcnt */ + if (unlikely(sk2 && !sock_flag(sk2, SOCK_RCU_FREE))) { + WARN_ONCE(1, "Found non-RCU, unreferenced socket!"); + return NULL; + } + sk = sk2; + } + } + + return sk; +} + +static struct sock * +bpf_skc_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len, + u8 proto, u64 netns_id, u64 flags) +{ + struct net *caller_net; + int ifindex; + + if (skb->dev) { + caller_net = dev_net(skb->dev); + ifindex = skb->dev->ifindex; + } else { + caller_net = sock_net(skb->sk); + ifindex = 0; + } + + return __bpf_skc_lookup(skb, tuple, len, caller_net, ifindex, proto, + netns_id, flags); +} + +static struct sock * +bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len, + u8 proto, u64 netns_id, u64 flags) +{ + struct sock *sk = bpf_skc_lookup(skb, tuple, len, proto, netns_id, + flags); + + if (sk) { + struct sock *sk2 = sk_to_full_sk(sk); + + /* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk + * sock refcnt is decremented to prevent a request_sock leak. + */ + if (!sk_fullsock(sk2)) + sk2 = NULL; + if (sk2 != sk) { + sock_gen_put(sk); + /* Ensure there is no need to bump sk2 refcnt */ + if (unlikely(sk2 && !sock_flag(sk2, SOCK_RCU_FREE))) { + WARN_ONCE(1, "Found non-RCU, unreferenced socket!"); + return NULL; + } + sk = sk2; + } + } + + return sk; +} + +BPF_CALL_5(bpf_skc_lookup_tcp, struct sk_buff *, skb, + struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags) +{ + return (unsigned long)bpf_skc_lookup(skb, tuple, len, IPPROTO_TCP, + netns_id, flags); +} + +static const struct bpf_func_proto bpf_skc_lookup_tcp_proto = { + .func = bpf_skc_lookup_tcp, + .gpl_only = false, + .pkt_access = true, + .ret_type = RET_PTR_TO_SOCK_COMMON_OR_NULL, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, + .arg5_type = ARG_ANYTHING, +}; + +BPF_CALL_5(bpf_sk_lookup_tcp, struct sk_buff *, skb, + struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags) +{ + return (unsigned long)bpf_sk_lookup(skb, tuple, len, IPPROTO_TCP, + netns_id, flags); +} + +static const struct bpf_func_proto bpf_sk_lookup_tcp_proto = { + .func = bpf_sk_lookup_tcp, + .gpl_only = false, + .pkt_access = true, + .ret_type = RET_PTR_TO_SOCKET_OR_NULL, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, + .arg5_type = ARG_ANYTHING, +}; + +BPF_CALL_5(bpf_sk_lookup_udp, struct sk_buff *, skb, + struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags) +{ + return (unsigned long)bpf_sk_lookup(skb, tuple, len, IPPROTO_UDP, + netns_id, flags); +} + +static const struct bpf_func_proto bpf_sk_lookup_udp_proto = { + .func = bpf_sk_lookup_udp, + .gpl_only = false, + .pkt_access = true, + .ret_type = RET_PTR_TO_SOCKET_OR_NULL, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, + .arg5_type = ARG_ANYTHING, +}; + +BPF_CALL_1(bpf_sk_release, struct sock *, sk) +{ + /* Only full sockets have sk->sk_flags. */ + if (!sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE)) + sock_gen_put(sk); + return 0; +} + +static const struct bpf_func_proto bpf_sk_release_proto = { + .func = bpf_sk_release, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_SOCK_COMMON, +}; + +BPF_CALL_5(bpf_xdp_sk_lookup_udp, struct xdp_buff *, ctx, + struct bpf_sock_tuple *, tuple, u32, len, u32, netns_id, u64, flags) +{ + struct net *caller_net = dev_net(ctx->rxq->dev); + int ifindex = ctx->rxq->dev->ifindex; + + return (unsigned long)__bpf_sk_lookup(NULL, tuple, len, caller_net, + ifindex, IPPROTO_UDP, netns_id, + flags); +} + +static const struct bpf_func_proto bpf_xdp_sk_lookup_udp_proto = { + .func = bpf_xdp_sk_lookup_udp, + .gpl_only = false, + .pkt_access = true, + .ret_type = RET_PTR_TO_SOCKET_OR_NULL, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, + .arg5_type = ARG_ANYTHING, +}; + +BPF_CALL_5(bpf_xdp_skc_lookup_tcp, struct xdp_buff *, ctx, + struct bpf_sock_tuple *, tuple, u32, len, u32, netns_id, u64, flags) +{ + struct net *caller_net = dev_net(ctx->rxq->dev); + int ifindex = ctx->rxq->dev->ifindex; + + return (unsigned long)__bpf_skc_lookup(NULL, tuple, len, caller_net, + ifindex, IPPROTO_TCP, netns_id, + flags); +} + +static const struct bpf_func_proto bpf_xdp_skc_lookup_tcp_proto = { + .func = bpf_xdp_skc_lookup_tcp, + .gpl_only = false, + .pkt_access = true, + .ret_type = RET_PTR_TO_SOCK_COMMON_OR_NULL, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, + .arg5_type = ARG_ANYTHING, +}; + +BPF_CALL_5(bpf_xdp_sk_lookup_tcp, struct xdp_buff *, ctx, + struct bpf_sock_tuple *, tuple, u32, len, u32, netns_id, u64, flags) +{ + struct net *caller_net = dev_net(ctx->rxq->dev); + int ifindex = ctx->rxq->dev->ifindex; + + return (unsigned long)__bpf_sk_lookup(NULL, tuple, len, caller_net, + ifindex, IPPROTO_TCP, netns_id, + flags); +} + +static const struct bpf_func_proto bpf_xdp_sk_lookup_tcp_proto = { + .func = bpf_xdp_sk_lookup_tcp, + .gpl_only = false, + .pkt_access = true, + .ret_type = RET_PTR_TO_SOCKET_OR_NULL, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, + .arg5_type = ARG_ANYTHING, +}; + +BPF_CALL_5(bpf_sock_addr_skc_lookup_tcp, struct bpf_sock_addr_kern *, ctx, + struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags) +{ + return (unsigned long)__bpf_skc_lookup(NULL, tuple, len, + sock_net(ctx->sk), 0, + IPPROTO_TCP, netns_id, flags); +} + +static const struct bpf_func_proto bpf_sock_addr_skc_lookup_tcp_proto = { + .func = bpf_sock_addr_skc_lookup_tcp, + .gpl_only = false, + .ret_type = RET_PTR_TO_SOCK_COMMON_OR_NULL, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, + .arg5_type = ARG_ANYTHING, +}; + +BPF_CALL_5(bpf_sock_addr_sk_lookup_tcp, struct bpf_sock_addr_kern *, ctx, + struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags) +{ + return (unsigned long)__bpf_sk_lookup(NULL, tuple, len, + sock_net(ctx->sk), 0, IPPROTO_TCP, + netns_id, flags); +} + +static const struct bpf_func_proto bpf_sock_addr_sk_lookup_tcp_proto = { + .func = bpf_sock_addr_sk_lookup_tcp, + .gpl_only = false, + .ret_type = RET_PTR_TO_SOCKET_OR_NULL, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, + .arg5_type = ARG_ANYTHING, +}; + +BPF_CALL_5(bpf_sock_addr_sk_lookup_udp, struct bpf_sock_addr_kern *, ctx, + struct bpf_sock_tuple *, tuple, u32, len, u64, netns_id, u64, flags) +{ + return (unsigned long)__bpf_sk_lookup(NULL, tuple, len, + sock_net(ctx->sk), 0, IPPROTO_UDP, + netns_id, flags); +} + +static const struct bpf_func_proto bpf_sock_addr_sk_lookup_udp_proto = { + .func = bpf_sock_addr_sk_lookup_udp, + .gpl_only = false, + .ret_type = RET_PTR_TO_SOCKET_OR_NULL, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, + .arg5_type = ARG_ANYTHING, +}; + +bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type, + struct bpf_insn_access_aux *info) +{ + if (off < 0 || off >= offsetofend(struct bpf_tcp_sock, + icsk_retransmits)) + return false; + + if (off % size != 0) + return false; + + switch (off) { + case offsetof(struct bpf_tcp_sock, bytes_received): + case offsetof(struct bpf_tcp_sock, bytes_acked): + return size == sizeof(__u64); + default: + return size == sizeof(__u32); + } +} + +u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type, + const struct bpf_insn *si, + struct bpf_insn *insn_buf, + struct bpf_prog *prog, u32 *target_size) +{ + struct bpf_insn *insn = insn_buf; + +#define BPF_TCP_SOCK_GET_COMMON(FIELD) \ + do { \ + BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD) > \ + FIELD_SIZEOF(struct bpf_tcp_sock, FIELD)); \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct tcp_sock, FIELD),\ + si->dst_reg, si->src_reg, \ + offsetof(struct tcp_sock, FIELD)); \ + } while (0) + +#define BPF_INET_SOCK_GET_COMMON(FIELD) \ + do { \ + BUILD_BUG_ON(FIELD_SIZEOF(struct inet_connection_sock, \ + FIELD) > \ + FIELD_SIZEOF(struct bpf_tcp_sock, FIELD)); \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ + struct inet_connection_sock, \ + FIELD), \ + si->dst_reg, si->src_reg, \ + offsetof( \ + struct inet_connection_sock, \ + FIELD)); \ + } while (0) + + if (insn > insn_buf) + return insn - insn_buf; + + switch (si->off) { + case offsetof(struct bpf_tcp_sock, rtt_min): + BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) != + sizeof(struct minmax)); + BUILD_BUG_ON(sizeof(struct minmax) < + sizeof(struct minmax_sample)); + + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, + offsetof(struct tcp_sock, rtt_min) + + offsetof(struct minmax_sample, v)); + break; + case offsetof(struct bpf_tcp_sock, snd_cwnd): + BPF_TCP_SOCK_GET_COMMON(snd_cwnd); + break; + case offsetof(struct bpf_tcp_sock, srtt_us): + BPF_TCP_SOCK_GET_COMMON(srtt_us); + break; + case offsetof(struct bpf_tcp_sock, snd_ssthresh): + BPF_TCP_SOCK_GET_COMMON(snd_ssthresh); + break; + case offsetof(struct bpf_tcp_sock, rcv_nxt): + BPF_TCP_SOCK_GET_COMMON(rcv_nxt); + break; + case offsetof(struct bpf_tcp_sock, snd_nxt): + BPF_TCP_SOCK_GET_COMMON(snd_nxt); + break; + case offsetof(struct bpf_tcp_sock, snd_una): + BPF_TCP_SOCK_GET_COMMON(snd_una); + break; + case offsetof(struct bpf_tcp_sock, mss_cache): + BPF_TCP_SOCK_GET_COMMON(mss_cache); + break; + case offsetof(struct bpf_tcp_sock, ecn_flags): + BPF_TCP_SOCK_GET_COMMON(ecn_flags); + break; + case offsetof(struct bpf_tcp_sock, rate_delivered): + BPF_TCP_SOCK_GET_COMMON(rate_delivered); + break; + case offsetof(struct bpf_tcp_sock, rate_interval_us): + BPF_TCP_SOCK_GET_COMMON(rate_interval_us); + break; + case offsetof(struct bpf_tcp_sock, packets_out): + BPF_TCP_SOCK_GET_COMMON(packets_out); + break; + case offsetof(struct bpf_tcp_sock, retrans_out): + BPF_TCP_SOCK_GET_COMMON(retrans_out); + break; + case offsetof(struct bpf_tcp_sock, total_retrans): + BPF_TCP_SOCK_GET_COMMON(total_retrans); + break; + case offsetof(struct bpf_tcp_sock, segs_in): + BPF_TCP_SOCK_GET_COMMON(segs_in); + break; + case offsetof(struct bpf_tcp_sock, data_segs_in): + BPF_TCP_SOCK_GET_COMMON(data_segs_in); + break; + case offsetof(struct bpf_tcp_sock, segs_out): + BPF_TCP_SOCK_GET_COMMON(segs_out); + break; + case offsetof(struct bpf_tcp_sock, data_segs_out): + BPF_TCP_SOCK_GET_COMMON(data_segs_out); + break; + case offsetof(struct bpf_tcp_sock, lost_out): + BPF_TCP_SOCK_GET_COMMON(lost_out); + break; + case offsetof(struct bpf_tcp_sock, sacked_out): + BPF_TCP_SOCK_GET_COMMON(sacked_out); + break; + case offsetof(struct bpf_tcp_sock, bytes_received): + BPF_TCP_SOCK_GET_COMMON(bytes_received); + break; + case offsetof(struct bpf_tcp_sock, bytes_acked): + BPF_TCP_SOCK_GET_COMMON(bytes_acked); + break; + case offsetof(struct bpf_tcp_sock, dsack_dups): + BPF_TCP_SOCK_GET_COMMON(dsack_dups); + break; + case offsetof(struct bpf_tcp_sock, delivered): + BPF_TCP_SOCK_GET_COMMON(delivered); + break; + case offsetof(struct bpf_tcp_sock, delivered_ce): + BPF_TCP_SOCK_GET_COMMON(delivered_ce); + break; + case offsetof(struct bpf_tcp_sock, icsk_retransmits): + BPF_INET_SOCK_GET_COMMON(icsk_retransmits); + break; + } + + return insn - insn_buf; +} + +BPF_CALL_1(bpf_tcp_sock, struct sock *, sk) +{ + if (sk_fullsock(sk) && sk->sk_protocol == IPPROTO_TCP) + return (unsigned long)sk; + + return (unsigned long)NULL; +} + +const struct bpf_func_proto bpf_tcp_sock_proto = { + .func = bpf_tcp_sock, + .gpl_only = false, + .ret_type = RET_PTR_TO_TCP_SOCK_OR_NULL, + .arg1_type = ARG_PTR_TO_SOCK_COMMON, +}; + +BPF_CALL_1(bpf_get_listener_sock, struct sock *, sk) +{ + sk = sk_to_full_sk(sk); + + if (sk->sk_state == TCP_LISTEN && sock_flag(sk, SOCK_RCU_FREE)) + return (unsigned long)sk; + + return (unsigned long)NULL; +} + +static const struct bpf_func_proto bpf_get_listener_sock_proto = { + .func = bpf_get_listener_sock, + .gpl_only = false, + .ret_type = RET_PTR_TO_SOCKET_OR_NULL, + .arg1_type = ARG_PTR_TO_SOCK_COMMON, +}; + +BPF_CALL_1(bpf_skb_ecn_set_ce, struct sk_buff *, skb) +{ + unsigned int iphdr_len; + + if (skb->protocol == cpu_to_be16(ETH_P_IP)) + iphdr_len = sizeof(struct iphdr); + else if (skb->protocol == cpu_to_be16(ETH_P_IPV6)) + iphdr_len = sizeof(struct ipv6hdr); + else + return 0; + + if (skb_headlen(skb) < iphdr_len) + return 0; + + if (skb_cloned(skb) && !skb_clone_writable(skb, iphdr_len)) + return 0; + + return INET_ECN_set_ce(skb); +} + +bool bpf_xdp_sock_is_valid_access(int off, int size, enum bpf_access_type type, + struct bpf_insn_access_aux *info) +{ + if (off < 0 || off >= offsetofend(struct bpf_xdp_sock, queue_id)) + return false; + + if (off % size != 0) + return false; + + switch (off) { + default: + return size == sizeof(__u32); + } +} + +u32 bpf_xdp_sock_convert_ctx_access(enum bpf_access_type type, + const struct bpf_insn *si, + struct bpf_insn *insn_buf, + struct bpf_prog *prog, u32 *target_size) +{ + struct bpf_insn *insn = insn_buf; + +#define BPF_XDP_SOCK_GET(FIELD) \ + do { \ + BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_sock, FIELD) > \ + FIELD_SIZEOF(struct bpf_xdp_sock, FIELD)); \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_sock, FIELD),\ + si->dst_reg, si->src_reg, \ + offsetof(struct xdp_sock, FIELD)); \ + } while (0) + + switch (si->off) { + case offsetof(struct bpf_xdp_sock, queue_id): + BPF_XDP_SOCK_GET(queue_id); + break; + } + + return insn - insn_buf; +} + +static const struct bpf_func_proto bpf_skb_ecn_set_ce_proto = { + .func = bpf_skb_ecn_set_ce, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + +BPF_CALL_5(bpf_tcp_check_syncookie, struct sock *, sk, void *, iph, u32, iph_len, + struct tcphdr *, th, u32, th_len) +{ +#ifdef CONFIG_SYN_COOKIES + u32 cookie; + int ret; + + if (unlikely(th_len < sizeof(*th))) + return -EINVAL; + + /* sk_listener() allows TCP_NEW_SYN_RECV, which makes no sense here. */ + if (sk->sk_protocol != IPPROTO_TCP || sk->sk_state != TCP_LISTEN) + return -EINVAL; + + if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies) + return -EINVAL; + + if (!th->ack || th->rst || th->syn) + return -ENOENT; + + if (unlikely(iph_len < sizeof(struct iphdr))) + return -EINVAL; + + if (tcp_synq_no_recent_overflow(sk)) + return -ENOENT; + + cookie = ntohl(th->ack_seq) - 1; + + /* Both struct iphdr and struct ipv6hdr have the version field at the + * same offset so we can cast to the shorter header (struct iphdr). + */ + switch (((struct iphdr *)iph)->version) { + case 4: + if (sk->sk_family == AF_INET6 && ipv6_only_sock(sk)) + return -EINVAL; + + ret = __cookie_v4_check((struct iphdr *)iph, th, cookie); + break; + +#if IS_BUILTIN(CONFIG_IPV6) + case 6: + if (unlikely(iph_len < sizeof(struct ipv6hdr))) + return -EINVAL; + + if (sk->sk_family != AF_INET6) + return -EINVAL; + + ret = __cookie_v6_check((struct ipv6hdr *)iph, th, cookie); + break; +#endif /* CONFIG_IPV6 */ + + default: + return -EPROTONOSUPPORT; + } + + if (ret > 0) + return 0; + + return -ENOENT; +#else + return -ENOTSUPP; +#endif +} + +static const struct bpf_func_proto bpf_tcp_check_syncookie_proto = { + .func = bpf_tcp_check_syncookie, + .gpl_only = true, + .pkt_access = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_SOCK_COMMON, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_PTR_TO_MEM, + .arg5_type = ARG_CONST_SIZE, +}; + +BPF_CALL_5(bpf_tcp_gen_syncookie, struct sock *, sk, void *, iph, u32, iph_len, + struct tcphdr *, th, u32, th_len) +{ +#ifdef CONFIG_SYN_COOKIES + u32 cookie; + u16 mss; + + if (unlikely(th_len < sizeof(*th) || th_len != th->doff * 4)) + return -EINVAL; + + if (sk->sk_protocol != IPPROTO_TCP || sk->sk_state != TCP_LISTEN) + return -EINVAL; + + if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies) + return -ENOENT; + + if (!th->syn || th->ack || th->fin || th->rst) + return -EINVAL; + + if (unlikely(iph_len < sizeof(struct iphdr))) + return -EINVAL; + + /* Both struct iphdr and struct ipv6hdr have the version field at the + * same offset so we can cast to the shorter header (struct iphdr). + */ + switch (((struct iphdr *)iph)->version) { + case 4: + if (sk->sk_family == AF_INET6 && sk->sk_ipv6only) + return -EINVAL; + + mss = tcp_v4_get_syncookie(sk, iph, th, &cookie); + break; + +#if IS_BUILTIN(CONFIG_IPV6) + case 6: + if (unlikely(iph_len < sizeof(struct ipv6hdr))) + return -EINVAL; + + if (sk->sk_family != AF_INET6) + return -EINVAL; + + mss = tcp_v6_get_syncookie(sk, iph, th, &cookie); + break; +#endif /* CONFIG_IPV6 */ + + default: + return -EPROTONOSUPPORT; + } + if (mss == 0) + return -ENOENT; + + return cookie | ((u64)mss << 32); +#else + return -EOPNOTSUPP; +#endif /* CONFIG_SYN_COOKIES */ +} + +static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = { + .func = bpf_tcp_gen_syncookie, + .gpl_only = true, /* __cookie_v*_init_sequence() is GPL */ + .pkt_access = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_SOCK_COMMON, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_PTR_TO_MEM, + .arg5_type = ARG_CONST_SIZE, +}; + +#endif /* CONFIG_INET */ + +bool bpf_helper_changes_pkt_data(void *func) +{ + if (func == bpf_skb_vlan_push || + func == bpf_skb_vlan_pop || + func == bpf_skb_store_bytes || + func == bpf_skb_change_proto || + func == bpf_skb_change_head || + func == sk_skb_change_head || + func == bpf_skb_change_tail || + func == sk_skb_change_tail || + func == bpf_skb_adjust_room || + func == bpf_skb_pull_data || + func == sk_skb_pull_data || + func == bpf_clone_redirect || + func == bpf_l3_csum_replace || + func == bpf_l4_csum_replace || + func == bpf_xdp_adjust_head || + func == bpf_xdp_adjust_meta || + func == bpf_msg_pull_data || + func == bpf_msg_push_data || + func == bpf_msg_pop_data || + func == bpf_xdp_adjust_tail || +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF) + func == bpf_lwt_seg6_store_bytes || + func == bpf_lwt_seg6_adjust_srh || + func == bpf_lwt_seg6_action || +#endif + func == bpf_lwt_in_push_encap || + func == bpf_lwt_xmit_push_encap) + return true; + + return false; +} + static const struct bpf_func_proto * bpf_base_func_proto(enum bpf_func_id func_id) { @@ -3750,6 +6074,12 @@ bpf_base_func_proto(enum bpf_func_id func_id) return &bpf_map_update_elem_proto; case BPF_FUNC_map_delete_elem: return &bpf_map_delete_elem_proto; + case BPF_FUNC_map_push_elem: + return &bpf_map_push_elem_proto; + case BPF_FUNC_map_pop_elem: + return &bpf_map_pop_elem_proto; + case BPF_FUNC_map_peek_elem: + return &bpf_map_peek_elem_proto; case BPF_FUNC_get_prandom_u32: return &bpf_get_prandom_u32_proto; case BPF_FUNC_get_smp_processor_id: @@ -3765,8 +6095,10 @@ bpf_base_func_proto(enum bpf_func_id func_id) default: break; } + if (!capable(CAP_SYS_ADMIN)) return NULL; + switch (func_id) { case BPF_FUNC_spin_lock: return &bpf_spin_lock_proto; @@ -3804,8 +6136,6 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) */ case BPF_FUNC_get_current_uid_gid: return &bpf_get_current_uid_gid_proto; - case BPF_FUNC_get_local_storage: - return &bpf_get_local_storage_proto; case BPF_FUNC_bind: switch (prog->expected_attach_type) { case BPF_CGROUP_INET4_CONNECT: @@ -3814,6 +6144,10 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) default: return NULL; } + case BPF_FUNC_get_socket_cookie: + return &bpf_get_socket_cookie_sock_addr_proto; + case BPF_FUNC_get_local_storage: + return &bpf_get_local_storage_proto; #ifdef CONFIG_INET case BPF_FUNC_sk_lookup_tcp: return &bpf_sock_addr_sk_lookup_tcp_proto; @@ -3821,7 +6155,13 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_sock_addr_sk_lookup_udp_proto; case BPF_FUNC_sk_release: return &bpf_sk_release_proto; + case BPF_FUNC_skc_lookup_tcp: + return &bpf_sock_addr_skc_lookup_tcp_proto; #endif /* CONFIG_INET */ + case BPF_FUNC_sk_storage_get: + return &bpf_sk_storage_get_proto; + case BPF_FUNC_sk_storage_delete: + return &bpf_sk_storage_delete_proto; default: return bpf_base_func_proto(func_id); } @@ -3839,6 +6179,8 @@ sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_socket_cookie_proto; case BPF_FUNC_get_socket_uid: return &bpf_get_socket_uid_proto; + case BPF_FUNC_perf_event_output: + return &bpf_skb_event_output_proto; default: return bpf_base_func_proto(func_id); } @@ -3859,9 +6201,19 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_sk_storage_get_proto; case BPF_FUNC_sk_storage_delete: return &bpf_sk_storage_delete_proto; + case BPF_FUNC_perf_event_output: + return &bpf_skb_event_output_proto; +#ifdef CONFIG_SOCK_CGROUP_DATA + case BPF_FUNC_skb_cgroup_id: + return &bpf_skb_cgroup_id_proto; +#endif #ifdef CONFIG_INET case BPF_FUNC_tcp_sock: return &bpf_tcp_sock_proto; + case BPF_FUNC_get_listener_sock: + return &bpf_get_listener_sock_proto; + case BPF_FUNC_skb_ecn_set_ce: + return &bpf_skb_ecn_set_ce_proto; #endif default: return sk_filter_func_proto(func_id, prog); @@ -3934,12 +6286,25 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_socket_cookie_proto; case BPF_FUNC_get_socket_uid: return &bpf_get_socket_uid_proto; + case BPF_FUNC_fib_lookup: + return &bpf_skb_fib_lookup_proto; case BPF_FUNC_sk_fullsock: return &bpf_sk_fullsock_proto; case BPF_FUNC_sk_storage_get: return &bpf_sk_storage_get_proto; case BPF_FUNC_sk_storage_delete: return &bpf_sk_storage_delete_proto; +#ifdef CONFIG_XFRM + case BPF_FUNC_skb_get_xfrm_state: + return &bpf_skb_get_xfrm_state_proto; +#endif +#ifdef CONFIG_SOCK_CGROUP_DATA + case BPF_FUNC_skb_cgroup_id: + return &bpf_skb_cgroup_id_proto; + case BPF_FUNC_skb_ancestor_cgroup_id: + return &bpf_skb_ancestor_cgroup_id_proto; +#endif +#ifdef CONFIG_INET case BPF_FUNC_sk_lookup_tcp: return &bpf_sk_lookup_tcp_proto; case BPF_FUNC_sk_lookup_udp: @@ -3948,6 +6313,17 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_sk_release_proto; case BPF_FUNC_tcp_sock: return &bpf_tcp_sock_proto; + case BPF_FUNC_get_listener_sock: + return &bpf_get_listener_sock_proto; + case BPF_FUNC_skc_lookup_tcp: + return &bpf_skc_lookup_tcp_proto; + case BPF_FUNC_tcp_check_syncookie: + return &bpf_tcp_check_syncookie_proto; + case BPF_FUNC_skb_ecn_set_ce: + return &bpf_skb_ecn_set_ce_proto; + case BPF_FUNC_tcp_gen_syncookie: + return &bpf_tcp_gen_syncookie_proto; +#endif default: return bpf_base_func_proto(func_id); } @@ -3961,6 +6337,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_xdp_event_output_proto; case BPF_FUNC_get_smp_processor_id: return &bpf_get_smp_processor_id_proto; + case BPF_FUNC_csum_diff: + return &bpf_csum_diff_proto; case BPF_FUNC_xdp_adjust_head: return &bpf_xdp_adjust_head_proto; case BPF_FUNC_xdp_adjust_meta: @@ -3969,6 +6347,24 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_xdp_redirect_proto; case BPF_FUNC_redirect_map: return &bpf_xdp_redirect_map_proto; + case BPF_FUNC_xdp_adjust_tail: + return &bpf_xdp_adjust_tail_proto; + case BPF_FUNC_fib_lookup: + return &bpf_xdp_fib_lookup_proto; +#ifdef CONFIG_INET + case BPF_FUNC_sk_lookup_udp: + return &bpf_xdp_sk_lookup_udp_proto; + case BPF_FUNC_sk_lookup_tcp: + return &bpf_xdp_sk_lookup_tcp_proto; + case BPF_FUNC_sk_release: + return &bpf_sk_release_proto; + case BPF_FUNC_skc_lookup_tcp: + return &bpf_xdp_skc_lookup_tcp_proto; + case BPF_FUNC_tcp_check_syncookie: + return &bpf_tcp_check_syncookie_proto; + case BPF_FUNC_tcp_gen_syncookie: + return &bpf_tcp_gen_syncookie_proto; +#endif default: return bpf_base_func_proto(func_id); } @@ -3977,45 +6373,34 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) const struct bpf_func_proto bpf_sock_map_update_proto __weak; const struct bpf_func_proto bpf_sock_hash_update_proto __weak; -static const struct bpf_func_proto * -lwt_inout_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) -{ - switch (func_id) { - case BPF_FUNC_skb_load_bytes: - return &bpf_skb_load_bytes_proto; - case BPF_FUNC_skb_pull_data: - return &bpf_skb_pull_data_proto; - case BPF_FUNC_csum_diff: - return &bpf_csum_diff_proto; - case BPF_FUNC_get_cgroup_classid: - return &bpf_get_cgroup_classid_proto; - case BPF_FUNC_get_route_realm: - return &bpf_get_route_realm_proto; - case BPF_FUNC_get_hash_recalc: - return &bpf_get_hash_recalc_proto; - case BPF_FUNC_perf_event_output: - return &bpf_skb_event_output_proto; - case BPF_FUNC_get_smp_processor_id: - return &bpf_get_smp_processor_id_proto; - case BPF_FUNC_skb_under_cgroup: - return &bpf_skb_under_cgroup_proto; - default: - return bpf_base_func_proto(func_id); - } -} - static const struct bpf_func_proto * sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { switch (func_id) { case BPF_FUNC_setsockopt: return &bpf_setsockopt_proto; + case BPF_FUNC_getsockopt: + return &bpf_getsockopt_proto; + case BPF_FUNC_sock_ops_cb_flags_set: + return &bpf_sock_ops_cb_flags_set_proto; case BPF_FUNC_sock_map_update: return &bpf_sock_map_update_proto; case BPF_FUNC_sock_hash_update: return &bpf_sock_hash_update_proto; + case BPF_FUNC_get_socket_cookie: + return &bpf_get_socket_cookie_sock_ops_proto; case BPF_FUNC_get_local_storage: return &bpf_get_local_storage_proto; + case BPF_FUNC_perf_event_output: + return &bpf_sockopt_event_output_proto; + case BPF_FUNC_sk_storage_get: + return &bpf_sk_storage_get_proto; + case BPF_FUNC_sk_storage_delete: + return &bpf_sk_storage_delete_proto; +#ifdef CONFIG_INET + case BPF_FUNC_tcp_sock: + return &bpf_tcp_sock_proto; +#endif /* CONFIG_INET */ default: return bpf_base_func_proto(func_id); } @@ -4024,15 +6409,24 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) const struct bpf_func_proto bpf_msg_redirect_map_proto __weak; const struct bpf_func_proto bpf_msg_redirect_hash_proto __weak; -static const struct bpf_func_proto *sk_msg_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +static const struct bpf_func_proto * +sk_msg_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { switch (func_id) { case BPF_FUNC_msg_redirect_map: return &bpf_msg_redirect_map_proto; case BPF_FUNC_msg_redirect_hash: return &bpf_msg_redirect_hash_proto; - case BPF_FUNC_get_local_storage: - return &bpf_get_local_storage_proto; + case BPF_FUNC_msg_apply_bytes: + return &bpf_msg_apply_bytes_proto; + case BPF_FUNC_msg_cork_bytes: + return &bpf_msg_cork_bytes_proto; + case BPF_FUNC_msg_pull_data: + return &bpf_msg_pull_data_proto; + case BPF_FUNC_msg_push_data: + return &bpf_msg_push_data_proto; + case BPF_FUNC_msg_pop_data: + return &bpf_msg_pop_data_proto; default: return bpf_base_func_proto(func_id); } @@ -4063,14 +6457,18 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_sk_redirect_map_proto; case BPF_FUNC_sk_redirect_hash: return &bpf_sk_redirect_hash_proto; - case BPF_FUNC_get_local_storage: - return &bpf_get_local_storage_proto; + case BPF_FUNC_perf_event_output: + return &bpf_skb_event_output_proto; +#ifdef CONFIG_INET case BPF_FUNC_sk_lookup_tcp: return &bpf_sk_lookup_tcp_proto; case BPF_FUNC_sk_lookup_udp: return &bpf_sk_lookup_udp_proto; case BPF_FUNC_sk_release: return &bpf_sk_release_proto; + case BPF_FUNC_skc_lookup_tcp: + return &bpf_skc_lookup_tcp_proto; +#endif default: return bpf_base_func_proto(func_id); } @@ -4081,12 +6479,50 @@ flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { switch (func_id) { case BPF_FUNC_skb_load_bytes: - return &bpf_skb_load_bytes_proto; + return &bpf_flow_dissector_load_bytes_proto; default: return bpf_base_func_proto(func_id); } } +static const struct bpf_func_proto * +lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_skb_load_bytes: + return &bpf_skb_load_bytes_proto; + case BPF_FUNC_skb_pull_data: + return &bpf_skb_pull_data_proto; + case BPF_FUNC_csum_diff: + return &bpf_csum_diff_proto; + case BPF_FUNC_get_cgroup_classid: + return &bpf_get_cgroup_classid_proto; + case BPF_FUNC_get_route_realm: + return &bpf_get_route_realm_proto; + case BPF_FUNC_get_hash_recalc: + return &bpf_get_hash_recalc_proto; + case BPF_FUNC_perf_event_output: + return &bpf_skb_event_output_proto; + case BPF_FUNC_get_smp_processor_id: + return &bpf_get_smp_processor_id_proto; + case BPF_FUNC_skb_under_cgroup: + return &bpf_skb_under_cgroup_proto; + default: + return bpf_base_func_proto(func_id); + } +} + +static const struct bpf_func_proto * +lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_lwt_push_encap: + return &bpf_lwt_in_push_encap_proto; + default: + return lwt_out_func_proto(func_id, prog); + } +} + static const struct bpf_func_proto * lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { @@ -4117,16 +6553,27 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_l4_csum_replace_proto; case BPF_FUNC_set_hash_invalid: return &bpf_set_hash_invalid_proto; -#ifdef CONFIG_INET - case BPF_FUNC_sk_lookup_udp: - return &bpf_xdp_sk_lookup_udp_proto; - case BPF_FUNC_sk_lookup_tcp: - return &bpf_xdp_sk_lookup_tcp_proto; - case BPF_FUNC_sk_release: - return &bpf_sk_release_proto; + case BPF_FUNC_lwt_push_encap: + return &bpf_lwt_xmit_push_encap_proto; + default: + return lwt_out_func_proto(func_id, prog); + } +} + +static const struct bpf_func_proto * +lwt_seg6local_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + switch (func_id) { +#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF) + case BPF_FUNC_lwt_seg6_store_bytes: + return &bpf_lwt_seg6_store_bytes_proto; + case BPF_FUNC_lwt_seg6_action: + return &bpf_lwt_seg6_action_proto; + case BPF_FUNC_lwt_seg6_adjust_srh: + return &bpf_lwt_seg6_adjust_srh_proto; #endif default: - return lwt_inout_func_proto(func_id, prog); + return lwt_out_func_proto(func_id, prog); } } @@ -4159,6 +6606,8 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type return false; break; case bpf_ctx_range_ptr(struct __sk_buff, flow_keys): + return false; + case bpf_ctx_range(struct __sk_buff, tstamp): if (size != sizeof(__u64)) return false; break; @@ -4166,9 +6615,6 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type if (type == BPF_WRITE || size != sizeof(__u64)) return false; info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL; - case bpf_ctx_range(struct __sk_buff, tstamp): - if (size != sizeof(__u64)) - return false; break; default: /* Only narrow read access allowed for now. */ @@ -4195,9 +6641,9 @@ static bool sk_filter_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, data): case bpf_ctx_range(struct __sk_buff, data_meta): case bpf_ctx_range(struct __sk_buff, data_end): - case bpf_ctx_range_ptr(struct __sk_buff, flow_keys): case bpf_ctx_range_till(struct __sk_buff, family, local_port): case bpf_ctx_range(struct __sk_buff, tstamp): + case bpf_ctx_range(struct __sk_buff, wire_len): return false; } @@ -4213,6 +6659,50 @@ static bool sk_filter_is_valid_access(int off, int size, return bpf_skb_is_valid_access(off, size, type, prog, info); } +static bool cg_skb_is_valid_access(int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + switch (off) { + case bpf_ctx_range(struct __sk_buff, tc_classid): + case bpf_ctx_range(struct __sk_buff, data_meta): + case bpf_ctx_range(struct __sk_buff, wire_len): + return false; + case bpf_ctx_range(struct __sk_buff, data): + case bpf_ctx_range(struct __sk_buff, data_end): + if (!capable(CAP_SYS_ADMIN)) + return false; + break; + } + + if (type == BPF_WRITE) { + switch (off) { + case bpf_ctx_range(struct __sk_buff, mark): + case bpf_ctx_range(struct __sk_buff, priority): + case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]): + break; + case bpf_ctx_range(struct __sk_buff, tstamp): + if (!capable(CAP_SYS_ADMIN)) + return false; + break; + default: + return false; + } + } + + switch (off) { + case bpf_ctx_range(struct __sk_buff, data): + info->reg_type = PTR_TO_PACKET; + break; + case bpf_ctx_range(struct __sk_buff, data_end): + info->reg_type = PTR_TO_PACKET_END; + break; + } + + return bpf_skb_is_valid_access(off, size, type, prog, info); +} + static bool lwt_is_valid_access(int off, int size, enum bpf_access_type type, const struct bpf_prog *prog, @@ -4220,10 +6710,10 @@ static bool lwt_is_valid_access(int off, int size, { switch (off) { case bpf_ctx_range(struct __sk_buff, tc_classid): - case bpf_ctx_range_ptr(struct __sk_buff, flow_keys): case bpf_ctx_range_till(struct __sk_buff, family, local_port): case bpf_ctx_range(struct __sk_buff, data_meta): case bpf_ctx_range(struct __sk_buff, tstamp): + case bpf_ctx_range(struct __sk_buff, wire_len): return false; } @@ -4250,7 +6740,6 @@ static bool lwt_is_valid_access(int off, int size, return bpf_skb_is_valid_access(off, size, type, prog, info); } - /* Attach type specific accesses */ static bool __sock_filter_check_attach_type(int off, enum bpf_access_type access_type, @@ -4295,21 +6784,6 @@ full_access: return true; } -static bool __sock_filter_check_size(int off, int size, - struct bpf_insn_access_aux *info) -{ - const int size_default = sizeof(__u32); - - switch (off) { - case bpf_ctx_range(struct bpf_sock, src_ip4): - case bpf_ctx_range_till(struct bpf_sock, src_ip6[0], src_ip6[3]): - bpf_ctx_record_field_size(info, size_default); - return bpf_ctx_narrow_access_ok(off, size, size_default); - } - - return size == size_default; -} - bool bpf_sock_common_is_valid_access(int off, int size, enum bpf_access_type type, struct bpf_insn_access_aux *info) @@ -4325,13 +6799,37 @@ bool bpf_sock_common_is_valid_access(int off, int size, bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type, struct bpf_insn_access_aux *info) { + const int size_default = sizeof(__u32); + int field_size; + if (off < 0 || off >= sizeof(struct bpf_sock)) return false; if (off % size != 0) return false; - if (!__sock_filter_check_size(off, size, info)) + + switch (off) { + case offsetof(struct bpf_sock, state): + case offsetof(struct bpf_sock, family): + case offsetof(struct bpf_sock, type): + case offsetof(struct bpf_sock, protocol): + case offsetof(struct bpf_sock, src_port): + case bpf_ctx_range(struct bpf_sock, src_ip4): + case bpf_ctx_range_till(struct bpf_sock, src_ip6[0], src_ip6[3]): + case bpf_ctx_range(struct bpf_sock, dst_ip4): + case bpf_ctx_range_till(struct bpf_sock, dst_ip6[0], dst_ip6[3]): + bpf_ctx_record_field_size(info, size_default); + return bpf_ctx_narrow_access_ok(off, size, size_default); + case bpf_ctx_range(struct bpf_sock, dst_port): + field_size = size == size_default ? + size_default : sizeof_field(struct bpf_sock, dst_port); + bpf_ctx_record_field_size(info, field_size); + return bpf_ctx_narrow_access_ok(off, size, field_size); + case offsetofend(struct bpf_sock, dst_port) ... + offsetof(struct bpf_sock, dst_ip4) - 1: return false; - return true; + } + + return size == size_default; } static bool sock_filter_is_valid_access(int off, int size, @@ -4345,6 +6843,15 @@ static bool sock_filter_is_valid_access(int off, int size, prog->expected_attach_type); } +static int bpf_noop_prologue(struct bpf_insn *insn_buf, bool direct_write, + const struct bpf_prog *prog) +{ + /* Neither direct read nor direct write requires any preliminary + * action. + */ + return 0; +} + static int bpf_unclone_prologue(struct bpf_insn *insn_buf, bool direct_write, const struct bpf_prog *prog, int drop_verdict) { @@ -4384,6 +6891,41 @@ static int bpf_unclone_prologue(struct bpf_insn *insn_buf, bool direct_write, return insn - insn_buf; } +static int bpf_gen_ld_abs(const struct bpf_insn *orig, + struct bpf_insn *insn_buf) +{ + bool indirect = BPF_MODE(orig->code) == BPF_IND; + struct bpf_insn *insn = insn_buf; + + if (!indirect) { + *insn++ = BPF_MOV64_IMM(BPF_REG_2, orig->imm); + } else { + *insn++ = BPF_MOV64_REG(BPF_REG_2, orig->src_reg); + if (orig->imm) + *insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, orig->imm); + } + /* We're guaranteed here that CTX is in R6. */ + *insn++ = BPF_MOV64_REG(BPF_REG_1, BPF_REG_CTX); + + switch (BPF_SIZE(orig->code)) { + case BPF_B: + *insn++ = BPF_EMIT_CALL(bpf_skb_load_helper_8_no_cache); + break; + case BPF_H: + *insn++ = BPF_EMIT_CALL(bpf_skb_load_helper_16_no_cache); + break; + case BPF_W: + *insn++ = BPF_EMIT_CALL(bpf_skb_load_helper_32_no_cache); + break; + } + + *insn++ = BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 0, 2); + *insn++ = BPF_ALU32_REG(BPF_XOR, BPF_REG_0, BPF_REG_0); + *insn++ = BPF_EXIT_INSN(); + + return insn - insn_buf; +} + static int tc_cls_act_prologue(struct bpf_insn *insn_buf, bool direct_write, const struct bpf_prog *prog) { @@ -4420,7 +6962,6 @@ static bool tc_cls_act_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, data_end): info->reg_type = PTR_TO_PACKET_END; break; - case bpf_ctx_range_ptr(struct __sk_buff, flow_keys): case bpf_ctx_range_till(struct __sk_buff, family, local_port): return false; } @@ -4445,8 +6986,15 @@ static bool xdp_is_valid_access(int off, int size, const struct bpf_prog *prog, struct bpf_insn_access_aux *info) { - if (type == BPF_WRITE) + if (type == BPF_WRITE) { + if (bpf_prog_is_dev_bound(prog->aux)) { + switch (off) { + case offsetof(struct xdp_md, rx_queue_index): + return __is_valid_xdp_access(off, size); + } + } return false; + } switch (off) { case offsetof(struct xdp_md, data): @@ -4536,12 +7084,32 @@ static bool sock_addr_is_valid_access(int off, int size, case bpf_ctx_range(struct bpf_sock_addr, msg_src_ip4): case bpf_ctx_range_till(struct bpf_sock_addr, msg_src_ip6[0], msg_src_ip6[3]): - /* Only narrow read access allowed for now. */ if (type == BPF_READ) { bpf_ctx_record_field_size(info, size_default); + + if (bpf_ctx_wide_access_ok(off, size, + struct bpf_sock_addr, + user_ip6)) + return true; + + if (bpf_ctx_wide_access_ok(off, size, + struct bpf_sock_addr, + msg_src_ip6)) + return true; + if (!bpf_ctx_narrow_access_ok(off, size, size_default)) return false; } else { + if (bpf_ctx_wide_access_ok(off, size, + struct bpf_sock_addr, + user_ip6)) + return true; + + if (bpf_ctx_wide_access_ok(off, size, + struct bpf_sock_addr, + msg_src_ip6)) + return true; + if (size != size_default) return false; } @@ -4550,6 +7118,13 @@ static bool sock_addr_is_valid_access(int off, int size, if (size != size_default) return false; break; + case offsetof(struct bpf_sock_addr, sk): + if (type != BPF_READ) + return false; + if (size != sizeof(__u64)) + return false; + info->reg_type = PTR_TO_SOCKET; + break; default: if (type == BPF_READ) { if (size != size_default) @@ -4562,35 +7137,50 @@ static bool sock_addr_is_valid_access(int off, int size, return true; } -static bool __is_valid_sock_ops_access(int off, int size) -{ - if (off < 0 || off >= sizeof(struct bpf_sock_ops)) - return false; - /* The verifier guarantees that size > 0. */ - if (off % size != 0) - return false; - if (size != sizeof(__u32)) - return false; - - return true; -} - static bool sock_ops_is_valid_access(int off, int size, enum bpf_access_type type, const struct bpf_prog *prog, struct bpf_insn_access_aux *info) { + const int size_default = sizeof(__u32); + + if (off < 0 || off >= sizeof(struct bpf_sock_ops)) + return false; + + /* The verifier guarantees that size > 0. */ + if (off % size != 0) + return false; + if (type == BPF_WRITE) { switch (off) { - case offsetof(struct bpf_sock_ops, op) ... - offsetof(struct bpf_sock_ops, replylong[3]): + case offsetof(struct bpf_sock_ops, reply): + case offsetof(struct bpf_sock_ops, sk_txhash): + if (size != size_default) + return false; break; default: return false; } + } else { + switch (off) { + case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received, + bytes_acked): + if (size != sizeof(__u64)) + return false; + break; + case offsetof(struct bpf_sock_ops, sk): + if (size != sizeof(__u64)) + return false; + info->reg_type = PTR_TO_SOCKET_OR_NULL; + break; + default: + if (size != size_default) + return false; + break; + } } - return __is_valid_sock_ops_access(off, size); + return true; } static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write, @@ -4608,6 +7198,7 @@ static bool sk_skb_is_valid_access(int off, int size, case bpf_ctx_range(struct __sk_buff, tc_classid): case bpf_ctx_range(struct __sk_buff, data_meta): case bpf_ctx_range(struct __sk_buff, tstamp): + case bpf_ctx_range(struct __sk_buff, wire_len): return false; } @@ -4623,7 +7214,7 @@ static bool sk_skb_is_valid_access(int off, int size, switch (off) { case bpf_ctx_range(struct __sk_buff, mark): - case bpf_ctx_range_ptr(struct __sk_buff, flow_keys): + return false; case bpf_ctx_range(struct __sk_buff, data): info->reg_type = PTR_TO_PACKET; break; @@ -4635,31 +7226,42 @@ static bool sk_skb_is_valid_access(int off, int size, return bpf_skb_is_valid_access(off, size, type, prog, info); } - static bool sk_msg_is_valid_access(int off, int size, - enum bpf_access_type type, - const struct bpf_prog *prog, - struct bpf_insn_access_aux *info) + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) { if (type == BPF_WRITE) return false; + if (off % size != 0) + return false; + switch (off) { case offsetof(struct sk_msg_md, data): info->reg_type = PTR_TO_PACKET; + if (size != sizeof(__u64)) + return false; break; case offsetof(struct sk_msg_md, data_end): info->reg_type = PTR_TO_PACKET_END; + if (size != sizeof(__u64)) + return false; break; + case bpf_ctx_range(struct sk_msg_md, family): + case bpf_ctx_range(struct sk_msg_md, remote_ip4): + case bpf_ctx_range(struct sk_msg_md, local_ip4): + case bpf_ctx_range_till(struct sk_msg_md, remote_ip6[0], remote_ip6[3]): + case bpf_ctx_range_till(struct sk_msg_md, local_ip6[0], local_ip6[3]): + case bpf_ctx_range(struct sk_msg_md, remote_port): + case bpf_ctx_range(struct sk_msg_md, local_port): + case bpf_ctx_range(struct sk_msg_md, size): + if (size != sizeof(__u32)) + return false; + break; + default: + return false; } - - if (off < 0 || off >= sizeof(struct sk_msg_md)) - return false; - if (off % size != 0) - return false; - if (size != sizeof(__u64)) - return false; - return true; } @@ -4668,32 +7270,86 @@ static bool flow_dissector_is_valid_access(int off, int size, const struct bpf_prog *prog, struct bpf_insn_access_aux *info) { - if (type == BPF_WRITE) { - switch (off) { - case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]): - break; - default: - return false; - } - } + const int size_default = sizeof(__u32); + + if (off < 0 || off >= sizeof(struct __sk_buff)) + return false; + + if (type == BPF_WRITE) + return false; switch (off) { case bpf_ctx_range(struct __sk_buff, data): + if (size != size_default) + return false; info->reg_type = PTR_TO_PACKET; - break; + return true; case bpf_ctx_range(struct __sk_buff, data_end): + if (size != size_default) + return false; info->reg_type = PTR_TO_PACKET_END; - break; + return true; case bpf_ctx_range_ptr(struct __sk_buff, flow_keys): + if (size != sizeof(__u64)) + return false; info->reg_type = PTR_TO_FLOW_KEYS; - break; - case bpf_ctx_range(struct __sk_buff, tc_classid): - case bpf_ctx_range_till(struct __sk_buff, family, local_port): - case bpf_ctx_range(struct __sk_buff, tstamp): + return true; + default: return false; } +} - return bpf_skb_is_valid_access(off, size, type, prog, info); +static u32 flow_dissector_convert_ctx_access(enum bpf_access_type type, + const struct bpf_insn *si, + struct bpf_insn *insn_buf, + struct bpf_prog *prog, + u32 *target_size) + +{ + struct bpf_insn *insn = insn_buf; + + switch (si->off) { + case offsetof(struct __sk_buff, data): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_flow_dissector, data), + si->dst_reg, si->src_reg, + offsetof(struct bpf_flow_dissector, data)); + break; + + case offsetof(struct __sk_buff, data_end): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_flow_dissector, data_end), + si->dst_reg, si->src_reg, + offsetof(struct bpf_flow_dissector, data_end)); + break; + + case offsetof(struct __sk_buff, flow_keys): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_flow_dissector, flow_keys), + si->dst_reg, si->src_reg, + offsetof(struct bpf_flow_dissector, flow_keys)); + break; + } + + return insn - insn_buf; +} + +static struct bpf_insn *bpf_convert_shinfo_access(const struct bpf_insn *si, + struct bpf_insn *insn) +{ + /* si->dst_reg = skb_shinfo(SKB); */ +#ifdef NET_SKBUFF_DATA_USES_OFFSET + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, end), + BPF_REG_AX, si->src_reg, + offsetof(struct sk_buff, end)); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, head), + si->dst_reg, si->src_reg, + offsetof(struct sk_buff, head)); + *insn++ = BPF_ALU64_REG(BPF_ADD, si->dst_reg, BPF_REG_AX); +#else + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, end), + si->dst_reg, si->src_reg, + offsetof(struct sk_buff, end)); +#endif + + return insn; } static u32 bpf_convert_ctx_access(enum bpf_access_type type, @@ -4999,19 +7655,6 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type, bpf_target_off(struct sock_common, skc_num, 2, target_size)); break; - case offsetof(struct __sk_buff, flow_keys): - off = si->off; - off -= offsetof(struct __sk_buff, flow_keys); - off += offsetof(struct sk_buff, cb); - off += offsetof(struct qdisc_skb_cb, flow_keys); - *insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, - si->src_reg, off); - break; - case offsetof(struct __sk_buff, sk): - *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, sk), - si->dst_reg, si->src_reg, - offsetof(struct sk_buff, sk)); - break; case offsetof(struct __sk_buff, tstamp): BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, tstamp) != 8); @@ -5028,6 +7671,40 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type, bpf_target_off(struct sk_buff, tstamp, 8, target_size)); + break; + + case offsetof(struct __sk_buff, gso_segs): + insn = bpf_convert_shinfo_access(si, insn); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct skb_shared_info, gso_segs), + si->dst_reg, si->dst_reg, + bpf_target_off(struct skb_shared_info, + gso_segs, 2, + target_size)); + break; + case offsetof(struct __sk_buff, gso_size): + insn = bpf_convert_shinfo_access(si, insn); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct skb_shared_info, gso_size), + si->dst_reg, si->dst_reg, + bpf_target_off(struct skb_shared_info, + gso_size, 2, + target_size)); + break; + case offsetof(struct __sk_buff, wire_len): + BUILD_BUG_ON(FIELD_SIZEOF(struct qdisc_skb_cb, pkt_len) != 4); + + off = si->off; + off -= offsetof(struct __sk_buff, wire_len); + off += offsetof(struct sk_buff, cb); + off += offsetof(struct qdisc_skb_cb, pkt_len); + *target_size = 4; + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, off); + break; + + case offsetof(struct __sk_buff, sk): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, sk), + si->dst_reg, si->src_reg, + offsetof(struct sk_buff, sk)); + break; } return insn - insn_buf; @@ -5076,24 +7753,32 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type, break; case offsetof(struct bpf_sock, family): - BUILD_BUG_ON(FIELD_SIZEOF(struct sock, sk_family) != 2); - - *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg, - offsetof(struct sock, sk_family)); + *insn++ = BPF_LDX_MEM( + BPF_FIELD_SIZEOF(struct sock_common, skc_family), + si->dst_reg, si->src_reg, + bpf_target_off(struct sock_common, + skc_family, + FIELD_SIZEOF(struct sock_common, + skc_family), + target_size)); break; case offsetof(struct bpf_sock, type): + BUILD_BUG_ON(HWEIGHT32(SK_FL_TYPE_MASK) != BITS_PER_BYTE * 2); *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, offsetof(struct sock, __sk_flags_offset)); *insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_TYPE_MASK); *insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_TYPE_SHIFT); + *target_size = 2; break; case offsetof(struct bpf_sock, protocol): + BUILD_BUG_ON(HWEIGHT32(SK_FL_PROTO_MASK) != BITS_PER_BYTE); *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, offsetof(struct sock, __sk_flags_offset)); *insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_PROTO_MASK); *insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, SK_FL_PROTO_SHIFT); + *target_size = 1; break; case offsetof(struct bpf_sock, src_ip4): @@ -5105,6 +7790,15 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type, target_size)); break; + case offsetof(struct bpf_sock, dst_ip4): + *insn++ = BPF_LDX_MEM( + BPF_SIZE(si->code), si->dst_reg, si->src_reg, + bpf_target_off(struct sock_common, skc_daddr, + FIELD_SIZEOF(struct sock_common, + skc_daddr), + target_size)); + break; + case bpf_ctx_range_till(struct bpf_sock, src_ip6[0], src_ip6[3]): #if IS_ENABLED(CONFIG_IPV6) off = si->off; @@ -5123,6 +7817,23 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type, #endif break; + case bpf_ctx_range_till(struct bpf_sock, dst_ip6[0], dst_ip6[3]): +#if IS_ENABLED(CONFIG_IPV6) + off = si->off; + off -= offsetof(struct bpf_sock, dst_ip6[0]); + *insn++ = BPF_LDX_MEM( + BPF_SIZE(si->code), si->dst_reg, si->src_reg, + bpf_target_off(struct sock_common, + skc_v6_daddr.s6_addr32[0], + FIELD_SIZEOF(struct sock_common, + skc_v6_daddr.s6_addr32[0]), + target_size) + off); +#else + *insn++ = BPF_MOV32_IMM(si->dst_reg, 0); + *target_size = 4; +#endif + break; + case offsetof(struct bpf_sock, src_port): *insn++ = BPF_LDX_MEM( BPF_FIELD_SIZEOF(struct sock_common, skc_num), @@ -5132,6 +7843,26 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type, skc_num), target_size)); break; + + case offsetof(struct bpf_sock, dst_port): + *insn++ = BPF_LDX_MEM( + BPF_FIELD_SIZEOF(struct sock_common, skc_dport), + si->dst_reg, si->src_reg, + bpf_target_off(struct sock_common, skc_dport, + FIELD_SIZEOF(struct sock_common, + skc_dport), + target_size)); + break; + + case offsetof(struct bpf_sock, state): + *insn++ = BPF_LDX_MEM( + BPF_FIELD_SIZEOF(struct sock_common, skc_state), + si->dst_reg, si->src_reg, + bpf_target_off(struct sock_common, skc_state, + FIELD_SIZEOF(struct sock_common, + skc_state), + target_size)); + break; } return insn - insn_buf; @@ -5184,6 +7915,24 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type, si->dst_reg, si->src_reg, offsetof(struct xdp_buff, data_end)); break; + case offsetof(struct xdp_md, ingress_ifindex): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq), + si->dst_reg, si->src_reg, + offsetof(struct xdp_buff, rxq)); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_rxq_info, dev), + si->dst_reg, si->dst_reg, + offsetof(struct xdp_rxq_info, dev)); + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, + offsetof(struct net_device, ifindex)); + break; + case offsetof(struct xdp_md, rx_queue_index): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq), + si->dst_reg, si->src_reg, + offsetof(struct xdp_buff, rxq)); + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, + offsetof(struct xdp_rxq_info, + queue_index)); + break; } return insn - insn_buf; @@ -5217,9 +7966,6 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type, /* SOCK_ADDR_STORE_NESTED_FIELD_OFF() has semantic similar to * SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF() but for store operation. * - * It doesn't support SIZE argument though since narrow stores are not - * supported for now. - * * In addition it uses Temporary Field TF (member of struct S) as the 3rd * "register" since two registers available in convert_ctx_access are not * enough: we can't override neither SRC, since it contains value to store, nor @@ -5227,7 +7973,7 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type, * instructions. But we need a temporary place to save pointer to nested * structure whose field we want to store to. */ -#define SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, OFF, TF) \ +#define SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, SIZE, OFF, TF) \ do { \ int tmp_reg = BPF_REG_9; \ if (si->src_reg == tmp_reg || si->dst_reg == tmp_reg) \ @@ -5238,8 +7984,7 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type, offsetof(S, TF)); \ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(S, F), tmp_reg, \ si->dst_reg, offsetof(S, F)); \ - *insn++ = BPF_STX_MEM( \ - BPF_FIELD_SIZEOF(NS, NF), tmp_reg, si->src_reg, \ + *insn++ = BPF_STX_MEM(SIZE, tmp_reg, si->src_reg, \ bpf_target_off(NS, NF, FIELD_SIZEOF(NS, NF), \ target_size) \ + OFF); \ @@ -5251,8 +7996,8 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type, TF) \ do { \ if (type == BPF_WRITE) { \ - SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, OFF, \ - TF); \ + SOCK_ADDR_STORE_NESTED_FIELD_OFF(S, NS, F, NF, SIZE, \ + OFF, TF); \ } else { \ SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF( \ S, NS, F, NF, SIZE, OFF); \ @@ -5347,6 +8092,11 @@ static u32 sock_addr_convert_ctx_access(enum bpf_access_type type, struct bpf_sock_addr_kern, struct in6_addr, t_ctx, s6_addr32[0], BPF_SIZE(si->code), off, tmp_reg); break; + case offsetof(struct bpf_sock_addr, sk): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_addr_kern, sk), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_addr_kern, sk)); + break; } return insn - insn_buf; @@ -5361,6 +8111,119 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, struct bpf_insn *insn = insn_buf; int off; +/* Helper macro for adding read access to tcp_sock or sock fields. */ +#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ) \ + do { \ + BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) > \ + FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD)); \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ + struct bpf_sock_ops_kern, \ + is_fullsock), \ + si->dst_reg, si->src_reg, \ + offsetof(struct bpf_sock_ops_kern, \ + is_fullsock)); \ + *insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 2); \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ + struct bpf_sock_ops_kern, sk),\ + si->dst_reg, si->src_reg, \ + offsetof(struct bpf_sock_ops_kern, sk));\ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ, \ + OBJ_FIELD), \ + si->dst_reg, si->dst_reg, \ + offsetof(OBJ, OBJ_FIELD)); \ + } while (0) + +#define SOCK_OPS_GET_SK() \ + do { \ + int fullsock_reg = si->dst_reg, reg = BPF_REG_9, jmp = 1; \ + if (si->dst_reg == reg || si->src_reg == reg) \ + reg--; \ + if (si->dst_reg == reg || si->src_reg == reg) \ + reg--; \ + if (si->dst_reg == si->src_reg) { \ + *insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, reg, \ + offsetof(struct bpf_sock_ops_kern, \ + temp)); \ + fullsock_reg = reg; \ + jmp += 2; \ + } \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ + struct bpf_sock_ops_kern, \ + is_fullsock), \ + fullsock_reg, si->src_reg, \ + offsetof(struct bpf_sock_ops_kern, \ + is_fullsock)); \ + *insn++ = BPF_JMP_IMM(BPF_JEQ, fullsock_reg, 0, jmp); \ + if (si->dst_reg == si->src_reg) \ + *insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg, \ + offsetof(struct bpf_sock_ops_kern, \ + temp)); \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ + struct bpf_sock_ops_kern, sk),\ + si->dst_reg, si->src_reg, \ + offsetof(struct bpf_sock_ops_kern, sk));\ + if (si->dst_reg == si->src_reg) { \ + *insn++ = BPF_JMP_A(1); \ + *insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg, \ + offsetof(struct bpf_sock_ops_kern, \ + temp)); \ + } \ + } while (0) + +#define SOCK_OPS_GET_TCP_SOCK_FIELD(FIELD) \ + SOCK_OPS_GET_FIELD(FIELD, FIELD, struct tcp_sock) + +/* Helper macro for adding write access to tcp_sock or sock fields. + * The macro is called with two registers, dst_reg which contains a pointer + * to ctx (context) and src_reg which contains the value that should be + * stored. However, we need an additional register since we cannot overwrite + * dst_reg because it may be used later in the program. + * Instead we "borrow" one of the other register. We first save its value + * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore + * it at the end of the macro. + */ +#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ) \ + do { \ + int reg = BPF_REG_9; \ + BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) > \ + FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD)); \ + if (si->dst_reg == reg || si->src_reg == reg) \ + reg--; \ + if (si->dst_reg == reg || si->src_reg == reg) \ + reg--; \ + *insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg, \ + offsetof(struct bpf_sock_ops_kern, \ + temp)); \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ + struct bpf_sock_ops_kern, \ + is_fullsock), \ + reg, si->dst_reg, \ + offsetof(struct bpf_sock_ops_kern, \ + is_fullsock)); \ + *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2); \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \ + struct bpf_sock_ops_kern, sk),\ + reg, si->dst_reg, \ + offsetof(struct bpf_sock_ops_kern, sk));\ + *insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD), \ + reg, si->src_reg, \ + offsetof(OBJ, OBJ_FIELD)); \ + *insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg, \ + offsetof(struct bpf_sock_ops_kern, \ + temp)); \ + } while (0) + +#define SOCK_OPS_GET_OR_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ, TYPE) \ + do { \ + if (TYPE == BPF_WRITE) \ + SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ); \ + else \ + SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ); \ + } while (0) + + if (insn > insn_buf) + return insn - insn_buf; + switch (si->off) { case offsetof(struct bpf_sock_ops, op) ... offsetof(struct bpf_sock_ops, replylong[3]): @@ -5404,7 +8267,8 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, break; case offsetof(struct bpf_sock_ops, local_ip4): - BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_rcv_saddr) != 4); + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, + skc_rcv_saddr) != 4); *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( struct bpf_sock_ops_kern, sk), @@ -5481,6 +8345,117 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg, offsetof(struct sock_common, skc_num)); break; + + case offsetof(struct bpf_sock_ops, is_fullsock): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct bpf_sock_ops_kern, + is_fullsock), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, + is_fullsock)); + break; + + case offsetof(struct bpf_sock_ops, state): + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 1); + + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct bpf_sock_ops_kern, sk), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, sk)); + *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg, + offsetof(struct sock_common, skc_state)); + break; + + case offsetof(struct bpf_sock_ops, rtt_min): + BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) != + sizeof(struct minmax)); + BUILD_BUG_ON(sizeof(struct minmax) < + sizeof(struct minmax_sample)); + + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct bpf_sock_ops_kern, sk), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, sk)); + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, + offsetof(struct tcp_sock, rtt_min) + + FIELD_SIZEOF(struct minmax_sample, t)); + break; + + case offsetof(struct bpf_sock_ops, bpf_sock_ops_cb_flags): + SOCK_OPS_GET_FIELD(bpf_sock_ops_cb_flags, bpf_sock_ops_cb_flags, + struct tcp_sock); + break; + + case offsetof(struct bpf_sock_ops, sk_txhash): + SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash, + struct sock, type); + break; + case offsetof(struct bpf_sock_ops, snd_cwnd): + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_cwnd); + break; + case offsetof(struct bpf_sock_ops, srtt_us): + SOCK_OPS_GET_TCP_SOCK_FIELD(srtt_us); + break; + case offsetof(struct bpf_sock_ops, snd_ssthresh): + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_ssthresh); + break; + case offsetof(struct bpf_sock_ops, rcv_nxt): + SOCK_OPS_GET_TCP_SOCK_FIELD(rcv_nxt); + break; + case offsetof(struct bpf_sock_ops, snd_nxt): + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_nxt); + break; + case offsetof(struct bpf_sock_ops, snd_una): + SOCK_OPS_GET_TCP_SOCK_FIELD(snd_una); + break; + case offsetof(struct bpf_sock_ops, mss_cache): + SOCK_OPS_GET_TCP_SOCK_FIELD(mss_cache); + break; + case offsetof(struct bpf_sock_ops, ecn_flags): + SOCK_OPS_GET_TCP_SOCK_FIELD(ecn_flags); + break; + case offsetof(struct bpf_sock_ops, rate_delivered): + SOCK_OPS_GET_TCP_SOCK_FIELD(rate_delivered); + break; + case offsetof(struct bpf_sock_ops, rate_interval_us): + SOCK_OPS_GET_TCP_SOCK_FIELD(rate_interval_us); + break; + case offsetof(struct bpf_sock_ops, packets_out): + SOCK_OPS_GET_TCP_SOCK_FIELD(packets_out); + break; + case offsetof(struct bpf_sock_ops, retrans_out): + SOCK_OPS_GET_TCP_SOCK_FIELD(retrans_out); + break; + case offsetof(struct bpf_sock_ops, total_retrans): + SOCK_OPS_GET_TCP_SOCK_FIELD(total_retrans); + break; + case offsetof(struct bpf_sock_ops, segs_in): + SOCK_OPS_GET_TCP_SOCK_FIELD(segs_in); + break; + case offsetof(struct bpf_sock_ops, data_segs_in): + SOCK_OPS_GET_TCP_SOCK_FIELD(data_segs_in); + break; + case offsetof(struct bpf_sock_ops, segs_out): + SOCK_OPS_GET_TCP_SOCK_FIELD(segs_out); + break; + case offsetof(struct bpf_sock_ops, data_segs_out): + SOCK_OPS_GET_TCP_SOCK_FIELD(data_segs_out); + break; + case offsetof(struct bpf_sock_ops, lost_out): + SOCK_OPS_GET_TCP_SOCK_FIELD(lost_out); + break; + case offsetof(struct bpf_sock_ops, sacked_out): + SOCK_OPS_GET_TCP_SOCK_FIELD(sacked_out); + break; + case offsetof(struct bpf_sock_ops, bytes_received): + SOCK_OPS_GET_TCP_SOCK_FIELD(bytes_received); + break; + case offsetof(struct bpf_sock_ops, bytes_acked): + SOCK_OPS_GET_TCP_SOCK_FIELD(bytes_acked); + break; + case offsetof(struct bpf_sock_ops, sk): + SOCK_OPS_GET_SK(); + break; } return insn - insn_buf; } @@ -5502,6 +8477,27 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type, *insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg, si->src_reg, off); break; + case offsetof(struct __sk_buff, cb[0]) ... + offsetofend(struct __sk_buff, cb[4]) - 1: + BUILD_BUG_ON(sizeof_field(struct sk_skb_cb, data) < 20); + BUILD_BUG_ON((offsetof(struct sk_buff, cb) + + offsetof(struct sk_skb_cb, data)) % + sizeof(__u64)); + + prog->cb_access = 1; + off = si->off; + off -= offsetof(struct __sk_buff, cb[0]); + off += offsetof(struct sk_buff, cb); + off += offsetof(struct sk_skb_cb, data); + if (type == BPF_WRITE) + *insn++ = BPF_STX_MEM(BPF_SIZE(si->code), si->dst_reg, + si->src_reg, off); + else + *insn++ = BPF_LDX_MEM(BPF_SIZE(si->code), si->dst_reg, + si->src_reg, off); + break; + + default: return bpf_convert_ctx_access(type, si, insn_buf, prog, target_size); @@ -5511,22 +8507,135 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type, } static u32 sk_msg_convert_ctx_access(enum bpf_access_type type, - const struct bpf_insn *si, - struct bpf_insn *insn_buf, - struct bpf_prog *prog, u32 *target_size) + const struct bpf_insn *si, + struct bpf_insn *insn_buf, + struct bpf_prog *prog, u32 *target_size) { struct bpf_insn *insn = insn_buf; +#if IS_ENABLED(CONFIG_IPV6) + int off; +#endif + + /* convert ctx uses the fact sg element is first in struct */ + BUILD_BUG_ON(offsetof(struct sk_msg, sg) != 0); switch (si->off) { case offsetof(struct sk_msg_md, data): *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg, data), - si->dst_reg, si->src_reg, - offsetof(struct sk_msg, data)); + si->dst_reg, si->src_reg, + offsetof(struct sk_msg, data)); break; case offsetof(struct sk_msg_md, data_end): *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg, data_end), - si->dst_reg, si->src_reg, - offsetof(struct sk_msg, data_end)); + si->dst_reg, si->src_reg, + offsetof(struct sk_msg, data_end)); + break; + case offsetof(struct sk_msg_md, family): + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_family) != 2); + + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct sk_msg, sk), + si->dst_reg, si->src_reg, + offsetof(struct sk_msg, sk)); + *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg, + offsetof(struct sock_common, skc_family)); + break; + + case offsetof(struct sk_msg_md, remote_ip4): + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_daddr) != 4); + + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct sk_msg, sk), + si->dst_reg, si->src_reg, + offsetof(struct sk_msg, sk)); + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, + offsetof(struct sock_common, skc_daddr)); + break; + + case offsetof(struct sk_msg_md, local_ip4): + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, + skc_rcv_saddr) != 4); + + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct sk_msg, sk), + si->dst_reg, si->src_reg, + offsetof(struct sk_msg, sk)); + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, + offsetof(struct sock_common, + skc_rcv_saddr)); + break; + + case offsetof(struct sk_msg_md, remote_ip6[0]) ... + offsetof(struct sk_msg_md, remote_ip6[3]): +#if IS_ENABLED(CONFIG_IPV6) + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, + skc_v6_daddr.s6_addr32[0]) != 4); + + off = si->off; + off -= offsetof(struct sk_msg_md, remote_ip6[0]); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct sk_msg, sk), + si->dst_reg, si->src_reg, + offsetof(struct sk_msg, sk)); + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, + offsetof(struct sock_common, + skc_v6_daddr.s6_addr32[0]) + + off); +#else + *insn++ = BPF_MOV32_IMM(si->dst_reg, 0); +#endif + break; + + case offsetof(struct sk_msg_md, local_ip6[0]) ... + offsetof(struct sk_msg_md, local_ip6[3]): +#if IS_ENABLED(CONFIG_IPV6) + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, + skc_v6_rcv_saddr.s6_addr32[0]) != 4); + + off = si->off; + off -= offsetof(struct sk_msg_md, local_ip6[0]); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct sk_msg, sk), + si->dst_reg, si->src_reg, + offsetof(struct sk_msg, sk)); + *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg, + offsetof(struct sock_common, + skc_v6_rcv_saddr.s6_addr32[0]) + + off); +#else + *insn++ = BPF_MOV32_IMM(si->dst_reg, 0); +#endif + break; + + case offsetof(struct sk_msg_md, remote_port): + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_dport) != 2); + + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct sk_msg, sk), + si->dst_reg, si->src_reg, + offsetof(struct sk_msg, sk)); + *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg, + offsetof(struct sock_common, skc_dport)); +#ifndef __BIG_ENDIAN_BITFIELD + *insn++ = BPF_ALU32_IMM(BPF_LSH, si->dst_reg, 16); +#endif + break; + + case offsetof(struct sk_msg_md, local_port): + BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_num) != 2); + + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( + struct sk_msg, sk), + si->dst_reg, si->src_reg, + offsetof(struct sk_msg, sk)); + *insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg, + offsetof(struct sock_common, skc_num)); + break; + + case offsetof(struct sk_msg_md, size): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_sg, size), + si->dst_reg, si->src_reg, + offsetof(struct sk_msg_sg, size)); break; } @@ -5537,9 +8646,11 @@ const struct bpf_verifier_ops sk_filter_verifier_ops = { .get_func_proto = sk_filter_func_proto, .is_valid_access = sk_filter_is_valid_access, .convert_ctx_access = bpf_convert_ctx_access, + .gen_ld_abs = bpf_gen_ld_abs, }; const struct bpf_prog_ops sk_filter_prog_ops = { + .test_run = bpf_prog_test_run_skb, }; const struct bpf_verifier_ops tc_cls_act_verifier_ops = { @@ -5547,6 +8658,7 @@ const struct bpf_verifier_ops tc_cls_act_verifier_ops = { .is_valid_access = tc_cls_act_is_valid_access, .convert_ctx_access = tc_cls_act_convert_ctx_access, .gen_prologue = tc_cls_act_prologue, + .gen_ld_abs = bpf_gen_ld_abs, }; const struct bpf_prog_ops tc_cls_act_prog_ops = { @@ -5557,6 +8669,7 @@ const struct bpf_verifier_ops xdp_verifier_ops = { .get_func_proto = xdp_func_proto, .is_valid_access = xdp_is_valid_access, .convert_ctx_access = xdp_convert_ctx_access, + .gen_prologue = bpf_noop_prologue, }; const struct bpf_prog_ops xdp_prog_ops = { @@ -5565,7 +8678,7 @@ const struct bpf_prog_ops xdp_prog_ops = { const struct bpf_verifier_ops cg_skb_verifier_ops = { .get_func_proto = cg_skb_func_proto, - .is_valid_access = sk_filter_is_valid_access, + .is_valid_access = cg_skb_is_valid_access, .convert_ctx_access = bpf_convert_ctx_access, }; @@ -5573,13 +8686,23 @@ const struct bpf_prog_ops cg_skb_prog_ops = { .test_run = bpf_prog_test_run_skb, }; -const struct bpf_verifier_ops lwt_inout_verifier_ops = { - .get_func_proto = lwt_inout_func_proto, +const struct bpf_verifier_ops lwt_in_verifier_ops = { + .get_func_proto = lwt_in_func_proto, .is_valid_access = lwt_is_valid_access, .convert_ctx_access = bpf_convert_ctx_access, }; -const struct bpf_prog_ops lwt_inout_prog_ops = { +const struct bpf_prog_ops lwt_in_prog_ops = { + .test_run = bpf_prog_test_run_skb, +}; + +const struct bpf_verifier_ops lwt_out_verifier_ops = { + .get_func_proto = lwt_out_func_proto, + .is_valid_access = lwt_is_valid_access, + .convert_ctx_access = bpf_convert_ctx_access, +}; + +const struct bpf_prog_ops lwt_out_prog_ops = { .test_run = bpf_prog_test_run_skb, }; @@ -5594,6 +8717,16 @@ const struct bpf_prog_ops lwt_xmit_prog_ops = { .test_run = bpf_prog_test_run_skb, }; +const struct bpf_verifier_ops lwt_seg6local_verifier_ops = { + .get_func_proto = lwt_seg6local_func_proto, + .is_valid_access = lwt_is_valid_access, + .convert_ctx_access = bpf_convert_ctx_access, +}; + +const struct bpf_prog_ops lwt_seg6local_prog_ops = { + .test_run = bpf_prog_test_run_skb, +}; + const struct bpf_verifier_ops cg_sock_verifier_ops = { .get_func_proto = sock_filter_func_proto, .is_valid_access = sock_filter_is_valid_access, @@ -5632,9 +8765,10 @@ const struct bpf_prog_ops sk_skb_prog_ops = { }; const struct bpf_verifier_ops sk_msg_verifier_ops = { - .get_func_proto = sk_msg_func_proto, - .is_valid_access = sk_msg_is_valid_access, - .convert_ctx_access = sk_msg_convert_ctx_access, + .get_func_proto = sk_msg_func_proto, + .is_valid_access = sk_msg_is_valid_access, + .convert_ctx_access = sk_msg_convert_ctx_access, + .gen_prologue = bpf_noop_prologue, }; const struct bpf_prog_ops sk_msg_prog_ops = { @@ -5643,10 +8777,11 @@ const struct bpf_prog_ops sk_msg_prog_ops = { const struct bpf_verifier_ops flow_dissector_verifier_ops = { .get_func_proto = flow_dissector_func_proto, .is_valid_access = flow_dissector_is_valid_access, - .convert_ctx_access = bpf_convert_ctx_access, + .convert_ctx_access = flow_dissector_convert_ctx_access, }; const struct bpf_prog_ops flow_dissector_prog_ops = { + .test_run = bpf_prog_test_run_flow_dissector, }; int sk_detach_filter(struct sock *sk) @@ -5712,3 +8847,271 @@ out: release_sock(sk); return ret; } + +#ifdef CONFIG_INET +struct sk_reuseport_kern { + struct sk_buff *skb; + struct sock *sk; + struct sock *selected_sk; + void *data_end; + u32 hash; + u32 reuseport_id; + bool bind_inany; +}; + +static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern, + struct sock_reuseport *reuse, + struct sock *sk, struct sk_buff *skb, + u32 hash) +{ + reuse_kern->skb = skb; + reuse_kern->sk = sk; + reuse_kern->selected_sk = NULL; + reuse_kern->data_end = skb->data + skb_headlen(skb); + reuse_kern->hash = hash; + reuse_kern->reuseport_id = reuse->reuseport_id; + reuse_kern->bind_inany = reuse->bind_inany; +} + +struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, + struct bpf_prog *prog, struct sk_buff *skb, + u32 hash) +{ + struct sk_reuseport_kern reuse_kern; + enum sk_action action; + + bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash); + action = BPF_PROG_RUN(prog, &reuse_kern); + + if (action == SK_PASS) + return reuse_kern.selected_sk; + else + return ERR_PTR(-ECONNREFUSED); +} + +BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern, + struct bpf_map *, map, void *, key, u32, flags) +{ + struct sock_reuseport *reuse; + struct sock *selected_sk; + + selected_sk = map->ops->map_lookup_elem(map, key); + if (!selected_sk) + return -ENOENT; + + reuse = rcu_dereference(selected_sk->sk_reuseport_cb); + if (!reuse) + /* selected_sk is unhashed (e.g. by close()) after the + * above map_lookup_elem(). Treat selected_sk has already + * been removed from the map. + */ + return -ENOENT; + + if (unlikely(reuse->reuseport_id != reuse_kern->reuseport_id)) { + struct sock *sk; + + if (unlikely(!reuse_kern->reuseport_id)) + /* There is a small race between adding the + * sk to the map and setting the + * reuse_kern->reuseport_id. + * Treat it as the sk has not been added to + * the bpf map yet. + */ + return -ENOENT; + + sk = reuse_kern->sk; + if (sk->sk_protocol != selected_sk->sk_protocol) + return -EPROTOTYPE; + else if (sk->sk_family != selected_sk->sk_family) + return -EAFNOSUPPORT; + + /* Catch all. Likely bound to a different sockaddr. */ + return -EBADFD; + } + + reuse_kern->selected_sk = selected_sk; + + return 0; +} + +static const struct bpf_func_proto sk_select_reuseport_proto = { + .func = sk_select_reuseport, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_PTR_TO_MAP_KEY, + .arg4_type = ARG_ANYTHING, +}; + +BPF_CALL_4(sk_reuseport_load_bytes, + const struct sk_reuseport_kern *, reuse_kern, u32, offset, + void *, to, u32, len) +{ + return ____bpf_skb_load_bytes(reuse_kern->skb, offset, to, len); +} + +static const struct bpf_func_proto sk_reuseport_load_bytes_proto = { + .func = sk_reuseport_load_bytes, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_UNINIT_MEM, + .arg4_type = ARG_CONST_SIZE, +}; + +BPF_CALL_5(sk_reuseport_load_bytes_relative, + const struct sk_reuseport_kern *, reuse_kern, u32, offset, + void *, to, u32, len, u32, start_header) +{ + return ____bpf_skb_load_bytes_relative(reuse_kern->skb, offset, to, + len, start_header); +} + +static const struct bpf_func_proto sk_reuseport_load_bytes_relative_proto = { + .func = sk_reuseport_load_bytes_relative, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_UNINIT_MEM, + .arg4_type = ARG_CONST_SIZE, + .arg5_type = ARG_ANYTHING, +}; + +static const struct bpf_func_proto * +sk_reuseport_func_proto(enum bpf_func_id func_id, + const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_sk_select_reuseport: + return &sk_select_reuseport_proto; + case BPF_FUNC_skb_load_bytes: + return &sk_reuseport_load_bytes_proto; + case BPF_FUNC_skb_load_bytes_relative: + return &sk_reuseport_load_bytes_relative_proto; + default: + return bpf_base_func_proto(func_id); + } +} + +static bool +sk_reuseport_is_valid_access(int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + const u32 size_default = sizeof(__u32); + + if (off < 0 || off >= sizeof(struct sk_reuseport_md) || + off % size || type != BPF_READ) + return false; + + switch (off) { + case offsetof(struct sk_reuseport_md, data): + info->reg_type = PTR_TO_PACKET; + return size == sizeof(__u64); + + case offsetof(struct sk_reuseport_md, data_end): + info->reg_type = PTR_TO_PACKET_END; + return size == sizeof(__u64); + + case offsetof(struct sk_reuseport_md, hash): + return size == size_default; + + /* Fields that allow narrowing */ + case bpf_ctx_range(struct sk_reuseport_md, eth_protocol): + if (size < FIELD_SIZEOF(struct sk_buff, protocol)) + return false; + /* fall through */ + case bpf_ctx_range(struct sk_reuseport_md, ip_protocol): + case bpf_ctx_range(struct sk_reuseport_md, bind_inany): + case bpf_ctx_range(struct sk_reuseport_md, len): + bpf_ctx_record_field_size(info, size_default); + return bpf_ctx_narrow_access_ok(off, size, size_default); + + default: + return false; + } +} + +#define SK_REUSEPORT_LOAD_FIELD(F) ({ \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_reuseport_kern, F), \ + si->dst_reg, si->src_reg, \ + bpf_target_off(struct sk_reuseport_kern, F, \ + FIELD_SIZEOF(struct sk_reuseport_kern, F), \ + target_size)); \ + }) + +#define SK_REUSEPORT_LOAD_SKB_FIELD(SKB_FIELD) \ + SOCK_ADDR_LOAD_NESTED_FIELD(struct sk_reuseport_kern, \ + struct sk_buff, \ + skb, \ + SKB_FIELD) + +#define SK_REUSEPORT_LOAD_SK_FIELD_SIZE_OFF(SK_FIELD, BPF_SIZE, EXTRA_OFF) \ + SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(struct sk_reuseport_kern, \ + struct sock, \ + sk, \ + SK_FIELD, BPF_SIZE, EXTRA_OFF) + +static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type, + const struct bpf_insn *si, + struct bpf_insn *insn_buf, + struct bpf_prog *prog, + u32 *target_size) +{ + struct bpf_insn *insn = insn_buf; + + switch (si->off) { + case offsetof(struct sk_reuseport_md, data): + SK_REUSEPORT_LOAD_SKB_FIELD(data); + break; + + case offsetof(struct sk_reuseport_md, len): + SK_REUSEPORT_LOAD_SKB_FIELD(len); + break; + + case offsetof(struct sk_reuseport_md, eth_protocol): + SK_REUSEPORT_LOAD_SKB_FIELD(protocol); + break; + + case offsetof(struct sk_reuseport_md, ip_protocol): + BUILD_BUG_ON(HWEIGHT32(SK_FL_PROTO_MASK) != BITS_PER_BYTE); + SK_REUSEPORT_LOAD_SK_FIELD_SIZE_OFF(__sk_flags_offset, + BPF_W, 0); + *insn++ = BPF_ALU32_IMM(BPF_AND, si->dst_reg, SK_FL_PROTO_MASK); + *insn++ = BPF_ALU32_IMM(BPF_RSH, si->dst_reg, + SK_FL_PROTO_SHIFT); + /* SK_FL_PROTO_MASK and SK_FL_PROTO_SHIFT are endian + * aware. No further narrowing or masking is needed. + */ + *target_size = 1; + break; + + case offsetof(struct sk_reuseport_md, data_end): + SK_REUSEPORT_LOAD_FIELD(data_end); + break; + + case offsetof(struct sk_reuseport_md, hash): + SK_REUSEPORT_LOAD_FIELD(hash); + break; + + case offsetof(struct sk_reuseport_md, bind_inany): + SK_REUSEPORT_LOAD_FIELD(bind_inany); + break; + } + + return insn - insn_buf; +} + +const struct bpf_verifier_ops sk_reuseport_verifier_ops = { + .get_func_proto = sk_reuseport_func_proto, + .is_valid_access = sk_reuseport_is_valid_access, + .convert_ctx_access = sk_reuseport_convert_ctx_access, +}; + +const struct bpf_prog_ops sk_reuseport_prog_ops = { +}; +#endif /* CONFIG_INET */ diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 886283d20817..9475872940a3 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -62,6 +62,45 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector, } EXPORT_SYMBOL(skb_flow_dissector_init); +int skb_flow_dissector_prog_query(const union bpf_attr *attr, + union bpf_attr __user *uattr) +{ + __u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids); + u32 prog_id, prog_cnt = 0, flags = 0; + struct bpf_prog *attached; + struct net *net; + + if (attr->query.query_flags) + return -EINVAL; + + net = get_net_ns_by_fd(attr->query.target_fd); + if (IS_ERR(net)) + return PTR_ERR(net); + + rcu_read_lock(); + attached = rcu_dereference(net->flow_dissector_prog); + if (attached) { + prog_cnt = 1; + prog_id = attached->aux->id; + } + rcu_read_unlock(); + + put_net(net); + + if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags))) + return -EFAULT; + if (copy_to_user(&uattr->query.prog_cnt, &prog_cnt, sizeof(prog_cnt))) + return -EFAULT; + + if (!attr->query.prog_cnt || !prog_ids || !prog_cnt) + return 0; + + if (copy_to_user(prog_ids, &prog_id, sizeof(u32))) + return -EFAULT; + + return 0; +} + int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr, struct bpf_prog *prog) { @@ -100,7 +139,6 @@ int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr) mutex_unlock(&flow_dissector_mutex); return 0; } - /** * skb_flow_get_be16 - extract be16 entity * @skb: sk_buff to extract from @@ -458,6 +496,7 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys, struct flow_dissector_key_basic *key_basic; struct flow_dissector_key_addrs *key_addrs; struct flow_dissector_key_ports *key_ports; + struct flow_dissector_key_tags *key_tags; key_control = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_CONTROL, @@ -502,10 +541,50 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys, key_ports->src = flow_keys->sport; key_ports->dst = flow_keys->dport; } + + if (dissector_uses_key(flow_dissector, + FLOW_DISSECTOR_KEY_FLOW_LABEL)) { + key_tags = skb_flow_dissector_target(flow_dissector, + FLOW_DISSECTOR_KEY_FLOW_LABEL, + target_container); + key_tags->flow_label = ntohl(flow_keys->flow_label); + } +} + +bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx, + __be16 proto, int nhoff, int hlen, unsigned int flags) +{ + struct bpf_flow_keys *flow_keys = ctx->flow_keys; + u32 result; + + /* Pass parameters to the BPF program */ + memset(flow_keys, 0, sizeof(*flow_keys)); + flow_keys->n_proto = proto; + flow_keys->nhoff = nhoff; + flow_keys->thoff = flow_keys->nhoff; + + BUILD_BUG_ON((int)BPF_FLOW_DISSECTOR_F_PARSE_1ST_FRAG != + (int)FLOW_DISSECTOR_F_PARSE_1ST_FRAG); + BUILD_BUG_ON((int)BPF_FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL != + (int)FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL); + BUILD_BUG_ON((int)BPF_FLOW_DISSECTOR_F_STOP_AT_ENCAP != + (int)FLOW_DISSECTOR_F_STOP_AT_ENCAP); + flow_keys->flags = flags; + + preempt_disable(); + result = BPF_PROG_RUN(prog, ctx); + preempt_enable(); + + flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, nhoff, hlen); + flow_keys->thoff = clamp_t(u16, flow_keys->thoff, + flow_keys->nhoff, hlen); + + return result == BPF_OK; } /** * __skb_flow_dissect - extract the flow_keys struct and return it + * @net: associated network namespace, derived from @skb if NULL * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified * @flow_dissector: list of keys to dissect * @target_container: target structure to put dissected values into @@ -513,6 +592,8 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys, * @proto: protocol for which to get the flow, if @data is NULL use skb->protocol * @nhoff: network header offset, if @data is NULL use skb_network_offset(skb) * @hlen: packet header length, if @data is NULL use skb_headlen(skb) + * @flags: flags that control the dissection process, e.g. + * FLOW_DISSECTOR_F_STOP_AT_ENCAP. * * The function will try to retrieve individual keys into target specified * by flow_dissector from either the skbuff or a raw buffer specified by the @@ -520,7 +601,8 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys, * * Caller must take care of zeroing target container memory. */ -bool __skb_flow_dissect(const struct sk_buff *skb, +bool __skb_flow_dissect(const struct net *net, + const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, void *data, __be16 proto, int nhoff, int hlen, @@ -533,9 +615,9 @@ bool __skb_flow_dissect(const struct sk_buff *skb, struct flow_dissector_key_icmp *key_icmp; struct flow_dissector_key_tags *key_tags; struct flow_dissector_key_vlan *key_vlan; + struct bpf_prog *attached = NULL; enum flow_dissect_ret fdret; bool skip_vlan = false; - struct bpf_prog *attached; int num_hdrs = 0; u8 ip_proto = 0; bool ret; @@ -576,43 +658,47 @@ bool __skb_flow_dissect(const struct sk_buff *skb, FLOW_DISSECTOR_KEY_BASIC, target_container); - rcu_read_lock(); - attached = skb ? rcu_dereference(dev_net(skb->dev)->flow_dissector_prog) - : NULL; - if (attached) { - /* Note that even though the const qualifier is discarded - * throughout the execution of the BPF program, all changes(the - * control block) are reverted after the BPF program returns. - * Therefore, __skb_flow_dissect does not alter the skb. - */ - struct bpf_flow_keys flow_keys = {}; - struct bpf_skb_data_end cb_saved; - struct bpf_skb_data_end *cb; - u32 result; - - cb = (struct bpf_skb_data_end *)skb->cb; - - /* Save Control Block */ - memcpy(&cb_saved, cb, sizeof(cb_saved)); - memset(cb, 0, sizeof(cb_saved)); - - /* Pass parameters to the BPF program */ - cb->qdisc_cb.flow_keys = &flow_keys; - flow_keys.nhoff = nhoff; - - bpf_compute_data_pointers((struct sk_buff *)skb); - result = BPF_PROG_RUN(attached, skb); - - /* Restore state */ - memcpy(cb, &cb_saved, sizeof(cb_saved)); - - __skb_flow_bpf_to_target(&flow_keys, flow_dissector, - target_container); - key_control->thoff = min_t(u16, key_control->thoff, skb->len); - rcu_read_unlock(); - return result == BPF_OK; + if (skb) { + if (!net) { + if (skb->dev) + net = dev_net(skb->dev); + else if (skb->sk) + net = sock_net(skb->sk); + } + } + + WARN_ON_ONCE(!net); + if (net) { + rcu_read_lock(); + attached = rcu_dereference(net->flow_dissector_prog); + + if (attached) { + struct bpf_flow_keys flow_keys; + struct bpf_flow_dissector ctx = { + .flow_keys = &flow_keys, + .data = data, + .data_end = data + hlen, + }; + __be16 n_proto = proto; + + if (skb) { + ctx.skb = skb; + /* we can't use 'proto' in the skb case + * because it might be set to skb->vlan_proto + * which has been pulled from the data + */ + n_proto = skb->protocol; + } + + ret = bpf_flow_dissect(attached, &ctx, n_proto, nhoff, + hlen, flags); + __skb_flow_bpf_to_target(&flow_keys, flow_dissector, + target_container); + rcu_read_unlock(); + return ret; + } + rcu_read_unlock(); } - rcu_read_unlock(); if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) { @@ -675,11 +761,6 @@ proto_again: __skb_flow_dissect_ipv4(skb, flow_dissector, target_container, data, iph); - if (flags & FLOW_DISSECTOR_F_STOP_AT_L3) { - fdret = FLOW_DISSECT_RET_OUT_GOOD; - break; - } - break; } case htons(ETH_P_IPV6): { @@ -730,9 +811,6 @@ proto_again: __skb_flow_dissect_ipv6(skb, flow_dissector, target_container, data, iph); - if (flags & FLOW_DISSECTOR_F_STOP_AT_L3) - fdret = FLOW_DISSECT_RET_OUT_GOOD; - break; } case htons(ETH_P_8021AD): @@ -1190,8 +1268,8 @@ u32 __skb_get_hash_symmetric(const struct sk_buff *skb) __flow_hash_secret_init(); memset(&keys, 0, sizeof(keys)); - __skb_flow_dissect(skb, &flow_keys_dissector_symmetric, &keys, - NULL, 0, 0, 0, + __skb_flow_dissect(NULL, skb, &flow_keys_dissector_symmetric, + &keys, NULL, 0, 0, 0, FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL); return __flow_hash_from_keys(&keys, &hashrnd); @@ -1230,7 +1308,7 @@ __u32 skb_get_hash_perturb(const struct sk_buff *skb, EXPORT_SYMBOL(skb_get_hash_perturb); u32 __skb_get_poff(const struct sk_buff *skb, void *data, - const struct flow_keys *keys, int hlen) + const struct flow_keys_basic *keys, int hlen) { u32 poff = keys->control.thoff; @@ -1291,9 +1369,10 @@ u32 __skb_get_poff(const struct sk_buff *skb, void *data, */ u32 skb_get_poff(const struct sk_buff *skb) { - struct flow_keys keys; + struct flow_keys_basic keys; - if (!skb_flow_dissect_flow_keys(skb, &keys, 0)) + if (!skb_flow_dissect_flow_keys_basic(NULL, skb, &keys, + NULL, 0, 0, 0, 0)) return 0; return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb)); @@ -1396,7 +1475,7 @@ static const struct flow_dissector_key flow_keys_dissector_symmetric_keys[] = { }, }; -static const struct flow_dissector_key flow_keys_buf_dissector_keys[] = { +static const struct flow_dissector_key flow_keys_basic_dissector_keys[] = { { .key_id = FLOW_DISSECTOR_KEY_CONTROL, .offset = offsetof(struct flow_keys, control), @@ -1410,7 +1489,8 @@ static const struct flow_dissector_key flow_keys_buf_dissector_keys[] = { struct flow_dissector flow_keys_dissector __read_mostly; EXPORT_SYMBOL(flow_keys_dissector); -struct flow_dissector flow_keys_buf_dissector __read_mostly; +struct flow_dissector flow_keys_basic_dissector __read_mostly; +EXPORT_SYMBOL(flow_keys_basic_dissector); static int __init init_default_flow_dissectors(void) { @@ -1420,9 +1500,9 @@ static int __init init_default_flow_dissectors(void) skb_flow_dissector_init(&flow_keys_dissector_symmetric, flow_keys_dissector_symmetric_keys, ARRAY_SIZE(flow_keys_dissector_symmetric_keys)); - skb_flow_dissector_init(&flow_keys_buf_dissector, - flow_keys_buf_dissector_keys, - ARRAY_SIZE(flow_keys_buf_dissector_keys)); + skb_flow_dissector_init(&flow_keys_basic_dissector, + flow_keys_basic_dissector_keys, + ARRAY_SIZE(flow_keys_basic_dissector_keys)); return 0; } diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c index 680782c53225..b3ee5899838d 100644 --- a/net/core/lwt_bpf.c +++ b/net/core/lwt_bpf.c @@ -1,13 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2016 Thomas Graf - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include @@ -392,6 +384,71 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = { .owner = THIS_MODULE, }; +static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len) +{ + /* Handling of GSO-enabled packets is added in the next patch. */ + return -EOPNOTSUPP; +} + +int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress) +{ + struct iphdr *iph; + bool ipv4; + int err; + + if (unlikely(len < sizeof(struct iphdr) || len > LWT_BPF_MAX_HEADROOM)) + return -EINVAL; + + /* validate protocol and length */ + iph = (struct iphdr *)hdr; + if (iph->version == 4) { + ipv4 = true; + if (unlikely(len < iph->ihl * 4)) + return -EINVAL; + } else if (iph->version == 6) { + ipv4 = false; + if (unlikely(len < sizeof(struct ipv6hdr))) + return -EINVAL; + } else { + return -EINVAL; + } + + if (ingress) + err = skb_cow_head(skb, len + skb->mac_len); + else + err = skb_cow_head(skb, + len + LL_RESERVED_SPACE(skb_dst(skb)->dev)); + if (unlikely(err)) + return err; + + /* push the encap headers and fix pointers */ + skb_reset_inner_headers(skb); + skb->encapsulation = 1; + skb_push(skb, len); + if (ingress) + skb_postpush_rcsum(skb, iph, len); + skb_reset_network_header(skb); + memcpy(skb_network_header(skb), hdr, len); + bpf_compute_data_pointers(skb); + skb_clear_hash(skb); + + if (ipv4) { + skb->protocol = htons(ETH_P_IP); + iph = ip_hdr(skb); + + if (!iph->check) + iph->check = ip_fast_csum((unsigned char *)iph, + iph->ihl); + } else { + skb->protocol = htons(ETH_P_IPV6); + } + + if (skb_is_gso(skb)) + return handle_gso_encap(skb, ipv4, len); + + return 0; +} + static int __init bpf_lwt_init(void) { return lwtunnel_encap_add_ops(&bpf_encap_ops, LWTUNNEL_ENCAP_BPF); diff --git a/net/core/page_pool.c b/net/core/page_pool.c new file mode 100644 index 000000000000..68bf07206744 --- /dev/null +++ b/net/core/page_pool.c @@ -0,0 +1,317 @@ +/* SPDX-License-Identifier: GPL-2.0 + * + * page_pool.c + * Author: Jesper Dangaard Brouer + * Copyright (C) 2016 Red Hat, Inc. + */ +#include +#include +#include + +#include +#include +#include +#include +#include /* for __put_page() */ + +static int page_pool_init(struct page_pool *pool, + const struct page_pool_params *params) +{ + unsigned int ring_qsize = 1024; /* Default */ + + memcpy(&pool->p, params, sizeof(pool->p)); + + /* Validate only known flags were used */ + if (pool->p.flags & ~(PP_FLAG_ALL)) + return -EINVAL; + + if (pool->p.pool_size) + ring_qsize = pool->p.pool_size; + + /* Sanity limit mem that can be pinned down */ + if (ring_qsize > 32768) + return -E2BIG; + + /* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL. + * DMA_BIDIRECTIONAL is for allowing page used for DMA sending, + * which is the XDP_TX use-case. + */ + if ((pool->p.dma_dir != DMA_FROM_DEVICE) && + (pool->p.dma_dir != DMA_BIDIRECTIONAL)) + return -EINVAL; + + if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0) + return -ENOMEM; + + return 0; +} + +struct page_pool *page_pool_create(const struct page_pool_params *params) +{ + struct page_pool *pool; + int err = 0; + + pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid); + if (!pool) + return ERR_PTR(-ENOMEM); + + err = page_pool_init(pool, params); + if (err < 0) { + pr_warn("%s() gave up with errno %d\n", __func__, err); + kfree(pool); + return ERR_PTR(err); + } + return pool; +} +EXPORT_SYMBOL(page_pool_create); + +/* fast path */ +static struct page *__page_pool_get_cached(struct page_pool *pool) +{ + struct ptr_ring *r = &pool->ring; + struct page *page; + + /* Quicker fallback, avoid locks when ring is empty */ + if (__ptr_ring_empty(r)) + return NULL; + + /* Test for safe-context, caller should provide this guarantee */ + if (likely(in_serving_softirq())) { + if (likely(pool->alloc.count)) { + /* Fast-path */ + page = pool->alloc.cache[--pool->alloc.count]; + return page; + } + /* Slower-path: Alloc array empty, time to refill + * + * Open-coded bulk ptr_ring consumer. + * + * Discussion: the ring consumer lock is not really + * needed due to the softirq/NAPI protection, but + * later need the ability to reclaim pages on the + * ring. Thus, keeping the locks. + */ + spin_lock(&r->consumer_lock); + while ((page = __ptr_ring_consume(r))) { + if (pool->alloc.count == PP_ALLOC_CACHE_REFILL) + break; + pool->alloc.cache[pool->alloc.count++] = page; + } + spin_unlock(&r->consumer_lock); + return page; + } + + /* Slow-path: Get page from locked ring queue */ + page = ptr_ring_consume(&pool->ring); + return page; +} + +/* slow path */ +noinline +static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, + gfp_t _gfp) +{ + struct page *page; + gfp_t gfp = _gfp; + dma_addr_t dma; + + /* We could always set __GFP_COMP, and avoid this branch, as + * prep_new_page() can handle order-0 with __GFP_COMP. + */ + if (pool->p.order) + gfp |= __GFP_COMP; + + /* FUTURE development: + * + * Current slow-path essentially falls back to single page + * allocations, which doesn't improve performance. This code + * need bulk allocation support from the page allocator code. + */ + + /* Cache was empty, do real allocation */ + page = alloc_pages_node(pool->p.nid, gfp, pool->p.order); + if (!page) + return NULL; + + if (!(pool->p.flags & PP_FLAG_DMA_MAP)) + goto skip_dma_map; + + /* Setup DMA mapping: use page->private for DMA-addr + * This mapping is kept for lifetime of page, until leaving pool. + */ + dma = dma_map_page(pool->p.dev, page, 0, + (PAGE_SIZE << pool->p.order), + pool->p.dma_dir); + if (dma_mapping_error(pool->p.dev, dma)) { + put_page(page); + return NULL; + } + set_page_private(page, dma); /* page->private = dma; */ + +skip_dma_map: + /* When page just alloc'ed is should/must have refcnt 1. */ + return page; +} + +/* For using page_pool replace: alloc_pages() API calls, but provide + * synchronization guarantee for allocation side. + */ +struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) +{ + struct page *page; + + /* Fast-path: Get a page from cache */ + page = __page_pool_get_cached(pool); + if (page) + return page; + + /* Slow-path: cache empty, do real allocation */ + page = __page_pool_alloc_pages_slow(pool, gfp); + return page; +} +EXPORT_SYMBOL(page_pool_alloc_pages); + +/* Cleanup page_pool state from page */ +static void __page_pool_clean_page(struct page_pool *pool, + struct page *page) +{ + if (!(pool->p.flags & PP_FLAG_DMA_MAP)) + return; + + /* DMA unmap */ + dma_unmap_page(pool->p.dev, page_private(page), + PAGE_SIZE << pool->p.order, pool->p.dma_dir); + set_page_private(page, 0); +} + +/* Return a page to the page allocator, cleaning up our state */ +static void __page_pool_return_page(struct page_pool *pool, struct page *page) +{ + __page_pool_clean_page(pool, page); + put_page(page); + /* An optimization would be to call __free_pages(page, pool->p.order) + * knowing page is not part of page-cache (thus avoiding a + * __page_cache_release() call). + */ +} + +static bool __page_pool_recycle_into_ring(struct page_pool *pool, + struct page *page) +{ + int ret; + /* BH protection not needed if current is serving softirq */ + if (in_serving_softirq()) + ret = ptr_ring_produce(&pool->ring, page); + else + ret = ptr_ring_produce_bh(&pool->ring, page); + + return (ret == 0) ? true : false; +} + +/* Only allow direct recycling in special circumstances, into the + * alloc side cache. E.g. during RX-NAPI processing for XDP_DROP use-case. + * + * Caller must provide appropriate safe context. + */ +static bool __page_pool_recycle_direct(struct page *page, + struct page_pool *pool) +{ + if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE)) + return false; + + /* Caller MUST have verified/know (page_ref_count(page) == 1) */ + pool->alloc.cache[pool->alloc.count++] = page; + return true; +} + +void __page_pool_put_page(struct page_pool *pool, + struct page *page, bool allow_direct) +{ + /* This allocator is optimized for the XDP mode that uses + * one-frame-per-page, but have fallbacks that act like the + * regular page allocator APIs. + * + * refcnt == 1 means page_pool owns page, and can recycle it. + */ + if (likely(page_ref_count(page) == 1)) { + /* Read barrier done in page_ref_count / READ_ONCE */ + + if (allow_direct && in_serving_softirq()) + if (__page_pool_recycle_direct(page, pool)) + return; + + if (!__page_pool_recycle_into_ring(pool, page)) { + /* Cache full, fallback to free pages */ + __page_pool_return_page(pool, page); + } + return; + } + /* Fallback/non-XDP mode: API user have elevated refcnt. + * + * Many drivers split up the page into fragments, and some + * want to keep doing this to save memory and do refcnt based + * recycling. Support this use case too, to ease drivers + * switching between XDP/non-XDP. + * + * In-case page_pool maintains the DMA mapping, API user must + * call page_pool_put_page once. In this elevated refcnt + * case, the DMA is unmapped/released, as driver is likely + * doing refcnt based recycle tricks, meaning another process + * will be invoking put_page. + */ + __page_pool_clean_page(pool, page); + put_page(page); +} +EXPORT_SYMBOL(__page_pool_put_page); + +static void __page_pool_empty_ring(struct page_pool *pool) +{ + struct page *page; + + /* Empty recycle ring */ + while ((page = ptr_ring_consume(&pool->ring))) { + /* Verify the refcnt invariant of cached pages */ + if (!(page_ref_count(page) == 1)) + pr_crit("%s() page_pool refcnt %d violation\n", + __func__, page_ref_count(page)); + + __page_pool_return_page(pool, page); + } +} + +static void __page_pool_destroy_rcu(struct rcu_head *rcu) +{ + struct page_pool *pool; + + pool = container_of(rcu, struct page_pool, rcu); + + WARN(pool->alloc.count, "API usage violation"); + + __page_pool_empty_ring(pool); + ptr_ring_cleanup(&pool->ring, NULL); + kfree(pool); +} + +/* Cleanup and release resources */ +void page_pool_destroy(struct page_pool *pool) +{ + struct page *page; + + /* Empty alloc cache, assume caller made sure this is + * no-longer in use, and page_pool_alloc_pages() cannot be + * call concurrently. + */ + while (pool->alloc.count) { + page = pool->alloc.cache[--pool->alloc.count]; + __page_pool_return_page(pool, page); + } + + /* No more consumers should exist, but producers could still + * be in-flight. + */ + __page_pool_empty_ring(pool); + + /* An xdp_mem_allocator can still ref page_pool pointer */ + call_rcu(&pool->rcu, __page_pool_destroy_rcu); +} +EXPORT_SYMBOL(page_pool_destroy); diff --git a/net/core/pktgen.c b/net/core/pktgen.c index a1a8afd64846..835ef82eeb45 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -399,7 +399,7 @@ struct pktgen_dev { __u8 ipsmode; /* IPSEC mode (config) */ __u8 ipsproto; /* IPSEC type (config) */ __u32 spi; - struct dst_entry dst; + struct xfrm_dst xdst; struct dst_ops dstops; #endif char result[512]; @@ -2609,7 +2609,7 @@ static int pktgen_output_ipsec(struct sk_buff *skb, struct pktgen_dev *pkt_dev) * supports both transport/tunnel mode + ESP/AH type. */ if ((x->props.mode == XFRM_MODE_TUNNEL) && (pkt_dev->spi != 0)) - skb->_skb_refdst = (unsigned long)&pkt_dev->dst | SKB_DST_NOREF; + skb->_skb_refdst = (unsigned long)&pkt_dev->xdst.u.dst | SKB_DST_NOREF; rcu_read_lock_bh(); err = x->outer_mode->output(x, skb); @@ -3745,10 +3745,10 @@ static int pktgen_add_device(struct pktgen_thread *t, const char *ifname) * performance under such circumstance. */ pkt_dev->dstops.family = AF_INET; - pkt_dev->dst.dev = pkt_dev->odev; - dst_init_metrics(&pkt_dev->dst, pktgen_dst_metrics, false); - pkt_dev->dst.child = &pkt_dev->dst; - pkt_dev->dst.ops = &pkt_dev->dstops; + pkt_dev->xdst.u.dst.dev = pkt_dev->odev; + dst_init_metrics(&pkt_dev->xdst.u.dst, pktgen_dst_metrics, false); + pkt_dev->xdst.child = &pkt_dev->xdst.u.dst; + pkt_dev->xdst.u.dst.ops = &pkt_dev->dstops; #endif return add_dev_to_thread(t, pkt_dev); diff --git a/net/core/ptp_classifier.c b/net/core/ptp_classifier.c index 703cf76aa7c2..4083e90e3ac0 100644 --- a/net/core/ptp_classifier.c +++ b/net/core/ptp_classifier.c @@ -1,13 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0-only /* PTP classifier - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ /* The below program is the bpf_asm (tools/net/) representation of diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index ed3c304ab418..46c0eae20d96 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -714,13 +714,15 @@ int rtnl_put_cacheinfo(struct sk_buff *skb, struct dst_entry *dst, u32 id, long expires, u32 error) { struct rta_cacheinfo ci = { - .rta_lastuse = jiffies_delta_to_clock_t(jiffies - dst->lastuse), - .rta_used = dst->__use, - .rta_clntref = atomic_read(&(dst->__refcnt)), .rta_error = error, .rta_id = id, }; + if (dst) { + ci.rta_lastuse = jiffies_delta_to_clock_t(jiffies - dst->lastuse); + ci.rta_used = dst->__use; + ci.rta_clntref = atomic_read(&dst->__refcnt); + } if (expires) { unsigned long clock; @@ -879,7 +881,8 @@ static size_t rtnl_xdp_size(void) { size_t xdp_size = nla_total_size(0) + /* nest IFLA_XDP */ nla_total_size(1) + /* XDP_ATTACHED */ - nla_total_size(4); /* XDP_PROG_ID */ + nla_total_size(4) + /* XDP_PROG_ID (or 1st mode) */ + nla_total_size(4); /* XDP__PROG_ID */ return xdp_size; } @@ -1229,23 +1232,51 @@ static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev) return 0; } -static u8 rtnl_xdp_attached_mode(struct net_device *dev, u32 *prog_id) +static u32 rtnl_xdp_prog_skb(struct net_device *dev) { - const struct net_device_ops *ops = dev->netdev_ops; const struct bpf_prog *generic_xdp_prog; ASSERT_RTNL(); - *prog_id = 0; generic_xdp_prog = rtnl_dereference(dev->xdp_prog); - if (generic_xdp_prog) { - *prog_id = generic_xdp_prog->aux->id; - return XDP_ATTACHED_SKB; - } - if (!ops->ndo_bpf) - return XDP_ATTACHED_NONE; + if (!generic_xdp_prog) + return 0; + return generic_xdp_prog->aux->id; +} - return __dev_xdp_attached(dev, ops->ndo_bpf, prog_id); +static u32 rtnl_xdp_prog_drv(struct net_device *dev) +{ + return __dev_xdp_query(dev, dev->netdev_ops->ndo_bpf, XDP_QUERY_PROG); +} + +static u32 rtnl_xdp_prog_hw(struct net_device *dev) +{ + return __dev_xdp_query(dev, dev->netdev_ops->ndo_bpf, + XDP_QUERY_PROG_HW); +} + +static int rtnl_xdp_report_one(struct sk_buff *skb, struct net_device *dev, + u32 *prog_id, u8 *mode, u8 tgt_mode, u32 attr, + u32 (*get_prog_id)(struct net_device *dev)) +{ + u32 curr_id; + int err; + + curr_id = get_prog_id(dev); + if (!curr_id) + return 0; + + *prog_id = curr_id; + err = nla_put_u32(skb, attr, curr_id); + if (err) + return err; + + if (*mode != XDP_ATTACHED_NONE) + *mode = XDP_ATTACHED_MULTI; + else + *mode = tgt_mode; + + return 0; } static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev) @@ -1253,17 +1284,29 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev) struct nlattr *xdp; u32 prog_id; int err; + u8 mode; xdp = nla_nest_start(skb, IFLA_XDP); if (!xdp) return -EMSGSIZE; - err = nla_put_u8(skb, IFLA_XDP_ATTACHED, - rtnl_xdp_attached_mode(dev, &prog_id)); + prog_id = 0; + mode = XDP_ATTACHED_NONE; + if (rtnl_xdp_report_one(skb, dev, &prog_id, &mode, XDP_ATTACHED_SKB, + IFLA_XDP_SKB_PROG_ID, rtnl_xdp_prog_skb)) + goto err_cancel; + if (rtnl_xdp_report_one(skb, dev, &prog_id, &mode, XDP_ATTACHED_DRV, + IFLA_XDP_DRV_PROG_ID, rtnl_xdp_prog_drv)) + goto err_cancel; + if (rtnl_xdp_report_one(skb, dev, &prog_id, &mode, XDP_ATTACHED_HW, + IFLA_XDP_HW_PROG_ID, rtnl_xdp_prog_hw)) + goto err_cancel; + + err = nla_put_u8(skb, IFLA_XDP_ATTACHED, mode); if (err) goto err_cancel; - if (prog_id) { + if (prog_id && mode != XDP_ATTACHED_MULTI) { err = nla_put_u32(skb, IFLA_XDP_PROG_ID, prog_id); if (err) goto err_cancel; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 7c8965ae0718..4dff86beb55d 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -79,9 +79,6 @@ struct kmem_cache *skbuff_head_cache __read_mostly; static struct kmem_cache *skbuff_fclone_cache __read_mostly; -#ifdef CONFIG_SKB_EXTENSIONS -static struct kmem_cache *skbuff_ext_cache __ro_after_init; -#endif int sysctl_max_skb_frags __read_mostly = MAX_SKB_FRAGS; EXPORT_SYMBOL(sysctl_max_skb_frags); @@ -259,6 +256,33 @@ nodata: } EXPORT_SYMBOL(__alloc_skb); +/* Caller must provide SKB that is memset cleared */ +static struct sk_buff *__build_skb_around(struct sk_buff *skb, + void *data, unsigned int frag_size) +{ + struct skb_shared_info *shinfo; + unsigned int size = frag_size ? : ksize(data); + + size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + + /* Assumes caller memset cleared SKB */ + skb->truesize = SKB_TRUESIZE(size); + refcount_set(&skb->users, 1); + skb->head = data; + skb->data = data; + skb_reset_tail_pointer(skb); + skb->end = skb->tail + size; + skb->mac_header = (typeof(skb->mac_header))~0U; + skb->transport_header = (typeof(skb->transport_header))~0U; + + /* make sure we initialize shinfo sequentially */ + shinfo = skb_shinfo(skb); + memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); + atomic_set(&shinfo->dataref, 1); + + return skb; +} + /** * __build_skb - build a network buffer * @data: data buffer provided by caller @@ -280,32 +304,15 @@ EXPORT_SYMBOL(__alloc_skb); */ struct sk_buff *__build_skb(void *data, unsigned int frag_size) { - struct skb_shared_info *shinfo; struct sk_buff *skb; - unsigned int size = frag_size ? : ksize(data); skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); - if (!skb) + if (unlikely(!skb)) return NULL; - size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); - memset(skb, 0, offsetof(struct sk_buff, tail)); - skb->truesize = SKB_TRUESIZE(size); - refcount_set(&skb->users, 1); - skb->head = data; - skb->data = data; - skb_reset_tail_pointer(skb); - skb->end = skb->tail + size; - skb->mac_header = (typeof(skb->mac_header))~0U; - skb->transport_header = (typeof(skb->transport_header))~0U; - /* make sure we initialize shinfo sequentially */ - shinfo = skb_shinfo(skb); - memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); - atomic_set(&shinfo->dataref, 1); - - return skb; + return __build_skb_around(skb, data, frag_size); } /* build_skb() is wrapper over __build_skb(), that specifically @@ -326,6 +333,29 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size) } EXPORT_SYMBOL(build_skb); +/** + * build_skb_around - build a network buffer around provided skb + * @skb: sk_buff provide by caller, must be memset cleared + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + */ +struct sk_buff *build_skb_around(struct sk_buff *skb, + void *data, unsigned int frag_size) +{ + if (unlikely(!skb)) + return NULL; + + skb = __build_skb_around(skb, data, frag_size); + + if (skb && frag_size) { + skb->head_frag = 1; + if (page_is_pfmemalloc(virt_to_head_page(data))) + skb->pfmemalloc = 1; + } + return skb; +} +EXPORT_SYMBOL(build_skb_around); + #define NAPI_SKB_CACHE_SIZE 64 struct napi_alloc_cache { @@ -642,7 +672,6 @@ void skb_release_head_state(struct sk_buff *skb) #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) nf_bridge_put(skb->nf_bridge); #endif - skb_ext_put(skb); } /* Free everything but the sk_buff shell. */ @@ -822,7 +851,6 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old) new->dev = old->dev; memcpy(new->cb, old->cb, sizeof(old->cb)); skb_dst_copy(new, old); - __skb_ext_copy(new, old); #ifdef CONFIG_XFRM new->sp = secpath_get(old->sp); #endif @@ -3988,38 +4016,6 @@ done: } EXPORT_SYMBOL_GPL(skb_gro_receive); -#ifdef CONFIG_SKB_EXTENSIONS -#define SKB_EXT_ALIGN_VALUE 8 -#define SKB_EXT_CHUNKSIZEOF(x) (ALIGN((sizeof(x)), SKB_EXT_ALIGN_VALUE) / SKB_EXT_ALIGN_VALUE) -static const u8 skb_ext_type_len[] = { -#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) - [SKB_EXT_BRIDGE_NF] = SKB_EXT_CHUNKSIZEOF(struct nf_bridge_info), -#endif -}; - -static __always_inline unsigned int skb_ext_total_length(void) -{ - return SKB_EXT_CHUNKSIZEOF(struct skb_ext) + -#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER) - skb_ext_type_len[SKB_EXT_BRIDGE_NF] + -#endif - 0; -} - -static void skb_extensions_init(void) -{ - BUILD_BUG_ON(SKB_EXT_NUM >= 8); - BUILD_BUG_ON(skb_ext_total_length() > 255); - skbuff_ext_cache = kmem_cache_create("skbuff_ext_cache", - SKB_EXT_ALIGN_VALUE * skb_ext_total_length(), - 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC, - NULL); -} -#else -static void skb_extensions_init(void) {} -#endif - void __init skb_init(void) { skbuff_head_cache = kmem_cache_create("skbuff_head_cache", @@ -4032,7 +4028,6 @@ void __init skb_init(void) 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); - skb_extensions_init(); } static int @@ -5635,113 +5630,3 @@ void skb_condense(struct sk_buff *skb) */ skb->truesize = SKB_TRUESIZE(skb_end_offset(skb)); } - -#ifdef CONFIG_SKB_EXTENSIONS -static void *skb_ext_get_ptr(struct skb_ext *ext, enum skb_ext_id id) -{ - return (void *)ext + (ext->offset[id] * SKB_EXT_ALIGN_VALUE); -} - -static struct skb_ext *skb_ext_alloc(void) -{ - struct skb_ext *new = kmem_cache_alloc(skbuff_ext_cache, GFP_ATOMIC); - - if (new) { - memset(new->offset, 0, sizeof(new->offset)); - refcount_set(&new->refcnt, 1); - } - - return new; -} - -static struct skb_ext *skb_ext_maybe_cow(struct skb_ext *old) -{ - struct skb_ext *new; - - if (refcount_read(&old->refcnt) == 1) - return old; - - new = kmem_cache_alloc(skbuff_ext_cache, GFP_ATOMIC); - if (!new) - return NULL; - - memcpy(new, old, old->chunks * SKB_EXT_ALIGN_VALUE); - refcount_set(&new->refcnt, 1); - __skb_ext_put(old); - - return new; -} - -/** - * skb_ext_add - allocate space for given extension, COW if needed - * @skb: buffer - * @id: extension to allocate space for - * - * Allocates enough space for the given extension. - * If the extension is already present, a pointer to that extension - * is returned. - * - * If the skb was cloned, COW applies and the returned memory can be - * modified without changing the extension space of clones buffers. - * - * Returns pointer to the extension or NULL on allocation failure. - */ -void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id) -{ - struct skb_ext *new, *old = NULL; - unsigned int newlen, newoff; - - if (skb->active_extensions) { - old = skb->extensions; - new = skb_ext_maybe_cow(old); - if (!new) - return NULL; - if (__skb_ext_exist(old, id)) { - if (old != new) - skb->extensions = new; - goto set_active; - } - newoff = old->chunks; - } else { - newoff = SKB_EXT_CHUNKSIZEOF(*new); - new = skb_ext_alloc(); - if (!new) - return NULL; - } - - newlen = newoff + skb_ext_type_len[id]; - new->chunks = newlen; - new->offset[id] = newoff; - skb->extensions = new; - -set_active: - skb->active_extensions |= 1 << id; - return skb_ext_get_ptr(new, id); -} - -EXPORT_SYMBOL(skb_ext_add); -void __skb_ext_del(struct sk_buff *skb, enum skb_ext_id id) -{ - struct skb_ext *ext = skb->extensions; - skb->active_extensions &= ~(1 << id); - if (skb->active_extensions == 0) { - skb->extensions = NULL; - __skb_ext_put(ext); - } -} - -EXPORT_SYMBOL(__skb_ext_del); -void __skb_ext_put(struct skb_ext *ext) -{ - /* If this is last clone, nothing can increment - * it after check passes. Avoids one atomic op. - */ - if (refcount_read(&ext->refcnt) == 1) - goto free_now; - if (!refcount_dec_and_test(&ext->refcnt)) - return; -free_now: - kmem_cache_free(skbuff_ext_cache, ext); -} -EXPORT_SYMBOL(__skb_ext_put); -#endif /* CONFIG_SKB_EXTENSIONS */ diff --git a/net/core/skmsg.c b/net/core/skmsg.c index ae2b281c9c57..aad5fbe4933d 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -73,6 +73,45 @@ int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len, } EXPORT_SYMBOL_GPL(sk_msg_alloc); +int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src, + u32 off, u32 len) +{ + int i = src->sg.start; + struct scatterlist *sge = sk_msg_elem(src, i); + u32 sge_len, sge_off; + + if (sk_msg_full(dst)) + return -ENOSPC; + + while (off) { + if (sge->length > off) + break; + off -= sge->length; + sk_msg_iter_var_next(i); + if (i == src->sg.end && off) + return -ENOSPC; + sge = sk_msg_elem(src, i); + } + + while (len) { + sge_len = sge->length - off; + sge_off = sge->offset + off; + if (sge_len > len) + sge_len = len; + off = 0; + len -= sge_len; + sk_msg_page_add(dst, sg_page(sge), sge_len, sge_off); + sk_mem_charge(sk, sge_len); + sk_msg_iter_var_next(i); + if (i == src->sg.end && len) + return -ENOSPC; + sge = sk_msg_elem(src, i); + } + + return 0; +} +EXPORT_SYMBOL_GPL(sk_msg_clone); + void sk_msg_return_zero(struct sock *sk, struct sk_msg *msg, int bytes) { int i = msg->sg.start; @@ -220,18 +259,28 @@ void sk_msg_trim(struct sock *sk, struct sk_msg *msg, int len) msg->sg.data[i].length -= trim; sk_mem_uncharge(sk, trim); + /* Adjust copybreak if it falls into the trimmed part of last buf */ + if (msg->sg.curr == i && msg->sg.copybreak > msg->sg.data[i].length) + msg->sg.copybreak = msg->sg.data[i].length; out: - /* If we trim data before curr pointer update copybreak and current - * so that any future copy operations start at new copy location. + sk_msg_iter_var_next(i); + msg->sg.end = i; + + /* If we trim data a full sg elem before curr pointer update + * copybreak and current so that any future copy operations + * start at new copy location. * However trimed data that has not yet been used in a copy op * does not require an update. */ - if (msg->sg.curr >= i) { + if (!msg->sg.size) { + msg->sg.curr = msg->sg.start; + msg->sg.copybreak = 0; + } else if (sk_msg_iter_dist(msg->sg.start, msg->sg.curr) >= + sk_msg_iter_dist(msg->sg.start, msg->sg.end)) { + sk_msg_iter_var_prev(i); msg->sg.curr = i; msg->sg.copybreak = msg->sg.data[i].length; } - sk_msg_iter_var_next(i); - msg->sg.end = i; } EXPORT_SYMBOL_GPL(sk_msg_trim); @@ -360,7 +409,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb) sk_mem_charge(sk, skb->len); copied = skb->len; msg->sg.start = 0; - msg->sg.end = num_sge == MAX_MSG_FRAGS ? 0 : num_sge; + msg->sg.end = num_sge; msg->skb = skb; sk_psock_queue_msg(psock, msg); diff --git a/net/core/sock.c b/net/core/sock.c index be7e26cc67b4..30787fdf492b 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -136,6 +136,7 @@ #include #include +#include #include @@ -230,7 +231,8 @@ static struct lock_class_key af_family_kern_slock_keys[AF_MAX]; x "AF_RXRPC" , x "AF_ISDN" , x "AF_PHONET" , \ x "AF_IEEE802154", x "AF_CAIF" , x "AF_ALG" , \ x "AF_NFC" , x "AF_VSOCK" , x "AF_KCM" , \ - x "AF_QIPCRTR", x "AF_SMC" , x "AF_MAX" + x "AF_QIPCRTR", x "AF_SMC" , x "AF_XDP" , \ + x "AF_MAX" static const char *const af_family_key_strings[AF_MAX+1] = { _sock_locks("sk_lock-") @@ -266,7 +268,8 @@ static const char *const af_family_rlock_key_strings[AF_MAX+1] = { "rlock-AF_RXRPC" , "rlock-AF_ISDN" , "rlock-AF_PHONET" , "rlock-AF_IEEE802154", "rlock-AF_CAIF" , "rlock-AF_ALG" , "rlock-AF_NFC" , "rlock-AF_VSOCK" , "rlock-AF_KCM" , - "rlock-AF_QIPCRTR", "rlock-AF_SMC" , "rlock-AF_MAX" + "rlock-AF_QIPCRTR", "rlock-AF_SMC" , "rlock-AF_XDP" , + "rlock-AF_MAX" }; static const char *const af_family_wlock_key_strings[AF_MAX+1] = { "wlock-AF_UNSPEC", "wlock-AF_UNIX" , "wlock-AF_INET" , @@ -283,7 +286,8 @@ static const char *const af_family_wlock_key_strings[AF_MAX+1] = { "wlock-AF_RXRPC" , "wlock-AF_ISDN" , "wlock-AF_PHONET" , "wlock-AF_IEEE802154", "wlock-AF_CAIF" , "wlock-AF_ALG" , "wlock-AF_NFC" , "wlock-AF_VSOCK" , "wlock-AF_KCM" , - "wlock-AF_QIPCRTR", "wlock-AF_SMC" , "wlock-AF_MAX" + "wlock-AF_QIPCRTR", "wlock-AF_SMC" , "wlock-AF_XDP" , + "wlock-AF_MAX" }; static const char *const af_family_elock_key_strings[AF_MAX+1] = { "elock-AF_UNSPEC", "elock-AF_UNIX" , "elock-AF_INET" , @@ -300,7 +304,8 @@ static const char *const af_family_elock_key_strings[AF_MAX+1] = { "elock-AF_RXRPC" , "elock-AF_ISDN" , "elock-AF_PHONET" , "elock-AF_IEEE802154", "elock-AF_CAIF" , "elock-AF_ALG" , "elock-AF_NFC" , "elock-AF_VSOCK" , "elock-AF_KCM" , - "elock-AF_QIPCRTR", "elock-AF_SMC" , "elock-AF_MAX" + "elock-AF_QIPCRTR", "elock-AF_SMC" , "elock-AF_XDP" , + "elock-AF_MAX" }; /* @@ -1744,6 +1749,10 @@ static void __sk_destruct(struct rcu_head *head) sock_disable_timestamp(sk, SK_FLAGS_TIMESTAMP); +#ifdef CONFIG_BPF_SYSCALL + bpf_sk_storage_free(sk); +#endif + if (atomic_read(&sk->sk_omem_alloc)) pr_debug("%s: optmem leakage (%d bytes) detected\n", __func__, atomic_read(&sk->sk_omem_alloc)); @@ -1895,6 +1904,12 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) } RCU_INIT_POINTER(newsk->sk_reuseport_cb, NULL); + if (bpf_sk_storage_clone(sk, newsk)) { + sk_free_unlock_clone(newsk); + newsk = NULL; + goto out; + } + newsk->sk_err = 0; newsk->sk_err_soft = 0; newsk->sk_priority = 0; @@ -2420,64 +2435,6 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag) } EXPORT_SYMBOL(sk_page_frag_refill); -int sk_alloc_sg(struct sock *sk, int len, struct scatterlist *sg, - int sg_start, int *sg_curr_index, unsigned int *sg_curr_size, - int first_coalesce) -{ - int sg_curr = *sg_curr_index, use = 0, rc = 0; - unsigned int size = *sg_curr_size; - struct page_frag *pfrag; - struct scatterlist *sge; - - len -= size; - pfrag = sk_page_frag(sk); - while (len > 0) { - unsigned int orig_offset; - - if (!sk_page_frag_refill(sk, pfrag)) { - rc = -ENOMEM; - goto out; - } - - use = min_t(int, len, pfrag->size - pfrag->offset); - if (!sk_wmem_schedule(sk, use)) { - rc = -ENOMEM; - goto out; - } - - sk_mem_charge(sk, use); - size += use; - orig_offset = pfrag->offset; - pfrag->offset += use; - - sge = sg + sg_curr - 1; - if (sg_curr > first_coalesce && sg_page(sg) == pfrag->page && - sg->offset + sg->length == orig_offset) { - sg->length += use; - } else { - sge = sg + sg_curr; - sg_unmark_end(sge); - sg_set_page(sge, pfrag->page, use, orig_offset); - get_page(pfrag->page); - sg_curr++; - if (sg_curr == MAX_SKB_FRAGS) - sg_curr = 0; - - if (sg_curr == sg_start) { - rc = -ENOSPC; - break; - } - } - - len -= use; - } -out: - *sg_curr_size = size; - *sg_curr_index = sg_curr; - return rc; -} -EXPORT_SYMBOL(sk_alloc_sg); - static void __lock_sock(struct sock *sk) __releases(&sk->sk_lock.slock) __acquires(&sk->sk_lock.slock) diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 3c0e44cb811a..d6b1073778f5 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -44,22 +44,17 @@ static struct bpf_map *sock_map_alloc(union bpf_attr *attr) /* Make sure page count doesn't overflow. */ cost = (u64) stab->map.max_entries * sizeof(struct sock *); - if (cost >= U32_MAX - PAGE_SIZE) { - err = -EINVAL; - goto free_stab; - } - - stab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT; - err = bpf_map_precharge_memlock(stab->map.pages); + err = bpf_map_charge_init(&stab->map.memory, cost); if (err) goto free_stab; - stab->sks = bpf_map_area_alloc(stab->map.max_entries * + stab->sks = bpf_map_area_alloc((u64) stab->map.max_entries * sizeof(struct sock *), stab->map.numa_node); if (stab->sks) return &stab->map; err = -ENOMEM; + bpf_map_charge_finish(&stab->map.memory); free_stab: kfree(stab); return ERR_PTR(err); @@ -76,7 +71,42 @@ int sock_map_get_from_fd(const union bpf_attr *attr, struct bpf_prog *prog) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); - ret = sock_map_prog_update(map, prog, attr->attach_type); + ret = sock_map_prog_update(map, prog, NULL, attr->attach_type); + fdput(f); + return ret; +} + +int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype) +{ + u32 ufd = attr->target_fd; + struct bpf_prog *prog; + struct bpf_map *map; + struct fd f; + int ret; + + if (attr->attach_flags) + return -EINVAL; + + f = fdget(ufd); + map = __bpf_map_get(f); + if (IS_ERR(map)) + return PTR_ERR(map); + + prog = bpf_prog_get(attr->attach_bpf_fd); + if (IS_ERR(prog)) { + ret = PTR_ERR(prog); + goto put_map; + } + + if (prog->type != ptype) { + ret = -EINVAL; + goto put_prog; + } + + ret = sock_map_prog_update(map, NULL, prog, attr->attach_type); +put_prog: + bpf_prog_put(prog); +put_map: fdput(f); return ret; } @@ -963,27 +993,32 @@ static struct sk_psock_progs *sock_map_progs(struct bpf_map *map) } int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog, - u32 which) + struct bpf_prog *old, u32 which) { struct sk_psock_progs *progs = sock_map_progs(map); + struct bpf_prog **pprog; if (!progs) return -EOPNOTSUPP; switch (which) { case BPF_SK_MSG_VERDICT: - psock_set_prog(&progs->msg_parser, prog); + pprog = &progs->msg_parser; break; case BPF_SK_SKB_STREAM_PARSER: - psock_set_prog(&progs->skb_parser, prog); + pprog = &progs->skb_parser; break; case BPF_SK_SKB_STREAM_VERDICT: - psock_set_prog(&progs->skb_verdict, prog); + pprog = &progs->skb_verdict; break; default: return -EOPNOTSUPP; } + if (old) + return psock_replace_prog(pprog, prog, old); + + psock_set_prog(pprog, prog); return 0; } diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 676092d7bd81..4a55e9d2dd03 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -8,11 +8,34 @@ #include #include +#include +#include #include #define INIT_SOCKS 128 -static DEFINE_SPINLOCK(reuseport_lock); +DEFINE_SPINLOCK(reuseport_lock); + +#define REUSEPORT_MIN_ID 1 +static DEFINE_IDA(reuseport_ida); + +int reuseport_get_id(struct sock_reuseport *reuse) +{ + int id; + + if (reuse->reuseport_id) + return reuse->reuseport_id; + + id = ida_simple_get(&reuseport_ida, REUSEPORT_MIN_ID, 0, + /* Called under reuseport_lock */ + GFP_ATOMIC); + if (id < 0) + return id; + + reuse->reuseport_id = id; + + return reuse->reuseport_id; +} static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks) { @@ -29,7 +52,7 @@ static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks) return reuse; } -int reuseport_alloc(struct sock *sk) +int reuseport_alloc(struct sock *sk, bool bind_inany) { struct sock_reuseport *reuse; @@ -41,9 +64,17 @@ int reuseport_alloc(struct sock *sk) /* Allocation attempts can occur concurrently via the setsockopt path * and the bind/hash path. Nothing to do when we lose the race. */ - if (rcu_dereference_protected(sk->sk_reuseport_cb, - lockdep_is_held(&reuseport_lock))) + reuse = rcu_dereference_protected(sk->sk_reuseport_cb, + lockdep_is_held(&reuseport_lock)); + if (reuse) { + /* Only set reuse->bind_inany if the bind_inany is true. + * Otherwise, it will overwrite the reuse->bind_inany + * which was set by the bind/hash path. + */ + if (bind_inany) + reuse->bind_inany = bind_inany; goto out; + } reuse = __reuseport_alloc(INIT_SOCKS); if (!reuse) { @@ -53,6 +84,7 @@ int reuseport_alloc(struct sock *sk) reuse->socks[0] = sk; reuse->num_socks = 1; + reuse->bind_inany = bind_inany; rcu_assign_pointer(sk->sk_reuseport_cb, reuse); out: @@ -78,6 +110,8 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse) more_reuse->max_socks = more_socks_size; more_reuse->num_socks = reuse->num_socks; more_reuse->prog = reuse->prog; + more_reuse->reuseport_id = reuse->reuseport_id; + more_reuse->bind_inany = reuse->bind_inany; memcpy(more_reuse->socks, reuse->socks, reuse->num_socks * sizeof(struct sock *)); @@ -99,8 +133,9 @@ static void reuseport_free_rcu(struct rcu_head *head) struct sock_reuseport *reuse; reuse = container_of(head, struct sock_reuseport, rcu); - if (reuse->prog) - bpf_prog_destroy(reuse->prog); + sk_reuseport_prog_free(rcu_dereference_protected(reuse->prog, 1)); + if (reuse->reuseport_id) + ida_simple_remove(&reuseport_ida, reuse->reuseport_id); kfree(reuse); } @@ -110,12 +145,12 @@ static void reuseport_free_rcu(struct rcu_head *head) * @sk2: Socket belonging to the existing reuseport group. * May return ENOMEM and not add socket to group under memory pressure. */ -int reuseport_add_sock(struct sock *sk, struct sock *sk2) +int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) { struct sock_reuseport *old_reuse, *reuse; if (!rcu_access_pointer(sk2->sk_reuseport_cb)) { - int err = reuseport_alloc(sk2); + int err = reuseport_alloc(sk2, bind_inany); if (err) return err; @@ -160,6 +195,14 @@ void reuseport_detach_sock(struct sock *sk) spin_lock_bh(&reuseport_lock); reuse = rcu_dereference_protected(sk->sk_reuseport_cb, lockdep_is_held(&reuseport_lock)); + + /* At least one of the sk in this reuseport group is added to + * a bpf map. Notify the bpf side. The bpf map logic will + * remove the sk if it is indeed added to a bpf map. + */ + if (reuse->reuseport_id) + bpf_sk_reuseport_detach(sk); + rcu_assign_pointer(sk->sk_reuseport_cb, NULL); for (i = 0; i < reuse->num_socks; i++) { @@ -175,9 +218,9 @@ void reuseport_detach_sock(struct sock *sk) } EXPORT_SYMBOL(reuseport_detach_sock); -static struct sock *run_bpf(struct sock_reuseport *reuse, u16 socks, - struct bpf_prog *prog, struct sk_buff *skb, - int hdr_len) +static struct sock *run_bpf_filter(struct sock_reuseport *reuse, u16 socks, + struct bpf_prog *prog, struct sk_buff *skb, + int hdr_len) { struct sk_buff *nskb = NULL; u32 index; @@ -238,9 +281,17 @@ struct sock *reuseport_select_sock(struct sock *sk, /* paired with smp_wmb() in reuseport_add_sock() */ smp_rmb(); - if (prog && skb) - sk2 = run_bpf(reuse, socks, prog, skb, hdr_len); + if (!prog || !skb) + goto select_by_hash; + + if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) + sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash); else + sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len); + +select_by_hash: + /* no bpf or invalid bpf result: fall back to hash usage */ + if (!sk2) sk2 = reuse->socks[reciprocal_scale(hash, socks)]; } @@ -250,12 +301,21 @@ out: } EXPORT_SYMBOL(reuseport_select_sock); -struct bpf_prog * -reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog) +int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog) { struct sock_reuseport *reuse; struct bpf_prog *old_prog; + if (sk_unhashed(sk) && sk->sk_reuseport) { + int err = reuseport_alloc(sk, false); + + if (err) + return err; + } else if (!rcu_access_pointer(sk->sk_reuseport_cb)) { + /* The socket wasn't bound with SO_REUSEPORT */ + return -EINVAL; + } + spin_lock_bh(&reuseport_lock); reuse = rcu_dereference_protected(sk->sk_reuseport_cb, lockdep_is_held(&reuseport_lock)); @@ -264,6 +324,7 @@ reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog) rcu_assign_pointer(reuse->prog, prog); spin_unlock_bh(&reuseport_lock); - return old_prog; + sk_reuseport_prog_free(old_prog); + return 0; } EXPORT_SYMBOL(reuseport_attach_prog); diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index 1b5749f2ef9c..3885c74f1110 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -270,9 +270,14 @@ static int proc_dointvec_minmax_bpf_enable(struct ctl_table *table, int write, tmp.data = &jit_enable; ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); if (write && !ret) { - *(int *)table->data = jit_enable; - if (jit_enable == 2) - pr_warn("bpf_jit_enable = 2 was set! NEVER use this in production, only for JIT debugging!\n"); + if (jit_enable < 2 || + (jit_enable == 2 && bpf_dump_raw_ok(current_cred()))) { + *(int *)table->data = jit_enable; + if (jit_enable == 2) + pr_warn("bpf_jit_enable = 2 was set! NEVER use this in production, only for JIT debugging!\n"); + } else { + ret = -EPERM; + } } return ret; } diff --git a/net/core/xdp.c b/net/core/xdp.c index 229bc5a0ee04..5df56e7eeeba 100644 --- a/net/core/xdp.c +++ b/net/core/xdp.c @@ -1,10 +1,17 @@ +// SPDX-License-Identifier: GPL-2.0-only /* net/core/xdp.c * * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc. - * Released under terms in GPL version 2. See COPYING. */ +#include +#include #include #include +#include +#include +#include +#include +#include #include @@ -13,6 +20,105 @@ #define REG_STATE_UNREGISTERED 0x2 #define REG_STATE_UNUSED 0x3 +static DEFINE_IDA(mem_id_pool); +static DEFINE_MUTEX(mem_id_lock); +#define MEM_ID_MAX 0xFFFE +#define MEM_ID_MIN 1 +static int mem_id_next = MEM_ID_MIN; + +static bool mem_id_init; /* false */ +static struct rhashtable *mem_id_ht; + +struct xdp_mem_allocator { + struct xdp_mem_info mem; + union { + void *allocator; + struct page_pool *page_pool; + struct zero_copy_allocator *zc_alloc; + }; + struct rhash_head node; + struct rcu_head rcu; +}; + +static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed) +{ + const u32 *k = data; + const u32 key = *k; + + BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_mem_allocator, mem.id) + != sizeof(u32)); + + /* Use cyclic increasing ID as direct hash key, see rht_bucket_index */ + return key << RHT_HASH_RESERVED_SPACE; +} + +static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg, + const void *ptr) +{ + const struct xdp_mem_allocator *xa = ptr; + u32 mem_id = *(u32 *)arg->key; + + return xa->mem.id != mem_id; +} + +static const struct rhashtable_params mem_id_rht_params = { + .nelem_hint = 64, + .head_offset = offsetof(struct xdp_mem_allocator, node), + .key_offset = offsetof(struct xdp_mem_allocator, mem.id), + .key_len = FIELD_SIZEOF(struct xdp_mem_allocator, mem.id), + .max_size = MEM_ID_MAX, + .min_size = 8, + .automatic_shrinking = true, + .hashfn = xdp_mem_id_hashfn, + .obj_cmpfn = xdp_mem_id_cmp, +}; + +static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu) +{ + struct xdp_mem_allocator *xa; + + xa = container_of(rcu, struct xdp_mem_allocator, rcu); + + /* Allow this ID to be reused */ + ida_simple_remove(&mem_id_pool, xa->mem.id); + + /* Notice, driver is expected to free the *allocator, + * e.g. page_pool, and MUST also use RCU free. + */ + + /* Poison memory */ + xa->mem.id = 0xFFFF; + xa->mem.type = 0xF0F0; + xa->allocator = (void *)0xDEAD9001; + + kfree(xa); +} + +static void __xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq) +{ + struct xdp_mem_allocator *xa; + int id = xdp_rxq->mem.id; + int err; + + if (id == 0) + return; + + mutex_lock(&mem_id_lock); + + xa = rhashtable_lookup(mem_id_ht, &id, mem_id_rht_params); + if (!xa) { + mutex_unlock(&mem_id_lock); + return; + } + + err = rhashtable_remove_fast(mem_id_ht, &xa->node, mem_id_rht_params); + WARN_ON(err); + + call_rcu(&xa->rcu, __xdp_mem_allocator_rcu_free); + + mutex_unlock(&mem_id_lock); +} + void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq) { /* Simplify driver cleanup code paths, allow unreg "unused" */ @@ -21,8 +127,14 @@ void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq) WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG"); + __xdp_rxq_info_unreg_mem_model(xdp_rxq); + xdp_rxq->reg_state = REG_STATE_UNREGISTERED; xdp_rxq->dev = NULL; + + /* Reset mem info to defaults */ + xdp_rxq->mem.id = 0; + xdp_rxq->mem.type = 0; } EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg); @@ -65,3 +177,247 @@ void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq) xdp_rxq->reg_state = REG_STATE_UNUSED; } EXPORT_SYMBOL_GPL(xdp_rxq_info_unused); + +bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq) +{ + return (xdp_rxq->reg_state == REG_STATE_REGISTERED); +} +EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg); + +static int __mem_id_init_hash_table(void) +{ + struct rhashtable *rht; + int ret; + + if (unlikely(mem_id_init)) + return 0; + + rht = kzalloc(sizeof(*rht), GFP_KERNEL); + if (!rht) + return -ENOMEM; + + ret = rhashtable_init(rht, &mem_id_rht_params); + if (ret < 0) { + kfree(rht); + return ret; + } + mem_id_ht = rht; + smp_mb(); /* mutex lock should provide enough pairing */ + mem_id_init = true; + + return 0; +} + +/* Allocate a cyclic ID that maps to allocator pointer. + * See: https://www.kernel.org/doc/html/latest/core-api/idr.html + * + * Caller must lock mem_id_lock. + */ +static int __mem_id_cyclic_get(gfp_t gfp) +{ + int retries = 1; + int id; + +again: + id = ida_simple_get(&mem_id_pool, mem_id_next, MEM_ID_MAX, gfp); + if (id < 0) { + if (id == -ENOSPC) { + /* Cyclic allocator, reset next id */ + if (retries--) { + mem_id_next = MEM_ID_MIN; + goto again; + } + } + return id; /* errno */ + } + mem_id_next = id + 1; + + return id; +} + +static bool __is_supported_mem_type(enum xdp_mem_type type) +{ + if (type == MEM_TYPE_PAGE_POOL) + return is_page_pool_compiled_in(); + + if (type >= MEM_TYPE_MAX) + return false; + + return true; +} + +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq, + enum xdp_mem_type type, void *allocator) +{ + struct xdp_mem_allocator *xdp_alloc; + gfp_t gfp = GFP_KERNEL; + int id, errno, ret; + void *ptr; + + if (xdp_rxq->reg_state != REG_STATE_REGISTERED) { + WARN(1, "Missing register, driver bug"); + return -EFAULT; + } + + if (!__is_supported_mem_type(type)) + return -EOPNOTSUPP; + + xdp_rxq->mem.type = type; + + if (!allocator) { + if (type == MEM_TYPE_PAGE_POOL || type == MEM_TYPE_ZERO_COPY) + return -EINVAL; /* Setup time check page_pool req */ + return 0; + } + + /* Delay init of rhashtable to save memory if feature isn't used */ + if (!mem_id_init) { + mutex_lock(&mem_id_lock); + ret = __mem_id_init_hash_table(); + mutex_unlock(&mem_id_lock); + if (ret < 0) { + WARN_ON(1); + return ret; + } + } + + xdp_alloc = kzalloc(sizeof(*xdp_alloc), gfp); + if (!xdp_alloc) + return -ENOMEM; + + mutex_lock(&mem_id_lock); + id = __mem_id_cyclic_get(gfp); + if (id < 0) { + errno = id; + goto err; + } + xdp_rxq->mem.id = id; + xdp_alloc->mem = xdp_rxq->mem; + xdp_alloc->allocator = allocator; + + /* Insert allocator into ID lookup table */ + ptr = rhashtable_insert_slow(mem_id_ht, &id, &xdp_alloc->node); + if (IS_ERR(ptr)) { + errno = PTR_ERR(ptr); + goto err; + } + + mutex_unlock(&mem_id_lock); + + return 0; +err: + mutex_unlock(&mem_id_lock); + kfree(xdp_alloc); + return errno; +} +EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model); + +/* XDP RX runs under NAPI protection, and in different delivery error + * scenarios (e.g. queue full), it is possible to return the xdp_frame + * while still leveraging this protection. The @napi_direct boolian + * is used for those calls sites. Thus, allowing for faster recycling + * of xdp_frames/pages in those cases. + */ +static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct, + unsigned long handle) +{ + struct xdp_mem_allocator *xa; + struct page *page; + + switch (mem->type) { + case MEM_TYPE_PAGE_POOL: + rcu_read_lock(); + /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */ + xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params); + page = virt_to_head_page(data); + if (xa) { + napi_direct &= !xdp_return_frame_no_direct(); + page_pool_put_page(xa->page_pool, page, napi_direct); + } else { + put_page(page); + } + rcu_read_unlock(); + break; + case MEM_TYPE_PAGE_SHARED: + page_frag_free(data); + break; + case MEM_TYPE_PAGE_ORDER0: + page = virt_to_page(data); /* Assumes order0 page*/ + put_page(page); + break; + case MEM_TYPE_ZERO_COPY: + /* NB! Only valid from an xdp_buff! */ + rcu_read_lock(); + /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */ + xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params); + xa->zc_alloc->free(xa->zc_alloc, handle); + rcu_read_unlock(); + default: + /* Not possible, checked in xdp_rxq_info_reg_mem_model() */ + break; + } +} + +void xdp_return_frame(struct xdp_frame *xdpf) +{ + __xdp_return(xdpf->data, &xdpf->mem, false, 0); +} +EXPORT_SYMBOL_GPL(xdp_return_frame); + +void xdp_return_frame_rx_napi(struct xdp_frame *xdpf) +{ + __xdp_return(xdpf->data, &xdpf->mem, true, 0); +} +EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi); + +void xdp_return_buff(struct xdp_buff *xdp) +{ + __xdp_return(xdp->data, &xdp->rxq->mem, true, xdp->handle); +} +EXPORT_SYMBOL_GPL(xdp_return_buff); + +/* Only called for MEM_TYPE_PAGE_POOL see xdp.h */ +void __xdp_release_frame(void *data, struct xdp_mem_info *mem) +{ + struct xdp_mem_allocator *xa; + struct page *page; + + rcu_read_lock(); + xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params); + page = virt_to_head_page(data); + if (xa) + page_pool_release_page(xa->page_pool, page); + rcu_read_unlock(); +} +EXPORT_SYMBOL_GPL(__xdp_release_frame); + +int xdp_attachment_query(struct xdp_attachment_info *info, + struct netdev_bpf *bpf) +{ + bpf->prog_id = info->prog ? info->prog->aux->id : 0; + bpf->prog_flags = info->prog ? info->flags : 0; + return 0; +} +EXPORT_SYMBOL_GPL(xdp_attachment_query); + +bool xdp_attachment_flags_ok(struct xdp_attachment_info *info, + struct netdev_bpf *bpf) +{ + if (info->prog && (bpf->flags ^ info->flags) & XDP_FLAGS_MODES) { + NL_SET_ERR_MSG(bpf->extack, + "program loaded with different flags"); + return false; + } + return true; +} +EXPORT_SYMBOL_GPL(xdp_attachment_flags_ok); + +void xdp_attachment_setup(struct xdp_attachment_info *info, + struct netdev_bpf *bpf) +{ + if (info->prog) + bpf_prog_put(info->prog); + info->prog = bpf->prog; + info->flags = bpf->flags; +} +EXPORT_SYMBOL_GPL(xdp_attachment_setup); diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c index 9d9f6f576217..4eac2d7a1167 100644 --- a/net/ethernet/eth.c +++ b/net/ethernet/eth.c @@ -128,15 +128,16 @@ u32 eth_get_headlen(void *data, unsigned int len) { const unsigned int flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG; const struct ethhdr *eth = (const struct ethhdr *)data; - struct flow_keys keys; + struct flow_keys_basic keys; /* this should never happen, but better safe than sorry */ if (unlikely(len < sizeof(*eth))) return len; /* parse any remaining L2/L3 headers, check for L4 */ - if (!skb_flow_dissect_flow_keys_buf(&keys, data, eth->h_proto, - sizeof(*eth), len, flags)) + if (!skb_flow_dissect_flow_keys_basic(NULL, NULL, &keys, data, + eth->h_proto, sizeof(*eth), + len, flags)) return max_t(u32, keys.control.thoff, sizeof(*eth)); /* parse for any L4 headers */ diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index 532dda77cdb4..9b6a0e031e22 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -13,7 +13,10 @@ obj-y := route.o inetpeer.o protocol.o \ tcp_offload.o datagram.o raw.o udp.o udplite.o \ udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \ fib_frontend.o fib_semantics.o fib_trie.o fib_notifier.o \ - inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o + inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \ + metrics.o + +obj-$(CONFIG_BPFILTER) += bpfilter/ obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index ca240a689866..05edbe2d725b 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -196,7 +196,7 @@ int inet_listen(struct socket *sock, int backlog) { struct sock *sk = sock->sk; unsigned char old_state; - int err; + int err, tcp_fastopen; lock_sock(sk); @@ -218,16 +218,18 @@ int inet_listen(struct socket *sock, int backlog) * because the socket was in TCP_LISTEN state previously but * was shutdown() rather than close(). */ - if ((sysctl_tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) && - (sysctl_tcp_fastopen & TFO_SERVER_ENABLE) && + tcp_fastopen = sock_net(sk)->ipv4.sysctl_tcp_fastopen; + if ((tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) && + (tcp_fastopen & TFO_SERVER_ENABLE) && !inet_csk(sk)->icsk_accept_queue.fastopenq.max_qlen) { fastopen_queue_tune(sk, backlog); - tcp_fastopen_init_key_once(true); + tcp_fastopen_init_key_once(sock_net(sk)); } err = inet_csk_listen_start(sk, backlog); if (err) goto out; + tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_LISTEN_CB, 0, NULL); } sk->sk_max_ack_backlog = backlog; err = 0; @@ -551,8 +553,8 @@ int inet_dgram_connect(struct socket *sock, struct sockaddr *uaddr, int addr_len, int flags) { struct sock *sk = sock->sk; - int err; const struct proto *prot; + int err; if (addr_len < sizeof(uaddr->sa_family)) return -EINVAL; diff --git a/net/ipv4/bpfilter/Makefile b/net/ipv4/bpfilter/Makefile new file mode 100644 index 000000000000..ce262d76cc48 --- /dev/null +++ b/net/ipv4/bpfilter/Makefile @@ -0,0 +1,2 @@ +obj-$(CONFIG_BPFILTER) += sockopt.o + diff --git a/net/ipv4/bpfilter/sockopt.c b/net/ipv4/bpfilter/sockopt.c new file mode 100644 index 000000000000..e8ba37028d07 --- /dev/null +++ b/net/ipv4/bpfilter/sockopt.c @@ -0,0 +1,80 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct bpfilter_umh_ops bpfilter_ops; +EXPORT_SYMBOL_GPL(bpfilter_ops); + +static void bpfilter_umh_cleanup(struct umh_info *info) +{ + mutex_lock(&bpfilter_ops.lock); + bpfilter_ops.stop = true; + fput(info->pipe_to_umh); + fput(info->pipe_from_umh); + info->pid = 0; + mutex_unlock(&bpfilter_ops.lock); +} + +int bpfilter_mbox_request(struct sock *sk, int optname, char __user *optval, + unsigned int optlen, bool is_set) +{ + int err; + mutex_lock(&bpfilter_ops.lock); + if (!bpfilter_ops.sockopt) { + mutex_unlock(&bpfilter_ops.lock); + err = request_module("bpfilter"); + mutex_lock(&bpfilter_ops.lock); + + if (err) + goto out; + if (!bpfilter_ops.sockopt) { + err = -ECHILD; + goto out; + } + } + if (bpfilter_ops.stop) { + err = bpfilter_ops.start(); + if (err) + goto out; + } + err = bpfilter_ops.sockopt(sk, optname, optval, optlen, is_set); +out: + mutex_unlock(&bpfilter_ops.lock); + return err; +} + +int bpfilter_ip_set_sockopt(struct sock *sk, int optname, char __user *optval, + unsigned int optlen) +{ + return bpfilter_mbox_request(sk, optname, optval, optlen, true); +} + +int bpfilter_ip_get_sockopt(struct sock *sk, int optname, char __user *optval, + int __user *optlen) +{ + int len; + + if (get_user(len, optlen)) + return -EFAULT; + + return bpfilter_mbox_request(sk, optname, optval, len, false); +} + +static int __init bpfilter_sockopt_init(void) +{ + mutex_init(&bpfilter_ops.lock); + bpfilter_ops.stop = true; + bpfilter_ops.info.cmdline = "bpfilter_umh"; + bpfilter_ops.info.cleanup = &bpfilter_umh_cleanup; + + return 0; +} + +module_init(bpfilter_sockopt_init); diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index e883b920f2a1..79fa2d7852ef 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -121,32 +121,14 @@ static void esp_ssg_unref(struct xfrm_state *x, void *tmp) static void esp_output_done(struct crypto_async_request *base, int err) { struct sk_buff *skb = base->data; - struct xfrm_offload *xo = xfrm_offload(skb); void *tmp; - struct xfrm_state *x; - - if (xo && (xo->flags & XFRM_DEV_RESUME)) - x = skb->sp->xvec[skb->sp->len - 1]; - else - x = skb_dst(skb)->xfrm; + struct dst_entry *dst = skb_dst(skb); + struct xfrm_state *x = dst->xfrm; tmp = ESP_SKB_CB(skb)->tmp; esp_ssg_unref(x, tmp); kfree(tmp); - - if (xo && (xo->flags & XFRM_DEV_RESUME)) { - if (err) { - XFRM_INC_STATS(xs_net(x), LINUX_MIB_XFRMOUTSTATEPROTOERROR); - kfree_skb(skb); - return; - } - - skb_push(skb, skb->data - skb_mac_header(skb)); - secpath_reset(skb); - xfrm_dev_resume(skb); - } else { - xfrm_output_resume(skb, err); - } + xfrm_output_resume(skb, err); } /* Move ESP header back into place. */ diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c index 815b90c9caf8..5be59ccb61aa 100644 --- a/net/ipv4/esp4_offload.c +++ b/net/ipv4/esp4_offload.c @@ -109,38 +109,78 @@ static void esp4_gso_encap(struct xfrm_state *x, struct sk_buff *skb) static struct sk_buff *esp4_gso_segment(struct sk_buff *skb, netdev_features_t features) { + __u32 seq; + int err = 0; + struct sk_buff *skb2; struct xfrm_state *x; struct ip_esp_hdr *esph; struct crypto_aead *aead; + struct sk_buff *segs = ERR_PTR(-EINVAL); netdev_features_t esp_features = features; struct xfrm_offload *xo = xfrm_offload(skb); if (!xo) - return ERR_PTR(-EINVAL); + goto out; if (!(skb_shinfo(skb)->gso_type & SKB_GSO_ESP)) - return ERR_PTR(-EINVAL); + goto out; + + seq = xo->seq.low; x = skb->sp->xvec[skb->sp->len - 1]; aead = x->data; esph = ip_esp_hdr(skb); if (esph->spi != x->id.spi) - return ERR_PTR(-EINVAL); + goto out; if (!pskb_may_pull(skb, sizeof(*esph) + crypto_aead_ivsize(aead))) - return ERR_PTR(-EINVAL); + goto out; __skb_pull(skb, sizeof(*esph) + crypto_aead_ivsize(aead)); skb->encap_hdr_csum = 1; - if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle || - (x->xso.dev != skb->dev)) + if (!(features & NETIF_F_HW_ESP)) esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK); - xo->flags |= XFRM_GSO_SEGMENT; - return x->outer_mode->gso_segment(x, skb, esp_features); + segs = x->outer_mode->gso_segment(x, skb, esp_features); + if (IS_ERR_OR_NULL(segs)) + goto out; + + __skb_pull(skb, skb->data - skb_mac_header(skb)); + + skb2 = segs; + do { + struct sk_buff *nskb = skb2->next; + + xo = xfrm_offload(skb2); + xo->flags |= XFRM_GSO_SEGMENT; + xo->seq.low = seq; + xo->seq.hi = xfrm_replay_seqhi(x, seq); + + if(!(features & NETIF_F_HW_ESP)) + xo->flags |= CRYPTO_FALLBACK; + + x->outer_mode->xmit(x, skb2); + + err = x->type_offload->xmit(x, skb2, esp_features); + if (err) { + kfree_skb_list(segs); + return ERR_PTR(err); + } + + if (!skb_is_gso(skb2)) + seq++; + else + seq += skb_shinfo(skb2)->gso_segs; + + skb_push(skb2, skb2->mac_len); + skb2 = nskb; + } while (skb2); + +out: + return segs; } static int esp_input_tail(struct xfrm_state *x, struct sk_buff *skb) @@ -167,7 +207,6 @@ static int esp_xmit(struct xfrm_state *x, struct sk_buff *skb, netdev_features_ struct crypto_aead *aead; struct esp_info esp; bool hw_offload = true; - __u32 seq; esp.inplace = true; @@ -206,28 +245,23 @@ static int esp_xmit(struct xfrm_state *x, struct sk_buff *skb, netdev_features_ return esp.nfrags; } - seq = xo->seq.low; - esph = esp.esph; esph->spi = x->id.spi; skb_push(skb, -skb_network_offset(skb)); if (xo->flags & XFRM_GSO_SEGMENT) { - esph->seq_no = htonl(seq); - if (!skb_is_gso(skb)) - xo->seq.low++; - else - xo->seq.low += skb_shinfo(skb)->gso_segs; + esph->seq_no = htonl(xo->seq.low); + } else { + ip_hdr(skb)->tot_len = htons(skb->len); + ip_send_check(ip_hdr(skb)); } - esp.seqno = cpu_to_be64(seq + ((u64)xo->seq.hi << 32)); - ip_hdr(skb)->tot_len = htons(skb->len); - ip_send_check(ip_hdr(skb)); - if (hw_offload) return 0; + esp.seqno = cpu_to_be64(xo->seq.low + ((u64)xo->seq.hi << 32)); + err = esp_output_tail(x, skb, &esp); if (err) return err; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index bc233fdfae0f..22dca7273392 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -723,7 +723,7 @@ bool fib_metrics_match(struct fib_config *cfg, struct fib_info *fi) bool ecn_ca = false; nla_strlcpy(tmp, nla, sizeof(tmp)); - val = tcp_ca_get_key_by_name(tmp, &ecn_ca); + val = tcp_ca_get_key_by_name(fi->fib_net, tmp, &ecn_ca); } else { if (nla_len(nla) != sizeof(u32)) return false; @@ -1029,49 +1029,8 @@ static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc) static int fib_convert_metrics(struct fib_info *fi, const struct fib_config *cfg) { - bool ecn_ca = false; - struct nlattr *nla; - int remaining; - - if (!cfg->fc_mx) - return 0; - - nla_for_each_attr(nla, cfg->fc_mx, cfg->fc_mx_len, remaining) { - int type = nla_type(nla); - u32 val; - - if (!type) - continue; - if (type > RTAX_MAX) - return -EINVAL; - - if (type == RTAX_CC_ALGO) { - char tmp[TCP_CA_NAME_MAX]; - - nla_strlcpy(tmp, nla, sizeof(tmp)); - val = tcp_ca_get_key_by_name(tmp, &ecn_ca); - if (val == TCP_CA_UNSPEC) - return -EINVAL; - } else { - if (nla_len(nla) != sizeof(u32)) - return -EINVAL; - val = nla_get_u32(nla); - } - if (type == RTAX_ADVMSS && val > 65535 - 40) - val = 65535 - 40; - if (type == RTAX_MTU && val > 65535 - 15) - val = 65535 - 15; - if (type == RTAX_HOPLIMIT && val > 255) - val = 255; - if (type == RTAX_FEATURES && (val & ~RTAX_FEATURE_MASK)) - return -EINVAL; - fi->fib_metrics->metrics[type - 1] = val; - } - - if (ecn_ca) - fi->fib_metrics->metrics[RTAX_FEATURES - 1] |= DST_FEATURE_ECN_CA; - - return 0; + return ip_metrics_convert(fi->fib_net, cfg->fc_mx, cfg->fc_mx_len, + fi->fib_metrics->metrics); } struct fib_info *fib_create_info(struct fib_config *cfg, diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 0c8fcc050ad2..b9277a530eab 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -87,32 +87,32 @@ static int call_fib_entry_notifier(struct notifier_block *nb, struct net *net, enum fib_event_type event_type, u32 dst, - int dst_len, struct fib_info *fi, - u8 tos, u8 type, u32 tb_id) + int dst_len, struct fib_alias *fa) { struct fib_entry_notifier_info info = { .dst = dst, .dst_len = dst_len, - .fi = fi, - .tos = tos, - .type = type, - .tb_id = tb_id, + .fi = fa->fa_info, + .tos = fa->fa_tos, + .type = fa->fa_type, + .tb_id = fa->tb_id, }; return call_fib4_notifier(nb, net, event_type, &info.info); } static int call_fib_entry_notifiers(struct net *net, enum fib_event_type event_type, u32 dst, - int dst_len, struct fib_info *fi, - u8 tos, u8 type, u32 tb_id) + int dst_len, struct fib_alias *fa, + struct netlink_ext_ack *extack) { struct fib_entry_notifier_info info = { + .info.extack = extack, .dst = dst, .dst_len = dst_len, - .fi = fi, - .tos = tos, - .type = type, - .tb_id = tb_id, + .fi = fa->fa_info, + .tos = fa->fa_tos, + .type = fa->fa_type, + .tb_id = fa->tb_id, }; return call_fib4_notifiers(net, event_type, &info.info); } @@ -1216,9 +1216,7 @@ int fib_table_insert(struct net *net, struct fib_table *tb, new_fa->fa_default = -1; call_fib_entry_notifiers(net, FIB_EVENT_ENTRY_REPLACE, - key, plen, fi, - new_fa->fa_tos, cfg->fc_type, - tb->tb_id); + key, plen, new_fa, extack); rtmsg_fib(RTM_NEWROUTE, htonl(key), new_fa, plen, tb->tb_id, &cfg->fc_nlinfo, nlflags); @@ -1273,8 +1271,7 @@ int fib_table_insert(struct net *net, struct fib_table *tb, tb->tb_num_default++; rt_cache_flush(cfg->fc_nlinfo.nl_net); - call_fib_entry_notifiers(net, event, key, plen, fi, tos, cfg->fc_type, - tb->tb_id); + call_fib_entry_notifiers(net, event, key, plen, new_fa, extack); rtmsg_fib(RTM_NEWROUTE, htonl(key), new_fa, plen, new_fa->tb_id, &cfg->fc_nlinfo, nlflags); succeeded: @@ -1574,8 +1571,7 @@ int fib_table_delete(struct net *net, struct fib_table *tb, return -ESRCH; call_fib_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, key, plen, - fa_to_delete->fa_info, tos, - fa_to_delete->fa_type, tb->tb_id); + fa_to_delete, extack); rtmsg_fib(RTM_DELROUTE, htonl(key), fa_to_delete, plen, tb->tb_id, &cfg->fc_nlinfo, 0); @@ -1901,9 +1897,8 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) call_fib_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, n->key, - KEYLENGTH - fa->fa_slen, - fi, fa->fa_tos, fa->fa_type, - tb->tb_id); + KEYLENGTH - fa->fa_slen, fa, + NULL); hlist_del_rcu(&fa->fa_list); fib_release_info(fa->fa_info); alias_free_mem_rcu(fa); @@ -1941,8 +1936,7 @@ static void fib_leaf_notify(struct net *net, struct key_vector *l, continue; call_fib_entry_notifier(nb, net, FIB_EVENT_ENTRY_ADD, l->key, - KEYLENGTH - fa->fa_slen, fi, fa->fa_tos, - fa->fa_type, fa->tb_id); + KEYLENGTH - fa->fa_slen, fa); } } diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 6a8e8ceb46e2..11f6f4e7ce40 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -39,11 +39,11 @@ EXPORT_SYMBOL(inet_csk_timer_bug_msg); * IPV6_ADDR_ANY only equals to IPV6_ADDR_ANY, * and 0.0.0.0 equals to 0.0.0.0 only */ -static int ipv6_rcv_saddr_equal(const struct in6_addr *sk1_rcv_saddr6, - const struct in6_addr *sk2_rcv_saddr6, - __be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr, - bool sk1_ipv6only, bool sk2_ipv6only, - bool match_wildcard) +static bool ipv6_rcv_saddr_equal(const struct in6_addr *sk1_rcv_saddr6, + const struct in6_addr *sk2_rcv_saddr6, + __be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr, + bool sk1_ipv6only, bool sk2_ipv6only, + bool match_wildcard) { int addr_type = ipv6_addr_type(sk1_rcv_saddr6); int addr_type2 = sk2_rcv_saddr6 ? ipv6_addr_type(sk2_rcv_saddr6) : IPV6_ADDR_MAPPED; @@ -52,29 +52,29 @@ static int ipv6_rcv_saddr_equal(const struct in6_addr *sk1_rcv_saddr6, if (addr_type == IPV6_ADDR_MAPPED && addr_type2 == IPV6_ADDR_MAPPED) { if (!sk2_ipv6only) { if (sk1_rcv_saddr == sk2_rcv_saddr) - return 1; + return true; if (!sk1_rcv_saddr || !sk2_rcv_saddr) return match_wildcard; } - return 0; + return false; } if (addr_type == IPV6_ADDR_ANY && addr_type2 == IPV6_ADDR_ANY) - return 1; + return true; if (addr_type2 == IPV6_ADDR_ANY && match_wildcard && !(sk2_ipv6only && addr_type == IPV6_ADDR_MAPPED)) - return 1; + return true; if (addr_type == IPV6_ADDR_ANY && match_wildcard && !(sk1_ipv6only && addr_type2 == IPV6_ADDR_MAPPED)) - return 1; + return true; if (sk2_rcv_saddr6 && ipv6_addr_equal(sk1_rcv_saddr6, sk2_rcv_saddr6)) - return 1; + return true; - return 0; + return false; } #endif @@ -82,20 +82,20 @@ static int ipv6_rcv_saddr_equal(const struct in6_addr *sk1_rcv_saddr6, * match_wildcard == false: addresses must be exactly the same, i.e. * 0.0.0.0 only equals to 0.0.0.0 */ -static int ipv4_rcv_saddr_equal(__be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr, - bool sk2_ipv6only, bool match_wildcard) +static bool ipv4_rcv_saddr_equal(__be32 sk1_rcv_saddr, __be32 sk2_rcv_saddr, + bool sk2_ipv6only, bool match_wildcard) { if (!sk2_ipv6only) { if (sk1_rcv_saddr == sk2_rcv_saddr) - return 1; + return true; if (!sk1_rcv_saddr || !sk2_rcv_saddr) return match_wildcard; } - return 0; + return false; } -int inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2, - bool match_wildcard) +bool inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2, + bool match_wildcard) { #if IS_ENABLED(CONFIG_IPV6) if (sk->sk_family == AF_INET6) @@ -112,6 +112,15 @@ int inet_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2, } EXPORT_SYMBOL(inet_rcv_saddr_equal); +bool inet_rcv_saddr_any(const struct sock *sk) +{ +#if IS_ENABLED(CONFIG_IPV6) + if (sk->sk_family == AF_INET6) + return ipv6_addr_any(&sk->sk_v6_rcv_saddr); +#endif + return !sk->sk_rcv_saddr; +} + void inet_get_local_port_range(struct net *net, int *low, int *high) { unsigned int seq; diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index c5092e2b5933..7ef23d5fe1ff 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -172,6 +173,60 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child) } EXPORT_SYMBOL_GPL(__inet_inherit_port); +static struct inet_listen_hashbucket * +inet_lhash2_bucket_sk(struct inet_hashinfo *h, struct sock *sk) +{ + u32 hash; + +#if IS_ENABLED(CONFIG_IPV6) + if (sk->sk_family == AF_INET6) + hash = ipv6_portaddr_hash(sock_net(sk), + &sk->sk_v6_rcv_saddr, + inet_sk(sk)->inet_num); + else +#endif + hash = ipv4_portaddr_hash(sock_net(sk), + inet_sk(sk)->inet_rcv_saddr, + inet_sk(sk)->inet_num); + return inet_lhash2_bucket(h, hash); +} + +static void inet_hash2(struct inet_hashinfo *h, struct sock *sk) +{ + struct inet_listen_hashbucket *ilb2; + + if (!h->lhash2) + return; + + ilb2 = inet_lhash2_bucket_sk(h, sk); + + spin_lock(&ilb2->lock); + if (sk->sk_reuseport && sk->sk_family == AF_INET6) + hlist_add_tail_rcu(&inet_csk(sk)->icsk_listen_portaddr_node, + &ilb2->head); + else + hlist_add_head_rcu(&inet_csk(sk)->icsk_listen_portaddr_node, + &ilb2->head); + ilb2->count++; + spin_unlock(&ilb2->lock); +} + +static void inet_unhash2(struct inet_hashinfo *h, struct sock *sk) +{ + struct inet_listen_hashbucket *ilb2; + + if (!h->lhash2 || + WARN_ON_ONCE(hlist_unhashed(&inet_csk(sk)->icsk_listen_portaddr_node))) + return; + + ilb2 = inet_lhash2_bucket_sk(h, sk); + + spin_lock(&ilb2->lock); + hlist_del_init_rcu(&inet_csk(sk)->icsk_listen_portaddr_node); + ilb2->count--; + spin_unlock(&ilb2->lock); +} + static inline int compute_score(struct sock *sk, struct net *net, const unsigned short hnum, const __be32 daddr, const int dif, const int sdif, bool exact_dif) @@ -211,6 +266,40 @@ static inline int compute_score(struct sock *sk, struct net *net, */ /* called with rcu_read_lock() : No refcount taken on the socket */ +static struct sock *inet_lhash2_lookup(struct net *net, + struct inet_listen_hashbucket *ilb2, + struct sk_buff *skb, int doff, + const __be32 saddr, __be16 sport, + const __be32 daddr, const unsigned short hnum, + const int dif, const int sdif) +{ + bool exact_dif = inet_exact_dif_match(net, skb); + struct inet_connection_sock *icsk; + struct sock *sk, *result = NULL; + int score, hiscore = 0; + u32 phash = 0; + + inet_lhash2_for_each_icsk_rcu(icsk, &ilb2->head) { + sk = (struct sock *)icsk; + score = compute_score(sk, net, hnum, daddr, + dif, sdif, exact_dif); + if (score > hiscore) { + if (sk->sk_reuseport) { + phash = inet_ehashfn(net, daddr, hnum, + saddr, sport); + result = reuseport_select_sock(sk, phash, + skb, doff); + if (result) + return result; + } + result = sk; + hiscore = score; + } + } + + return result; +} + struct sock *__inet_lookup_listener(struct net *net, struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, @@ -220,35 +309,64 @@ struct sock *__inet_lookup_listener(struct net *net, { unsigned int hash = inet_lhashfn(net, hnum); struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash]; - int score, hiscore = 0, matches = 0, reuseport = 0; bool exact_dif = inet_exact_dif_match(net, skb); + struct inet_listen_hashbucket *ilb2; struct sock *sk, *result = NULL; struct hlist_nulls_node *node; + int score, hiscore = 0; + unsigned int hash2; u32 phash = 0; + if (ilb->count <= 10 || !hashinfo->lhash2) + goto port_lookup; + + /* Too many sk in the ilb bucket (which is hashed by port alone). + * Try lhash2 (which is hashed by port and addr) instead. + */ + + hash2 = ipv4_portaddr_hash(net, daddr, hnum); + ilb2 = inet_lhash2_bucket(hashinfo, hash2); + if (ilb2->count > ilb->count) + goto port_lookup; + + result = inet_lhash2_lookup(net, ilb2, skb, doff, + saddr, sport, daddr, hnum, + dif, sdif); + if (result) + goto done; + + /* Lookup lhash2 with INADDR_ANY */ + + hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum); + ilb2 = inet_lhash2_bucket(hashinfo, hash2); + if (ilb2->count > ilb->count) + goto port_lookup; + + result = inet_lhash2_lookup(net, ilb2, skb, doff, + saddr, sport, daddr, hnum, + dif, sdif); + goto done; + +port_lookup: sk_nulls_for_each_rcu(sk, node, &ilb->nulls_head) { score = compute_score(sk, net, hnum, daddr, dif, sdif, exact_dif); if (score > hiscore) { - reuseport = sk->sk_reuseport; - if (reuseport) { + if (sk->sk_reuseport) { phash = inet_ehashfn(net, daddr, hnum, saddr, sport); result = reuseport_select_sock(sk, phash, skb, doff); if (result) - return result; - matches = 1; + goto done; } result = sk; hiscore = score; - } else if (score == hiscore && reuseport) { - matches++; - if (reciprocal_scale(phash, matches) == 0) - result = sk; - phash = next_pseudo_random32(phash); } } +done: + if (unlikely(IS_ERR(result))) + return NULL; return result; } EXPORT_SYMBOL_GPL(__inet_lookup_listener); @@ -508,10 +626,11 @@ static int inet_reuseport_add_sock(struct sock *sk, inet_csk(sk2)->icsk_bind_hash == tb && sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) && inet_rcv_saddr_equal(sk, sk2, false)) - return reuseport_add_sock(sk, sk2); + return reuseport_add_sock(sk, sk2, + inet_rcv_saddr_any(sk)); } - return reuseport_alloc(sk); + return reuseport_alloc(sk, inet_rcv_saddr_any(sk)); } int __inet_hash(struct sock *sk, struct sock *osk) @@ -538,6 +657,8 @@ int __inet_hash(struct sock *sk, struct sock *osk) __sk_nulls_add_node_tail_rcu(sk, &ilb->nulls_head); else __sk_nulls_add_node_rcu(sk, &ilb->nulls_head); + inet_hash2(hashinfo, sk); + ilb->count++; sock_set_flag(sk, SOCK_RCU_FREE); sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1); unlock: @@ -564,25 +685,33 @@ EXPORT_SYMBOL_GPL(inet_hash); void inet_unhash(struct sock *sk) { struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; + struct inet_listen_hashbucket *ilb; spinlock_t *lock; bool listener = false; - int done; if (sk_unhashed(sk)) return; if (sk->sk_state == TCP_LISTEN) { - lock = &hashinfo->listening_hash[inet_sk_listen_hashfn(sk)].lock; + ilb = &hashinfo->listening_hash[inet_sk_listen_hashfn(sk)]; + lock = &ilb->lock; listener = true; } else { lock = inet_ehash_lockp(hashinfo, sk->sk_hash); } spin_lock_bh(lock); + if (sk_unhashed(sk)) + goto unlock; + if (rcu_access_pointer(sk->sk_reuseport_cb)) reuseport_detach_sock(sk); - done = __sk_nulls_del_node_init_rcu(sk); - if (done) - sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); + if (listener) { + inet_unhash2(hashinfo, sk); + ilb->count--; + } + __sk_nulls_del_node_init_rcu(sk); + sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); +unlock: spin_unlock_bh(lock); } EXPORT_SYMBOL_GPL(inet_unhash); @@ -733,8 +862,11 @@ void inet_hashinfo_init(struct inet_hashinfo *h) spin_lock_init(&h->listening_hash[i].lock); INIT_HLIST_NULLS_HEAD(&h->listening_hash[i].nulls_head, i + LISTENING_NULLS_BASE); + h->listening_hash[i].count = 0; } + h->lhash2 = NULL; + if (h != &tcp_hashinfo) return; @@ -746,6 +878,30 @@ void inet_hashinfo_init(struct inet_hashinfo *h) } EXPORT_SYMBOL_GPL(inet_hashinfo_init); +void __init inet_hashinfo2_init(struct inet_hashinfo *h, const char *name, + unsigned long numentries, int scale, + unsigned long low_limit, + unsigned long high_limit) +{ + unsigned int i; + + h->lhash2 = alloc_large_system_hash(name, + sizeof(*h->lhash2), + numentries, + scale, + 0, + NULL, + &h->lhash2_mask, + low_limit, + high_limit); + + for (i = 0; i <= h->lhash2_mask; i++) { + spin_lock_init(&h->lhash2[i].lock); + INIT_HLIST_HEAD(&h->lhash2[i].head); + h->lhash2[i].count = 0; + } +} + int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo) { unsigned int locksz = sizeof(spinlock_t); diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index eacffdb0bae0..5ffd206a1a22 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -114,7 +114,8 @@ MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN"); static struct rtnl_link_ops ipgre_link_ops __read_mostly; static int ipgre_tunnel_init(struct net_device *dev); static void erspan_build_header(struct sk_buff *skb, - __be32 id, u32 index, bool truncate); + __be32 id, u32 index, + bool truncate, bool is_ipv4); static unsigned int ipgre_net_id __read_mostly; static unsigned int gre_tap_net_id __read_mostly; @@ -515,6 +516,7 @@ err_free_skb: static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev, __be16 proto) { + struct ip_tunnel *tunnel = netdev_priv(dev); struct ip_tunnel_info *tun_info; const struct ip_tunnel_key *key; struct rtable *rt = NULL; @@ -538,9 +540,11 @@ static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev, if (gre_handle_offloads(skb, !!(tun_info->key.tun_flags & TUNNEL_CSUM))) goto err_free_rt; - flags = tun_info->key.tun_flags & (TUNNEL_CSUM | TUNNEL_KEY); + flags = tun_info->key.tun_flags & + (TUNNEL_CSUM | TUNNEL_KEY | TUNNEL_SEQ); gre_build_header(skb, tunnel_hlen, flags, proto, - tunnel_id_to_key32(tun_info->key.tun_id), 0); + tunnel_id_to_key32(tun_info->key.tun_id), + (flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0); df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0; @@ -574,6 +578,8 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev, goto err_free_skb; key = &tun_info->key; + if (!(tun_info->key.tun_flags & TUNNEL_ERSPAN_OPT)) + goto err_free_rt; /* ERSPAN has fixed 8 byte GRE header */ tunnel_hlen = 8 + sizeof(struct erspanhdr); @@ -598,7 +604,7 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev, goto err_free_rt; erspan_build_header(skb, tunnel_id_to_key32(key->tun_id), - ntohl(md->index), truncate); + ntohl(md->index), truncate, true); gre_build_header(skb, 8, TUNNEL_SEQ, htons(ETH_P_ERSPAN), 0, htonl(tunnel->o_seqno++)); @@ -680,52 +686,6 @@ free_skb: return NETDEV_TX_OK; } -static inline u8 tos_to_cos(u8 tos) -{ - u8 dscp, cos; - - dscp = tos >> 2; - cos = dscp >> 3; - return cos; -} - -static void erspan_build_header(struct sk_buff *skb, - __be32 id, u32 index, bool truncate) -{ - struct iphdr *iphdr = ip_hdr(skb); - struct ethhdr *eth = (struct ethhdr *)skb->data; - enum erspan_encap_type enc_type; - struct erspanhdr *ershdr; - struct qtag_prefix { - __be16 eth_type; - __be16 tci; - } *qp; - u16 vlan_tci = 0; - - enc_type = ERSPAN_ENCAP_NOVLAN; - - /* If mirrored packet has vlan tag, extract tci and - * perserve vlan header in the mirrored frame. - */ - if (eth->h_proto == htons(ETH_P_8021Q)) { - qp = (struct qtag_prefix *)(skb->data + 2 * ETH_ALEN); - vlan_tci = ntohs(qp->tci); - enc_type = ERSPAN_ENCAP_INFRAME; - } - - skb_push(skb, sizeof(*ershdr)); - ershdr = (struct erspanhdr *)skb->data; - memset(ershdr, 0, sizeof(*ershdr)); - - ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) | - (ERSPAN_VERSION << VER_OFFSET)); - ershdr->session_id = htons((u16)(ntohl(id) & ID_MASK) | - ((tos_to_cos(iphdr->tos) << COS_OFFSET) & COS_MASK) | - (enc_type << EN_OFFSET & EN_MASK) | - ((truncate << T_OFFSET) & T_MASK)); - ershdr->md.index = htonl(index & INDEX_MASK); -} - static netdev_tx_t erspan_xmit(struct sk_buff *skb, struct net_device *dev) { @@ -752,7 +712,8 @@ static netdev_tx_t erspan_xmit(struct sk_buff *skb, } /* Push ERSPAN header */ - erspan_build_header(skb, tunnel->parms.o_key, tunnel->index, truncate); + erspan_build_header(skb, tunnel->parms.o_key, tunnel->index, + truncate, true); tunnel->parms.o_flags &= ~TUNNEL_KEY; __gre_xmit(skb, dev, &tunnel->parms.iph, htons(ETH_P_ERSPAN)); return NETDEV_TX_OK; diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 32066e2375c9..ecf367ae4d7a 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -542,7 +542,6 @@ static void ip_copy_metadata(struct sk_buff *to, struct sk_buff *from) to->tc_index = from->tc_index; #endif nf_copy(to, from); - skb_ext_copy(to, from); #if IS_ENABLED(CONFIG_IP_VS) to->ipvs_property = from->ipvs_property; #endif diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index d1081eac3b49..d9c17b50b6f3 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -47,6 +47,8 @@ #include #include +#include + /* * SOL_IP control messages. */ @@ -1246,6 +1248,11 @@ int ip_setsockopt(struct sock *sk, int level, return -ENOPROTOOPT; err = do_ip_setsockopt(sk, level, optname, optval, optlen); +#ifdef CONFIG_BPFILTER + if (optname >= BPFILTER_IPT_SO_SET_REPLACE && + optname < BPFILTER_IPT_SET_MAX) + err = bpfilter_ip_set_sockopt(sk, optname, optval, optlen); +#endif #ifdef CONFIG_NETFILTER /* we need to exclude all possible ENOPROTOOPTs except default case */ if (err == -ENOPROTOOPT && optname != IP_HDRINCL && @@ -1554,6 +1561,11 @@ int ip_getsockopt(struct sock *sk, int level, int err; err = do_ip_getsockopt(sk, level, optname, optval, optlen, 0); +#ifdef CONFIG_BPFILTER + if (optname >= BPFILTER_IPT_SO_GET_INFO && + optname < BPFILTER_IPT_GET_MAX) + err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen); +#endif #ifdef CONFIG_NETFILTER /* we need to exclude all possible ENOPROTOOPTs except default case */ if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS && @@ -1586,6 +1598,11 @@ int compat_ip_getsockopt(struct sock *sk, int level, int optname, err = do_ip_getsockopt(sk, level, optname, optval, optlen, MSG_CMSG_COMPAT); +#ifdef CONFIG_BPFILTER + if (optname >= BPFILTER_IPT_SO_GET_INFO && + optname < BPFILTER_IPT_GET_MAX) + err = bpfilter_ip_get_sockopt(sk, optname, optval, optlen); +#endif #ifdef CONFIG_NETFILTER /* we need to exclude all possible ENOPROTOOPTs except default case */ if (err == -ENOPROTOOPT && optname != IP_PKTOPTIONS && diff --git a/net/ipv4/metrics.c b/net/ipv4/metrics.c new file mode 100644 index 000000000000..55baf3e251aa --- /dev/null +++ b/net/ipv4/metrics.c @@ -0,0 +1,56 @@ +#include +#include +#include +#include +#include +#include + +int ip_metrics_convert(struct net *net, struct nlattr *fc_mx, int fc_mx_len, + u32 *metrics) +{ + bool ecn_ca = false; + struct nlattr *nla; + int remaining; + + if (!fc_mx) + return 0; + + nla_for_each_attr(nla, fc_mx, fc_mx_len, remaining) { + int type = nla_type(nla); + u32 val; + + if (!type) + continue; + if (type > RTAX_MAX) + return -EINVAL; + + if (type == RTAX_CC_ALGO) { + char tmp[TCP_CA_NAME_MAX]; + + nla_strlcpy(tmp, nla, sizeof(tmp)); + val = tcp_ca_get_key_by_name(net, tmp, &ecn_ca); + if (val == TCP_CA_UNSPEC) + return -EINVAL; + } else { + if (nla_len(nla) != sizeof(u32)) + return -EINVAL; + + val = nla_get_u32(nla); + } + if (type == RTAX_ADVMSS && val > 65535 - 40) + val = 65535 - 40; + if (type == RTAX_MTU && val > 65535 - 15) + val = 65535 - 15; + if (type == RTAX_HOPLIMIT && val > 255) + val = 255; + if (type == RTAX_FEATURES && (val & ~RTAX_FEATURE_MASK)) + return -EINVAL; + metrics[type - 1] = val; + } + + if (ecn_ca) + metrics[RTAX_FEATURES - 1] |= DST_FEATURE_ECN_CA; + + return 0; +} +EXPORT_SYMBOL_GPL(ip_metrics_convert); diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index 88aaf14983e8..65dfa9dffa88 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -299,6 +299,8 @@ static const struct snmp_mib snmp4_net_list[] = { SNMP_MIB_ITEM("TCPKeepAlive", LINUX_MIB_TCPKEEPALIVE), SNMP_MIB_ITEM("TCPMTUPFail", LINUX_MIB_TCPMTUPFAIL), SNMP_MIB_ITEM("TCPMTUPSuccess", LINUX_MIB_TCPMTUPSUCCESS), + SNMP_MIB_ITEM("TCPDelivered", LINUX_MIB_TCPDELIVERED), + SNMP_MIB_ITEM("TCPDeliveredCE", LINUX_MIB_TCPDELIVEREDCE), SNMP_MIB_ITEM("TCPWqueueTooBig", LINUX_MIB_TCPWQUEUETOOBIG), SNMP_MIB_SENTINEL }; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 2ce7fbec55ea..368df8c52bd8 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -141,7 +141,8 @@ static int ip_rt_gc_timeout __read_mostly = RT_GC_TIMEOUT; static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie); static unsigned int ipv4_default_advmss(const struct dst_entry *dst); static unsigned int ipv4_mtu(const struct dst_entry *dst); -static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst); +static void ipv4_negative_advice(struct sock *sk, + struct dst_entry *dst); static void ipv4_link_failure(struct sk_buff *skb); static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk, struct sk_buff *skb, u32 mtu, @@ -863,22 +864,15 @@ static void ip_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_buf __ip_do_redirect(rt, skb, &fl4, true); } -static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst) +static void ipv4_negative_advice(struct sock *sk, + struct dst_entry *dst) { struct rtable *rt = (struct rtable *)dst; - struct dst_entry *ret = dst; - if (rt) { - if (dst->obsolete > 0) { - ip_rt_put(rt); - ret = NULL; - } else if ((rt->rt_flags & RTCF_REDIRECTED) || - rt->dst.expires) { - ip_rt_put(rt); - ret = NULL; - } - } - return ret; + if ((dst->obsolete > 0) || + (rt->rt_flags & RTCF_REDIRECTED) || + rt->dst.expires) + sk_dst_reset(sk); } /* @@ -1145,7 +1139,7 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu) new = true; } - __ip_rt_update_pmtu((struct rtable *) rt->dst.path, &fl4, mtu); + __ip_rt_update_pmtu((struct rtable *) xfrm_dst_path(&rt->dst), &fl4, mtu); if (!dst_check(&rt->dst, 0)) { if (new) @@ -1408,6 +1402,37 @@ static struct fib_nh_exception *find_exception(struct fib_nh *nh, __be32 daddr) return NULL; } +/* MTU selection: + * 1. mtu on route is locked - use it + * 2. mtu from nexthop exception + * 3. mtu from egress device + */ + +u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr) +{ + struct fib_info *fi = res->fi; + struct fib_nh *nh = &fi->fib_nh[res->nh_sel]; + struct net_device *dev = nh->nh_dev; + u32 mtu = 0; + + if (dev_net(dev)->ipv4.sysctl_ip_fwd_use_pmtu || + fi->fib_metrics->metrics[RTAX_LOCK - 1] & (1 << RTAX_MTU)) + mtu = fi->fib_mtu; + + if (likely(!mtu)) { + struct fib_nh_exception *fnhe; + + fnhe = find_exception(nh, daddr); + if (fnhe && !time_after_eq(jiffies, fnhe->fnhe_expires)) + mtu = fnhe->fnhe_pmtu; + } + + if (likely(!mtu)) + mtu = min(READ_ONCE(dev->mtu), IP_MAX_MTU); + + return mtu - lwtunnel_headroom(nh->nh_lwtstate, mtu); +} + static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe, __be32 daddr, const bool do_cache) { @@ -1725,19 +1750,6 @@ static void ip_handle_martian_source(struct net_device *dev, #endif } -static void set_lwt_redirect(struct rtable *rth) -{ - if (lwtunnel_output_redirect(rth->dst.lwtstate)) { - rth->dst.lwtstate->orig_output = rth->dst.output; - rth->dst.output = lwtunnel_output; - } - - if (lwtunnel_input_redirect(rth->dst.lwtstate)) { - rth->dst.lwtstate->orig_input = rth->dst.input; - rth->dst.input = lwtunnel_input; - } -} - /* called in rcu_read_lock() section */ static int __mkroute_input(struct sk_buff *skb, const struct fib_result *res, @@ -1818,7 +1830,7 @@ static int __mkroute_input(struct sk_buff *skb, rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag, do_cache); - set_lwt_redirect(rth); + lwtunnel_set_redirect(&rth->dst); skb_dst_set(skb, &rth->dst); out: err = 0; @@ -2333,7 +2345,7 @@ add: } rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0, do_cache); - set_lwt_redirect(rth); + lwtunnel_set_redirect(&rth->dst); return rth; } diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 55ef264a8293..34f02b25a084 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -26,6 +26,7 @@ #include #include #include +#include static int zero; static int one = 1; @@ -210,6 +211,8 @@ static int ipv4_ping_group_range(struct ctl_table *table, int write, static int proc_tcp_congestion_control(struct ctl_table *ctl, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { + struct net *net = container_of(ctl->data, struct net, + ipv4.tcp_congestion_control); char val[TCP_CA_NAME_MAX]; struct ctl_table tbl = { .data = val, @@ -217,11 +220,11 @@ static int proc_tcp_congestion_control(struct ctl_table *ctl, int write, }; int ret; - tcp_get_default_congestion_control(val); + tcp_get_default_congestion_control(net, val); ret = proc_dostring(&tbl, write, buffer, lenp, ppos); if (write && ret == 0) - ret = tcp_set_default_congestion_control(val); + ret = tcp_set_default_congestion_control(net, val); return ret; } @@ -262,10 +265,12 @@ static int proc_allowed_congestion_control(struct ctl_table *ctl, return ret; } -static int proc_tcp_fastopen_key(struct ctl_table *ctl, int write, +static int proc_tcp_fastopen_key(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { + struct net *net = container_of(table->data, struct net, + ipv4.sysctl_tcp_fastopen); struct ctl_table tbl = { .maxlen = (TCP_FASTOPEN_KEY_LENGTH * 2 + 10) }; struct tcp_fastopen_context *ctxt; u32 user_key[4]; /* 16 bytes, matching TCP_FASTOPEN_KEY_LENGTH */ @@ -277,7 +282,7 @@ static int proc_tcp_fastopen_key(struct ctl_table *ctl, int write, return -ENOMEM; rcu_read_lock(); - ctxt = rcu_dereference(tcp_fastopen_ctx); + ctxt = rcu_dereference(net->ipv4.tcp_fastopen_ctx); if (ctxt) memcpy(key, ctxt->key, TCP_FASTOPEN_KEY_LENGTH); else @@ -297,16 +302,11 @@ static int proc_tcp_fastopen_key(struct ctl_table *ctl, int write, ret = -EINVAL; goto bad_key; } - /* Generate a dummy secret but don't publish it. This - * is needed so we don't regenerate a new key on the - * first invocation of tcp_fastopen_cookie_gen - */ - tcp_fastopen_init_key_once(false); for (i = 0; i < ARRAY_SIZE(user_key); i++) key[i] = cpu_to_le32(user_key[i]); - tcp_fastopen_reset_cipher(key, TCP_FASTOPEN_KEY_LENGTH); + tcp_fastopen_reset_cipher(net, key, TCP_FASTOPEN_KEY_LENGTH); } bad_key: @@ -322,11 +322,13 @@ static int proc_tfo_blackhole_detect_timeout(struct ctl_table *table, void __user *buffer, size_t *lenp, loff_t *ppos) { + struct net *net = container_of(table->data, struct net, + ipv4.sysctl_tcp_fastopen_blackhole_timeout); int ret; ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); if (write && ret == 0) - tcp_fastopen_active_timeout_reset(); + atomic_set(&net->ipv4.tfo_active_disable_times, 0); return ret; } @@ -349,6 +351,23 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl, return ret; } +#ifdef CONFIG_IP_ROUTE_MULTIPATH +static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + struct net *net = container_of(table->data, struct net, + ipv4.sysctl_fib_multipath_hash_policy); + int ret; + + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (write && ret == 0) + call_netevent_notifiers(NETEVENT_IPV4_MPATH_HASH_UPDATE, net); + + return ret; +} +#endif + static struct ctl_table ipv4_table[] = { { .procname = "tcp_retrans_collapse", @@ -364,27 +383,6 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, - { - .procname = "tcp_fastopen", - .data = &sysctl_tcp_fastopen, - .maxlen = sizeof(int), - .mode = 0644, - .proc_handler = proc_dointvec, - }, - { - .procname = "tcp_fastopen_key", - .mode = 0600, - .maxlen = ((TCP_FASTOPEN_KEY_LENGTH * 2) + 10), - .proc_handler = proc_tcp_fastopen_key, - }, - { - .procname = "tcp_fastopen_blackhole_timeout_sec", - .data = &sysctl_tcp_fastopen_blackhole_timeout, - .maxlen = sizeof(int), - .mode = 0644, - .proc_handler = proc_tfo_blackhole_detect_timeout, - .extra1 = &zero, - }, { .procname = "tcp_abort_on_overflow", .data = &sysctl_tcp_abort_on_overflow, @@ -538,12 +536,6 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, - { - .procname = "tcp_congestion_control", - .mode = 0644, - .maxlen = TCP_CA_NAME_MAX, - .proc_handler = proc_tcp_congestion_control, - }, { .procname = "tcp_workaround_signed_windows", .data = &sysctl_tcp_workaround_signed_windows, @@ -967,6 +959,13 @@ static struct ctl_table ipv4_net_table[] = { .extra1 = &one }, #endif + { + .procname = "tcp_congestion_control", + .data = &init_net.ipv4.tcp_congestion_control, + .mode = 0644, + .maxlen = TCP_CA_NAME_MAX, + .proc_handler = proc_tcp_congestion_control, + }, { .procname = "tcp_keepalive_time", .data = &init_net.ipv4.sysctl_tcp_keepalive_time, @@ -1077,6 +1076,28 @@ static struct ctl_table ipv4_net_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, + { + .procname = "tcp_fastopen", + .data = &init_net.ipv4.sysctl_tcp_fastopen, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .procname = "tcp_fastopen_key", + .mode = 0600, + .data = &init_net.ipv4.sysctl_tcp_fastopen, + .maxlen = ((TCP_FASTOPEN_KEY_LENGTH * 2) + 10), + .proc_handler = proc_tcp_fastopen_key, + }, + { + .procname = "tcp_fastopen_blackhole_timeout_sec", + .data = &init_net.ipv4.sysctl_tcp_fastopen_blackhole_timeout, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_tfo_blackhole_detect_timeout, + .extra1 = &zero, + }, #ifdef CONFIG_IP_ROUTE_MULTIPATH { .procname = "fib_multipath_use_neigh", @@ -1092,7 +1113,7 @@ static struct ctl_table ipv4_net_table[] = { .data = &init_net.ipv4.sysctl_fib_multipath_hash_policy, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec_minmax, + .proc_handler = proc_fib_multipath_hash_policy, .extra1 = &zero, .extra2 = &one, }, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 52c1bd2bd9b5..c944a7218008 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -462,6 +462,18 @@ void tcp_init_sock(struct sock *sk) } EXPORT_SYMBOL(tcp_init_sock); +void tcp_init_transfer(struct sock *sk, int bpf_op) +{ + struct inet_connection_sock *icsk = inet_csk(sk); + + tcp_mtup_init(sk); + icsk->icsk_af_ops->rebuild_header(sk); + tcp_init_metrics(sk); + tcp_call_bpf(sk, bpf_op, 0, NULL); + tcp_init_congestion_control(sk); + tcp_init_buffer_space(sk); +} + static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb) { if (tsflags && skb) { @@ -1159,7 +1171,7 @@ static int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, struct sockaddr *uaddr = msg->msg_name; int err, flags; - if (!(sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) || + if (!(sock_net(sk)->ipv4.sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) || (uaddr && msg->msg_namelen >= sizeof(uaddr->sa_family) && uaddr->sa_family == AF_UNSPEC)) return -EOPNOTSUPP; @@ -2052,6 +2064,30 @@ void tcp_set_state(struct sock *sk, int state) { int oldstate = sk->sk_state; + /* We defined a new enum for TCP states that are exported in BPF + * so as not force the internal TCP states to be frozen. The + * following checks will detect if an internal state value ever + * differs from the BPF value. If this ever happens, then we will + * need to remap the internal value to the BPF value before calling + * tcp_call_bpf_2arg. + */ + BUILD_BUG_ON((int)BPF_TCP_ESTABLISHED != (int)TCP_ESTABLISHED); + BUILD_BUG_ON((int)BPF_TCP_SYN_SENT != (int)TCP_SYN_SENT); + BUILD_BUG_ON((int)BPF_TCP_SYN_RECV != (int)TCP_SYN_RECV); + BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT1 != (int)TCP_FIN_WAIT1); + BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT2 != (int)TCP_FIN_WAIT2); + BUILD_BUG_ON((int)BPF_TCP_TIME_WAIT != (int)TCP_TIME_WAIT); + BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE); + BUILD_BUG_ON((int)BPF_TCP_CLOSE_WAIT != (int)TCP_CLOSE_WAIT); + BUILD_BUG_ON((int)BPF_TCP_LAST_ACK != (int)TCP_LAST_ACK); + BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN); + BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING); + BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV); + BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES); + + if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG)) + tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state); + switch (state) { case TCP_ESTABLISHED: if (oldstate != TCP_ESTABLISHED) @@ -2394,6 +2430,7 @@ int tcp_disconnect(struct sock *sk, int flags) tp->max_packets_out = 0; tp->window_clamp = 0; tp->delivered = 0; + tp->delivered_ce = 0; if (icsk->icsk_ca_ops->release) icsk->icsk_ca_ops->release(sk); memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv)); @@ -2413,10 +2450,13 @@ int tcp_disconnect(struct sock *sk, int flags) tcp_saved_syn_free(tp); tp->segs_in = 0; tp->segs_out = 0; + tp->bytes_sent = 0; tp->bytes_acked = 0; tp->bytes_received = 0; + tp->bytes_retrans = 0; tp->data_segs_in = 0; tp->data_segs_out = 0; + tp->dsack_dups = 0; /* Clean up fastopen related fields */ tcp_free_fastopen_req(tp); @@ -2805,7 +2845,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, case TCP_FASTOPEN: if (val >= 0 && ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) { - tcp_fastopen_init_key_once(true); + tcp_fastopen_init_key_once(net); fastopen_queue_tune(sk, val); } else { @@ -2815,7 +2855,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, case TCP_FASTOPEN_CONNECT: if (val > 1 || val < 0) { err = -EINVAL; - } else if (sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) { + } else if (net->ipv4.sysctl_tcp_fastopen & TFO_CLIENT_ENABLE) { if (sk->sk_state == TCP_CLOSE) tp->fastopen_connect = val; else @@ -2997,10 +3037,41 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info) rate64 = tcp_compute_delivery_rate(tp); if (rate64) info->tcpi_delivery_rate = rate64; + info->tcpi_delivered = tp->delivered; + info->tcpi_delivered_ce = tp->delivered_ce; + info->tcpi_bytes_sent = tp->bytes_sent; + info->tcpi_bytes_retrans = tp->bytes_retrans; + info->tcpi_dsack_dups = tp->dsack_dups; unlock_sock_fast(sk, slow); } EXPORT_SYMBOL_GPL(tcp_get_info); +static size_t tcp_opt_stats_get_size(void) +{ + return + nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_BUSY */ + nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_RWND_LIMITED */ + nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_SNDBUF_LIMITED */ + nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_DATA_SEGS_OUT */ + nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_TOTAL_RETRANS */ + nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_PACING_RATE */ + nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_DELIVERY_RATE */ + nla_total_size(sizeof(u32)) + /* TCP_NLA_SND_CWND */ + nla_total_size(sizeof(u32)) + /* TCP_NLA_REORDERING */ + nla_total_size(sizeof(u32)) + /* TCP_NLA_MIN_RTT */ + nla_total_size(sizeof(u8)) + /* TCP_NLA_RECUR_RETRANS */ + nla_total_size(sizeof(u8)) + /* TCP_NLA_DELIVERY_RATE_APP_LMT */ + nla_total_size(sizeof(u32)) + /* TCP_NLA_SNDQ_SIZE */ + nla_total_size(sizeof(u8)) + /* TCP_NLA_CA_STATE */ + nla_total_size(sizeof(u32)) + /* TCP_NLA_SND_SSTHRESH */ + nla_total_size(sizeof(u32)) + /* TCP_NLA_DELIVERED */ + nla_total_size(sizeof(u32)) + /* TCP_NLA_DELIVERED_CE */ + nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_BYTES_SENT */ + nla_total_size_64bit(sizeof(u64)) + /* TCP_NLA_BYTES_RETRANS */ + nla_total_size(sizeof(u32)) + /* TCP_NLA_DSACK_DUPS */ + 0; +} + struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk); @@ -3009,9 +3080,7 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk) u64 rate64; u32 rate; - stats = alloc_skb(7 * nla_total_size_64bit(sizeof(u64)) + - 3 * nla_total_size(sizeof(u32)) + - 2 * nla_total_size(sizeof(u8)), GFP_ATOMIC); + stats = alloc_skb(tcp_opt_stats_get_size(), GFP_ATOMIC); if (!stats) return NULL; @@ -3040,6 +3109,19 @@ struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk) nla_put_u8(stats, TCP_NLA_RECUR_RETRANS, inet_csk(sk)->icsk_retransmits); nla_put_u8(stats, TCP_NLA_DELIVERY_RATE_APP_LMT, !!tp->rate_app_limited); + nla_put_u32(stats, TCP_NLA_SND_SSTHRESH, tp->snd_ssthresh); + nla_put_u32(stats, TCP_NLA_DELIVERED, tp->delivered); + nla_put_u32(stats, TCP_NLA_DELIVERED_CE, tp->delivered_ce); + + nla_put_u32(stats, TCP_NLA_SNDQ_SIZE, tp->write_seq - tp->snd_una); + nla_put_u8(stats, TCP_NLA_CA_STATE, inet_csk(sk)->icsk_ca_state); + + nla_put_u64_64bit(stats, TCP_NLA_BYTES_SENT, tp->bytes_sent, + TCP_NLA_PAD); + nla_put_u64_64bit(stats, TCP_NLA_BYTES_RETRANS, tp->bytes_retrans, + TCP_NLA_PAD); + nla_put_u32(stats, TCP_NLA_DSACK_DUPS, tp->dsack_dups); + return stats; } diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 80debb0daf37..52ccefc77e4e 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -39,17 +39,19 @@ static int tcp_bpf_wait_data(struct sock *sk, struct sk_psock *psock, } int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, - struct msghdr *msg, int len) + struct msghdr *msg, int len, int flags) { struct iov_iter *iter = &msg->msg_iter; + int peek = flags & MSG_PEEK; int i, ret, copied = 0; + struct sk_msg *msg_rx; + + msg_rx = list_first_entry_or_null(&psock->ingress_msg, + struct sk_msg, list); while (copied != len) { struct scatterlist *sge; - struct sk_msg *msg_rx; - msg_rx = list_first_entry_or_null(&psock->ingress_msg, - struct sk_msg, list); if (unlikely(!msg_rx)) break; @@ -70,21 +72,30 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, } copied += copy; - sge->offset += copy; - sge->length -= copy; - sk_mem_uncharge(sk, copy); - if (!sge->length) { - i++; - if (i == MAX_SKB_FRAGS) - i = 0; - if (!msg_rx->skb) - put_page(page); + if (likely(!peek)) { + sge->offset += copy; + sge->length -= copy; + sk_mem_uncharge(sk, copy); + msg_rx->sg.size -= copy; + + if (!sge->length) { + sk_msg_iter_var_next(i); + if (!msg_rx->skb) + put_page(page); + } + } else { + sk_msg_iter_var_next(i); } if (copied == len) break; } while (i != msg_rx->sg.end); + if (unlikely(peek)) { + msg_rx = list_next_entry(msg_rx, list); + continue; + } + msg_rx->sg.start = i; if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) { list_del(&msg_rx->list); @@ -92,6 +103,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, consume_skb(msg_rx->skb); kfree(msg_rx); } + msg_rx = list_first_entry_or_null(&psock->ingress_msg, + struct sk_msg, list); } return copied; @@ -114,7 +127,7 @@ int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len); lock_sock(sk); msg_bytes_ready: - copied = __tcp_bpf_recvmsg(sk, psock, msg, len); + copied = __tcp_bpf_recvmsg(sk, psock, msg, len, flags); if (!copied) { int data, err = 0; long timeo; @@ -273,14 +286,25 @@ EXPORT_SYMBOL_GPL(tcp_bpf_sendmsg_redir); static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock, struct sk_msg *msg, int *copied, int flags) { - bool cork = false, enospc = msg->sg.start == msg->sg.end; + bool cork = false, enospc = sk_msg_full(msg); struct sock *sk_redir; - u32 tosend; + u32 tosend, delta = 0; int ret; more_data: - if (psock->eval == __SK_NONE) + if (psock->eval == __SK_NONE) { + /* Track delta in msg size to add/subtract it on SK_DROP from + * returned to user copied size. This ensures user doesn't + * get a positive return code with msg_cut_data and SK_DROP + * verdict. + */ + delta = msg->sg.size; psock->eval = sk_psock_msg_verdict(sk, psock, msg); + if (msg->sg.size < delta) + delta -= msg->sg.size; + else + delta = 0; + } if (msg->cork_bytes && msg->cork_bytes > msg->sg.size && !enospc) { @@ -336,7 +360,7 @@ more_data: default: sk_msg_free_partial(sk, msg, tosend); sk_msg_apply_bytes(psock, tosend); - *copied -= tosend; + *copied -= (tosend + delta); return -EACCES; } diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index b193dcebbf7e..533f8d84d2f7 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -33,9 +33,11 @@ static struct tcp_congestion_ops *tcp_ca_find(const char *name) } /* Must be called with rcu lock held */ -static const struct tcp_congestion_ops *__tcp_ca_find_autoload(const char *name) +static struct tcp_congestion_ops *tcp_ca_find_autoload(struct net *net, + const char *name) { - const struct tcp_congestion_ops *ca = tcp_ca_find(name); + struct tcp_congestion_ops *ca = tcp_ca_find(name); + #ifdef CONFIG_MODULES if (!ca && capable(CAP_NET_ADMIN)) { rcu_read_unlock(); @@ -115,7 +117,7 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *ca) } EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control); -u32 tcp_ca_get_key_by_name(const char *name, bool *ecn_ca) +u32 tcp_ca_get_key_by_name(struct net *net, const char *name, bool *ecn_ca) { const struct tcp_congestion_ops *ca; u32 key = TCP_CA_UNSPEC; @@ -123,7 +125,7 @@ u32 tcp_ca_get_key_by_name(const char *name, bool *ecn_ca) might_sleep(); rcu_read_lock(); - ca = __tcp_ca_find_autoload(name); + ca = tcp_ca_find_autoload(net, name); if (ca) { key = ca->key; *ecn_ca = ca->flags & TCP_CONG_NEEDS_ECN; @@ -153,23 +155,18 @@ EXPORT_SYMBOL_GPL(tcp_ca_get_name_by_key); /* Assign choice of congestion control. */ void tcp_assign_congestion_control(struct sock *sk) { + struct net *net = sock_net(sk); struct inet_connection_sock *icsk = inet_csk(sk); - struct tcp_congestion_ops *ca; + const struct tcp_congestion_ops *ca; rcu_read_lock(); - list_for_each_entry_rcu(ca, &tcp_cong_list, list) { - if (likely(try_module_get(ca->owner))) { - icsk->icsk_ca_ops = ca; - goto out; - } - /* Fallback to next available. The last really - * guaranteed fallback is Reno from this list. - */ - } -out: + ca = rcu_dereference(net->ipv4.tcp_congestion_control); + if (unlikely(!try_module_get(ca->owner))) + ca = &tcp_reno; + icsk->icsk_ca_ops = ca; rcu_read_unlock(); - memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv)); + memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv)); if (ca->flags & TCP_CONG_NEEDS_ECN) INET_ECN_xmit(sk); else @@ -219,29 +216,31 @@ void tcp_cleanup_congestion_control(struct sock *sk) } /* Used by sysctl to change default congestion control */ -int tcp_set_default_congestion_control(const char *name) +int tcp_set_default_congestion_control(struct net *net, const char *name) { struct tcp_congestion_ops *ca; - int ret = -ENOENT; + const struct tcp_congestion_ops *prev; + int ret; - spin_lock(&tcp_cong_list_lock); - ca = tcp_ca_find(name); -#ifdef CONFIG_MODULES - if (!ca && capable(CAP_NET_ADMIN)) { - spin_unlock(&tcp_cong_list_lock); + rcu_read_lock(); + ca = tcp_ca_find_autoload(net, name); + if (!ca) { + ret = -ENOENT; + } else if (!try_module_get(ca->owner)) { + ret = -EBUSY; + } else if (!net_eq(net, &init_net) && + !(ca->flags & TCP_CONG_NON_RESTRICTED)) { + /* Only init netns can set default to a restricted algorithm */ + ret = -EPERM; + } else { + prev = xchg(&net->ipv4.tcp_congestion_control, ca); + if (prev) + module_put(prev->owner); - request_module("tcp_%s", name); - spin_lock(&tcp_cong_list_lock); - ca = tcp_ca_find(name); - } -#endif - - if (ca) { - ca->flags |= TCP_CONG_NON_RESTRICTED; /* default is always allowed */ - list_move(&ca->list, &tcp_cong_list); + ca->flags |= TCP_CONG_NON_RESTRICTED; ret = 0; } - spin_unlock(&tcp_cong_list_lock); + rcu_read_unlock(); return ret; } @@ -249,7 +248,8 @@ int tcp_set_default_congestion_control(const char *name) /* Set default value from kernel configuration at bootup */ static int __init tcp_congestion_default(void) { - return tcp_set_default_congestion_control(CONFIG_DEFAULT_TCP_CONG); + return tcp_set_default_congestion_control(&init_net, + CONFIG_DEFAULT_TCP_CONG); } late_initcall(tcp_congestion_default); @@ -269,14 +269,12 @@ void tcp_get_available_congestion_control(char *buf, size_t maxlen) } /* Get current default congestion control */ -void tcp_get_default_congestion_control(char *name) +void tcp_get_default_congestion_control(struct net *net, char *name) { - struct tcp_congestion_ops *ca; - /* We will always have reno... */ - BUG_ON(list_empty(&tcp_cong_list)); + const struct tcp_congestion_ops *ca; rcu_read_lock(); - ca = list_entry(tcp_cong_list.next, struct tcp_congestion_ops, list); + ca = rcu_dereference(net->ipv4.tcp_congestion_control); strncpy(name, ca->name, TCP_CA_NAME_MAX); rcu_read_unlock(); } @@ -357,12 +355,14 @@ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, if (!load) ca = tcp_ca_find(name); else - ca = __tcp_ca_find_autoload(name); + ca = tcp_ca_find_autoload(sock_net(sk), name); + /* No change asking for existing value */ if (ca == icsk->icsk_ca_ops) { icsk->icsk_ca_setsockopt = 1; goto out; } + if (!ca) { err = -ENOENT; } else if (!load) { diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c index 0edd8d357e3d..813cc24b4b37 100644 --- a/net/ipv4/tcp_fastopen.c +++ b/net/ipv4/tcp_fastopen.c @@ -10,15 +10,18 @@ #include #include -int sysctl_tcp_fastopen __read_mostly = TFO_CLIENT_ENABLE; - -struct tcp_fastopen_context __rcu *tcp_fastopen_ctx; - -static DEFINE_SPINLOCK(tcp_fastopen_ctx_lock); - -void tcp_fastopen_init_key_once(bool publish) +void tcp_fastopen_init_key_once(struct net *net) { - static u8 key[TCP_FASTOPEN_KEY_LENGTH]; + u8 key[TCP_FASTOPEN_KEY_LENGTH]; + struct tcp_fastopen_context *ctxt; + + rcu_read_lock(); + ctxt = rcu_dereference(net->ipv4.tcp_fastopen_ctx); + if (ctxt) { + rcu_read_unlock(); + return; + } + rcu_read_unlock(); /* tcp_fastopen_reset_cipher publishes the new context * atomically, so we allow this race happening here. @@ -26,8 +29,8 @@ void tcp_fastopen_init_key_once(bool publish) * All call sites of tcp_fastopen_cookie_gen also check * for a valid cookie, so this is an acceptable risk. */ - if (net_get_random_once(key, sizeof(key)) && publish) - tcp_fastopen_reset_cipher(key, sizeof(key)); + get_random_bytes(key, sizeof(key)); + tcp_fastopen_reset_cipher(net, key, sizeof(key)); } static void tcp_fastopen_ctx_free(struct rcu_head *head) @@ -38,7 +41,22 @@ static void tcp_fastopen_ctx_free(struct rcu_head *head) kfree(ctx); } -int tcp_fastopen_reset_cipher(void *key, unsigned int len) +void tcp_fastopen_ctx_destroy(struct net *net) +{ + struct tcp_fastopen_context *ctxt; + + spin_lock(&net->ipv4.tcp_fastopen_ctx_lock); + + ctxt = rcu_dereference_protected(net->ipv4.tcp_fastopen_ctx, + lockdep_is_held(&net->ipv4.tcp_fastopen_ctx_lock)); + rcu_assign_pointer(net->ipv4.tcp_fastopen_ctx, NULL); + spin_unlock(&net->ipv4.tcp_fastopen_ctx_lock); + + if (ctxt) + call_rcu(&ctxt->rcu, tcp_fastopen_ctx_free); +} + +int tcp_fastopen_reset_cipher(struct net *net, void *key, unsigned int len) { int err; struct tcp_fastopen_context *ctx, *octx; @@ -62,26 +80,27 @@ error: kfree(ctx); } memcpy(ctx->key, key, len); - spin_lock(&tcp_fastopen_ctx_lock); + spin_lock(&net->ipv4.tcp_fastopen_ctx_lock); - octx = rcu_dereference_protected(tcp_fastopen_ctx, - lockdep_is_held(&tcp_fastopen_ctx_lock)); - rcu_assign_pointer(tcp_fastopen_ctx, ctx); - spin_unlock(&tcp_fastopen_ctx_lock); + octx = rcu_dereference_protected(net->ipv4.tcp_fastopen_ctx, + lockdep_is_held(&net->ipv4.tcp_fastopen_ctx_lock)); + rcu_assign_pointer(net->ipv4.tcp_fastopen_ctx, ctx); + spin_unlock(&net->ipv4.tcp_fastopen_ctx_lock); if (octx) call_rcu(&octx->rcu, tcp_fastopen_ctx_free); return err; } -static bool __tcp_fastopen_cookie_gen(const void *path, +static bool __tcp_fastopen_cookie_gen(struct net *net, + const void *path, struct tcp_fastopen_cookie *foc) { struct tcp_fastopen_context *ctx; bool ok = false; rcu_read_lock(); - ctx = rcu_dereference(tcp_fastopen_ctx); + ctx = rcu_dereference(net->ipv4.tcp_fastopen_ctx); if (ctx) { crypto_cipher_encrypt_one(ctx->tfm, foc->val, path); foc->len = TCP_FASTOPEN_COOKIE_SIZE; @@ -97,7 +116,8 @@ static bool __tcp_fastopen_cookie_gen(const void *path, * * XXX (TFO) - refactor when TCP_FASTOPEN_COOKIE_SIZE != AES_BLOCK_SIZE. */ -static bool tcp_fastopen_cookie_gen(struct request_sock *req, +static bool tcp_fastopen_cookie_gen(struct net *net, + struct request_sock *req, struct sk_buff *syn, struct tcp_fastopen_cookie *foc) { @@ -105,7 +125,7 @@ static bool tcp_fastopen_cookie_gen(struct request_sock *req, const struct iphdr *iph = ip_hdr(syn); __be32 path[4] = { iph->saddr, iph->daddr, 0, 0 }; - return __tcp_fastopen_cookie_gen(path, foc); + return __tcp_fastopen_cookie_gen(net, path, foc); } #if IS_ENABLED(CONFIG_IPV6) @@ -113,13 +133,13 @@ static bool tcp_fastopen_cookie_gen(struct request_sock *req, const struct ipv6hdr *ip6h = ipv6_hdr(syn); struct tcp_fastopen_cookie tmp; - if (__tcp_fastopen_cookie_gen(&ip6h->saddr, &tmp)) { + if (__tcp_fastopen_cookie_gen(net, &ip6h->saddr, &tmp)) { struct in6_addr *buf = &tmp.addr; int i; for (i = 0; i < 4; i++) buf->s6_addr32[i] ^= ip6h->daddr.s6_addr32[i]; - return __tcp_fastopen_cookie_gen(buf, foc); + return __tcp_fastopen_cookie_gen(net, buf, foc); } } #endif @@ -217,12 +237,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk, refcount_set(&req->rsk_refcnt, 2); /* Now finish processing the fastopen child socket. */ - inet_csk(child)->icsk_af_ops->rebuild_header(child); - tcp_init_congestion_control(child); - tcp_mtup_init(child); - tcp_init_metrics(child); - tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); - tcp_init_buffer_space(child); + tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1; @@ -282,25 +297,26 @@ struct sock *tcp_try_fastopen(struct sock *sk, struct sk_buff *skb, struct request_sock *req, struct tcp_fastopen_cookie *foc) { - struct tcp_fastopen_cookie valid_foc = { .len = -1 }; bool syn_data = TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq + 1; + int tcp_fastopen = sock_net(sk)->ipv4.sysctl_tcp_fastopen; + struct tcp_fastopen_cookie valid_foc = { .len = -1 }; struct sock *child; if (foc->len == 0) /* Client requests a cookie */ NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFASTOPENCOOKIEREQD); - if (!((sysctl_tcp_fastopen & TFO_SERVER_ENABLE) && + if (!((tcp_fastopen & TFO_SERVER_ENABLE) && (syn_data || foc->len >= 0) && tcp_fastopen_queue_check(sk))) { foc->len = -1; return NULL; } - if (syn_data && (sysctl_tcp_fastopen & TFO_SERVER_COOKIE_NOT_REQD)) + if (syn_data && (tcp_fastopen & TFO_SERVER_COOKIE_NOT_REQD)) goto fastopen; if (foc->len >= 0 && /* Client presents or requests a cookie */ - tcp_fastopen_cookie_gen(req, skb, &valid_foc) && + tcp_fastopen_cookie_gen(sock_net(sk), req, skb, &valid_foc) && foc->len == TCP_FASTOPEN_COOKIE_SIZE && foc->len == valid_foc.len && !memcmp(foc->val, valid_foc.val, foc->len)) { @@ -332,17 +348,7 @@ fastopen: bool tcp_fastopen_cookie_check(struct sock *sk, u16 *mss, struct tcp_fastopen_cookie *cookie) { - unsigned long last_syn_loss = 0; - int syn_loss = 0; - - tcp_fastopen_cache_get(sk, mss, cookie, &syn_loss, &last_syn_loss); - - /* Recurring FO SYN losses: no cookie or data in SYN */ - if (syn_loss > 1 && - time_before(jiffies, last_syn_loss + (60*HZ << syn_loss))) { - cookie->len = -1; - return false; - } + tcp_fastopen_cache_get(sk, mss, cookie); /* Firewall blackhole issue check */ if (tcp_fastopen_active_should_disable(sk)) { @@ -350,7 +356,7 @@ bool tcp_fastopen_cookie_check(struct sock *sk, u16 *mss, return false; } - if (sysctl_tcp_fastopen & TFO_CLIENT_NO_COOKIE) { + if (sock_net(sk)->ipv4.sysctl_tcp_fastopen & TFO_CLIENT_NO_COOKIE) { cookie->len = -1; return true; } @@ -398,31 +404,24 @@ EXPORT_SYMBOL(tcp_fastopen_defer_connect); * following circumstances: * 1. client side TFO socket receives out of order FIN * 2. client side TFO socket receives out of order RST + * 3. client side TFO socket has timed out three times consecutively during + * or after handshake * We disable active side TFO globally for 1hr at first. Then if it * happens again, we disable it for 2h, then 4h, 8h, ... * And we reset the timeout back to 1hr when we see a successful active * TFO connection with data exchanges. */ -/* Default to 1hr */ -unsigned int sysctl_tcp_fastopen_blackhole_timeout __read_mostly = 60 * 60; -static atomic_t tfo_active_disable_times __read_mostly = ATOMIC_INIT(0); -static unsigned long tfo_active_disable_stamp __read_mostly; - /* Disable active TFO and record current jiffies and * tfo_active_disable_times */ void tcp_fastopen_active_disable(struct sock *sk) { - atomic_inc(&tfo_active_disable_times); - tfo_active_disable_stamp = jiffies; - NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFASTOPENBLACKHOLE); -} + struct net *net = sock_net(sk); -/* Reset tfo_active_disable_times to 0 */ -void tcp_fastopen_active_timeout_reset(void) -{ - atomic_set(&tfo_active_disable_times, 0); + atomic_inc(&net->ipv4.tfo_active_disable_times); + net->ipv4.tfo_active_disable_stamp = jiffies; + NET_INC_STATS(net, LINUX_MIB_TCPFASTOPENBLACKHOLE); } /* Calculate timeout for tfo active disable @@ -431,17 +430,18 @@ void tcp_fastopen_active_timeout_reset(void) */ bool tcp_fastopen_active_should_disable(struct sock *sk) { - int tfo_da_times = atomic_read(&tfo_active_disable_times); - int multiplier; + unsigned int tfo_bh_timeout = sock_net(sk)->ipv4.sysctl_tcp_fastopen_blackhole_timeout; + int tfo_da_times = atomic_read(&sock_net(sk)->ipv4.tfo_active_disable_times); unsigned long timeout; + int multiplier; if (!tfo_da_times) return false; /* Limit timout to max: 2^6 * initial timeout */ multiplier = 1 << min(tfo_da_times - 1, 6); - timeout = multiplier * sysctl_tcp_fastopen_blackhole_timeout * HZ; - if (time_before(jiffies, tfo_active_disable_stamp + timeout)) + timeout = multiplier * tfo_bh_timeout * HZ; + if (time_before(jiffies, sock_net(sk)->ipv4.tfo_active_disable_stamp + timeout)) return true; /* Mark check bit so we can check for successful active TFO @@ -475,10 +475,27 @@ void tcp_fastopen_active_disable_ofo_check(struct sock *sk) } } } else if (tp->syn_fastopen_ch && - atomic_read(&tfo_active_disable_times)) { + atomic_read(&sock_net(sk)->ipv4.tfo_active_disable_times)) { dst = sk_dst_get(sk); if (!(dst && dst->dev && (dst->dev->flags & IFF_LOOPBACK))) - tcp_fastopen_active_timeout_reset(); + atomic_set(&sock_net(sk)->ipv4.tfo_active_disable_times, 0); dst_release(dst); } } + +void tcp_fastopen_active_detect_blackhole(struct sock *sk, bool expired) +{ + u32 timeouts = inet_csk(sk)->icsk_retransmits; + struct tcp_sock *tp = tcp_sk(sk); + + /* Broken middle-boxes may black-hole Fast Open connection during or + * even after the handshake. Be extremely conservative and pause + * Fast Open globally after hitting the third consecutive timeout or + * exceeding the configured timeout limit. + */ + if ((tp->syn_fastopen || tp->syn_data || tp->syn_data_acked) && + (timeouts == 2 || (timeouts < 2 && expired))) { + tcp_fastopen_active_disable(sk); + NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPFASTOPENACTIVEFAIL); + } +} diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9b0878780430..35e64a75c96f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -791,6 +791,8 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us) tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2; tp->rtt_seq = tp->snd_nxt; tp->mdev_max_us = tcp_rto_min_us(sk); + + tcp_bpf_rtt(sk); } } else { /* no previous measure. */ @@ -799,6 +801,8 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us) tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk)); tp->mdev_max_us = tp->rttvar_us; tp->rtt_seq = tp->snd_nxt; + + tcp_bpf_rtt(sk); } tp->srtt_us = max(1U, srtt); } @@ -901,6 +905,7 @@ void tcp_disable_fack(struct tcp_sock *tp) static void tcp_dsack_seen(struct tcp_sock *tp) { tp->rx_opt.sack_ok |= TCP_DSACK_SEEN; + tp->dsack_dups++; } static void tcp_update_reordering(struct sock *sk, const int metric, @@ -3633,6 +3638,22 @@ static void tcp_xmit_recovery(struct sock *sk, int rexmit) tcp_xmit_retransmit_queue(sk); } +/* Returns the number of packets newly acked or sacked by the current ACK */ +static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag) +{ + const struct net *net = sock_net(sk); + struct tcp_sock *tp = tcp_sk(sk); + u32 delivered; + + delivered = tp->delivered - prior_delivered; + NET_ADD_STATS(net, LINUX_MIB_TCPDELIVERED, delivered); + if (flag & FLAG_ECE) { + tp->delivered_ce += delivered; + NET_ADD_STATS(net, LINUX_MIB_TCPDELIVEREDCE, delivered); + } + return delivered; +} + /* This routine deals with incoming acks, but not outgoing ones. */ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) { @@ -3760,7 +3781,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) sk_dst_confirm(sk); - delivered = tp->delivered - delivered; /* freshly ACKed or SACKed */ + delivered = tcp_newly_delivered(sk, delivered, flag); lost = tp->lost - lost; /* freshly marked lost */ tcp_rate_gen(sk, delivered, lost, is_sack_reneg, sack_state.rate); tcp_cong_control(sk, ack, delivered, flag, sack_state.rate); @@ -3769,8 +3790,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) no_queue: /* If data was DSACKed, see if we can undo a cwnd reduction. */ - if (flag & FLAG_DSACKING_ACK) + if (flag & FLAG_DSACKING_ACK) { tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit); + tcp_newly_delivered(sk, delivered, flag); + } /* If this ack opens up a zero window, clear backoff. It was * being used to time the probes, and is probably far higher than * it needs to be for normal retransmission. @@ -3794,6 +3817,7 @@ old_ack: flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una, &sack_state); tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit); + tcp_newly_delivered(sk, delivered, flag); tcp_xmit_recovery(sk, rexmit); } @@ -3818,6 +3842,49 @@ static void tcp_parse_fastopen_option(int len, const unsigned char *cookie, foc->exp = exp_opt; } +/* Try to parse the MSS option from the TCP header. Return 0 on failure, clamped + * value on success. + */ +static u16 tcp_parse_mss_option(const struct tcphdr *th, u16 user_mss) +{ + const unsigned char *ptr = (const unsigned char *)(th + 1); + int length = (th->doff * 4) - sizeof(struct tcphdr); + u16 mss = 0; + + while (length > 0) { + int opcode = *ptr++; + int opsize; + + switch (opcode) { + case TCPOPT_EOL: + return mss; + case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */ + length--; + continue; + default: + if (length < 2) + return mss; + opsize = *ptr++; + if (opsize < 2) /* "silly options" */ + return mss; + if (opsize > length) + return mss; /* fail on partial options */ + if (opcode == TCPOPT_MSS && opsize == TCPOLEN_MSS) { + u16 in_mss = get_unaligned_be16(ptr); + + if (in_mss) { + if (user_mss && user_mss < in_mss) + in_mss = user_mss; + mss = in_mss; + } + } + ptr += opsize - 2; + length -= opsize; + } + } + return mss; +} + /* Look for tcp options. Normally only called on SYN and SYNACK packets. * But, this can also be called on packets in the established flow when * the fast version below fails. @@ -5675,20 +5742,13 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb) security_inet_conn_established(sk, skb); } - /* Make sure socket is routed, for correct metrics. */ - icsk->icsk_af_ops->rebuild_header(sk); - - tcp_init_metrics(sk); - tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB); - tcp_init_congestion_control(sk); + tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB); /* Prevent spurious tcp_cwnd_restart() on first data * packet. */ tp->lsndtime = tcp_jiffies32; - tcp_init_buffer_space(sk); - if (sock_flag(sk, SOCK_KEEPOPEN)) inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp)); @@ -5855,7 +5915,6 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb, if (tcp_is_sack(tp) && sysctl_tcp_fack) tcp_enable_fack(tp); - tcp_mtup_init(sk); tcp_sync_mss(sk, icsk->icsk_pmtu_cookie); tcp_initialize_rcv_mss(sk); @@ -6084,14 +6143,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) inet_csk(sk)->icsk_retransmits = 0; reqsk_fastopen_remove(sk, req, false); } else { - /* Make sure socket is routed, for correct metrics. */ - icsk->icsk_af_ops->rebuild_header(sk); - tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); - tcp_init_congestion_control(sk); - - tcp_mtup_init(sk); + tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); tp->copied_seq = tp->rcv_nxt; - tcp_init_buffer_space(sk); } smp_mb(); tcp_set_state(sk, TCP_ESTABLISHED); @@ -6121,8 +6174,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) * are sent out. */ tcp_rearm_rto(sk); - } else - tcp_init_metrics(sk); + } if (!inet_csk(sk)->icsk_ca_ops->cong_control) tcp_update_pacing_rate(sk); @@ -6359,9 +6411,7 @@ EXPORT_SYMBOL(inet_reqsk_alloc); /* * Return true if a syncookie should be sent */ -static bool tcp_syn_flood_action(const struct sock *sk, - const struct sk_buff *skb, - const char *proto) +static bool tcp_syn_flood_action(const struct sock *sk, const char *proto) { struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue; const char *msg = "Dropping request"; @@ -6381,7 +6431,7 @@ static bool tcp_syn_flood_action(const struct sock *sk, net->ipv4.sysctl_tcp_syncookies != 2 && xchg(&queue->synflood_warned, 1) == 0) pr_info("%s: Possible SYN flooding on port %d. %s. Check SNMP counters.\n", - proto, ntohs(tcp_hdr(skb)->dest), msg); + proto, sk->sk_num, msg); return want_cookie; } @@ -6403,6 +6453,36 @@ static void tcp_reqsk_record_syn(const struct sock *sk, } } +/* If a SYN cookie is required and supported, returns a clamped MSS value to be + * used for SYN cookie generation. + */ +u16 tcp_get_syncookie_mss(struct request_sock_ops *rsk_ops, + const struct tcp_request_sock_ops *af_ops, + struct sock *sk, struct tcphdr *th) +{ + struct tcp_sock *tp = tcp_sk(sk); + u16 mss; + + if (sock_net(sk)->ipv4.sysctl_tcp_syncookies != 2 && + !inet_csk_reqsk_queue_is_full(sk)) + return 0; + + if (!tcp_syn_flood_action(sk, rsk_ops->slab_name)) + return 0; + + if (sk_acceptq_is_full(sk)) { + NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS); + return 0; + } + + mss = tcp_parse_mss_option(th, tp->rx_opt.user_mss); + if (!mss) + mss = af_ops->mss_clamp; + + return mss; +} +EXPORT_SYMBOL_GPL(tcp_get_syncookie_mss); + int tcp_conn_request(struct request_sock_ops *rsk_ops, const struct tcp_request_sock_ops *af_ops, struct sock *sk, struct sk_buff *skb) @@ -6424,7 +6504,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, */ if ((net->ipv4.sysctl_tcp_syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) { - want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name); + want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name); if (!want_cookie) goto drop; } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 98ff399740b4..b4908f96aa84 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1477,6 +1477,21 @@ static struct sock *tcp_v4_cookie_check(struct sock *sk, struct sk_buff *skb) return sk; } +u16 tcp_v4_get_syncookie(struct sock *sk, struct iphdr *iph, + struct tcphdr *th, u32 *cookie) +{ + u16 mss = 0; +#ifdef CONFIG_SYN_COOKIES + mss = tcp_get_syncookie_mss(&tcp_request_sock_ops, + &tcp_request_sock_ipv4_ops, sk, th); + if (mss) { + *cookie = __cookie_v4_init_sequence(iph, th, &mss); + tcp_synq_overflow(sk); + } +#endif + return mss; +} + /* The socket must have it's spinlock held when we get * here, unless it is a TCP_LISTEN socket. * @@ -2490,6 +2505,8 @@ static void __net_exit tcp_sk_exit(struct net *net) { int cpu; + module_put(net->ipv4.tcp_congestion_control->owner); + for_each_possible_cpu(cpu) inet_ctl_sock_destroy(*per_cpu_ptr(net->ipv4.tcp_sk, cpu)); free_percpu(net->ipv4.tcp_sk); @@ -2554,6 +2571,18 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.sysctl_tcp_early_retrans = 3; net->ipv4.sysctl_tcp_default_init_rwnd = TCP_INIT_CWND * 2; + net->ipv4.sysctl_tcp_fastopen = TFO_CLIENT_ENABLE; + spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock); + net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60; + atomic_set(&net->ipv4.tfo_active_disable_times, 0); + + /* Reno is always built in */ + if (!net_eq(net, &init_net) && + try_module_get(init_net.ipv4.tcp_congestion_control->owner)) + net->ipv4.tcp_congestion_control = init_net.ipv4.tcp_congestion_control; + else + net->ipv4.tcp_congestion_control = &tcp_reno; + return 0; fail: tcp_sk_exit(net); @@ -2563,7 +2592,12 @@ fail: static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list) { + struct net *net; + inet_twsk_purge(&tcp_hashinfo, AF_INET); + + list_for_each_entry(net, net_exit_list, exit_list) + tcp_fastopen_ctx_destroy(net); } static struct pernet_operations __net_initdata tcp_sk_ops = { diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c index e433b222368c..7688e07b7387 100644 --- a/net/ipv4/tcp_metrics.c +++ b/net/ipv4/tcp_metrics.c @@ -567,8 +567,7 @@ bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst) } void tcp_fastopen_cache_get(struct sock *sk, u16 *mss, - struct tcp_fastopen_cookie *cookie, - int *syn_loss, unsigned long *last_syn_loss) + struct tcp_fastopen_cookie *cookie) { struct tcp_metrics_block *tm; @@ -585,8 +584,6 @@ void tcp_fastopen_cache_get(struct sock *sk, u16 *mss, *cookie = tfom->cookie; if (cookie->len <= 0 && tfom->try_exp == 1) cookie->exp = true; - *syn_loss = tfom->syn_loss; - *last_syn_loss = *syn_loss ? tfom->last_syn_loss : 0; } while (read_seqretry(&fastopen_seqlock, seq)); } rcu_read_unlock(); diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c index 09f8773fd769..37f343470b28 100644 --- a/net/ipv4/tcp_nv.c +++ b/net/ipv4/tcp_nv.c @@ -39,7 +39,7 @@ * nv_cong_dec_mult Decrease cwnd by X% (30%) of congestion when detected * nv_ssthresh_factor On congestion set ssthresh to this * / 8 * nv_rtt_factor RTT averaging factor - * nv_loss_dec_factor Decrease cwnd by this (50%) when losses occur + * nv_loss_dec_factor Decrease cwnd to this (80%) when losses occur * nv_dec_eval_min_calls Wait this many RTT measurements before dec cwnd * nv_inc_eval_min_calls Wait this many RTT measurements before inc cwnd * nv_ssthresh_eval_min_calls Wait this many RTT measurements before stopping @@ -61,7 +61,7 @@ static int nv_min_cwnd __read_mostly = 2; static int nv_cong_dec_mult __read_mostly = 30 * 128 / 100; /* = 30% */ static int nv_ssthresh_factor __read_mostly = 8; /* = 1 */ static int nv_rtt_factor __read_mostly = 128; /* = 1/2*old + 1/2*new */ -static int nv_loss_dec_factor __read_mostly = 512; /* => 50% */ +static int nv_loss_dec_factor __read_mostly = 819; /* => 80% */ static int nv_cwnd_growth_rate_neg __read_mostly = 8; static int nv_cwnd_growth_rate_pos __read_mostly; /* 0 => fixed like Reno */ static int nv_dec_eval_min_calls __read_mostly = 60; @@ -101,6 +101,11 @@ struct tcpnv { u32 nv_last_rtt; /* last rtt */ u32 nv_min_rtt; /* active min rtt. Used to determine slope */ u32 nv_min_rtt_new; /* min rtt for future use */ + u32 nv_base_rtt; /* If non-zero it represents the threshold for + * congestion */ + u32 nv_lower_bound_rtt; /* Used in conjunction with nv_base_rtt. It is + * set to 80% of nv_base_rtt. It helps reduce + * unfairness between flows */ u32 nv_rtt_max_rate; /* max rate seen during current RTT */ u32 nv_rtt_start_seq; /* current RTT ends when packet arrives * acking beyond nv_rtt_start_seq */ @@ -132,9 +137,24 @@ static inline void tcpnv_reset(struct tcpnv *ca, struct sock *sk) static void tcpnv_init(struct sock *sk) { struct tcpnv *ca = inet_csk_ca(sk); + int base_rtt; tcpnv_reset(ca, sk); + /* See if base_rtt is available from socket_ops bpf program. + * It is meant to be used in environments, such as communication + * within a datacenter, where we have reasonable estimates of + * RTTs + */ + base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT, 0, NULL); + if (base_rtt > 0) { + ca->nv_base_rtt = base_rtt; + ca->nv_lower_bound_rtt = (base_rtt * 205) >> 8; /* 80% */ + } else { + ca->nv_base_rtt = 0; + ca->nv_lower_bound_rtt = 0; + } + ca->nv_allow_cwnd_growth = 1; ca->nv_min_rtt_reset_jiffies = jiffies + 2 * HZ; ca->nv_min_rtt = NV_INIT_RTT; @@ -144,6 +164,19 @@ static void tcpnv_init(struct sock *sk) ca->cwnd_growth_factor = 0; } +/* If provided, apply upper (base_rtt) and lower (lower_bound_rtt) + * bounds to RTT. + */ +inline u32 nv_get_bounded_rtt(struct tcpnv *ca, u32 val) +{ + if (ca->nv_lower_bound_rtt > 0 && val < ca->nv_lower_bound_rtt) + return ca->nv_lower_bound_rtt; + else if (ca->nv_base_rtt > 0 && val > ca->nv_base_rtt) + return ca->nv_base_rtt; + else + return val; +} + static void tcpnv_cong_avoid(struct sock *sk, u32 ack, u32 acked) { struct tcp_sock *tp = tcp_sk(sk); @@ -265,6 +298,9 @@ static void tcpnv_acked(struct sock *sk, const struct ack_sample *sample) if (ca->nv_eval_call_cnt < 255) ca->nv_eval_call_cnt++; + /* Apply bounds to rtt. Only used to update min_rtt */ + avg_rtt = nv_get_bounded_rtt(ca, avg_rtt); + /* update min rtt if necessary */ if (avg_rtt < ca->nv_min_rtt) ca->nv_min_rtt = avg_rtt; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 84dbb5d9f928..6e9bf2b142b3 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -62,6 +62,22 @@ int sysctl_tcp_tso_win_divisor __read_mostly = 3; /* By default, RFC2861 behavior. */ int sysctl_tcp_slow_start_after_idle __read_mostly = 1; +/* Refresh clocks of a TCP socket, + * ensuring monotically increasing values. + */ +void tcp_mstamp_refresh(struct tcp_sock *tp) +{ + u64 val = tcp_clock_ns(); + + /* departure time for next data packet */ + if (val > tp->tcp_wstamp_ns) + tp->tcp_wstamp_ns = val; + + val = div_u64(val, NSEC_PER_USEC); + if (val > tp->tcp_mstamp) + tp->tcp_mstamp = val; +} + static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int push_one, gfp_t gfp); @@ -1116,6 +1132,7 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, if (skb->len != tcp_header_size) { tcp_event_data_sent(tp, sk); tp->data_segs_out += tcp_skb_pcount(skb); + tp->bytes_sent += skb->len - tcp_header_size; tcp_internal_pacing(sk, skb); } @@ -2955,6 +2972,7 @@ start: if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN) __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS); tp->total_retrans += segs; + tp->bytes_retrans += skb->len; /* make sure skb->data is aligned on arches that require it * and check if ack-trimming & collapsing extended the headroom @@ -2975,6 +2993,10 @@ start: err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC); } + if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG)) + tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RETRANS_CB, + TCP_SKB_CB(skb)->seq, segs, err); + if (likely(!err)) { TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS; } else if (err != -EBUSY) { @@ -3542,7 +3564,7 @@ int tcp_connect(struct sock *sk) struct sk_buff *buff; int err; - tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_CONNECT_CB); + tcp_call_bpf(sk, BPF_SOCK_OPS_TCP_CONNECT_CB, 0, NULL); if (inet_csk(sk)->icsk_af_ops->rebuild_header(sk)) return -EHOSTUNREACH; /* Routing failure or similar. */ diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 49ca020bf5ac..abe1b222555b 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -232,11 +232,6 @@ static int tcp_write_timeout(struct sock *sk) if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { if (icsk->icsk_retransmits) { dst_negative_advice(sk); - if (tp->syn_fastopen || tp->syn_data) - tcp_fastopen_cache_set(sk, 0, NULL, true, 0); - if (tp->syn_data && icsk->icsk_retransmits == 1) - NET_INC_STATS(sock_net(sk), - LINUX_MIB_TCPFASTOPENACTIVEFAIL); } else if (!tp->syn_data && !tp->syn_fastopen) { sk_rethink_txhash(sk); } @@ -244,17 +239,6 @@ static int tcp_write_timeout(struct sock *sk) expired = icsk->icsk_retransmits >= retry_until; } else { if (retransmits_timed_out(sk, net->ipv4.sysctl_tcp_retries1, 0)) { - /* Some middle-boxes may black-hole Fast Open _after_ - * the handshake. Therefore we conservatively disable - * Fast Open on this path on recurring timeouts after - * successful Fast Open. - */ - if (tp->syn_data_acked) { - tcp_fastopen_cache_set(sk, 0, NULL, true, 0); - if (icsk->icsk_retransmits == net->ipv4.sysctl_tcp_retries1) - NET_INC_STATS(sock_net(sk), - LINUX_MIB_TCPFASTOPENACTIVEFAIL); - } /* Black hole detection */ tcp_mtu_probing(icsk, sk); @@ -277,11 +261,19 @@ static int tcp_write_timeout(struct sock *sk) expired = retransmits_timed_out(sk, retry_until, icsk->icsk_user_timeout); } + tcp_fastopen_active_detect_blackhole(sk, expired); + + if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG)) + tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB, + icsk->icsk_retransmits, + icsk->icsk_rto, (int)expired); + if (expired) { /* Has it gone just too far? */ tcp_write_err(sk); return 1; } + return 0; } diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c index 7dd44b6156c7..07bf9e02df13 100644 --- a/net/ipv4/tcp_ulp.c +++ b/net/ipv4/tcp_ulp.c @@ -6,7 +6,7 @@ * */ -#include +#include #include #include #include @@ -29,18 +29,6 @@ static struct tcp_ulp_ops *tcp_ulp_find(const char *name) return NULL; } -static struct tcp_ulp_ops *tcp_ulp_find_id(const int ulp) -{ - struct tcp_ulp_ops *e; - - list_for_each_entry_rcu(e, &tcp_ulp_list, list) { - if (e->uid == ulp) - return e; - } - - return NULL; -} - static const struct tcp_ulp_ops *__tcp_ulp_find_autoload(const char *name) { const struct tcp_ulp_ops *ulp = NULL; @@ -63,18 +51,6 @@ static const struct tcp_ulp_ops *__tcp_ulp_find_autoload(const char *name) return ulp; } -static const struct tcp_ulp_ops *__tcp_ulp_lookup(const int uid) -{ - const struct tcp_ulp_ops *ulp; - - rcu_read_lock(); - ulp = tcp_ulp_find_id(uid); - if (!ulp || !try_module_get(ulp->owner)) - ulp = NULL; - rcu_read_unlock(); - return ulp; -} - /* Attach new upper layer protocol to the list * of available protocols. */ @@ -131,54 +107,35 @@ void tcp_cleanup_ulp(struct sock *sk) module_put(icsk->icsk_ulp_ops->owner); } -/* Change upper layer protocol for socket */ -int tcp_set_ulp(struct sock *sk, const char *name) +static int __tcp_set_ulp(struct sock *sk, const struct tcp_ulp_ops *ulp_ops) { struct inet_connection_sock *icsk = inet_csk(sk); - const struct tcp_ulp_ops *ulp_ops; - int err = 0; + int err; + err = -EEXIST; if (icsk->icsk_ulp_ops) - return -EEXIST; + goto out_err; + + err = ulp_ops->init(sk); + if (err) + goto out_err; + + icsk->icsk_ulp_ops = ulp_ops; + return 0; +out_err: + module_put(ulp_ops->owner); + return err; +} + +int tcp_set_ulp(struct sock *sk, const char *name) +{ + const struct tcp_ulp_ops *ulp_ops; + + sock_owned_by_me(sk); ulp_ops = __tcp_ulp_find_autoload(name); if (!ulp_ops) return -ENOENT; - if (!ulp_ops->user_visible) { - module_put(ulp_ops->owner); - return -ENOENT; - } - - err = ulp_ops->init(sk); - if (err) { - module_put(ulp_ops->owner); - return err; - } - - icsk->icsk_ulp_ops = ulp_ops; - return 0; -} - -int tcp_set_ulp_id(struct sock *sk, int ulp) -{ - struct inet_connection_sock *icsk = inet_csk(sk); - const struct tcp_ulp_ops *ulp_ops; - int err; - - if (icsk->icsk_ulp_ops) - return -EEXIST; - - ulp_ops = __tcp_ulp_lookup(ulp); - if (!ulp_ops) - return -ENOENT; - - err = ulp_ops->init(sk); - if (err) { - module_put(ulp_ops->owner); - return err; - } - - icsk->icsk_ulp_ops = ulp_ops; - return 0; + return __tcp_set_ulp(sk, ulp_ops); } diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 74d0e661c9e5..1615ccd99c28 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -231,11 +231,12 @@ static int udp_reuseport_add_sock(struct sock *sk, struct udp_hslot *hslot) (sk2->sk_bound_dev_if == sk->sk_bound_dev_if) && sk2->sk_reuseport && uid_eq(uid, sock_i_uid(sk2)) && inet_rcv_saddr_equal(sk, sk2, false)) { - return reuseport_add_sock(sk, sk2); + return reuseport_add_sock(sk, sk2, + inet_rcv_saddr_any(sk)); } } - return reuseport_alloc(sk); + return reuseport_alloc(sk, inet_rcv_saddr_any(sk)); } /** @@ -366,18 +367,12 @@ fail: } EXPORT_SYMBOL(udp_lib_get_port); -static u32 udp4_portaddr_hash(const struct net *net, __be32 saddr, - unsigned int port) -{ - return jhash_1word((__force u32)saddr, net_hash_mix(net)) ^ port; -} - int udp_v4_get_port(struct sock *sk, unsigned short snum) { unsigned int hash2_nulladdr = - udp4_portaddr_hash(sock_net(sk), htonl(INADDR_ANY), snum); + ipv4_portaddr_hash(sock_net(sk), htonl(INADDR_ANY), snum); unsigned int hash2_partial = - udp4_portaddr_hash(sock_net(sk), inet_sk(sk)->inet_rcv_saddr, 0); + ipv4_portaddr_hash(sock_net(sk), inet_sk(sk)->inet_rcv_saddr, 0); /* precompute partial secondary hash */ udp_sk(sk)->udp_portaddr_hash = hash2_partial; @@ -454,7 +449,7 @@ static struct sock *udp4_lib_lookup2(struct net *net, struct sk_buff *skb) { struct sock *sk, *result; - int score, badness, matches = 0, reuseport = 0; + int score, badness; u32 hash = 0; result = NULL; @@ -463,23 +458,16 @@ static struct sock *udp4_lib_lookup2(struct net *net, score = compute_score(sk, net, saddr, sport, daddr, hnum, dif, sdif, exact_dif); if (score > badness) { - reuseport = sk->sk_reuseport; - if (reuseport) { + if (sk->sk_reuseport) { hash = udp_ehashfn(net, daddr, hnum, saddr, sport); result = reuseport_select_sock(sk, hash, skb, sizeof(struct udphdr)); if (result) return result; - matches = 1; } badness = score; result = sk; - } else if (score == badness && reuseport) { - matches++; - if (reciprocal_scale(hash, matches) == 0) - result = sk; - hash = next_pseudo_random32(hash); } } return result; @@ -497,11 +485,11 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, unsigned int hash2, slot2, slot = udp_hashfn(net, hnum, udptable->mask); struct udp_hslot *hslot2, *hslot = &udptable->hash[slot]; bool exact_dif = udp_lib_exact_dif_match(net, skb); - int score, badness, matches = 0, reuseport = 0; + int score, badness; u32 hash = 0; if (hslot->count > 10) { - hash2 = udp4_portaddr_hash(net, daddr, hnum); + hash2 = ipv4_portaddr_hash(net, daddr, hnum); slot2 = hash2 & udptable->mask; hslot2 = &udptable->hash2[slot2]; if (hslot->count < hslot2->count) @@ -512,7 +500,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, exact_dif, hslot2, skb); if (!result) { unsigned int old_slot2 = slot2; - hash2 = udp4_portaddr_hash(net, htonl(INADDR_ANY), hnum); + hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum); slot2 = hash2 & udptable->mask; /* avoid searching the same slot again. */ if (unlikely(slot2 == old_slot2)) @@ -526,6 +514,8 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, daddr, hnum, dif, sdif, exact_dif, hslot2, skb); } + if (unlikely(IS_ERR(result))) + return NULL; return result; } begin: @@ -535,23 +525,18 @@ begin: score = compute_score(sk, net, saddr, sport, daddr, hnum, dif, sdif, exact_dif); if (score > badness) { - reuseport = sk->sk_reuseport; - if (reuseport) { + if (sk->sk_reuseport) { hash = udp_ehashfn(net, daddr, hnum, saddr, sport); result = reuseport_select_sock(sk, hash, skb, sizeof(struct udphdr)); + if (unlikely(IS_ERR(result))) + return NULL; if (result) return result; - matches = 1; } result = sk; badness = score; - } else if (score == badness && reuseport) { - matches++; - if (reciprocal_scale(hash, matches) == 0) - result = sk; - hash = next_pseudo_random32(hash); } } return result; @@ -1901,7 +1886,7 @@ EXPORT_SYMBOL(udp_lib_rehash); static void udp_v4_rehash(struct sock *sk) { - u16 new_hash = udp4_portaddr_hash(sock_net(sk), + u16 new_hash = ipv4_portaddr_hash(sock_net(sk), inet_sk(sk)->inet_rcv_saddr, inet_sk(sk)->inet_num); udp_lib_rehash(sk, new_hash); @@ -2113,9 +2098,9 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb, struct sk_buff *nskb; if (use_hash2) { - hash2_any = udp4_portaddr_hash(net, htonl(INADDR_ANY), hnum) & + hash2_any = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum) & udptable->mask; - hash2 = udp4_portaddr_hash(net, daddr, hnum) & udptable->mask; + hash2 = ipv4_portaddr_hash(net, daddr, hnum) & udptable->mask; start_lookup: hslot = &udptable->hash2[hash2]; offset = offsetof(typeof(*sk), __sk_common.skc_portaddr_node); @@ -2523,7 +2508,7 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net, int dif, int sdif) { unsigned short hnum = ntohs(loc_port); - unsigned int hash2 = udp4_portaddr_hash(net, loc_addr, hnum); + unsigned int hash2 = ipv4_portaddr_hash(net, loc_addr, hnum); unsigned int slot2 = hash2 & udp_table.mask; struct udp_hslot *hslot2 = &udp_table.hash2[slot2]; INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr); diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c index 0614b4b4015e..7d885a44dc9d 100644 --- a/net/ipv4/xfrm4_mode_tunnel.c +++ b/net/ipv4/xfrm4_mode_tunnel.c @@ -62,7 +62,7 @@ static int xfrm4_mode_tunnel_output(struct xfrm_state *x, struct sk_buff *skb) top_iph->frag_off = (flags & XFRM_STATE_NOPMTUDISC) ? 0 : (XFRM_MODE_SKB_CB(skb)->frag_off & htons(IP_DF)); - top_iph->ttl = ip4_dst_hoplimit(dst->child); + top_iph->ttl = ip4_dst_hoplimit(xfrm_dst_child(dst)); top_iph->saddr = x->props.saddr.a4; top_iph->daddr = x->id.daddr.a4; @@ -112,9 +112,11 @@ static void xfrm4_mode_tunnel_xmit(struct xfrm_state *x, struct sk_buff *skb) { struct xfrm_offload *xo = xfrm_offload(skb); - if (xo->flags & XFRM_GSO_SEGMENT) + if (xo->flags & XFRM_GSO_SEGMENT) { + skb->network_header = skb->network_header - x->props.header_len; skb->transport_header = skb->network_header + sizeof(struct iphdr); + } skb_reset_mac_len(skb); pskb_pull(skb, skb->mac_len + x->props.header_len); diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig index a941f09a3fce..897893c0d6b8 100644 --- a/net/ipv6/Kconfig +++ b/net/ipv6/Kconfig @@ -331,4 +331,9 @@ config IPV6_SEG6_HMAC If unsure, say N. +config IPV6_SEG6_BPF + def_bool y + depends on IPV6_SEG6_LWTUNNEL + depends on IPV6 = y + endif # IPV6 diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 095e109eda6e..9d1078c70de0 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -176,7 +176,7 @@ static void addrconf_type_change(struct net_device *dev, unsigned long event); static int addrconf_ifdown(struct net_device *dev, int how); -static struct rt6_info *addrconf_get_prefix_route(const struct in6_addr *pfx, +static struct fib6_info *addrconf_get_prefix_route(const struct in6_addr *pfx, int plen, const struct net_device *dev, u32 flags, u32 noflags); @@ -925,7 +925,6 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp) pr_warn("Freeing alive inet6 address %p\n", ifp); return; } - ip6_rt_put(ifp->rt); kfree_rcu(ifp, rcu); } @@ -955,18 +954,43 @@ static u32 inet6_addr_hash(const struct in6_addr *addr) return hash_32(ipv6_addr_hash(addr), IN6_ADDR_HSIZE_SHIFT); } +static int ipv6_add_addr_hash(struct net_device *dev, struct inet6_ifaddr *ifa) +{ + unsigned int hash; + int err = 0; + + spin_lock(&addrconf_hash_lock); + + /* Ignore adding duplicate addresses on an interface */ + if (ipv6_chk_same_addr(dev_net(dev), &ifa->addr, dev)) { + ADBG("ipv6_add_addr: already assigned\n"); + err = -EEXIST; + goto out; + } + + /* Add to big hash table */ + hash = inet6_addr_hash(&ifa->addr); + hlist_add_head_rcu(&ifa->addr_lst, &inet6_addr_lst[hash]); + +out: + spin_unlock(&addrconf_hash_lock); + + return err; +} + /* On success it returns ifp with increased reference count */ static struct inet6_ifaddr * ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr, const struct in6_addr *peer_addr, int pfxlen, - int scope, u32 flags, u32 valid_lft, u32 prefered_lft) + int scope, u32 flags, u32 valid_lft, u32 prefered_lft, + bool can_block) { + gfp_t gfp_flags = can_block ? GFP_KERNEL : GFP_ATOMIC; struct net *net = dev_net(idev->dev); struct inet6_ifaddr *ifa = NULL; - struct rt6_info *rt; + struct fib6_info *f6i = NULL; struct in6_validator_info i6vi; - unsigned int hash; int err = 0; int addr_type = ipv6_addr_type(addr); @@ -976,57 +1000,40 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr, addr_type & IPV6_ADDR_LOOPBACK)) return ERR_PTR(-EADDRNOTAVAIL); - rcu_read_lock_bh(); - - in6_dev_hold(idev); - if (idev->dead) { err = -ENODEV; /*XXX*/ - goto out2; + goto out; } if (idev->cnf.disable_ipv6) { err = -EACCES; - goto out2; + goto out; } i6vi.i6vi_addr = *addr; i6vi.i6vi_dev = idev; - rcu_read_unlock_bh(); - err = inet6addr_validator_notifier_call_chain(NETDEV_UP, &i6vi); - - rcu_read_lock_bh(); err = notifier_to_errno(err); - if (err) - goto out2; - - spin_lock(&addrconf_hash_lock); - - /* Ignore adding duplicate addresses on an interface */ - if (ipv6_chk_same_addr(dev_net(idev->dev), addr, idev->dev)) { - ADBG("ipv6_add_addr: already assigned\n"); - err = -EEXIST; + if (err < 0) goto out; - } - - ifa = kzalloc(sizeof(struct inet6_ifaddr), GFP_ATOMIC); + ifa = kzalloc(sizeof(*ifa), gfp_flags); if (!ifa) { ADBG("ipv6_add_addr: malloc failed\n"); err = -ENOBUFS; goto out; } - rt = addrconf_dst_alloc(idev, addr, false); - if (IS_ERR(rt)) { - err = PTR_ERR(rt); + f6i = addrconf_f6i_alloc(net, idev, addr, false, gfp_flags); + if (IS_ERR(f6i)) { + err = PTR_ERR(f6i); + f6i = NULL; goto out; } if (net->ipv6.devconf_all->disable_policy || idev->cnf.disable_policy) - rt->dst.flags |= DST_NOPOLICY; + f6i->dst_nopolicy = true; neigh_parms_data_state_setall(idev->nd_parms); @@ -1048,19 +1055,24 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr, ifa->cstamp = ifa->tstamp = jiffies; ifa->tokenized = false; - ifa->rt = rt; + ifa->rt = f6i; ifa->idev = idev; + in6_dev_hold(idev); + /* For caller */ refcount_set(&ifa->refcnt, 1); - /* Add to big hash table */ - hash = inet6_addr_hash(addr); + rcu_read_lock_bh(); - hlist_add_head_rcu(&ifa->addr_lst, &inet6_addr_lst[hash]); - spin_unlock(&addrconf_hash_lock); + err = ipv6_add_addr_hash(idev->dev, ifa); + if (err < 0) { + rcu_read_unlock_bh(); + goto out; + } write_lock(&idev->lock); + /* Add to inet6_dev unicast addr list. */ ipv6_link_dev_addr(idev, ifa); @@ -1071,21 +1083,23 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr, in6_ifa_hold(ifa); write_unlock(&idev->lock); -out2: + rcu_read_unlock_bh(); - if (likely(err == 0)) - inet6addr_notifier_call_chain(NETDEV_UP, ifa); - else { - kfree(ifa); - in6_dev_put(idev); + inet6addr_notifier_call_chain(NETDEV_UP, ifa); +out: + if (unlikely(err < 0)) { + fib6_info_release(f6i); + + if (ifa) { + if (ifa->idev) + in6_dev_put(ifa->idev); + kfree(ifa); + } ifa = ERR_PTR(err); } return ifa; -out: - spin_unlock(&addrconf_hash_lock); - goto out2; } enum cleanup_prefix_rt_t { @@ -1153,19 +1167,19 @@ check_cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned long *expires) static void cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned long expires, bool del_rt) { - struct rt6_info *rt; + struct fib6_info *f6i; - rt = addrconf_get_prefix_route(&ifp->addr, + f6i = addrconf_get_prefix_route(&ifp->addr, ifp->prefix_len, ifp->idev->dev, 0, RTF_GATEWAY | RTF_DEFAULT); - if (rt) { + if (f6i) { if (del_rt) - ip6_del_rt(rt); + ip6_del_rt(dev_net(ifp->idev->dev), f6i); else { - if (!(rt->rt6i_flags & RTF_EXPIRES)) - rt6_set_expires(rt, expires); - ip6_rt_put(rt); + if (!(f6i->fib6_flags & RTF_EXPIRES)) + fib6_set_expires(f6i, expires); + fib6_info_release(f6i); } } } @@ -1327,7 +1341,7 @@ retry: ift = ipv6_add_addr(idev, &addr, NULL, tmp_plen, ipv6_addr_scope(&addr), addr_flags, - tmp_valid_lft, tmp_prefered_lft); + tmp_valid_lft, tmp_prefered_lft, true); if (IS_ERR(ift)) { in6_ifa_put(ifp); in6_dev_put(idev); @@ -2023,7 +2037,7 @@ void addrconf_dad_failure(struct inet6_ifaddr *ifp) ifp2 = ipv6_add_addr(idev, &new_addr, NULL, pfxlen, scope, flags, valid_lft, - preferred_lft); + preferred_lft, false); if (IS_ERR(ifp2)) goto lock_errdad; @@ -2310,7 +2324,7 @@ u32 addrconf_rt_table(const struct net_device *dev, u32 default_table) { static void addrconf_prefix_route(struct in6_addr *pfx, int plen, struct net_device *dev, - unsigned long expires, u32 flags) + unsigned long expires, u32 flags, gfp_t gfp_flags) { struct fib6_config cfg = { .fc_table = l3mdev_fib_table(dev) ? : addrconf_rt_table(dev, RT6_TABLE_PREFIX), @@ -2321,6 +2335,7 @@ addrconf_prefix_route(struct in6_addr *pfx, int plen, struct net_device *dev, .fc_flags = RTF_UP | flags, .fc_nlinfo.nl_net = dev_net(dev), .fc_protocol = RTPROT_KERNEL, + .fc_type = RTN_UNICAST, }; cfg.fc_dst = *pfx; @@ -2334,17 +2349,17 @@ addrconf_prefix_route(struct in6_addr *pfx, int plen, struct net_device *dev, cfg.fc_flags |= RTF_NONEXTHOP; #endif - ip6_route_add(&cfg, NULL); + ip6_route_add(&cfg, gfp_flags, NULL); } -static struct rt6_info *addrconf_get_prefix_route(const struct in6_addr *pfx, +static struct fib6_info *addrconf_get_prefix_route(const struct in6_addr *pfx, int plen, const struct net_device *dev, u32 flags, u32 noflags) { struct fib6_node *fn; - struct rt6_info *rt = NULL; + struct fib6_info *rt = NULL; struct fib6_table *table; u32 tb_id = l3mdev_fib_table(dev) ? : addrconf_rt_table(dev, RT6_TABLE_PREFIX); @@ -2352,24 +2367,24 @@ static struct rt6_info *addrconf_get_prefix_route(const struct in6_addr *pfx, if (!table) return NULL; - read_lock_bh(&table->tb6_lock); - fn = fib6_locate(&table->tb6_root, pfx, plen, NULL, 0); + rcu_read_lock(); + fn = fib6_locate(&table->tb6_root, pfx, plen, NULL, 0, true); if (!fn) goto out; - noflags |= RTF_CACHE; - for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) { - if (rt->dst.dev->ifindex != dev->ifindex) + for_each_fib6_node_rt_rcu(fn) { + if (rt->fib6_nh.nh_dev->ifindex != dev->ifindex) continue; - if ((rt->rt6i_flags & flags) != flags) + if ((rt->fib6_flags & flags) != flags) continue; - if ((rt->rt6i_flags & noflags) != 0) + if ((rt->fib6_flags & noflags) != 0) + continue; + if (!fib6_info_hold_safe(rt)) continue; - dst_hold(&rt->dst); break; } out: - read_unlock_bh(&table->tb6_lock); + rcu_read_unlock(); return rt; } @@ -2384,13 +2399,14 @@ static void addrconf_add_mroute(struct net_device *dev) .fc_ifindex = dev->ifindex, .fc_dst_len = 8, .fc_flags = RTF_UP, + .fc_type = RTN_UNICAST, .fc_nlinfo.nl_net = dev_net(dev), .fc_protocol = RTPROT_KERNEL, }; ipv6_addr_set(&cfg.fc_dst, htonl(0xFF000000), 0, 0, 0); - ip6_route_add(&cfg, NULL); + ip6_route_add(&cfg, GFP_ATOMIC, NULL); } static struct inet6_dev *addrconf_add_dev(struct net_device *dev) @@ -2521,7 +2537,7 @@ int addrconf_prefix_rcv_add_addr(struct net *net, struct net_device *dev, pinfo->prefix_len, addr_type&IPV6_ADDR_SCOPE_MASK, addr_flags, valid_lft, - prefered_lft); + prefered_lft, false); if (IS_ERR_OR_NULL(ifp)) return -1; @@ -2639,7 +2655,7 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, int len, bool sllao) */ if (pinfo->onlink) { - struct rt6_info *rt; + struct fib6_info *rt; unsigned long rt_expires; /* Avoid arithmetic overflow. Really, we could @@ -2664,13 +2680,13 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, int len, bool sllao) if (rt) { /* Autoconf prefix route */ if (valid_lft == 0) { - ip6_del_rt(rt); + ip6_del_rt(net, rt); rt = NULL; } else if (addrconf_finite_timeout(rt_expires)) { /* not infinity */ - rt6_set_expires(rt, jiffies + rt_expires); + fib6_set_expires(rt, jiffies + rt_expires); } else { - rt6_clean_expires(rt); + fib6_clean_expires(rt); } } else if (valid_lft) { clock_t expires = 0; @@ -2683,10 +2699,11 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, int len, bool sllao) if (dev->ip6_ptr->cnf.accept_ra_prefix_route) { addrconf_prefix_route(&pinfo->prefix, pinfo->prefix_len, - dev, expires, flags); + dev, expires, flags, + GFP_ATOMIC); } } - ip6_rt_put(rt); + fib6_info_release(rt); } /* Try to figure out our local address for this prefix */ @@ -2893,12 +2910,12 @@ static int inet6_addr_add(struct net *net, int ifindex, } ifp = ipv6_add_addr(idev, pfx, peer_pfx, plen, scope, ifa_flags, - valid_lft, prefered_lft); + valid_lft, prefered_lft, true); if (!IS_ERR(ifp)) { if (!(ifa_flags & IFA_F_NOPREFIXROUTE)) { addrconf_prefix_route(&ifp->addr, ifp->prefix_len, dev, - expires, flags); + expires, flags, GFP_KERNEL); } /* @@ -3008,7 +3025,8 @@ static void add_addr(struct inet6_dev *idev, const struct in6_addr *addr, ifp = ipv6_add_addr(idev, addr, NULL, plen, scope, IFA_F_PERMANENT, - INFINITY_LIFE_TIME, INFINITY_LIFE_TIME); + INFINITY_LIFE_TIME, INFINITY_LIFE_TIME, + true); if (!IS_ERR(ifp)) { spin_lock_bh(&ifp->lock); ifp->flags &= ~IFA_F_TENTATIVE; @@ -3048,7 +3066,8 @@ static void sit_add_v4_addrs(struct inet6_dev *idev) if (addr.s6_addr32[3]) { add_addr(idev, &addr, plen, scope); - addrconf_prefix_route(&addr, plen, idev->dev, 0, pflags); + addrconf_prefix_route(&addr, plen, idev->dev, 0, pflags, + GFP_ATOMIC); return; } @@ -3073,7 +3092,7 @@ static void sit_add_v4_addrs(struct inet6_dev *idev) add_addr(idev, &addr, plen, flag); addrconf_prefix_route(&addr, plen, idev->dev, 0, - pflags); + pflags, GFP_ATOMIC); } } } @@ -3111,9 +3130,10 @@ void addrconf_add_linklocal(struct inet6_dev *idev, #endif ifp = ipv6_add_addr(idev, addr, NULL, 64, IFA_LINK, addr_flags, - INFINITY_LIFE_TIME, INFINITY_LIFE_TIME); + INFINITY_LIFE_TIME, INFINITY_LIFE_TIME, true); if (!IS_ERR(ifp)) { - addrconf_prefix_route(&ifp->addr, ifp->prefix_len, idev->dev, 0, 0); + addrconf_prefix_route(&ifp->addr, ifp->prefix_len, idev->dev, + 0, 0, GFP_ATOMIC); addrconf_dad_start(ifp); in6_ifa_put(ifp); } @@ -3232,7 +3252,8 @@ static void addrconf_addr_gen(struct inet6_dev *idev, bool prefix_route) addrconf_add_linklocal(idev, &addr, IFA_F_STABLE_PRIVACY); else if (prefix_route) - addrconf_prefix_route(&addr, 64, idev->dev, 0, 0); + addrconf_prefix_route(&addr, 64, idev->dev, + 0, 0, GFP_KERNEL); break; case IN6_ADDR_GEN_MODE_EUI64: /* addrconf_add_linklocal also adds a prefix_route and we @@ -3242,7 +3263,8 @@ static void addrconf_addr_gen(struct inet6_dev *idev, bool prefix_route) if (ipv6_generate_eui64(addr.s6_addr + 8, idev->dev) == 0) addrconf_add_linklocal(idev, &addr, 0); else if (prefix_route) - addrconf_prefix_route(&addr, 64, idev->dev, 0, 0); + addrconf_prefix_route(&addr, 64, idev->dev, + 0, 0, GFP_ATOMIC); break; case IN6_ADDR_GEN_MODE_NONE: default: @@ -3339,32 +3361,34 @@ static void addrconf_gre_config(struct net_device *dev) } #endif -static int fixup_permanent_addr(struct inet6_dev *idev, +static int fixup_permanent_addr(struct net *net, + struct inet6_dev *idev, struct inet6_ifaddr *ifp) { - /* !rt6i_node means the host route was removed from the + /* !fib6_node means the host route was removed from the * FIB, for example, if 'lo' device is taken down. In that * case regenerate the host route. */ - if (!ifp->rt || !ifp->rt->rt6i_node) { - struct rt6_info *rt, *prev; + if (!ifp->rt || !ifp->rt->fib6_node) { + struct fib6_info *f6i, *prev; - rt = addrconf_dst_alloc(idev, &ifp->addr, false); - if (unlikely(IS_ERR(rt))) - return PTR_ERR(rt); + f6i = addrconf_f6i_alloc(net, idev, &ifp->addr, false, + GFP_ATOMIC); + if (unlikely(IS_ERR(f6i))) + return PTR_ERR(f6i); /* ifp->rt can be accessed outside of rtnl */ spin_lock(&ifp->lock); prev = ifp->rt; - ifp->rt = rt; + ifp->rt = f6i; spin_unlock(&ifp->lock); - ip6_rt_put(prev); + fib6_info_release(prev); } if (!(ifp->flags & IFA_F_NOPREFIXROUTE)) { addrconf_prefix_route(&ifp->addr, ifp->prefix_len, - idev->dev, 0, 0); + idev->dev, 0, 0, GFP_ATOMIC); } if (ifp->state == INET6_IFADDR_STATE_PREDAD) @@ -3373,7 +3397,7 @@ static int fixup_permanent_addr(struct inet6_dev *idev, return 0; } -static void addrconf_permanent_addr(struct net_device *dev) +static void addrconf_permanent_addr(struct net *net, struct net_device *dev) { struct inet6_ifaddr *ifp, *tmp; struct inet6_dev *idev; @@ -3386,7 +3410,7 @@ static void addrconf_permanent_addr(struct net_device *dev) list_for_each_entry_safe(ifp, tmp, &idev->addr_list, if_list) { if ((ifp->flags & IFA_F_PERMANENT) && - fixup_permanent_addr(idev, ifp) < 0) { + fixup_permanent_addr(net, idev, ifp) < 0) { write_unlock_bh(&idev->lock); in6_ifa_hold(ifp); ipv6_del_addr(ifp); @@ -3455,7 +3479,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, if (event == NETDEV_UP) { /* restore routes for permanent addresses */ - addrconf_permanent_addr(dev); + addrconf_permanent_addr(net, dev); if (!addrconf_link_ready(dev)) { /* device is not ready yet. */ @@ -3474,6 +3498,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, } else if (event == NETDEV_CHANGE) { if (!addrconf_link_ready(dev)) { /* device is still not ready. */ + rt6_sync_down_dev(dev, event); break; } @@ -3485,6 +3510,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, * multicast snooping switches */ ipv6_mc_up(idev); + rt6_sync_up(dev, RTNH_F_LINKDOWN); break; } idev->if_flags |= IF_READY; @@ -3520,6 +3546,9 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, if (run_pending) addrconf_dad_run(idev); + /* Device has an address by now */ + rt6_sync_up(dev, RTNH_F_DEAD); + /* * If the MTU changed during the interface down, * when the interface up, the changed MTU must be @@ -3613,6 +3642,7 @@ static bool addr_is_local(const struct in6_addr *addr) static int addrconf_ifdown(struct net_device *dev, int how) { + unsigned long event = how ? NETDEV_UNREGISTER : NETDEV_DOWN; struct net *net = dev_net(dev); struct inet6_dev *idev; struct inet6_ifaddr *ifa, *tmp; @@ -3624,8 +3654,7 @@ static int addrconf_ifdown(struct net_device *dev, int how) ASSERT_RTNL(); - rt6_ifdown(net, dev); - neigh_ifdown(&nd_tbl, dev); + rt6_disable_ip(dev, event); idev = __in6_dev_get(dev); if (!idev) @@ -3714,7 +3743,7 @@ restart: INIT_LIST_HEAD(&del_list); list_for_each_entry_safe(ifa, tmp, &idev->addr_list, if_list) { - struct rt6_info *rt = NULL; + struct fib6_info *rt = NULL; bool keep; addrconf_del_dad_work(ifa); @@ -3744,7 +3773,7 @@ restart: spin_unlock_bh(&ifa->lock); if (rt) - ip6_del_rt(rt); + ip6_del_rt(net, rt); if (state != INET6_IFADDR_STATE_DEAD) { __ipv6_ifa_notify(RTM_DELADDR, ifa); @@ -3867,6 +3896,7 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp) struct inet6_dev *idev = ifp->idev; struct net_device *dev = idev->dev; bool bump_id, notify = false; + struct net *net; addrconf_join_solict(dev, &ifp->addr); @@ -3877,8 +3907,9 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp) if (ifp->state == INET6_IFADDR_STATE_DEAD) goto out; + net = dev_net(dev); if (dev->flags&(IFF_NOARP|IFF_LOOPBACK) || - (dev_net(dev)->ipv6.devconf_all->accept_dad < 1 && + (net->ipv6.devconf_all->accept_dad < 1 && idev->cnf.accept_dad < 1) || !(ifp->flags&IFA_F_TENTATIVE) || ifp->flags & IFA_F_NODAD) { @@ -3914,8 +3945,8 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp) * Frames right away */ if (ifp->flags & IFA_F_OPTIMISTIC) { - ip6_ins_rt(ifp->rt); - if (ipv6_use_optimistic_addr(dev_net(dev), idev)) { + ip6_ins_rt(net, ifp->rt); + if (ipv6_use_optimistic_addr(net, idev)) { /* Because optimistic nodes can use this address, * notify listeners. If DAD fails, RTM_DELADDR is sent. */ @@ -4585,8 +4616,9 @@ static int inet6_addr_modify(struct inet6_ifaddr *ifp, u32 ifa_flags, ipv6_ifa_notify(0, ifp); if (!(ifa_flags & IFA_F_NOPREFIXROUTE)) { - addrconf_prefix_route(&ifp->addr, ifp->prefix_len, ifp->idev->dev, - expires, flags); + addrconf_prefix_route(&ifp->addr, ifp->prefix_len, + ifp->idev->dev, expires, flags, + GFP_KERNEL); } else if (had_prefixroute) { enum cleanup_prefix_rt_t action; unsigned long rt_expires; @@ -4813,9 +4845,10 @@ static int inet6_fill_ifmcaddr(struct sk_buff *skb, struct ifmcaddr6 *ifmca, static int inet6_fill_ifacaddr(struct sk_buff *skb, struct ifacaddr6 *ifaca, u32 portid, u32 seq, int event, unsigned int flags) { + struct net_device *dev = fib6_info_nh_dev(ifaca->aca_rt); + int ifindex = dev ? dev->ifindex : 1; struct nlmsghdr *nlh; u8 scope = RT_SCOPE_UNIVERSE; - int ifindex = ifaca->aca_idev->dev->ifindex; if (ipv6_addr_scope(&ifaca->aca_addr) & IFA_SITE) scope = RT_SCOPE_SITE; @@ -5620,8 +5653,8 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp) * host route, so nothing to insert. That will be fixed when * the device is brought up. */ - if (ifp->rt && !rcu_access_pointer(ifp->rt->rt6i_node)) { - ip6_ins_rt(ifp->rt); + if (ifp->rt && !rcu_access_pointer(ifp->rt->fib6_node)) { + ip6_ins_rt(net, ifp->rt); } else if (!ifp->rt && (ifp->idev->dev->flags & IFF_UP)) { pr_warn("BUG: Address %pI6c on device %s is missing its host route.\n", &ifp->addr, ifp->idev->dev->name); @@ -5631,23 +5664,24 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp) addrconf_join_anycast(ifp); if (!ipv6_addr_any(&ifp->peer_addr)) addrconf_prefix_route(&ifp->peer_addr, 128, - ifp->idev->dev, 0, 0); + ifp->idev->dev, 0, 0, + GFP_KERNEL); break; case RTM_DELADDR: if (ifp->idev->cnf.forwarding) addrconf_leave_anycast(ifp); addrconf_leave_solict(ifp->idev, &ifp->addr); if (!ipv6_addr_any(&ifp->peer_addr)) { - struct rt6_info *rt; + struct fib6_info *rt; rt = addrconf_get_prefix_route(&ifp->peer_addr, 128, ifp->idev->dev, 0, 0); if (rt) - ip6_del_rt(rt); + ip6_del_rt(net, rt); } if (ifp->rt) { - if (dst_hold_safe(&ifp->rt->dst)) - ip6_del_rt(ifp->rt); + ip6_del_rt(net, ifp->rt); + ifp->rt = NULL; } rt_genid_bump_ipv6(net); break; @@ -5994,12 +6028,11 @@ void addrconf_disable_policy_idev(struct inet6_dev *idev, int val) list_for_each_entry(ifa, &idev->addr_list, if_list) { spin_lock(&ifa->lock); if (ifa->rt) { - struct rt6_info *rt = ifa->rt; - struct fib6_table *table = rt->rt6i_table; + struct fib6_info *rt = ifa->rt; int cpu; - read_lock(&table->tb6_lock); - addrconf_set_nopolicy(ifa->rt, val); + rcu_read_lock(); + ifa->rt->dst_nopolicy = val ? true : false; if (rt->rt6i_pcpu) { for_each_possible_cpu(cpu) { struct rt6_info **rtp; @@ -6008,7 +6041,7 @@ void addrconf_disable_policy_idev(struct inet6_dev *idev, int val) addrconf_set_nopolicy(*rtp, val); } } - read_unlock(&table->tb6_lock); + rcu_read_unlock(); } spin_unlock(&ifa->lock); } diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c index f5a267972c57..c1dedcfbcae1 100644 --- a/net/ipv6/addrconf_core.c +++ b/net/ipv6/addrconf_core.c @@ -134,8 +134,47 @@ static struct dst_entry *eafnosupport_ipv6_dst_lookup_flow(struct net *net, return ERR_PTR(-EAFNOSUPPORT); } +static struct fib6_table *eafnosupport_fib6_get_table(struct net *net, u32 id) +{ + return NULL; +} + +static struct fib6_info * +eafnosupport_fib6_table_lookup(struct net *net, struct fib6_table *table, + int oif, struct flowi6 *fl6, int flags) +{ + return NULL; +} + +static struct fib6_info * +eafnosupport_fib6_lookup(struct net *net, int oif, struct flowi6 *fl6, + int flags) +{ + return NULL; +} + +static struct fib6_info * +eafnosupport_fib6_multipath_select(const struct net *net, struct fib6_info *f6i, + struct flowi6 *fl6, int oif, + const struct sk_buff *skb, int strict) +{ + return f6i; +} + +static u32 +eafnosupport_ip6_mtu_from_fib6(struct fib6_info *f6i, struct in6_addr *daddr, + struct in6_addr *saddr) +{ + return 0; +} + const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) { .ipv6_dst_lookup_flow = eafnosupport_ipv6_dst_lookup_flow, + .fib6_get_table = eafnosupport_fib6_get_table, + .fib6_table_lookup = eafnosupport_fib6_table_lookup, + .fib6_lookup = eafnosupport_fib6_lookup, + .fib6_multipath_select = eafnosupport_fib6_multipath_select, + .ip6_mtu_from_fib6 = eafnosupport_ip6_mtu_from_fib6, }; EXPORT_SYMBOL_GPL(ipv6_stub); diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c index afb20eb37f54..0947d4b06dcb 100644 --- a/net/ipv6/af_inet6.c +++ b/net/ipv6/af_inet6.c @@ -870,6 +870,10 @@ static int __net_init inet6_net_init(struct net *net) net->ipv6.sysctl.idgen_retries = 3; net->ipv6.sysctl.idgen_delay = 1 * HZ; net->ipv6.sysctl.flowlabel_state_ranges = 0; + net->ipv6.sysctl.max_dst_opts_cnt = IP6_DEFAULT_MAX_DST_OPTS_CNT; + net->ipv6.sysctl.max_hbh_opts_cnt = IP6_DEFAULT_MAX_HBH_OPTS_CNT; + net->ipv6.sysctl.max_dst_opts_len = IP6_DEFAULT_MAX_DST_OPTS_LEN; + net->ipv6.sysctl.max_hbh_opts_len = IP6_DEFAULT_MAX_HBH_OPTS_LEN; atomic_set(&net->ipv6.fib6_sernum, 1); err = ipv6_init_mibs(net); @@ -918,6 +922,11 @@ static const struct ipv6_stub ipv6_stub_impl = { .ipv6_sock_mc_join = ipv6_sock_mc_join, .ipv6_sock_mc_drop = ipv6_sock_mc_drop, .ipv6_dst_lookup_flow = ip6_dst_lookup_flow, + .fib6_get_table = fib6_get_table, + .fib6_table_lookup = fib6_table_lookup, + .fib6_lookup = fib6_lookup, + .fib6_multipath_select = fib6_multipath_select, + .ip6_mtu_from_fib6 = ip6_mtu_from_fib6, .udpv6_encap_enable = udpv6_encap_enable, .ndisc_send_na = ndisc_send_na, .nd_tbl = &nd_tbl, @@ -925,6 +934,7 @@ static const struct ipv6_stub ipv6_stub_impl = { static const struct ipv6_bpf_stub ipv6_bpf_stub_impl = { .inet6_bind = __inet6_bind, + .udp6_lib_lookup = __udp6_lib_lookup, }; static int __init inet6_init(void) diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c index 4d8b3d1d530b..36f388b5922a 100644 --- a/net/ipv6/anycast.c +++ b/net/ipv6/anycast.c @@ -78,7 +78,7 @@ int ipv6_sock_ac_join(struct sock *sk, int ifindex, const struct in6_addr *addr) if (ifindex == 0) { struct rt6_info *rt; - rt = rt6_lookup(net, addr, NULL, 0, 0); + rt = rt6_lookup(net, addr, NULL, 0, NULL, 0); if (rt) { dev = rt->dst.dev; ip6_rt_put(rt); @@ -216,16 +216,14 @@ static void aca_get(struct ifacaddr6 *aca) static void aca_put(struct ifacaddr6 *ac) { if (refcount_dec_and_test(&ac->aca_refcnt)) { - in6_dev_put(ac->aca_idev); - dst_release(&ac->aca_rt->dst); + fib6_info_release(ac->aca_rt); kfree(ac); } } -static struct ifacaddr6 *aca_alloc(struct rt6_info *rt, +static struct ifacaddr6 *aca_alloc(struct fib6_info *f6i, const struct in6_addr *addr) { - struct inet6_dev *idev = rt->rt6i_idev; struct ifacaddr6 *aca; aca = kzalloc(sizeof(*aca), GFP_ATOMIC); @@ -233,9 +231,8 @@ static struct ifacaddr6 *aca_alloc(struct rt6_info *rt, return NULL; aca->aca_addr = *addr; - in6_dev_hold(idev); - aca->aca_idev = idev; - aca->aca_rt = rt; + fib6_info_hold(f6i); + aca->aca_rt = f6i; aca->aca_users = 1; /* aca_tstamp should be updated upon changes */ aca->aca_cstamp = aca->aca_tstamp = jiffies; @@ -250,7 +247,8 @@ static struct ifacaddr6 *aca_alloc(struct rt6_info *rt, int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr) { struct ifacaddr6 *aca; - struct rt6_info *rt; + struct fib6_info *f6i; + struct net *net; int err; ASSERT_RTNL(); @@ -269,14 +267,15 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr) } } - rt = addrconf_dst_alloc(idev, addr, true); - if (IS_ERR(rt)) { - err = PTR_ERR(rt); + net = dev_net(idev->dev); + f6i = addrconf_f6i_alloc(net, idev, addr, true, GFP_ATOMIC); + if (IS_ERR(f6i)) { + err = PTR_ERR(f6i); goto out; } - aca = aca_alloc(rt, addr); + aca = aca_alloc(f6i, addr); if (!aca) { - ip6_rt_put(rt); + fib6_info_release(f6i); err = -ENOMEM; goto out; } @@ -290,7 +289,7 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr) aca_get(aca); write_unlock_bh(&idev->lock); - ip6_ins_rt(rt); + ip6_ins_rt(net, f6i); addrconf_join_solict(idev->dev, &aca->aca_addr); @@ -332,8 +331,7 @@ int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr) write_unlock_bh(&idev->lock); addrconf_leave_solict(idev, &aca->aca_addr); - dst_hold(&aca->aca_rt->dst); - ip6_del_rt(aca->aca_rt); + ip6_del_rt(dev_net(idev->dev), aca->aca_rt); aca_put(aca); return 0; @@ -360,8 +358,7 @@ void ipv6_ac_destroy_dev(struct inet6_dev *idev) addrconf_leave_solict(idev, &aca->aca_addr); - dst_hold(&aca->aca_rt->dst); - ip6_del_rt(aca->aca_rt); + ip6_del_rt(dev_net(idev->dev), aca->aca_rt); aca_put(aca); diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index 8fcfcab407d8..25c12d0ccd28 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -141,32 +141,14 @@ static void esp_ssg_unref(struct xfrm_state *x, void *tmp) static void esp_output_done(struct crypto_async_request *base, int err) { struct sk_buff *skb = base->data; - struct xfrm_offload *xo = xfrm_offload(skb); void *tmp; - - struct xfrm_state *x; - if (xo && (xo->flags & XFRM_DEV_RESUME)) - x = skb->sp->xvec[skb->sp->len - 1]; - else - x = skb_dst(skb)->xfrm; + struct dst_entry *dst = skb_dst(skb); + struct xfrm_state *x = dst->xfrm; tmp = ESP_SKB_CB(skb)->tmp; esp_ssg_unref(x, tmp); kfree(tmp); - - if (xo && (xo->flags & XFRM_DEV_RESUME)) { - if (err) { - XFRM_INC_STATS(xs_net(x), LINUX_MIB_XFRMOUTSTATEPROTOERROR); - kfree_skb(skb); - return; - } - - skb_push(skb, skb->data - skb_mac_header(skb)); - secpath_reset(skb); - xfrm_dev_resume(skb); - } else { - xfrm_output_resume(skb, err); - } + xfrm_output_resume(skb, err); } /* Move ESP header back into place. */ diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c index d5902d19a1c7..7c72b85c9339 100644 --- a/net/ipv6/esp6_offload.c +++ b/net/ipv6/esp6_offload.c @@ -143,38 +143,78 @@ static void esp6_gso_encap(struct xfrm_state *x, struct sk_buff *skb) static struct sk_buff *esp6_gso_segment(struct sk_buff *skb, netdev_features_t features) { + __u32 seq; + int err = 0; + struct sk_buff *skb2; struct xfrm_state *x; struct ip_esp_hdr *esph; struct crypto_aead *aead; + struct sk_buff *segs = ERR_PTR(-EINVAL); netdev_features_t esp_features = features; struct xfrm_offload *xo = xfrm_offload(skb); if (!xo) - return ERR_PTR(-EINVAL); + goto out; if (!(skb_shinfo(skb)->gso_type & SKB_GSO_ESP)) - return ERR_PTR(-EINVAL); + goto out; + + seq = xo->seq.low; x = skb->sp->xvec[skb->sp->len - 1]; aead = x->data; esph = ip_esp_hdr(skb); if (esph->spi != x->id.spi) - return ERR_PTR(-EINVAL); + goto out; if (!pskb_may_pull(skb, sizeof(*esph) + crypto_aead_ivsize(aead))) - return ERR_PTR(-EINVAL); + goto out; __skb_pull(skb, sizeof(*esph) + crypto_aead_ivsize(aead)); skb->encap_hdr_csum = 1; - if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle || - (x->xso.dev != skb->dev)) + if (!(features & NETIF_F_HW_ESP)) esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK); - xo->flags |= XFRM_GSO_SEGMENT; - return x->outer_mode->gso_segment(x, skb, esp_features); + segs = x->outer_mode->gso_segment(x, skb, esp_features); + if (IS_ERR_OR_NULL(segs)) + goto out; + + __skb_pull(skb, skb->data - skb_mac_header(skb)); + + skb2 = segs; + do { + struct sk_buff *nskb = skb2->next; + + xo = xfrm_offload(skb2); + xo->flags |= XFRM_GSO_SEGMENT; + xo->seq.low = seq; + xo->seq.hi = xfrm_replay_seqhi(x, seq); + + if(!(features & NETIF_F_HW_ESP)) + xo->flags |= CRYPTO_FALLBACK; + + x->outer_mode->xmit(x, skb2); + + err = x->type_offload->xmit(x, skb2, esp_features); + if (err) { + kfree_skb_list(segs); + return ERR_PTR(err); + } + + if (!skb_is_gso(skb2)) + seq++; + else + seq += skb_shinfo(skb2)->gso_segs; + + skb_push(skb2, skb2->mac_len); + skb2 = nskb; + } while (skb2); + +out: + return segs; } static int esp6_input_tail(struct xfrm_state *x, struct sk_buff *skb) @@ -193,7 +233,6 @@ static int esp6_input_tail(struct xfrm_state *x, struct sk_buff *skb) static int esp6_xmit(struct xfrm_state *x, struct sk_buff *skb, netdev_features_t features) { - int len; int err; int alen; int blksize; @@ -202,7 +241,6 @@ static int esp6_xmit(struct xfrm_state *x, struct sk_buff *skb, netdev_features struct crypto_aead *aead; struct esp_info esp; bool hw_offload = true; - __u32 seq; esp.inplace = true; @@ -238,32 +276,28 @@ static int esp6_xmit(struct xfrm_state *x, struct sk_buff *skb, netdev_features return esp.nfrags; } - seq = xo->seq.low; - esph = ip_esp_hdr(skb); esph->spi = x->id.spi; skb_push(skb, -skb_network_offset(skb)); if (xo->flags & XFRM_GSO_SEGMENT) { - esph->seq_no = htonl(seq); - if (!skb_is_gso(skb)) - xo->seq.low++; - else - xo->seq.low += skb_shinfo(skb)->gso_segs; + esph->seq_no = htonl(xo->seq.low); + } else { + int len; + + len = skb->len - sizeof(struct ipv6hdr); + if (len > IPV6_MAXPLEN) + len = 0; + + ipv6_hdr(skb)->payload_len = htons(len); } - esp.seqno = cpu_to_be64(xo->seq.low + ((u64)xo->seq.hi << 32)); - len = skb->len - sizeof(struct ipv6hdr); - - if (len > IPV6_MAXPLEN) - len = 0; - - ipv6_hdr(skb)->payload_len = htons(len); - if (hw_offload) return 0; + esp.seqno = cpu_to_be64(xo->seq.low + ((u64)xo->seq.hi << 32)); + err = esp6_output_tail(x, skb, &esp); if (err) return err; diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c index 47a5f8f88c70..6d2429b0376d 100644 --- a/net/ipv6/exthdrs.c +++ b/net/ipv6/exthdrs.c @@ -74,8 +74,20 @@ struct tlvtype_proc { /* An unknown option is detected, decide what to do */ -static bool ip6_tlvopt_unknown(struct sk_buff *skb, int optoff) +static bool ip6_tlvopt_unknown(struct sk_buff *skb, int optoff, + bool disallow_unknowns) { + if (disallow_unknowns) { + /* If unknown TLVs are disallowed by configuration + * then always silently drop packet. Note this also + * means no ICMP parameter problem is sent which + * could be a good property to mitigate a reflection DOS + * attack. + */ + + goto drop; + } + switch ((skb_network_header(skb)[optoff] & 0xC0) >> 6) { case 0: /* ignore */ return true; @@ -94,20 +106,30 @@ static bool ip6_tlvopt_unknown(struct sk_buff *skb, int optoff) return false; } +drop: kfree_skb(skb); return false; } /* Parse tlv encoded option header (hop-by-hop or destination) */ -static bool ip6_parse_tlv(const struct tlvtype_proc *procs, struct sk_buff *skb) +static bool ip6_parse_tlv(const struct tlvtype_proc *procs, + struct sk_buff *skb, + int max_count) { - const struct tlvtype_proc *curr; + int len = (skb_transport_header(skb)[1] + 1) << 3; const unsigned char *nh = skb_network_header(skb); int off = skb_network_header_len(skb); - int len = (skb_transport_header(skb)[1] + 1) << 3; + const struct tlvtype_proc *curr; + bool disallow_unknowns = false; + int tlv_count = 0; int padlen = 0; + if (unlikely(max_count < 0)) { + disallow_unknowns = true; + max_count = -max_count; + } + if (skb_transport_offset(skb) + len > skb_headlen(skb)) goto bad; @@ -148,6 +170,11 @@ static bool ip6_parse_tlv(const struct tlvtype_proc *procs, struct sk_buff *skb) default: /* Other TLV code so scan list */ if (optlen > len) goto bad; + + tlv_count++; + if (tlv_count > max_count) + goto bad; + for (curr = procs; curr->type >= 0; curr++) { if (curr->type == nh[off]) { /* type specific length/alignment @@ -158,10 +185,10 @@ static bool ip6_parse_tlv(const struct tlvtype_proc *procs, struct sk_buff *skb) break; } } - if (curr->type < 0) { - if (ip6_tlvopt_unknown(skb, off) == 0) - return false; - } + if (curr->type < 0 && + !ip6_tlvopt_unknown(skb, off, disallow_unknowns)) + return false; + padlen = 0; break; } @@ -255,28 +282,37 @@ static const struct tlvtype_proc tlvprocdestopt_lst[] = { static int ipv6_destopt_rcv(struct sk_buff *skb) { + struct inet6_dev *idev = __in6_dev_get(skb->dev); struct inet6_skb_parm *opt = IP6CB(skb); #if IS_ENABLED(CONFIG_IPV6_MIP6) __u16 dstbuf; #endif struct dst_entry *dst = skb_dst(skb); + struct net *net = dev_net(skb->dev); + int extlen; if (!pskb_may_pull(skb, skb_transport_offset(skb) + 8) || !pskb_may_pull(skb, (skb_transport_offset(skb) + ((skb_transport_header(skb)[1] + 1) << 3)))) { - __IP6_INC_STATS(dev_net(dst->dev), ip6_dst_idev(dst), + __IP6_INC_STATS(dev_net(dst->dev), idev, IPSTATS_MIB_INHDRERRORS); +fail_and_free: kfree_skb(skb); return -1; } + extlen = (skb_transport_header(skb)[1] + 1) << 3; + if (extlen > net->ipv6.sysctl.max_dst_opts_len) + goto fail_and_free; + opt->lastopt = opt->dst1 = skb_network_header_len(skb); #if IS_ENABLED(CONFIG_IPV6_MIP6) dstbuf = opt->dst1; #endif - if (ip6_parse_tlv(tlvprocdestopt_lst, skb)) { - skb->transport_header += (skb_transport_header(skb)[1] + 1) << 3; + if (ip6_parse_tlv(tlvprocdestopt_lst, skb, + init_net.ipv6.sysctl.max_dst_opts_cnt)) { + skb->transport_header += extlen; opt = IP6CB(skb); #if IS_ENABLED(CONFIG_IPV6_MIP6) opt->nhoff = dstbuf; @@ -286,8 +322,7 @@ static int ipv6_destopt_rcv(struct sk_buff *skb) return 1; } - __IP6_INC_STATS(dev_net(dst->dev), - ip6_dst_idev(dst), IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); return -1; } @@ -383,8 +418,7 @@ looped_back: } if (hdr->segments_left >= (hdr->hdrlen >> 1)) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), - IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, ((&hdr->segments_left) - skb_network_header(skb))); @@ -423,8 +457,7 @@ looped_back: if (skb_dst(skb)->dev->flags & IFF_LOOPBACK) { if (ipv6_hdr(skb)->hop_limit <= 1) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), - IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT, 0); kfree_skb(skb); @@ -448,10 +481,10 @@ looped_back: /* called with rcu_read_lock() */ static int ipv6_rthdr_rcv(struct sk_buff *skb) { + struct inet6_dev *idev = __in6_dev_get(skb->dev); struct inet6_skb_parm *opt = IP6CB(skb); struct in6_addr *addr = NULL; struct in6_addr daddr; - struct inet6_dev *idev; int n, i; struct ipv6_rt_hdr *hdr; struct rt0_hdr *rthdr; @@ -465,8 +498,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb) if (!pskb_may_pull(skb, skb_transport_offset(skb) + 8) || !pskb_may_pull(skb, (skb_transport_offset(skb) + ((skb_transport_header(skb)[1] + 1) << 3)))) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), - IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); kfree_skb(skb); return -1; } @@ -475,8 +507,7 @@ static int ipv6_rthdr_rcv(struct sk_buff *skb) if (ipv6_addr_is_multicast(&ipv6_hdr(skb)->daddr) || skb->pkt_type != PACKET_HOST) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), - IPSTATS_MIB_INADDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS); kfree_skb(skb); return -1; } @@ -494,7 +525,7 @@ looped_back: * processed by own */ if (!addr) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS); kfree_skb(skb); return -1; @@ -520,8 +551,7 @@ looped_back: goto unknown_rh; /* Silently discard invalid RTH type 2 */ if (hdr->hdrlen != 2 || hdr->segments_left != 1) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), - IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); kfree_skb(skb); return -1; } @@ -539,8 +569,7 @@ looped_back: n = hdr->hdrlen >> 1; if (hdr->segments_left > n) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), - IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, ((&hdr->segments_left) - skb_network_header(skb))); @@ -576,14 +605,12 @@ looped_back: if (xfrm6_input_addr(skb, (xfrm_address_t *)addr, (xfrm_address_t *)&ipv6_hdr(skb)->saddr, IPPROTO_ROUTING) < 0) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), - IPSTATS_MIB_INADDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS); kfree_skb(skb); return -1; } if (!ipv6_chk_home_addr(dev_net(skb_dst(skb)->dev), addr)) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), - IPSTATS_MIB_INADDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS); kfree_skb(skb); return -1; } @@ -594,8 +621,7 @@ looped_back: } if (ipv6_addr_is_multicast(addr)) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), - IPSTATS_MIB_INADDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS); kfree_skb(skb); return -1; } @@ -614,8 +640,7 @@ looped_back: if (skb_dst(skb)->dev->flags&IFF_LOOPBACK) { if (ipv6_hdr(skb)->hop_limit <= 1) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), - IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT, 0); kfree_skb(skb); @@ -630,7 +655,7 @@ looped_back: return -1; unknown_rh: - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, (&hdr->type) - skb_network_header(skb)); return -1; @@ -722,34 +747,31 @@ static bool ipv6_hop_ra(struct sk_buff *skb, int optoff) static bool ipv6_hop_jumbo(struct sk_buff *skb, int optoff) { const unsigned char *nh = skb_network_header(skb); + struct inet6_dev *idev = __in6_dev_get_safely(skb->dev); struct net *net = ipv6_skb_net(skb); u32 pkt_len; if (nh[optoff + 1] != 4 || (optoff & 3) != 2) { net_dbg_ratelimited("ipv6_hop_jumbo: wrong jumbo opt length/alignment %d\n", nh[optoff+1]); - __IP6_INC_STATS(net, ipv6_skb_idev(skb), - IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); goto drop; } pkt_len = ntohl(*(__be32 *)(nh + optoff + 2)); if (pkt_len <= IPV6_MAXPLEN) { - __IP6_INC_STATS(net, ipv6_skb_idev(skb), - IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, optoff+2); return false; } if (ipv6_hdr(skb)->payload_len) { - __IP6_INC_STATS(net, ipv6_skb_idev(skb), - IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, optoff); return false; } if (pkt_len > skb->len - sizeof(struct ipv6hdr)) { - __IP6_INC_STATS(net, ipv6_skb_idev(skb), - IPSTATS_MIB_INTRUNCATEDPKTS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INTRUNCATEDPKTS); goto drop; } @@ -805,6 +827,8 @@ static const struct tlvtype_proc tlvprochopopt_lst[] = { int ipv6_parse_hopopts(struct sk_buff *skb) { struct inet6_skb_parm *opt = IP6CB(skb); + struct net *net = dev_net(skb->dev); + int extlen; /* * skb_network_header(skb) is equal to skb->data, and @@ -815,13 +839,19 @@ int ipv6_parse_hopopts(struct sk_buff *skb) if (!pskb_may_pull(skb, sizeof(struct ipv6hdr) + 8) || !pskb_may_pull(skb, (sizeof(struct ipv6hdr) + ((skb_transport_header(skb)[1] + 1) << 3)))) { +fail_and_free: kfree_skb(skb); return -1; } + extlen = (skb_transport_header(skb)[1] + 1) << 3; + if (extlen > net->ipv6.sysctl.max_hbh_opts_len) + goto fail_and_free; + opt->flags |= IP6SKB_HOPBYHOP; - if (ip6_parse_tlv(tlvprochopopt_lst, skb)) { - skb->transport_header += (skb_transport_header(skb)[1] + 1) << 3; + if (ip6_parse_tlv(tlvprochopopt_lst, skb, + init_net.ipv6.sysctl.max_hbh_opts_cnt)) { + skb->transport_header += extlen; opt = IP6CB(skb); opt->nhoff = sizeof(struct ipv6hdr); return 1; diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c index 1e19fdbbd2bf..bd519be3f833 100644 --- a/net/ipv6/fib6_rules.c +++ b/net/ipv6/fib6_rules.c @@ -60,12 +60,47 @@ unsigned int fib6_rules_seq_read(struct net *net) return fib_rules_seq_read(net, AF_INET6); } +/* called with rcu lock held; no reference taken on fib6_info */ +struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6, + int flags) +{ + struct fib6_info *f6i; + int err; + + if (net->ipv6.fib6_has_custom_rules) { + struct fib_lookup_arg arg = { + .lookup_ptr = fib6_table_lookup, + .lookup_data = &oif, + .flags = FIB_LOOKUP_NOREF, + }; + + l3mdev_update_flow(net, flowi6_to_flowi(fl6)); + + err = fib_rules_lookup(net->ipv6.fib6_rules_ops, + flowi6_to_flowi(fl6), flags, &arg); + if (err) + return ERR_PTR(err); + + f6i = arg.result ? : net->ipv6.fib6_null_entry; + } else { + f6i = fib6_table_lookup(net, net->ipv6.fib6_local_tbl, + oif, fl6, flags); + if (!f6i || f6i == net->ipv6.fib6_null_entry) + f6i = fib6_table_lookup(net, net->ipv6.fib6_main_tbl, + oif, fl6, flags); + } + + return f6i; +} + struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6, + const struct sk_buff *skb, int flags, pol_lookup_t lookup) { if (net->ipv6.fib6_has_custom_rules) { struct fib_lookup_arg arg = { .lookup_ptr = lookup, + .lookup_data = skb, .flags = FIB_LOOKUP_NOREF, }; @@ -80,11 +115,11 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6, } else { struct rt6_info *rt; - rt = lookup(net, net->ipv6.fib6_local_tbl, fl6, flags); + rt = lookup(net, net->ipv6.fib6_local_tbl, fl6, skb, flags); if (rt != net->ipv6.ip6_null_entry && rt->dst.error != -EAGAIN) return &rt->dst; ip6_rt_put(rt); - rt = lookup(net, net->ipv6.fib6_main_tbl, fl6, flags); + rt = lookup(net, net->ipv6.fib6_main_tbl, fl6, skb, flags); if (rt->dst.error != -EAGAIN) return &rt->dst; ip6_rt_put(rt); @@ -119,8 +154,48 @@ static int fib6_rule_saddr(struct net *net, struct fib_rule *rule, int flags, return 0; } -static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp, - int flags, struct fib_lookup_arg *arg) +static int fib6_rule_action_alt(struct fib_rule *rule, struct flowi *flp, + int flags, struct fib_lookup_arg *arg) +{ + struct flowi6 *flp6 = &flp->u.ip6; + struct net *net = rule->fr_net; + struct fib6_table *table; + struct fib6_info *f6i; + int err = -EAGAIN, *oif; + u32 tb_id; + + switch (rule->action) { + case FR_ACT_TO_TBL: + break; + case FR_ACT_UNREACHABLE: + return -ENETUNREACH; + case FR_ACT_PROHIBIT: + return -EACCES; + case FR_ACT_BLACKHOLE: + default: + return -EINVAL; + } + + tb_id = fib_rule_get_table(rule, arg); + table = fib6_get_table(net, tb_id); + if (!table) + return -EAGAIN; + + oif = (int *)arg->lookup_data; + f6i = fib6_table_lookup(net, table, *oif, flp6, flags); + if (f6i != net->ipv6.fib6_null_entry) { + err = fib6_rule_saddr(net, rule, flags, flp6, + fib6_info_nh_dev(f6i)); + + if (likely(!err)) + arg->result = f6i; + } + + return err; +} + +static int __fib6_rule_action(struct fib_rule *rule, struct flowi *flp, + int flags, struct fib_lookup_arg *arg) { struct flowi6 *flp6 = &flp->u.ip6; struct rt6_info *rt = NULL; @@ -155,7 +230,7 @@ static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp, goto out; } - rt = lookup(net, table, flp6, flags); + rt = lookup(net, table, flp6, arg->lookup_data, flags); if (rt != net->ipv6.ip6_null_entry) { struct inet6_dev *idev = ip6_dst_idev(&rt->dst); @@ -184,6 +259,15 @@ out: return err; } +static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp, + int flags, struct fib_lookup_arg *arg) +{ + if (arg->lookup_ptr == fib6_table_lookup) + return fib6_rule_action_alt(rule, flp, flags, arg); + + return __fib6_rule_action(rule, flp, flags, arg); +} + static bool fib6_rule_suppress(struct fib_rule *rule, struct fib_lookup_arg *arg) { struct rt6_info *rt = (struct rt6_info *) arg->result; @@ -272,12 +356,26 @@ static int fib6_rule_configure(struct fib_rule *rule, struct sk_buff *skb, rule6->dst.plen = frh->dst_len; rule6->tclass = frh->tos; + if (fib_rule_requires_fldissect(rule)) + net->ipv6.fib6_rules_require_fldissect++; + net->ipv6.fib6_has_custom_rules = true; err = 0; errout: return err; } +static int fib6_rule_delete(struct fib_rule *rule) +{ + struct net *net = rule->fr_net; + + if (net->ipv6.fib6_rules_require_fldissect && + fib_rule_requires_fldissect(rule)) + net->ipv6.fib6_rules_require_fldissect--; + + return 0; +} + static int fib6_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh, struct nlattr **tb) { @@ -342,6 +440,7 @@ static const struct fib_rules_ops __net_initconst fib6_rules_ops_template = { .match = fib6_rule_match, .suppress = fib6_rule_suppress, .configure = fib6_rule_configure, + .delete = fib6_rule_delete, .compare = fib6_rule_compare, .fill = fib6_rule_fill, .nlmsg_payload = fib6_rule_nlmsg_payload, @@ -370,6 +469,7 @@ static int __net_init fib6_rules_net_init(struct net *net) goto out_fib6_rules_ops; net->ipv6.fib6_rules_ops = ops; + net->ipv6.fib6_rules_require_fldissect = 0; out: return err; diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index d47dbe421922..041cc640ebe6 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -527,7 +527,7 @@ void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, fl6.fl6_icmp_type = type; fl6.fl6_icmp_code = code; fl6.flowi6_uid = sock_net_uid(net, NULL); - fl6.mp_hash = rt6_multipath_hash(&fl6, skb); + fl6.mp_hash = rt6_multipath_hash(net, &fl6, skb, NULL); security_skb_classify_flow(skb, flowi6_to_flowi(&fl6)); sk = icmpv6_xmit_lock(net); @@ -636,7 +636,8 @@ int ip6_err_gen_icmpv6_unreach(struct sk_buff *skb, int nhs, int type, skb_pull(skb2, nhs); skb_reset_network_header(skb2); - rt = rt6_lookup(dev_net(skb->dev), &ipv6_hdr(skb2)->saddr, NULL, 0, 0); + rt = rt6_lookup(dev_net(skb->dev), &ipv6_hdr(skb2)->saddr, NULL, 0, + skb, 0); if (rt && rt->dst.dev) skb2->dev = rt->dst.dev; diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c index 7d83ab627b09..d8391921363f 100644 --- a/net/ipv6/inet6_hashtables.c +++ b/net/ipv6/inet6_hashtables.c @@ -125,6 +125,40 @@ static inline int compute_score(struct sock *sk, struct net *net, } /* called with rcu_read_lock() */ +static struct sock *inet6_lhash2_lookup(struct net *net, + struct inet_listen_hashbucket *ilb2, + struct sk_buff *skb, int doff, + const struct in6_addr *saddr, + const __be16 sport, const struct in6_addr *daddr, + const unsigned short hnum, const int dif, const int sdif) +{ + bool exact_dif = inet6_exact_dif_match(net, skb); + struct inet_connection_sock *icsk; + struct sock *sk, *result = NULL; + int score, hiscore = 0; + u32 phash = 0; + + inet_lhash2_for_each_icsk_rcu(icsk, &ilb2->head) { + sk = (struct sock *)icsk; + score = compute_score(sk, net, hnum, daddr, dif, sdif, + exact_dif); + if (score > hiscore) { + if (sk->sk_reuseport) { + phash = inet6_ehashfn(net, daddr, hnum, + saddr, sport); + result = reuseport_select_sock(sk, phash, + skb, doff); + if (result) + return result; + } + result = sk; + hiscore = score; + } + } + + return result; +} + struct sock *inet6_lookup_listener(struct net *net, struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, @@ -134,34 +168,63 @@ struct sock *inet6_lookup_listener(struct net *net, { unsigned int hash = inet_lhashfn(net, hnum); struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash]; - int score, hiscore = 0, matches = 0, reuseport = 0; bool exact_dif = inet6_exact_dif_match(net, skb); + struct inet_listen_hashbucket *ilb2; struct sock *sk, *result = NULL; struct hlist_nulls_node *node; + int score, hiscore = 0; + unsigned int hash2; u32 phash = 0; + if (ilb->count <= 10 || !hashinfo->lhash2) + goto port_lookup; + + /* Too many sk in the ilb bucket (which is hashed by port alone). + * Try lhash2 (which is hashed by port and addr) instead. + */ + + hash2 = ipv6_portaddr_hash(net, daddr, hnum); + ilb2 = inet_lhash2_bucket(hashinfo, hash2); + if (ilb2->count > ilb->count) + goto port_lookup; + + result = inet6_lhash2_lookup(net, ilb2, skb, doff, + saddr, sport, daddr, hnum, + dif, sdif); + if (result) + goto done; + + /* Lookup lhash2 with in6addr_any */ + + hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum); + ilb2 = inet_lhash2_bucket(hashinfo, hash2); + if (ilb2->count > ilb->count) + goto port_lookup; + + result = inet6_lhash2_lookup(net, ilb2, skb, doff, + saddr, sport, daddr, hnum, + dif, sdif); + goto done; + +port_lookup: sk_nulls_for_each(sk, node, &ilb->nulls_head) { score = compute_score(sk, net, hnum, daddr, dif, sdif, exact_dif); if (score > hiscore) { - reuseport = sk->sk_reuseport; - if (reuseport) { + if (sk->sk_reuseport) { phash = inet6_ehashfn(net, daddr, hnum, saddr, sport); result = reuseport_select_sock(sk, phash, skb, doff); if (result) - return result; - matches = 1; + goto done; } result = sk; hiscore = score; - } else if (score == hiscore && reuseport) { - matches++; - if (reciprocal_scale(phash, matches) == 0) - result = sk; - phash = next_pseudo_random32(phash); } } +done: + if (unlikely(IS_ERR(result))) + return NULL; return result; } EXPORT_SYMBOL_GPL(inet6_lookup_listener); diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index aae8dab8b031..92914be99e9f 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -38,21 +38,12 @@ #include #include -#include -#define RT6_DEBUG 2 - -#if RT6_DEBUG >= 3 -#define RT6_TRACE(x...) pr_debug(x) -#else -#define RT6_TRACE(x...) do { ; } while (0) -#endif - static struct kmem_cache *fib6_node_kmem __read_mostly; struct fib6_cleaner { struct fib6_walker w; struct net *net; - int (*func)(struct rt6_info *, void *arg); + int (*func)(struct fib6_info *, void *arg); int sernum; void *arg; }; @@ -63,9 +54,12 @@ struct fib6_cleaner { #define FWS_INIT FWS_L #endif -static void fib6_prune_clones(struct net *net, struct fib6_node *fn); -static struct rt6_info *fib6_find_prefix(struct net *net, struct fib6_node *fn); -static struct fib6_node *fib6_repair_tree(struct net *net, struct fib6_node *fn); +static struct fib6_info *fib6_find_prefix(struct net *net, + struct fib6_table *table, + struct fib6_node *fn); +static struct fib6_node *fib6_repair_tree(struct net *net, + struct fib6_table *table, + struct fib6_node *fn); static int fib6_walk(struct net *net, struct fib6_walker *w); static int fib6_walk_continue(struct fib6_walker *w); @@ -111,6 +105,16 @@ enum { FIB6_NO_SERNUM_CHANGE = 0, }; +void fib6_update_sernum(struct net *net, struct fib6_info *f6i) +{ + struct fib6_node *fn; + + fn = rcu_dereference_protected(f6i->fib6_node, + lockdep_is_held(&f6i->fib6_table->tb6_lock)); + if (fn) + WRITE_ONCE(fn->fn_sernum, fib6_new_sernum(net)); +} + /* * Auxiliary address test functions for the radix tree. * @@ -141,18 +145,85 @@ static __be32 addr_bit_set(const void *token, int fn_bit) addr[fn_bit >> 5]; } -static struct fib6_node *node_alloc(void) +struct fib6_info *fib6_info_alloc(gfp_t gfp_flags) +{ + struct fib6_info *f6i; + + f6i = kzalloc(sizeof(*f6i), gfp_flags); + if (!f6i) + return NULL; + + f6i->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, gfp_flags); + if (!f6i->rt6i_pcpu) { + kfree(f6i); + return NULL; + } + + INIT_LIST_HEAD(&f6i->fib6_siblings); + f6i->fib6_metrics = (struct dst_metrics *)&dst_default_metrics; + + atomic_inc(&f6i->fib6_ref); + + return f6i; +} + +void fib6_info_destroy_rcu(struct rcu_head *head) +{ + struct fib6_info *f6i = container_of(head, struct fib6_info, rcu); + struct rt6_exception_bucket *bucket; + struct dst_metrics *m; + + WARN_ON(f6i->fib6_node); + + bucket = rcu_dereference_protected(f6i->rt6i_exception_bucket, 1); + if (bucket) { + f6i->rt6i_exception_bucket = NULL; + kfree(bucket); + } + + if (f6i->rt6i_pcpu) { + int cpu; + + for_each_possible_cpu(cpu) { + struct rt6_info **ppcpu_rt; + struct rt6_info *pcpu_rt; + + ppcpu_rt = per_cpu_ptr(f6i->rt6i_pcpu, cpu); + pcpu_rt = *ppcpu_rt; + if (pcpu_rt) { + dst_dev_put(&pcpu_rt->dst); + dst_release(&pcpu_rt->dst); + *ppcpu_rt = NULL; + } + } + } + + if (f6i->fib6_nh.nh_dev) + dev_put(f6i->fib6_nh.nh_dev); + + m = f6i->fib6_metrics; + if (m != &dst_default_metrics && refcount_dec_and_test(&m->refcnt)) + kfree(m); + + kfree(f6i); +} +EXPORT_SYMBOL_GPL(fib6_info_destroy_rcu); + +static struct fib6_node *node_alloc(struct net *net) { struct fib6_node *fn; fn = kmem_cache_zalloc(fib6_node_kmem, GFP_ATOMIC); + if (fn) + net->ipv6.rt6_stats->fib_nodes++; return fn; } -static void node_free_immediate(struct fib6_node *fn) +static void node_free_immediate(struct net *net, struct fib6_node *fn) { kmem_cache_free(fib6_node_kmem, fn); + net->ipv6.rt6_stats->fib_nodes--; } static void node_free_rcu(struct rcu_head *head) @@ -162,36 +233,12 @@ static void node_free_rcu(struct rcu_head *head) kmem_cache_free(fib6_node_kmem, fn); } -static void node_free(struct fib6_node *fn) +static void node_free(struct net *net, struct fib6_node *fn) { call_rcu(&fn->rcu, node_free_rcu); + net->ipv6.rt6_stats->fib_nodes--; } -void rt6_free_pcpu(struct rt6_info *non_pcpu_rt) -{ - int cpu; - - if (!non_pcpu_rt->rt6i_pcpu) - return; - - for_each_possible_cpu(cpu) { - struct rt6_info **ppcpu_rt; - struct rt6_info *pcpu_rt; - - ppcpu_rt = per_cpu_ptr(non_pcpu_rt->rt6i_pcpu, cpu); - pcpu_rt = *ppcpu_rt; - if (pcpu_rt) { - dst_dev_put(&pcpu_rt->dst); - dst_release(&pcpu_rt->dst); - *ppcpu_rt = NULL; - } - } - - free_percpu(non_pcpu_rt->rt6i_pcpu); - non_pcpu_rt->rt6i_pcpu = NULL; -} -EXPORT_SYMBOL_GPL(rt6_free_pcpu); - static void fib6_free_table(struct fib6_table *table) { inetpeer_invalidate_tree(&table->tb6_peers); @@ -206,8 +253,7 @@ static void fib6_link_table(struct net *net, struct fib6_table *tb) * Initialize table lock at a single place to give lockdep a key, * tables aren't visible prior to being linked to the list. */ - rwlock_init(&tb->tb6_lock); - + spin_lock_init(&tb->tb6_lock); h = tb->tb6_id & (FIB6_TABLE_HASHSZ - 1); /* @@ -226,7 +272,8 @@ static struct fib6_table *fib6_alloc_table(struct net *net, u32 id) table = kzalloc(sizeof(*table), GFP_ATOMIC); if (table) { table->tb6_id = id; - table->tb6_root.leaf = net->ipv6.ip6_null_entry; + rcu_assign_pointer(table->tb6_root.leaf, + net->ipv6.fib6_null_entry); table->tb6_root.fn_flags = RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO; inet_peer_base_init(&table->tb6_peers); } @@ -293,11 +340,12 @@ struct fib6_table *fib6_get_table(struct net *net, u32 id) } struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6, + const struct sk_buff *skb, int flags, pol_lookup_t lookup) { struct rt6_info *rt; - rt = lookup(net, net->ipv6.fib6_main_tbl, fl6, flags); + rt = lookup(net, net->ipv6.fib6_main_tbl, fl6, skb, flags); if (rt->dst.error == -EAGAIN) { ip6_rt_put(rt); rt = net->ipv6.ip6_null_entry; @@ -307,6 +355,13 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6, return &rt->dst; } +/* called with rcu lock held; no reference taken on fib6_info */ +struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6, + int flags) +{ + return fib6_table_lookup(net, net->ipv6.fib6_main_tbl, oif, fl6, flags); +} + static void __net_init fib6_tables_init(struct net *net) { fib6_link_table(net, net->ipv6.fib6_main_tbl); @@ -323,11 +378,8 @@ unsigned int fib6_tables_seq_read(struct net *net) struct hlist_head *head = &net->ipv6.fib_table_hash[h]; struct fib6_table *tb; - hlist_for_each_entry_rcu(tb, head, tb6_hlist) { - read_lock_bh(&tb->tb6_lock); + hlist_for_each_entry_rcu(tb, head, tb6_hlist) fib_seq += tb->fib_seq; - read_unlock_bh(&tb->tb6_lock); - } } rcu_read_unlock(); @@ -336,7 +388,7 @@ unsigned int fib6_tables_seq_read(struct net *net) static int call_fib6_entry_notifier(struct notifier_block *nb, struct net *net, enum fib_event_type event_type, - struct rt6_info *rt) + struct fib6_info *rt) { struct fib6_entry_notifier_info info = { .rt = rt, @@ -347,13 +399,15 @@ static int call_fib6_entry_notifier(struct notifier_block *nb, struct net *net, static int call_fib6_entry_notifiers(struct net *net, enum fib_event_type event_type, - struct rt6_info *rt) + struct fib6_info *rt, + struct netlink_ext_ack *extack) { struct fib6_entry_notifier_info info = { + .info.extack = extack, .rt = rt, }; - rt->rt6i_table->fib_seq++; + rt->fib6_table->fib_seq++; return call_fib6_notifiers(net, event_type, &info.info); } @@ -362,18 +416,18 @@ struct fib6_dump_arg { struct notifier_block *nb; }; -static void fib6_rt_dump(struct rt6_info *rt, struct fib6_dump_arg *arg) +static void fib6_rt_dump(struct fib6_info *rt, struct fib6_dump_arg *arg) { - if (rt == arg->net->ipv6.ip6_null_entry) + if (rt == arg->net->ipv6.fib6_null_entry) return; call_fib6_entry_notifier(arg->nb, arg->net, FIB_EVENT_ENTRY_ADD, rt); } static int fib6_node_dump(struct fib6_walker *w) { - struct rt6_info *rt; + struct fib6_info *rt; - for (rt = w->leaf; rt; rt = rt->dst.rt6_next) + for_each_fib6_walker_rt(w) fib6_rt_dump(rt, w->args); w->leaf = NULL; return 0; @@ -383,9 +437,9 @@ static void fib6_table_dump(struct net *net, struct fib6_table *tb, struct fib6_walker *w) { w->root = &tb->tb6_root; - read_lock_bh(&tb->tb6_lock); + spin_lock_bh(&tb->tb6_lock); fib6_walk(net, w); - read_unlock_bh(&tb->tb6_lock); + spin_unlock_bh(&tb->tb6_lock); } /* Called with rcu_read_lock() */ @@ -420,9 +474,9 @@ int fib6_tables_dump(struct net *net, struct notifier_block *nb) static int fib6_dump_node(struct fib6_walker *w) { int res; - struct rt6_info *rt; + struct fib6_info *rt; - for (rt = w->leaf; rt; rt = rt->dst.rt6_next) { + for_each_fib6_walker_rt(w) { res = rt6_dump_route(rt, w->args); if (res < 0) { /* Frame is full, suspend walking */ @@ -435,10 +489,10 @@ static int fib6_dump_node(struct fib6_walker *w) * last sibling of this route (no need to dump the * sibling routes again) */ - if (rt->rt6i_nsiblings) - rt = list_last_entry(&rt->rt6i_siblings, - struct rt6_info, - rt6i_siblings); + if (rt->fib6_nsiblings) + rt = list_last_entry(&rt->fib6_siblings, + struct fib6_info, + fib6_siblings); } w->leaf = NULL; return 0; @@ -481,26 +535,27 @@ static int fib6_dump_table(struct fib6_table *table, struct sk_buff *skb, w->count = 0; w->skip = 0; - read_lock_bh(&table->tb6_lock); + spin_lock_bh(&table->tb6_lock); res = fib6_walk(net, w); - read_unlock_bh(&table->tb6_lock); + spin_unlock_bh(&table->tb6_lock); if (res > 0) { cb->args[4] = 1; - cb->args[5] = w->root->fn_sernum; + cb->args[5] = READ_ONCE(w->root->fn_sernum); } } else { - if (cb->args[5] != w->root->fn_sernum) { + int sernum = READ_ONCE(w->root->fn_sernum); + if (cb->args[5] != sernum) { /* Begin at the root if the tree changed */ - cb->args[5] = w->root->fn_sernum; + cb->args[5] = sernum; w->state = FWS_INIT; w->node = w->root; w->skip = w->count; } else w->skip = 0; - read_lock_bh(&table->tb6_lock); + spin_lock_bh(&table->tb6_lock); res = fib6_walk_continue(w); - read_unlock_bh(&table->tb6_lock); + spin_unlock_bh(&table->tb6_lock); if (res <= 0) { fib6_walker_unlink(net, w); cb->args[4] = 0; @@ -573,6 +628,24 @@ out: return res; } +void fib6_metric_set(struct fib6_info *f6i, int metric, u32 val) +{ + if (!f6i) + return; + + if (f6i->fib6_metrics == &dst_default_metrics) { + struct dst_metrics *p = kzalloc(sizeof(*p), GFP_ATOMIC); + + if (!p) + return; + + refcount_set(&p->refcnt, 1); + f6i->fib6_metrics = p; + } + + f6i->fib6_metrics->metrics[metric - 1] = val; +} + /* * Routing Table * @@ -581,11 +654,13 @@ out: * node. */ -static struct fib6_node *fib6_add_1(struct fib6_node *root, - struct in6_addr *addr, int plen, - int offset, int allow_create, - int replace_required, int sernum, - struct netlink_ext_ack *extack) +static struct fib6_node *fib6_add_1(struct net *net, + struct fib6_table *table, + struct fib6_node *root, + struct in6_addr *addr, int plen, + int offset, int allow_create, + int replace_required, + struct netlink_ext_ack *extack) { struct fib6_node *fn, *in, *ln; struct fib6_node *pn = NULL; @@ -600,7 +675,9 @@ static struct fib6_node *fib6_add_1(struct fib6_node *root, fn = root; do { - key = (struct rt6key *)((u8 *)fn->leaf + offset); + struct fib6_info *leaf = rcu_dereference_protected(fn->leaf, + lockdep_is_held(&table->tb6_lock)); + key = (struct rt6key *)((u8 *)leaf + offset); /* * Prefix match @@ -626,12 +703,15 @@ static struct fib6_node *fib6_add_1(struct fib6_node *root, if (plen == fn->fn_bit) { /* clean up an intermediate node */ if (!(fn->fn_flags & RTN_RTINFO)) { - rt6_release(fn->leaf); - fn->leaf = NULL; + RCU_INIT_POINTER(fn->leaf, NULL); + fib6_info_release(leaf); + /* remove null_entry in the root node */ + } else if (fn->fn_flags & RTN_TL_ROOT && + rcu_access_pointer(fn->leaf) == + net->ipv6.fib6_null_entry) { + RCU_INIT_POINTER(fn->leaf, NULL); } - fn->fn_sernum = sernum; - return fn; } @@ -640,10 +720,13 @@ static struct fib6_node *fib6_add_1(struct fib6_node *root, */ /* Try to walk down on tree. */ - fn->fn_sernum = sernum; dir = addr_bit_set(addr, fn->fn_bit); pn = fn; - fn = dir ? fn->right : fn->left; + fn = dir ? + rcu_dereference_protected(fn->right, + lockdep_is_held(&table->tb6_lock)) : + rcu_dereference_protected(fn->left, + lockdep_is_held(&table->tb6_lock)); } while (fn); if (!allow_create) { @@ -669,19 +752,17 @@ static struct fib6_node *fib6_add_1(struct fib6_node *root, * Create new leaf node without children. */ - ln = node_alloc(); + ln = node_alloc(net); if (!ln) return ERR_PTR(-ENOMEM); ln->fn_bit = plen; - - ln->parent = pn; - ln->fn_sernum = sernum; + RCU_INIT_POINTER(ln->parent, pn); if (dir) - pn->right = ln; + rcu_assign_pointer(pn->right, ln); else - pn->left = ln; + rcu_assign_pointer(pn->left, ln); return ln; @@ -695,7 +776,8 @@ insert_above: * and the current */ - pn = fn->parent; + pn = rcu_dereference_protected(fn->parent, + lockdep_is_held(&table->tb6_lock)); /* find 1st bit in difference between the 2 addrs. @@ -711,14 +793,14 @@ insert_above: * (new leaf node)[ln] (old node)[fn] */ if (plen > bit) { - in = node_alloc(); - ln = node_alloc(); + in = node_alloc(net); + ln = node_alloc(net); if (!in || !ln) { if (in) - node_free_immediate(in); + node_free_immediate(net, in); if (ln) - node_free_immediate(ln); + node_free_immediate(net, ln); return ERR_PTR(-ENOMEM); } @@ -732,31 +814,28 @@ insert_above: in->fn_bit = bit; - in->parent = pn; + RCU_INIT_POINTER(in->parent, pn); in->leaf = fn->leaf; - atomic_inc(&in->leaf->rt6i_ref); - - in->fn_sernum = sernum; + atomic_inc(&rcu_dereference_protected(in->leaf, + lockdep_is_held(&table->tb6_lock))->fib6_ref); /* update parent pointer */ if (dir) - pn->right = in; + rcu_assign_pointer(pn->right, in); else - pn->left = in; + rcu_assign_pointer(pn->left, in); ln->fn_bit = plen; - ln->parent = in; - fn->parent = in; - - ln->fn_sernum = sernum; + RCU_INIT_POINTER(ln->parent, in); + rcu_assign_pointer(fn->parent, in); if (addr_bit_set(addr, bit)) { - in->right = ln; - in->left = fn; + rcu_assign_pointer(in->right, ln); + rcu_assign_pointer(in->left, fn); } else { - in->left = ln; - in->right = fn; + rcu_assign_pointer(in->left, ln); + rcu_assign_pointer(in->right, fn); } } else { /* plen <= bit */ @@ -766,74 +845,70 @@ insert_above: * (old node)[fn] NULL */ - ln = node_alloc(); + ln = node_alloc(net); if (!ln) return ERR_PTR(-ENOMEM); ln->fn_bit = plen; - ln->parent = pn; - - ln->fn_sernum = sernum; - - if (dir) - pn->right = ln; - else - pn->left = ln; + RCU_INIT_POINTER(ln->parent, pn); if (addr_bit_set(&key->addr, plen)) - ln->right = fn; + RCU_INIT_POINTER(ln->right, fn); else - ln->left = fn; + RCU_INIT_POINTER(ln->left, fn); - fn->parent = ln; + rcu_assign_pointer(fn->parent, ln); + + if (dir) + rcu_assign_pointer(pn->right, ln); + else + rcu_assign_pointer(pn->left, ln); } return ln; } -static bool rt6_qualify_for_ecmp(struct rt6_info *rt) +static void fib6_drop_pcpu_from(struct fib6_info *f6i, + const struct fib6_table *table) { - return (rt->rt6i_flags & (RTF_GATEWAY|RTF_ADDRCONF|RTF_DYNAMIC)) == - RTF_GATEWAY; -} + int cpu; -static void fib6_copy_metrics(u32 *mp, const struct mx6_config *mxc) -{ - int i; + /* Make sure rt6_make_pcpu_route() wont add other percpu routes + * while we are cleaning them here. + */ + f6i->fib6_destroying = 1; + mb(); /* paired with the cmpxchg() in rt6_make_pcpu_route() */ - for (i = 0; i < RTAX_MAX; i++) { - if (test_bit(i, mxc->mx_valid)) - mp[i] = mxc->mx[i]; + /* release the reference to this fib entry from + * all of its cached pcpu routes + */ + for_each_possible_cpu(cpu) { + struct rt6_info **ppcpu_rt; + struct rt6_info *pcpu_rt; + + ppcpu_rt = per_cpu_ptr(f6i->rt6i_pcpu, cpu); + pcpu_rt = *ppcpu_rt; + if (pcpu_rt) { + struct fib6_info *from; + + from = rcu_dereference_protected(pcpu_rt->from, + lockdep_is_held(&table->tb6_lock)); + rcu_assign_pointer(pcpu_rt->from, NULL); + fib6_info_release(from); + } } } -static int fib6_commit_metrics(struct dst_entry *dst, struct mx6_config *mxc) -{ - if (!mxc->mx) - return 0; - - if (dst->flags & DST_HOST) { - u32 *mp = dst_metrics_write_ptr(dst); - - if (unlikely(!mp)) - return -ENOMEM; - - fib6_copy_metrics(mp, mxc); - } else { - dst_init_metrics(dst, mxc->mx, false); - - /* We've stolen mx now. */ - mxc->mx = NULL; - } - - return 0; -} - -static void fib6_purge_rt(struct rt6_info *rt, struct fib6_node *fn, +static void fib6_purge_rt(struct fib6_info *rt, struct fib6_node *fn, struct net *net) { - if (atomic_read(&rt->rt6i_ref) != 1) { + struct fib6_table *table = rt->fib6_table; + + if (rt->rt6i_pcpu) + fib6_drop_pcpu_from(rt, table); + + if (atomic_read(&rt->fib6_ref) != 1) { /* This route is used as dummy address holder in some split * nodes. It is not leaked, but it still holds other resources, * which must be released in time. So, scan ascendant nodes @@ -841,12 +916,18 @@ static void fib6_purge_rt(struct rt6_info *rt, struct fib6_node *fn, * to still alive ones. */ while (fn) { - if (!(fn->fn_flags & RTN_RTINFO) && fn->leaf == rt) { - fn->leaf = fib6_find_prefix(net, fn); - atomic_inc(&fn->leaf->rt6i_ref); - rt6_release(rt); + struct fib6_info *leaf = rcu_dereference_protected(fn->leaf, + lockdep_is_held(&table->tb6_lock)); + struct fib6_info *new_leaf; + if (!(fn->fn_flags & RTN_RTINFO) && leaf == rt) { + new_leaf = fib6_find_prefix(net, table, fn); + atomic_inc(&new_leaf->fib6_ref); + + rcu_assign_pointer(fn->leaf, new_leaf); + fib6_info_release(rt); } - fn = fn->parent; + fn = rcu_dereference_protected(fn->parent, + lockdep_is_held(&table->tb6_lock)); } } } @@ -855,12 +936,15 @@ static void fib6_purge_rt(struct rt6_info *rt, struct fib6_node *fn, * Insert routing information in a node. */ -static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt, - struct nl_info *info, struct mx6_config *mxc) +static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt, + struct nl_info *info, + struct netlink_ext_ack *extack) { - struct rt6_info *iter = NULL; - struct rt6_info **ins; - struct rt6_info **fallback_ins = NULL; + struct fib6_info *leaf = rcu_dereference_protected(fn->leaf, + lockdep_is_held(&rt->fib6_table->tb6_lock)); + struct fib6_info *iter = NULL; + struct fib6_info __rcu **ins; + struct fib6_info __rcu **fallback_ins = NULL; int replace = (info->nlh && (info->nlh->nlmsg_flags & NLM_F_REPLACE)); int add = (!info->nlh || @@ -875,12 +959,14 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt, ins = &fn->leaf; - for (iter = fn->leaf; iter; iter = iter->dst.rt6_next) { + for (iter = leaf; iter; + iter = rcu_dereference_protected(iter->fib6_next, + lockdep_is_held(&rt->fib6_table->tb6_lock))) { /* * Search for duplicates */ - if (iter->rt6i_metric == rt->rt6i_metric) { + if (iter->fib6_metric == rt->fib6_metric) { /* * Same priority level */ @@ -899,15 +985,15 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt, } if (rt6_duplicate_nexthop(iter, rt)) { - if (rt->rt6i_nsiblings) - rt->rt6i_nsiblings = 0; - if (!(iter->rt6i_flags & RTF_EXPIRES)) + if (rt->fib6_nsiblings) + rt->fib6_nsiblings = 0; + if (!(iter->fib6_flags & RTF_EXPIRES)) return -EEXIST; - if (!(rt->rt6i_flags & RTF_EXPIRES)) - rt6_clean_expires(iter); + if (!(rt->fib6_flags & RTF_EXPIRES)) + fib6_clean_expires(iter); else - rt6_set_expires(iter, rt->dst.expires); - iter->rt6i_pmtu = rt->rt6i_pmtu; + fib6_set_expires(iter, rt->expires); + fib6_metric_set(iter, RTAX_MTU, rt->fib6_pmtu); return -EEXIST; } /* If we have the same destination and the same metric, @@ -923,14 +1009,14 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt, */ if (rt_can_ecmp && rt6_qualify_for_ecmp(iter)) - rt->rt6i_nsiblings++; + rt->fib6_nsiblings++; } - if (iter->rt6i_metric > rt->rt6i_metric) + if (iter->fib6_metric > rt->fib6_metric) break; next_iter: - ins = &iter->dst.rt6_next; + ins = &iter->fib6_next; } if (fallback_ins && !found) { @@ -938,7 +1024,8 @@ next_iter: * first matching route */ ins = fallback_ins; - iter = *ins; + iter = rcu_dereference_protected(*ins, + lockdep_is_held(&rt->fib6_table->tb6_lock)); found++; } @@ -947,33 +1034,35 @@ next_iter: fn->rr_ptr = NULL; /* Link this route to others same route. */ - if (rt->rt6i_nsiblings) { - unsigned int rt6i_nsiblings; - struct rt6_info *sibling, *temp_sibling; + if (rt->fib6_nsiblings) { + unsigned int fib6_nsiblings; + struct fib6_info *sibling, *temp_sibling; /* Find the first route that have the same metric */ - sibling = fn->leaf; + sibling = leaf; while (sibling) { - if (sibling->rt6i_metric == rt->rt6i_metric && + if (sibling->fib6_metric == rt->fib6_metric && rt6_qualify_for_ecmp(sibling)) { - list_add_tail(&rt->rt6i_siblings, - &sibling->rt6i_siblings); + list_add_tail(&rt->fib6_siblings, + &sibling->fib6_siblings); break; } - sibling = sibling->dst.rt6_next; + sibling = rcu_dereference_protected(sibling->fib6_next, + lockdep_is_held(&rt->fib6_table->tb6_lock)); } /* For each sibling in the list, increment the counter of * siblings. BUG() if counters does not match, list of siblings * is broken! */ - rt6i_nsiblings = 0; + fib6_nsiblings = 0; list_for_each_entry_safe(sibling, temp_sibling, - &rt->rt6i_siblings, rt6i_siblings) { - sibling->rt6i_nsiblings++; - BUG_ON(sibling->rt6i_nsiblings != rt->rt6i_nsiblings); - rt6i_nsiblings++; + &rt->fib6_siblings, fib6_siblings) { + sibling->fib6_nsiblings++; + BUG_ON(sibling->fib6_nsiblings != rt->fib6_nsiblings); + fib6_nsiblings++; } - BUG_ON(rt6i_nsiblings != rt->rt6i_nsiblings); + BUG_ON(fib6_nsiblings != rt->fib6_nsiblings); + rt6_multipath_rebalance(temp_sibling); } /* @@ -985,16 +1074,17 @@ next_iter: add: nlflags |= NLM_F_CREATE; - err = fib6_commit_metrics(&rt->dst, mxc); + + err = call_fib6_entry_notifiers(info->nl_net, + FIB_EVENT_ENTRY_ADD, + rt, extack); if (err) return err; - rt->dst.rt6_next = iter; - *ins = rt; - rcu_assign_pointer(rt->rt6i_node, fn); - atomic_inc(&rt->rt6i_ref); - call_fib6_entry_notifiers(info->nl_net, FIB_EVENT_ENTRY_ADD, - rt); + rcu_assign_pointer(rt->fib6_next, iter); + atomic_inc(&rt->fib6_ref); + rcu_assign_pointer(rt->fib6_node, fn); + rcu_assign_pointer(*ins, rt); if (!info->skip_notify) inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags); info->nl_net->ipv6.rt6_stats->fib_rt_entries++; @@ -1014,48 +1104,51 @@ add: return -ENOENT; } - err = fib6_commit_metrics(&rt->dst, mxc); + err = call_fib6_entry_notifiers(info->nl_net, + FIB_EVENT_ENTRY_REPLACE, + rt, extack); if (err) return err; - *ins = rt; - rcu_assign_pointer(rt->rt6i_node, fn); - rt->dst.rt6_next = iter->dst.rt6_next; - atomic_inc(&rt->rt6i_ref); - call_fib6_entry_notifiers(info->nl_net, FIB_EVENT_ENTRY_REPLACE, - rt); + atomic_inc(&rt->fib6_ref); + rcu_assign_pointer(rt->fib6_node, fn); + rt->fib6_next = iter->fib6_next; + rcu_assign_pointer(*ins, rt); if (!info->skip_notify) inet6_rt_notify(RTM_NEWROUTE, rt, info, NLM_F_REPLACE); if (!(fn->fn_flags & RTN_RTINFO)) { info->nl_net->ipv6.rt6_stats->fib_route_nodes++; fn->fn_flags |= RTN_RTINFO; } - nsiblings = iter->rt6i_nsiblings; - iter->rt6i_node = NULL; + nsiblings = iter->fib6_nsiblings; + iter->fib6_node = NULL; fib6_purge_rt(iter, fn, info->nl_net); - if (fn->rr_ptr == iter) + if (rcu_access_pointer(fn->rr_ptr) == iter) fn->rr_ptr = NULL; - rt6_release(iter); + fib6_info_release(iter); if (nsiblings) { /* Replacing an ECMP route, remove all siblings */ - ins = &rt->dst.rt6_next; - iter = *ins; + ins = &rt->fib6_next; + iter = rcu_dereference_protected(*ins, + lockdep_is_held(&rt->fib6_table->tb6_lock)); while (iter) { - if (iter->rt6i_metric > rt->rt6i_metric) + if (iter->fib6_metric > rt->fib6_metric) break; if (rt6_qualify_for_ecmp(iter)) { - *ins = iter->dst.rt6_next; - iter->rt6i_node = NULL; + *ins = iter->fib6_next; + iter->fib6_node = NULL; fib6_purge_rt(iter, fn, info->nl_net); - if (fn->rr_ptr == iter) + if (rcu_access_pointer(fn->rr_ptr) == iter) fn->rr_ptr = NULL; - rt6_release(iter); + fib6_info_release(iter); nsiblings--; + info->nl_net->ipv6.rt6_stats->fib_rt_entries--; } else { - ins = &iter->dst.rt6_next; + ins = &iter->fib6_next; } - iter = *ins; + iter = rcu_dereference_protected(*ins, + lockdep_is_held(&rt->fib6_table->tb6_lock)); } WARN_ON(nsiblings != 0); } @@ -1064,10 +1157,10 @@ add: return 0; } -static void fib6_start_gc(struct net *net, struct rt6_info *rt) +static void fib6_start_gc(struct net *net, struct fib6_info *rt) { if (!timer_pending(&net->ipv6.ip6_fib_timer) && - (rt->rt6i_flags & (RTF_EXPIRES | RTF_CACHE))) + (rt->fib6_flags & RTF_EXPIRES)) mod_timer(&net->ipv6.ip6_fib_timer, jiffies + net->ipv6.sysctl.ip6_rt_gc_interval); } @@ -1079,25 +1172,43 @@ void fib6_force_start_gc(struct net *net) jiffies + net->ipv6.sysctl.ip6_rt_gc_interval); } +static void __fib6_update_sernum_upto_root(struct fib6_info *rt, + int sernum) +{ + struct fib6_node *fn = rcu_dereference_protected(rt->fib6_node, + lockdep_is_held(&rt->fib6_table->tb6_lock)); + + /* paired with smp_rmb() in rt6_get_cookie_safe() */ + smp_wmb(); + while (fn) { + WRITE_ONCE(fn->fn_sernum, sernum); + fn = rcu_dereference_protected(fn->parent, + lockdep_is_held(&rt->fib6_table->tb6_lock)); + } +} + +void fib6_update_sernum_upto_root(struct net *net, struct fib6_info *rt) +{ + __fib6_update_sernum_upto_root(rt, fib6_new_sernum(net)); +} + /* * Add routing information to the routing tree. * / * with source addr info in sub-trees + * Need to own table->tb6_lock */ -int fib6_add(struct fib6_node *root, struct rt6_info *rt, - struct nl_info *info, struct mx6_config *mxc, - struct netlink_ext_ack *extack) +int fib6_add(struct fib6_node *root, struct fib6_info *rt, + struct nl_info *info, struct netlink_ext_ack *extack) { + struct fib6_table *table = rt->fib6_table; struct fib6_node *fn, *pn = NULL; int err = -ENOMEM; int allow_create = 1; int replace_required = 0; int sernum = fib6_new_sernum(info->nl_net); - if (WARN_ON_ONCE(!atomic_read(&rt->dst.__refcnt))) - return -EINVAL; - if (info->nlh) { if (!(info->nlh->nlmsg_flags & NLM_F_CREATE)) allow_create = 0; @@ -1107,9 +1218,10 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, if (!allow_create && !replace_required) pr_warn("RTM_NEWROUTE with no NLM_F_CREATE or NLM_F_REPLACE\n"); - fn = fib6_add_1(root, &rt->rt6i_dst.addr, rt->rt6i_dst.plen, - offsetof(struct rt6_info, rt6i_dst), allow_create, - replace_required, sernum, extack); + fn = fib6_add_1(info->nl_net, table, root, + &rt->fib6_dst.addr, rt->fib6_dst.plen, + offsetof(struct fib6_info, fib6_dst), allow_create, + replace_required, extack); if (IS_ERR(fn)) { err = PTR_ERR(fn); fn = NULL; @@ -1119,10 +1231,10 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, pn = fn; #ifdef CONFIG_IPV6_SUBTREES - if (rt->rt6i_src.plen) { + if (rt->fib6_src.plen) { struct fib6_node *sn; - if (!fn->subtree) { + if (!rcu_access_pointer(fn->subtree)) { struct fib6_node *sfn; /* @@ -1136,42 +1248,40 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, */ /* Create subtree root node */ - sfn = node_alloc(); + sfn = node_alloc(info->nl_net); if (!sfn) goto failure; - sfn->leaf = info->nl_net->ipv6.ip6_null_entry; - atomic_inc(&info->nl_net->ipv6.ip6_null_entry->rt6i_ref); + atomic_inc(&info->nl_net->ipv6.fib6_null_entry->fib6_ref); + rcu_assign_pointer(sfn->leaf, + info->nl_net->ipv6.fib6_null_entry); sfn->fn_flags = RTN_ROOT; - sfn->fn_sernum = sernum; /* Now add the first leaf node to new subtree */ - sn = fib6_add_1(sfn, &rt->rt6i_src.addr, - rt->rt6i_src.plen, - offsetof(struct rt6_info, rt6i_src), - allow_create, replace_required, sernum, - extack); + sn = fib6_add_1(info->nl_net, table, sfn, + &rt->fib6_src.addr, rt->fib6_src.plen, + offsetof(struct fib6_info, fib6_src), + allow_create, replace_required, extack); if (IS_ERR(sn)) { /* If it is failed, discard just allocated root, and then (in failure) stale node in main tree. */ - node_free_immediate(sfn); + node_free_immediate(info->nl_net, sfn); err = PTR_ERR(sn); goto failure; } /* Now link new subtree to main tree */ - sfn->parent = fn; - fn->subtree = sfn; + rcu_assign_pointer(sfn->parent, fn); + rcu_assign_pointer(fn->subtree, sfn); } else { - sn = fib6_add_1(fn->subtree, &rt->rt6i_src.addr, - rt->rt6i_src.plen, - offsetof(struct rt6_info, rt6i_src), - allow_create, replace_required, sernum, - extack); + sn = fib6_add_1(info->nl_net, table, FIB6_SUBTREE(fn), + &rt->fib6_src.addr, rt->fib6_src.plen, + offsetof(struct fib6_info, fib6_src), + allow_create, replace_required, extack); if (IS_ERR(sn)) { err = PTR_ERR(sn); @@ -1179,19 +1289,24 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, } } - if (!fn->leaf) { - fn->leaf = rt; - atomic_inc(&rt->rt6i_ref); + if (!rcu_access_pointer(fn->leaf)) { + if (fn->fn_flags & RTN_TL_ROOT) { + /* put back null_entry for root node */ + rcu_assign_pointer(fn->leaf, + info->nl_net->ipv6.fib6_null_entry); + } else { + atomic_inc(&rt->fib6_ref); + rcu_assign_pointer(fn->leaf, rt); + } } fn = sn; } #endif - err = fib6_add_rt2node(fn, rt, info, mxc); + err = fib6_add_rt2node(fn, rt, info, extack); if (!err) { + __fib6_update_sernum_upto_root(rt, sernum); fib6_start_gc(info->nl_net, rt); - if (!(rt->rt6i_flags & RTF_CACHE)) - fib6_prune_clones(info->nl_net, pn); } out: @@ -1201,42 +1316,47 @@ out: * If fib6_add_1 has cleared the old leaf pointer in the * super-tree leaf node we have to find a new one for it. */ - if (pn != fn && pn->leaf == rt) { - pn->leaf = NULL; - atomic_dec(&rt->rt6i_ref); - } - if (pn != fn && !pn->leaf && !(pn->fn_flags & RTN_RTINFO)) { - pn->leaf = fib6_find_prefix(info->nl_net, pn); -#if RT6_DEBUG >= 2 - if (!pn->leaf) { - WARN_ON(pn->leaf == NULL); - pn->leaf = info->nl_net->ipv6.ip6_null_entry; + if (pn != fn) { + struct fib6_info *pn_leaf = + rcu_dereference_protected(pn->leaf, + lockdep_is_held(&table->tb6_lock)); + if (pn_leaf == rt) { + pn_leaf = NULL; + RCU_INIT_POINTER(pn->leaf, NULL); + fib6_info_release(rt); } + if (!pn_leaf && !(pn->fn_flags & RTN_RTINFO)) { + pn_leaf = fib6_find_prefix(info->nl_net, table, + pn); +#if RT6_DEBUG >= 2 + if (!pn_leaf) { + WARN_ON(!pn_leaf); + pn_leaf = + info->nl_net->ipv6.fib6_null_entry; + } #endif - atomic_inc(&pn->leaf->rt6i_ref); + fib6_info_hold(pn_leaf); + rcu_assign_pointer(pn->leaf, pn_leaf); + } } #endif goto failure; } - if (!err && strstr(rt->dst.dev->name, "rmnet_data")) - net_log("fib6_add(): %s : Prefix: %pI6/%u, GW: %pI6, table: %u, proto: %u\n", - rt->dst.dev->name, &rt->rt6i_dst.addr, rt->rt6i_dst.plen, &rt->rt6i_gateway, - rt->rt6i_table->tb6_id, rt->rt6i_protocol); return err; failure: - /* fn->leaf could be NULL if fn is an intermediate node and we - * failed to add the new route to it in both subtree creation - * failure and fib6_add_rt2node() failure case. - * In both cases, fib6_repair_tree() should be called to fix - * fn->leaf. + /* fn->leaf could be NULL and fib6_repair_tree() needs to be called if: + * 1. fn is an intermediate node and we failed to add the new + * route to it in both subtree creation failure and fib6_add_rt2node() + * failure case. + * 2. fn is the root node in the table and we fail to add the first + * default route to it. */ - if (fn && !(fn->fn_flags & (RTN_RTINFO|RTN_ROOT))) - fib6_repair_tree(info->nl_net, fn); - /* Always release dst as dst->__refcnt is guaranteed - * to be taken before entering this function - */ - dst_release_immediate(&rt->dst); + if (fn && + (!(fn->fn_flags & (RTN_RTINFO|RTN_ROOT)) || + (fn->fn_flags & RTN_TL_ROOT && + !rcu_access_pointer(fn->leaf)))) + fib6_repair_tree(info->nl_net, table, fn); return err; } @@ -1246,12 +1366,12 @@ failure: */ struct lookup_args { - int offset; /* key offset on rt6_info */ + int offset; /* key offset on fib6_info */ const struct in6_addr *addr; /* search key */ }; -static struct fib6_node *fib6_lookup_1(struct fib6_node *root, - struct lookup_args *args) +static struct fib6_node *fib6_node_lookup_1(struct fib6_node *root, + struct lookup_args *args) { struct fib6_node *fn; __be32 dir; @@ -1270,7 +1390,8 @@ static struct fib6_node *fib6_lookup_1(struct fib6_node *root, dir = addr_bit_set(args->addr, fn->fn_bit); - next = dir ? fn->right : fn->left; + next = dir ? rcu_dereference(fn->right) : + rcu_dereference(fn->left); if (next) { fn = next; @@ -1280,18 +1401,23 @@ static struct fib6_node *fib6_lookup_1(struct fib6_node *root, } while (fn) { - if (FIB6_SUBTREE(fn) || fn->fn_flags & RTN_RTINFO) { + struct fib6_node *subtree = FIB6_SUBTREE(fn); + + if (subtree || fn->fn_flags & RTN_RTINFO) { + struct fib6_info *leaf = rcu_dereference(fn->leaf); struct rt6key *key; - key = (struct rt6key *) ((u8 *) fn->leaf + - args->offset); + if (!leaf) + goto backtrack; + + key = (struct rt6key *) ((u8 *)leaf + args->offset); if (ipv6_prefix_equal(&key->addr, args->addr, key->plen)) { #ifdef CONFIG_IPV6_SUBTREES - if (fn->subtree) { + if (subtree) { struct fib6_node *sfn; - sfn = fib6_lookup_1(fn->subtree, - args + 1); + sfn = fib6_node_lookup_1(subtree, + args + 1); if (!sfn) goto backtrack; fn = sfn; @@ -1301,30 +1427,31 @@ static struct fib6_node *fib6_lookup_1(struct fib6_node *root, return fn; } } -#ifdef CONFIG_IPV6_SUBTREES backtrack: -#endif if (fn->fn_flags & RTN_ROOT) break; - fn = fn->parent; + fn = rcu_dereference(fn->parent); } return NULL; } -struct fib6_node *fib6_lookup(struct fib6_node *root, const struct in6_addr *daddr, - const struct in6_addr *saddr) +/* called with rcu_read_lock() held + */ +struct fib6_node *fib6_node_lookup(struct fib6_node *root, + const struct in6_addr *daddr, + const struct in6_addr *saddr) { struct fib6_node *fn; struct lookup_args args[] = { { - .offset = offsetof(struct rt6_info, rt6i_dst), + .offset = offsetof(struct fib6_info, fib6_dst), .addr = daddr, }, #ifdef CONFIG_IPV6_SUBTREES { - .offset = offsetof(struct rt6_info, rt6i_src), + .offset = offsetof(struct fib6_info, fib6_src), .addr = saddr, }, #endif @@ -1333,7 +1460,7 @@ struct fib6_node *fib6_lookup(struct fib6_node *root, const struct in6_addr *dad } }; - fn = fib6_lookup_1(root, daddr ? args : args + 1); + fn = fib6_node_lookup_1(root, daddr ? args : args + 1); if (!fn || fn->fn_flags & RTN_TL_ROOT) fn = root; @@ -1343,54 +1470,87 @@ struct fib6_node *fib6_lookup(struct fib6_node *root, const struct in6_addr *dad /* * Get node with specified destination prefix (and source prefix, * if subtrees are used) + * exact_match == true means we try to find fn with exact match of + * the passed in prefix addr + * exact_match == false means we try to find fn with longest prefix + * match of the passed in prefix addr. This is useful for finding fn + * for cached route as it will be stored in the exception table under + * the node with longest prefix length. */ static struct fib6_node *fib6_locate_1(struct fib6_node *root, const struct in6_addr *addr, - int plen, int offset) + int plen, int offset, + bool exact_match) { - struct fib6_node *fn; + struct fib6_node *fn, *prev = NULL; for (fn = root; fn ; ) { - struct rt6key *key = (struct rt6key *)((u8 *)fn->leaf + offset); + struct fib6_info *leaf = rcu_dereference(fn->leaf); + struct rt6key *key; + + /* This node is being deleted */ + if (!leaf) { + if (plen <= fn->fn_bit) + goto out; + else + goto next; + } + + key = (struct rt6key *)((u8 *)leaf + offset); /* * Prefix match */ if (plen < fn->fn_bit || !ipv6_prefix_equal(&key->addr, addr, fn->fn_bit)) - return NULL; + goto out; if (plen == fn->fn_bit) return fn; + prev = fn; + +next: /* * We have more bits to go */ if (addr_bit_set(addr, fn->fn_bit)) - fn = fn->right; + fn = rcu_dereference(fn->right); else - fn = fn->left; + fn = rcu_dereference(fn->left); } - return NULL; +out: + if (exact_match) + return NULL; + else + return prev; } struct fib6_node *fib6_locate(struct fib6_node *root, const struct in6_addr *daddr, int dst_len, - const struct in6_addr *saddr, int src_len) + const struct in6_addr *saddr, int src_len, + bool exact_match) { struct fib6_node *fn; fn = fib6_locate_1(root, daddr, dst_len, - offsetof(struct rt6_info, rt6i_dst)); + offsetof(struct fib6_info, fib6_dst), + exact_match); #ifdef CONFIG_IPV6_SUBTREES if (src_len) { WARN_ON(saddr == NULL); - if (fn && fn->subtree) - fn = fib6_locate_1(fn->subtree, saddr, src_len, - offsetof(struct rt6_info, rt6i_src)); + if (fn) { + struct fib6_node *subtree = FIB6_SUBTREE(fn); + + if (subtree) { + fn = fib6_locate_1(subtree, saddr, src_len, + offsetof(struct fib6_info, fib6_src), + exact_match); + } + } } #endif @@ -1406,16 +1566,26 @@ struct fib6_node *fib6_locate(struct fib6_node *root, * */ -static struct rt6_info *fib6_find_prefix(struct net *net, struct fib6_node *fn) +static struct fib6_info *fib6_find_prefix(struct net *net, + struct fib6_table *table, + struct fib6_node *fn) { + struct fib6_node *child_left, *child_right; + if (fn->fn_flags & RTN_ROOT) - return net->ipv6.ip6_null_entry; + return net->ipv6.fib6_null_entry; while (fn) { - if (fn->left) - return fn->left->leaf; - if (fn->right) - return fn->right->leaf; + child_left = rcu_dereference_protected(fn->left, + lockdep_is_held(&table->tb6_lock)); + child_right = rcu_dereference_protected(fn->right, + lockdep_is_held(&table->tb6_lock)); + if (child_left) + return rcu_dereference_protected(child_left->leaf, + lockdep_is_held(&table->tb6_lock)); + if (child_right) + return rcu_dereference_protected(child_right->leaf, + lockdep_is_held(&table->tb6_lock)); fn = FIB6_SUBTREE(fn); } @@ -1425,31 +1595,55 @@ static struct rt6_info *fib6_find_prefix(struct net *net, struct fib6_node *fn) /* * Called to trim the tree of intermediate nodes when possible. "fn" * is the node we want to try and remove. + * Need to own table->tb6_lock */ static struct fib6_node *fib6_repair_tree(struct net *net, - struct fib6_node *fn) + struct fib6_table *table, + struct fib6_node *fn) { int children; int nstate; - struct fib6_node *child, *pn; + struct fib6_node *child; struct fib6_walker *w; int iter = 0; + /* Set fn->leaf to null_entry for root node. */ + if (fn->fn_flags & RTN_TL_ROOT) { + rcu_assign_pointer(fn->leaf, net->ipv6.fib6_null_entry); + return fn; + } + for (;;) { + struct fib6_node *fn_r = rcu_dereference_protected(fn->right, + lockdep_is_held(&table->tb6_lock)); + struct fib6_node *fn_l = rcu_dereference_protected(fn->left, + lockdep_is_held(&table->tb6_lock)); + struct fib6_node *pn = rcu_dereference_protected(fn->parent, + lockdep_is_held(&table->tb6_lock)); + struct fib6_node *pn_r = rcu_dereference_protected(pn->right, + lockdep_is_held(&table->tb6_lock)); + struct fib6_node *pn_l = rcu_dereference_protected(pn->left, + lockdep_is_held(&table->tb6_lock)); + struct fib6_info *fn_leaf = rcu_dereference_protected(fn->leaf, + lockdep_is_held(&table->tb6_lock)); + struct fib6_info *pn_leaf = rcu_dereference_protected(pn->leaf, + lockdep_is_held(&table->tb6_lock)); + struct fib6_info *new_fn_leaf; + RT6_TRACE("fixing tree: plen=%d iter=%d\n", fn->fn_bit, iter); iter++; WARN_ON(fn->fn_flags & RTN_RTINFO); WARN_ON(fn->fn_flags & RTN_TL_ROOT); - WARN_ON(fn->leaf); + WARN_ON(fn_leaf); children = 0; child = NULL; - if (fn->right) - child = fn->right, children |= 1; - if (fn->left) - child = fn->left, children |= 2; + if (fn_r) + child = fn_r, children |= 1; + if (fn_l) + child = fn_l, children |= 2; if (children == 3 || FIB6_SUBTREE(fn) #ifdef CONFIG_IPV6_SUBTREES @@ -1457,36 +1651,36 @@ static struct fib6_node *fib6_repair_tree(struct net *net, || (children && fn->fn_flags & RTN_ROOT) #endif ) { - fn->leaf = fib6_find_prefix(net, fn); + new_fn_leaf = fib6_find_prefix(net, table, fn); #if RT6_DEBUG >= 2 - if (!fn->leaf) { - WARN_ON(!fn->leaf); - fn->leaf = net->ipv6.ip6_null_entry; + if (!new_fn_leaf) { + WARN_ON(!new_fn_leaf); + new_fn_leaf = net->ipv6.fib6_null_entry; } #endif - atomic_inc(&fn->leaf->rt6i_ref); - return fn->parent; + fib6_info_hold(new_fn_leaf); + rcu_assign_pointer(fn->leaf, new_fn_leaf); + return pn; } - pn = fn->parent; #ifdef CONFIG_IPV6_SUBTREES if (FIB6_SUBTREE(pn) == fn) { WARN_ON(!(fn->fn_flags & RTN_ROOT)); - FIB6_SUBTREE(pn) = NULL; + RCU_INIT_POINTER(pn->subtree, NULL); nstate = FWS_L; } else { WARN_ON(fn->fn_flags & RTN_ROOT); #endif - if (pn->right == fn) - pn->right = child; - else if (pn->left == fn) - pn->left = child; + if (pn_r == fn) + rcu_assign_pointer(pn->right, child); + else if (pn_l == fn) + rcu_assign_pointer(pn->left, child); #if RT6_DEBUG >= 2 else WARN_ON(1); #endif if (child) - child->parent = pn; + rcu_assign_pointer(child->parent, pn); nstate = FWS_R; #ifdef CONFIG_IPV6_SUBTREES } @@ -1495,19 +1689,12 @@ static struct fib6_node *fib6_repair_tree(struct net *net, read_lock(&net->ipv6.fib6_walker_lock); FOR_WALKERS(net, w) { if (!child) { - if (w->root == fn) { - w->root = w->node = NULL; - RT6_TRACE("W %p adjusted by delroot 1\n", w); - } else if (w->node == fn) { + if (w->node == fn) { RT6_TRACE("W %p adjusted by delnode 1, s=%d/%d\n", w, w->state, nstate); w->node = pn; w->state = nstate; } } else { - if (w->root == fn) { - w->root = child; - RT6_TRACE("W %p adjusted by delroot 2\n", w); - } if (w->node == fn) { w->node = child; if (children&2) { @@ -1522,44 +1709,49 @@ static struct fib6_node *fib6_repair_tree(struct net *net, } read_unlock(&net->ipv6.fib6_walker_lock); - node_free(fn); + node_free(net, fn); if (pn->fn_flags & RTN_RTINFO || FIB6_SUBTREE(pn)) return pn; - rt6_release(pn->leaf); - pn->leaf = NULL; + RCU_INIT_POINTER(pn->leaf, NULL); + fib6_info_release(pn_leaf); fn = pn; } } -static void fib6_del_route(struct fib6_node *fn, struct rt6_info **rtp, - struct nl_info *info) +static void fib6_del_route(struct fib6_table *table, struct fib6_node *fn, + struct fib6_info __rcu **rtp, struct nl_info *info) { struct fib6_walker *w; - struct rt6_info *rt = *rtp; + struct fib6_info *rt = rcu_dereference_protected(*rtp, + lockdep_is_held(&table->tb6_lock)); struct net *net = info->nl_net; RT6_TRACE("fib6_del_route\n"); /* Unlink it */ - *rtp = rt->dst.rt6_next; - rt->rt6i_node = NULL; + *rtp = rt->fib6_next; + rt->fib6_node = NULL; net->ipv6.rt6_stats->fib_rt_entries--; net->ipv6.rt6_stats->fib_discarded_routes++; + /* Flush all cached dst in exception table */ + rt6_flush_exceptions(rt); + /* Reset round-robin state, if necessary */ - if (fn->rr_ptr == rt) + if (rcu_access_pointer(fn->rr_ptr) == rt) fn->rr_ptr = NULL; /* Remove this entry from other siblings */ - if (rt->rt6i_nsiblings) { - struct rt6_info *sibling, *next_sibling; + if (rt->fib6_nsiblings) { + struct fib6_info *sibling, *next_sibling; list_for_each_entry_safe(sibling, next_sibling, - &rt->rt6i_siblings, rt6i_siblings) - sibling->rt6i_nsiblings--; - rt->rt6i_nsiblings = 0; - list_del_init(&rt->rt6i_siblings); + &rt->fib6_siblings, fib6_siblings) + sibling->fib6_nsiblings--; + rt->fib6_nsiblings = 0; + list_del_init(&rt->fib6_siblings); + rt6_multipath_rebalance(next_sibling); } /* Adjust walkers */ @@ -1567,70 +1759,61 @@ static void fib6_del_route(struct fib6_node *fn, struct rt6_info **rtp, FOR_WALKERS(net, w) { if (w->state == FWS_C && w->leaf == rt) { RT6_TRACE("walker %p adjusted by delroute\n", w); - w->leaf = rt->dst.rt6_next; + w->leaf = rcu_dereference_protected(rt->fib6_next, + lockdep_is_held(&table->tb6_lock)); if (!w->leaf) w->state = FWS_U; } } read_unlock(&net->ipv6.fib6_walker_lock); - rt->dst.rt6_next = NULL; - - /* If it was last route, expunge its radix tree node */ - if (!fn->leaf) { - fn->fn_flags &= ~RTN_RTINFO; - net->ipv6.rt6_stats->fib_route_nodes--; - fn = fib6_repair_tree(net, fn); + /* If it was last route, call fib6_repair_tree() to: + * 1. For root node, put back null_entry as how the table was created. + * 2. For other nodes, expunge its radix tree node. + */ + if (!rcu_access_pointer(fn->leaf)) { + if (!(fn->fn_flags & RTN_TL_ROOT)) { + fn->fn_flags &= ~RTN_RTINFO; + net->ipv6.rt6_stats->fib_route_nodes--; + } + fn = fib6_repair_tree(net, table, fn); } fib6_purge_rt(rt, fn, net); - call_fib6_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, rt); + call_fib6_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, rt, NULL); if (!info->skip_notify) inet6_rt_notify(RTM_DELROUTE, rt, info, 0); - rt6_release(rt); + fib6_info_release(rt); } -int fib6_del(struct rt6_info *rt, struct nl_info *info) +/* Need to own table->tb6_lock */ +int fib6_del(struct fib6_info *rt, struct nl_info *info) { - struct fib6_node *fn = rcu_dereference_protected(rt->rt6i_node, - lockdep_is_held(&rt->rt6i_table->tb6_lock)); + struct fib6_node *fn = rcu_dereference_protected(rt->fib6_node, + lockdep_is_held(&rt->fib6_table->tb6_lock)); + struct fib6_table *table = rt->fib6_table; struct net *net = info->nl_net; - struct rt6_info **rtp; + struct fib6_info __rcu **rtp; + struct fib6_info __rcu **rtp_next; -#if RT6_DEBUG >= 2 - if (rt->dst.obsolete > 0) { - WARN_ON(fn); - return -ENOENT; - } -#endif - if (!fn || rt == net->ipv6.ip6_null_entry) + if (!fn || rt == net->ipv6.fib6_null_entry) return -ENOENT; WARN_ON(!(fn->fn_flags & RTN_RTINFO)); - if (!(rt->rt6i_flags & RTF_CACHE)) { - struct fib6_node *pn = fn; -#ifdef CONFIG_IPV6_SUBTREES - /* clones of this route might be in another subtree */ - if (rt->rt6i_src.plen) { - while (!(pn->fn_flags & RTN_ROOT)) - pn = pn->parent; - pn = pn->parent; - } -#endif - fib6_prune_clones(info->nl_net, pn); - } - /* * Walk the leaf entries looking for ourself */ - for (rtp = &fn->leaf; *rtp; rtp = &(*rtp)->dst.rt6_next) { - if (*rtp == rt) { - fib6_del_route(fn, rtp, info); + for (rtp = &fn->leaf; *rtp; rtp = rtp_next) { + struct fib6_info *cur = rcu_dereference_protected(*rtp, + lockdep_is_held(&table->tb6_lock)); + if (rt == cur) { + fib6_del_route(table, fn, rtp, info); return 0; } + rtp_next = &cur->fib6_next; } return -ENOENT; } @@ -1657,22 +1840,22 @@ int fib6_del(struct rt6_info *rt, struct nl_info *info) * 0 -> walk is complete. * >0 -> walk is incomplete (i.e. suspended) * <0 -> walk is terminated by an error. + * + * This function is called with tb6_lock held. */ static int fib6_walk_continue(struct fib6_walker *w) { - struct fib6_node *fn, *pn; + struct fib6_node *fn, *pn, *left, *right; + + /* w->root should always be table->tb6_root */ + WARN_ON_ONCE(!(w->root->fn_flags & RTN_TL_ROOT)); for (;;) { fn = w->node; if (!fn) return 0; - if (w->prune && fn != w->root && - fn->fn_flags & RTN_RTINFO && w->state < FWS_C) { - w->state = FWS_C; - w->leaf = fn->leaf; - } switch (w->state) { #ifdef CONFIG_IPV6_SUBTREES case FWS_S: @@ -1683,20 +1866,22 @@ static int fib6_walk_continue(struct fib6_walker *w) w->state = FWS_L; #endif case FWS_L: - if (fn->left) { - w->node = fn->left; + left = rcu_dereference_protected(fn->left, 1); + if (left) { + w->node = left; w->state = FWS_INIT; continue; } w->state = FWS_R; case FWS_R: - if (fn->right) { - w->node = fn->right; + right = rcu_dereference_protected(fn->right, 1); + if (right) { + w->node = right; w->state = FWS_INIT; continue; } w->state = FWS_C; - w->leaf = fn->leaf; + w->leaf = rcu_dereference_protected(fn->leaf, 1); case FWS_C: if (w->leaf && fn->fn_flags & RTN_RTINFO) { int err; @@ -1718,7 +1903,9 @@ skip: case FWS_U: if (fn == w->root) return 0; - pn = fn->parent; + pn = rcu_dereference_protected(fn->parent, 1); + left = rcu_dereference_protected(pn->left, 1); + right = rcu_dereference_protected(pn->right, 1); w->node = pn; #ifdef CONFIG_IPV6_SUBTREES if (FIB6_SUBTREE(pn) == fn) { @@ -1727,13 +1914,13 @@ skip: continue; } #endif - if (pn->left == fn) { + if (left == fn) { w->state = FWS_R; continue; } - if (pn->right == fn) { + if (right == fn) { w->state = FWS_C; - w->leaf = w->node->leaf; + w->leaf = rcu_dereference_protected(w->node->leaf, 1); continue; } #if RT6_DEBUG >= 2 @@ -1760,15 +1947,15 @@ static int fib6_walk(struct net *net, struct fib6_walker *w) static int fib6_clean_node(struct fib6_walker *w) { int res; - struct rt6_info *rt; + struct fib6_info *rt; struct fib6_cleaner *c = container_of(w, struct fib6_cleaner, w); struct nl_info info = { .nl_net = c->net, }; if (c->sernum != FIB6_NO_SERNUM_CHANGE && - w->node->fn_sernum != c->sernum) - w->node->fn_sernum = c->sernum; + READ_ONCE(w->node->fn_sernum) != c->sernum) + WRITE_ONCE(w->node->fn_sernum, c->sernum); if (!c->func) { WARN_ON_ONCE(c->sernum == FIB6_NO_SERNUM_CHANGE); @@ -1776,21 +1963,27 @@ static int fib6_clean_node(struct fib6_walker *w) return 0; } - for (rt = w->leaf; rt; rt = rt->dst.rt6_next) { + for_each_fib6_walker_rt(w) { res = c->func(rt, c->arg); - if (res < 0) { + if (res == -1) { w->leaf = rt; res = fib6_del(rt, &info); if (res) { #if RT6_DEBUG >= 2 pr_debug("%s: del failed: rt=%p@%p err=%d\n", __func__, rt, - rcu_access_pointer(rt->rt6i_node), + rcu_access_pointer(rt->fib6_node), res); #endif continue; } return 0; + } else if (res == -2) { + if (WARN_ON(!rt->fib6_nsiblings)) + continue; + rt = list_last_entry(&rt->fib6_siblings, + struct fib6_info, fib6_siblings); + continue; } WARN_ON(res != 0); } @@ -1802,22 +1995,19 @@ static int fib6_clean_node(struct fib6_walker *w) * Convenient frontend to tree walker. * * func is called on each route. - * It may return -1 -> delete this route. + * It may return -2 -> skip multipath route. + * -1 -> delete this route. * 0 -> continue walking - * - * prune==1 -> only immediate children of node (certainly, - * ignoring pure split nodes) will be scanned. */ static void fib6_clean_tree(struct net *net, struct fib6_node *root, - int (*func)(struct rt6_info *, void *arg), - bool prune, int sernum, void *arg) + int (*func)(struct fib6_info *, void *arg), + int sernum, void *arg) { struct fib6_cleaner c; c.w.root = root; c.w.func = fib6_clean_node; - c.w.prune = prune; c.w.count = 0; c.w.skip = 0; c.func = func; @@ -1829,7 +2019,7 @@ static void fib6_clean_tree(struct net *net, struct fib6_node *root, } static void __fib6_clean_all(struct net *net, - int (*func)(struct rt6_info *, void *), + int (*func)(struct fib6_info *, void *), int sernum, void *arg) { struct fib6_table *table; @@ -1840,37 +2030,21 @@ static void __fib6_clean_all(struct net *net, for (h = 0; h < FIB6_TABLE_HASHSZ; h++) { head = &net->ipv6.fib_table_hash[h]; hlist_for_each_entry_rcu(table, head, tb6_hlist) { - write_lock_bh(&table->tb6_lock); + spin_lock_bh(&table->tb6_lock); fib6_clean_tree(net, &table->tb6_root, - func, false, sernum, arg); - write_unlock_bh(&table->tb6_lock); + func, sernum, arg); + spin_unlock_bh(&table->tb6_lock); } } rcu_read_unlock(); } -void fib6_clean_all(struct net *net, int (*func)(struct rt6_info *, void *), +void fib6_clean_all(struct net *net, int (*func)(struct fib6_info *, void *), void *arg) { __fib6_clean_all(net, func, FIB6_NO_SERNUM_CHANGE, arg); } -static int fib6_prune_clone(struct rt6_info *rt, void *arg) -{ - if (rt->rt6i_flags & RTF_CACHE) { - RT6_TRACE("pruning clone %p\n", rt); - return -1; - } - - return 0; -} - -static void fib6_prune_clones(struct net *net, struct fib6_node *fn) -{ - fib6_clean_tree(net, fn, fib6_prune_clone, true, - FIB6_NO_SERNUM_CHANGE, NULL); -} - static void fib6_flush_trees(struct net *net) { int new_sernum = fib6_new_sernum(net); @@ -1882,13 +2056,7 @@ static void fib6_flush_trees(struct net *net) * Garbage collection */ -struct fib6_gc_args -{ - int timeout; - int more; -}; - -static int fib6_age(struct rt6_info *rt, void *arg) +static int fib6_age(struct fib6_info *rt, void *arg) { struct fib6_gc_args *gc_args = arg; unsigned long now = jiffies; @@ -1896,42 +2064,22 @@ static int fib6_age(struct rt6_info *rt, void *arg) /* * check addrconf expiration here. * Routes are expired even if they are in use. - * - * Also age clones. Note, that clones are aged out - * only if they are not in use now. */ - if (rt->rt6i_flags & RTF_EXPIRES && rt->dst.expires) { - if (time_after(now, rt->dst.expires)) { + if (rt->fib6_flags & RTF_EXPIRES && rt->expires) { + if (time_after(now, rt->expires)) { RT6_TRACE("expiring %p\n", rt); return -1; } gc_args->more++; - } else if (rt->rt6i_flags & RTF_CACHE) { - if (time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) - rt->dst.obsolete = DST_OBSOLETE_KILL; - if (atomic_read(&rt->dst.__refcnt) == 1 && - rt->dst.obsolete == DST_OBSOLETE_KILL) { - RT6_TRACE("aging clone %p\n", rt); - return -1; - } else if (rt->rt6i_flags & RTF_GATEWAY) { - struct neighbour *neigh; - __u8 neigh_flags = 0; - - neigh = dst_neigh_lookup(&rt->dst, &rt->rt6i_gateway); - if (neigh) { - neigh_flags = neigh->flags; - neigh_release(neigh); - } - if (!(neigh_flags & NTF_ROUTER)) { - RT6_TRACE("purging route %p via non-router but gateway\n", - rt); - return -1; - } - } - gc_args->more++; } + /* Also age clones in the exception table. + * Note, that clones are aged out + * only if they are not in use now. + */ + rt6_age_exceptions(rt, gc_args, now); + return 0; } @@ -1999,7 +2147,8 @@ static int __net_init fib6_net_init(struct net *net) goto out_fib_table_hash; net->ipv6.fib6_main_tbl->tb6_id = RT6_TABLE_MAIN; - net->ipv6.fib6_main_tbl->tb6_root.leaf = net->ipv6.ip6_null_entry; + rcu_assign_pointer(net->ipv6.fib6_main_tbl->tb6_root.leaf, + net->ipv6.fib6_null_entry); net->ipv6.fib6_main_tbl->tb6_root.fn_flags = RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO; inet_peer_base_init(&net->ipv6.fib6_main_tbl->tb6_peers); @@ -2010,7 +2159,8 @@ static int __net_init fib6_net_init(struct net *net) if (!net->ipv6.fib6_local_tbl) goto out_fib6_main_tbl; net->ipv6.fib6_local_tbl->tb6_id = RT6_TABLE_LOCAL; - net->ipv6.fib6_local_tbl->tb6_root.leaf = net->ipv6.ip6_null_entry; + rcu_assign_pointer(net->ipv6.fib6_local_tbl->tb6_root.leaf, + net->ipv6.fib6_null_entry); net->ipv6.fib6_local_tbl->tb6_root.fn_flags = RTN_ROOT | RTN_TL_ROOT | RTN_RTINFO; inet_peer_base_init(&net->ipv6.fib6_local_tbl->tb6_peers); @@ -2036,7 +2186,6 @@ static void fib6_net_exit(struct net *net) { unsigned int i; - rt6_ifdown(net, NULL); del_timer_sync(&net->ipv6.ip6_fib_timer); for (i = 0; i < FIB6_TABLE_HASHSZ; i++) { @@ -2109,25 +2258,26 @@ struct ipv6_route_iter { static int ipv6_route_seq_show(struct seq_file *seq, void *v) { - struct rt6_info *rt = v; + struct fib6_info *rt = v; struct ipv6_route_iter *iter = seq->private; + const struct net_device *dev; - seq_printf(seq, "%pi6 %02x ", &rt->rt6i_dst.addr, rt->rt6i_dst.plen); + seq_printf(seq, "%pi6 %02x ", &rt->fib6_dst.addr, rt->fib6_dst.plen); #ifdef CONFIG_IPV6_SUBTREES - seq_printf(seq, "%pi6 %02x ", &rt->rt6i_src.addr, rt->rt6i_src.plen); + seq_printf(seq, "%pi6 %02x ", &rt->fib6_src.addr, rt->fib6_src.plen); #else seq_puts(seq, "00000000000000000000000000000000 00 "); #endif - if (rt->rt6i_flags & RTF_GATEWAY) - seq_printf(seq, "%pi6", &rt->rt6i_gateway); + if (rt->fib6_flags & RTF_GATEWAY) + seq_printf(seq, "%pi6", &rt->fib6_nh.nh_gw); else seq_puts(seq, "00000000000000000000000000000000"); + dev = rt->fib6_nh.nh_dev; seq_printf(seq, " %08x %08x %08x %08x %8s\n", - rt->rt6i_metric, atomic_read(&rt->dst.__refcnt), - rt->dst.__use, rt->rt6i_flags, - rt->dst.dev ? rt->dst.dev->name : ""); + rt->fib6_metric, atomic_read(&rt->fib6_ref), 0, + rt->fib6_flags, dev ? dev->name : ""); iter->w.leaf = NULL; return 0; } @@ -2140,7 +2290,9 @@ static int ipv6_route_yield(struct fib6_walker *w) return 1; do { - iter->w.leaf = iter->w.leaf->dst.rt6_next; + iter->w.leaf = rcu_dereference_protected( + iter->w.leaf->fib6_next, + lockdep_is_held(&iter->tbl->tb6_lock)); iter->skip--; if (!iter->skip && iter->w.leaf) return 1; @@ -2158,7 +2310,7 @@ static void ipv6_route_seq_setup_walk(struct ipv6_route_iter *iter, iter->w.state = FWS_INIT; iter->w.node = iter->w.root; iter->w.args = iter; - iter->sernum = iter->w.root->fn_sernum; + iter->sernum = READ_ONCE(iter->w.root->fn_sernum); INIT_LIST_HEAD(&iter->w.lh); fib6_walker_link(net, &iter->w); } @@ -2186,8 +2338,10 @@ static struct fib6_table *ipv6_route_seq_next_table(struct fib6_table *tbl, static void ipv6_route_check_sernum(struct ipv6_route_iter *iter) { - if (iter->sernum != iter->w.root->fn_sernum) { - iter->sernum = iter->w.root->fn_sernum; + int sernum = READ_ONCE(iter->w.root->fn_sernum); + + if (iter->sernum != sernum) { + iter->sernum = sernum; iter->w.state = FWS_INIT; iter->w.node = iter->w.root; WARN_ON(iter->w.skip); @@ -2198,14 +2352,14 @@ static void ipv6_route_check_sernum(struct ipv6_route_iter *iter) static void *ipv6_route_seq_next(struct seq_file *seq, void *v, loff_t *pos) { int r; - struct rt6_info *n; + struct fib6_info *n; struct net *net = seq_file_net(seq); struct ipv6_route_iter *iter = seq->private; if (!v) goto iter_table; - n = ((struct rt6_info *)v)->dst.rt6_next; + n = rcu_dereference_bh(((struct fib6_info *)v)->fib6_next); if (n) { ++*pos; return n; @@ -2213,9 +2367,9 @@ static void *ipv6_route_seq_next(struct seq_file *seq, void *v, loff_t *pos) iter_table: ipv6_route_check_sernum(iter); - read_lock(&iter->tbl->tb6_lock); + spin_lock_bh(&iter->tbl->tb6_lock); r = fib6_walk_continue(&iter->w); - read_unlock(&iter->tbl->tb6_lock); + spin_unlock_bh(&iter->tbl->tb6_lock); if (r > 0) { if (v) ++*pos; diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c index 5ca2ab56901d..32e1091a77fc 100644 --- a/net/ipv6/ip6_gre.c +++ b/net/ipv6/ip6_gre.c @@ -55,6 +55,8 @@ #include #include #include +#include +#include static bool log_ecn_error = true; @@ -68,11 +70,13 @@ static unsigned int ip6gre_net_id __read_mostly; struct ip6gre_net { struct ip6_tnl __rcu *tunnels[4][IP6_GRE_HASH_SIZE]; + struct ip6_tnl __rcu *collect_md_tun; struct net_device *fb_tunnel_dev; }; static struct rtnl_link_ops ip6gre_link_ops __read_mostly; static struct rtnl_link_ops ip6gre_tap_ops __read_mostly; +static struct rtnl_link_ops ip6erspan_tap_ops __read_mostly; static int ip6gre_tunnel_init(struct net_device *dev); static void ip6gre_tunnel_setup(struct net_device *dev); static void ip6gre_tunnel_link(struct ip6gre_net *ign, struct ip6_tnl *t); @@ -121,7 +125,8 @@ static struct ip6_tnl *ip6gre_tunnel_lookup(struct net_device *dev, unsigned int h1 = HASH_KEY(key); struct ip6_tnl *t, *cand = NULL; struct ip6gre_net *ign = net_generic(net, ip6gre_net_id); - int dev_type = (gre_proto == htons(ETH_P_TEB)) ? + int dev_type = (gre_proto == htons(ETH_P_TEB) || + gre_proto == htons(ETH_P_ERSPAN)) ? ARPHRD_ETHER : ARPHRD_IP6GRE; int score, cand_score = 4; struct net_device *ndev; @@ -227,6 +232,10 @@ static struct ip6_tnl *ip6gre_tunnel_lookup(struct net_device *dev, if (cand) return cand; + t = rcu_dereference(ign->collect_md_tun); + if (t && t->dev->flags & IFF_UP) + return t; + ndev = READ_ONCE(ign->fb_tunnel_dev); if (ndev && ndev->flags & IFF_UP) return netdev_priv(ndev); @@ -262,6 +271,9 @@ static void ip6gre_tunnel_link(struct ip6gre_net *ign, struct ip6_tnl *t) { struct ip6_tnl __rcu **tp = ip6gre_bucket(ign, t); + if (t->parms.collect_md) + rcu_assign_pointer(ign->collect_md_tun, t); + rcu_assign_pointer(t->next, rtnl_dereference(*tp)); rcu_assign_pointer(*tp, t); } @@ -271,6 +283,9 @@ static void ip6gre_tunnel_unlink(struct ip6gre_net *ign, struct ip6_tnl *t) struct ip6_tnl __rcu **tp; struct ip6_tnl *iter; + if (t->parms.collect_md) + rcu_assign_pointer(ign->collect_md_tun, NULL); + for (tp = ip6gre_bucket(ign, t); (iter = rtnl_dereference(*tp)) != NULL; tp = &iter->next) { @@ -374,6 +389,7 @@ static void ip6gre_tunnel_uninit(struct net_device *dev) static void ip6gre_err(struct sk_buff *skb, struct inet6_skb_parm *opt, u8 type, u8 code, int offset, __be32 info) { + struct net *net = dev_net(skb->dev); const struct gre_base_hdr *greh; const struct ipv6hdr *ipv6h; int grehlen = sizeof(*greh); @@ -407,9 +423,8 @@ static void ip6gre_err(struct sk_buff *skb, struct inet6_skb_parm *opt, return; switch (type) { - __u32 teli; struct ipv6_tlv_tnl_enc_lim *tel; - __u32 mtu; + __u32 teli; case ICMPV6_DEST_UNREACH: net_dbg_ratelimited("%s: Path to destination invalid or inactive!\n", t->parms.name); @@ -440,12 +455,11 @@ static void ip6gre_err(struct sk_buff *skb, struct inet6_skb_parm *opt, } return; case ICMPV6_PKT_TOOBIG: - mtu = be32_to_cpu(info) - offset - t->tun_hlen; - if (t->dev->type == ARPHRD_ETHER) - mtu -= ETH_HLEN; - if (mtu < IPV6_MIN_MTU) - mtu = IPV6_MIN_MTU; - t->dev->mtu = mtu; + ip6_update_pmtu(skb, net, info, 0, 0, sock_net_uid(net, NULL)); + return; + case NDISC_REDIRECT: + ip6_redirect(skb, net, skb->dev->ifindex, 0, + sock_net_uid(net, NULL)); return; } @@ -466,7 +480,86 @@ static int ip6gre_rcv(struct sk_buff *skb, const struct tnl_ptk_info *tpi) &ipv6h->saddr, &ipv6h->daddr, tpi->key, tpi->proto); if (tunnel) { - ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error); + if (tunnel->parms.collect_md) { + struct metadata_dst *tun_dst; + __be64 tun_id; + __be16 flags; + + flags = tpi->flags; + tun_id = key32_to_tunnel_id(tpi->key); + + tun_dst = ipv6_tun_rx_dst(skb, flags, tun_id, 0); + if (!tun_dst) + return PACKET_REJECT; + + ip6_tnl_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error); + } else { + ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error); + } + + return PACKET_RCVD; + } + + return PACKET_REJECT; +} + +static int ip6erspan_rcv(struct sk_buff *skb, int gre_hdr_len, + struct tnl_ptk_info *tpi) +{ + const struct ipv6hdr *ipv6h; + struct erspanhdr *ershdr; + struct ip6_tnl *tunnel; + __be32 index; + + ipv6h = ipv6_hdr(skb); + ershdr = (struct erspanhdr *)skb->data; + + if (unlikely(!pskb_may_pull(skb, sizeof(*ershdr)))) + return PACKET_REJECT; + + tpi->key = cpu_to_be32(ntohs(ershdr->session_id) & ID_MASK); + index = ershdr->md.index; + + tunnel = ip6gre_tunnel_lookup(skb->dev, + &ipv6h->saddr, &ipv6h->daddr, tpi->key, + tpi->proto); + if (tunnel) { + if (__iptunnel_pull_header(skb, sizeof(*ershdr), + htons(ETH_P_TEB), + false, false) < 0) + return PACKET_REJECT; + + if (tunnel->parms.collect_md) { + struct metadata_dst *tun_dst; + struct ip_tunnel_info *info; + struct erspan_metadata *md; + __be64 tun_id; + __be16 flags; + + tpi->flags |= TUNNEL_KEY; + flags = tpi->flags; + tun_id = key32_to_tunnel_id(tpi->key); + + tun_dst = ipv6_tun_rx_dst(skb, flags, tun_id, + sizeof(*md)); + if (!tun_dst) + return PACKET_REJECT; + + info = &tun_dst->u.tun_info; + md = ip_tunnel_info_opts(info); + if (!md) + return PACKET_REJECT; + + md->index = index; + info->key.tun_flags |= TUNNEL_ERSPAN_OPT; + info->options_len = sizeof(*md); + + ip6_tnl_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error); + + } else { + tunnel->parms.index = ntohl(index); + ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error); + } return PACKET_RCVD; } @@ -487,6 +580,12 @@ static int gre_rcv(struct sk_buff *skb) if (iptunnel_pull_header(skb, hdr_len, tpi.proto, false)) goto drop; + if (unlikely(tpi.proto == htons(ETH_P_ERSPAN))) { + if (ip6erspan_rcv(skb, hdr_len, &tpi) == PACKET_RCVD) + return 0; + goto drop; + } + if (ip6gre_rcv(skb, &tpi) == PACKET_RCVD) return 0; @@ -502,13 +601,84 @@ static int gre_handle_offloads(struct sk_buff *skb, bool csum) csum ? SKB_GSO_GRE_CSUM : SKB_GSO_GRE); } +static void prepare_ip6gre_xmit_ipv4(struct sk_buff *skb, + struct net_device *dev, + struct flowi6 *fl6, __u8 *dsfield, + int *encap_limit) +{ + const struct iphdr *iph = ip_hdr(skb); + struct ip6_tnl *t = netdev_priv(dev); + + if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) + *encap_limit = t->parms.encap_limit; + + memcpy(fl6, &t->fl.u.ip6, sizeof(*fl6)); + + if (t->parms.flags & IP6_TNL_F_USE_ORIG_TCLASS) + *dsfield = ipv4_get_dsfield(iph); + else + *dsfield = ip6_tclass(t->parms.flowinfo); + + if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK) + fl6->flowi6_mark = skb->mark; + else + fl6->flowi6_mark = t->parms.fwmark; + + fl6->flowi6_uid = sock_net_uid(dev_net(dev), NULL); +} + +static int prepare_ip6gre_xmit_ipv6(struct sk_buff *skb, + struct net_device *dev, + struct flowi6 *fl6, __u8 *dsfield, + int *encap_limit) +{ + struct ipv6hdr *ipv6h = ipv6_hdr(skb); + struct ip6_tnl *t = netdev_priv(dev); + __u16 offset; + + offset = ip6_tnl_parse_tlv_enc_lim(skb, skb_network_header(skb)); + /* ip6_tnl_parse_tlv_enc_lim() might have reallocated skb->head */ + + if (offset > 0) { + struct ipv6_tlv_tnl_enc_lim *tel; + + tel = (struct ipv6_tlv_tnl_enc_lim *)&skb_network_header(skb)[offset]; + if (tel->encap_limit == 0) { + icmpv6_send(skb, ICMPV6_PARAMPROB, + ICMPV6_HDR_FIELD, offset + 2); + return -1; + } + *encap_limit = tel->encap_limit - 1; + } else if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) { + *encap_limit = t->parms.encap_limit; + } + + memcpy(fl6, &t->fl.u.ip6, sizeof(*fl6)); + + if (t->parms.flags & IP6_TNL_F_USE_ORIG_TCLASS) + *dsfield = ipv6_get_dsfield(ipv6h); + else + *dsfield = ip6_tclass(t->parms.flowinfo); + + if (t->parms.flags & IP6_TNL_F_USE_ORIG_FLOWLABEL) + fl6->flowlabel |= ip6_flowlabel(ipv6h); + + if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK) + fl6->flowi6_mark = skb->mark; + else + fl6->flowi6_mark = t->parms.fwmark; + + fl6->flowi6_uid = sock_net_uid(dev_net(dev), NULL); + + return 0; +} + static netdev_tx_t __gre6_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield, struct flowi6 *fl6, int encap_limit, __u32 *pmtu, __be16 proto) { struct ip6_tnl *tunnel = netdev_priv(dev); - struct dst_entry *dst = skb_dst(skb); __be16 protocol; if (dev->type == ARPHRD_ETHER) @@ -519,17 +689,46 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb, else fl6->daddr = tunnel->parms.raddr; - if (tunnel->parms.o_flags & TUNNEL_SEQ) - tunnel->o_seqno++; - /* Push GRE header. */ protocol = (dev->type == ARPHRD_ETHER) ? htons(ETH_P_TEB) : proto; - gre_build_header(skb, tunnel->tun_hlen, tunnel->parms.o_flags, - protocol, tunnel->parms.o_key, htonl(tunnel->o_seqno)); - /* TooBig packet may have updated dst->dev's mtu */ - if (dst && dst_mtu(dst) > dst->dev->mtu) - dst->ops->update_pmtu(dst, NULL, skb, dst->dev->mtu, false); + if (tunnel->parms.collect_md) { + struct ip_tunnel_info *tun_info; + const struct ip_tunnel_key *key; + __be16 flags; + + tun_info = skb_tunnel_info(skb); + if (unlikely(!tun_info || + !(tun_info->mode & IP_TUNNEL_INFO_TX) || + ip_tunnel_info_af(tun_info) != AF_INET6)) + return -EINVAL; + + key = &tun_info->key; + memset(fl6, 0, sizeof(*fl6)); + fl6->flowi6_proto = IPPROTO_GRE; + fl6->daddr = key->u.ipv6.dst; + fl6->flowlabel = key->label; + fl6->flowi6_uid = sock_net_uid(dev_net(dev), NULL); + + dsfield = key->tos; + flags = key->tun_flags & + (TUNNEL_CSUM | TUNNEL_KEY | TUNNEL_SEQ); + tunnel->tun_hlen = gre_calc_hlen(flags); + + gre_build_header(skb, tunnel->tun_hlen, + flags, protocol, + tunnel_id_to_key32(tun_info->key.tun_id), + (flags | TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) + : 0); + + } else { + if (tunnel->parms.o_flags & TUNNEL_SEQ) + tunnel->o_seqno++; + + gre_build_header(skb, tunnel->tun_hlen, tunnel->parms.o_flags, + protocol, tunnel->parms.o_key, + htonl(tunnel->o_seqno)); + } return ip6_tnl_xmit(skb, dev, dsfield, fl6, encap_limit, pmtu, NEXTHDR_GRE); @@ -538,30 +737,17 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb, static inline int ip6gre_xmit_ipv4(struct sk_buff *skb, struct net_device *dev) { struct ip6_tnl *t = netdev_priv(dev); - const struct iphdr *iph = ip_hdr(skb); int encap_limit = -1; struct flowi6 fl6; - __u8 dsfield; + __u8 dsfield = 0; __u32 mtu; int err; memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt)); - if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) - encap_limit = t->parms.encap_limit; - - memcpy(&fl6, &t->fl.u.ip6, sizeof(fl6)); - - if (t->parms.flags & IP6_TNL_F_USE_ORIG_TCLASS) - dsfield = ipv4_get_dsfield(iph); - else - dsfield = ip6_tclass(t->parms.flowinfo); - if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK) - fl6.flowi6_mark = skb->mark; - else - fl6.flowi6_mark = t->parms.fwmark; - - fl6.flowi6_uid = sock_net_uid(dev_net(dev), NULL); + if (!t->parms.collect_md) + prepare_ip6gre_xmit_ipv4(skb, dev, &fl6, + &dsfield, &encap_limit); err = gre_handle_offloads(skb, !!(t->parms.o_flags & TUNNEL_CSUM)); if (err) @@ -585,46 +771,17 @@ static inline int ip6gre_xmit_ipv6(struct sk_buff *skb, struct net_device *dev) struct ip6_tnl *t = netdev_priv(dev); struct ipv6hdr *ipv6h = ipv6_hdr(skb); int encap_limit = -1; - __u16 offset; struct flowi6 fl6; - __u8 dsfield; + __u8 dsfield = 0; __u32 mtu; int err; if (ipv6_addr_equal(&t->parms.raddr, &ipv6h->saddr)) return -1; - offset = ip6_tnl_parse_tlv_enc_lim(skb, skb_network_header(skb)); - /* ip6_tnl_parse_tlv_enc_lim() might have reallocated skb->head */ - ipv6h = ipv6_hdr(skb); - - if (offset > 0) { - struct ipv6_tlv_tnl_enc_lim *tel; - tel = (struct ipv6_tlv_tnl_enc_lim *)&skb_network_header(skb)[offset]; - if (tel->encap_limit == 0) { - icmpv6_send(skb, ICMPV6_PARAMPROB, - ICMPV6_HDR_FIELD, offset + 2); - return -1; - } - encap_limit = tel->encap_limit - 1; - } else if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) - encap_limit = t->parms.encap_limit; - - memcpy(&fl6, &t->fl.u.ip6, sizeof(fl6)); - - if (t->parms.flags & IP6_TNL_F_USE_ORIG_TCLASS) - dsfield = ipv6_get_dsfield(ipv6h); - else - dsfield = ip6_tclass(t->parms.flowinfo); - - if (t->parms.flags & IP6_TNL_F_USE_ORIG_FLOWLABEL) - fl6.flowlabel |= ip6_flowlabel(ipv6h); - if (t->parms.flags & IP6_TNL_F_USE_ORIG_FWMARK) - fl6.flowi6_mark = skb->mark; - else - fl6.flowi6_mark = t->parms.fwmark; - - fl6.flowi6_uid = sock_net_uid(dev_net(dev), NULL); + if (!t->parms.collect_md && + prepare_ip6gre_xmit_ipv6(skb, dev, &fl6, &dsfield, &encap_limit)) + return -1; if (gre_handle_offloads(skb, !!(t->parms.o_flags & TUNNEL_CSUM))) return -1; @@ -671,7 +828,8 @@ static int ip6gre_xmit_other(struct sk_buff *skb, struct net_device *dev) if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) encap_limit = t->parms.encap_limit; - memcpy(&fl6, &t->fl.u.ip6, sizeof(fl6)); + if (!t->parms.collect_md) + memcpy(&fl6, &t->fl.u.ip6, sizeof(fl6)); err = gre_handle_offloads(skb, !!(t->parms.o_flags & TUNNEL_CSUM)); if (err) @@ -719,6 +877,121 @@ tx_err: return NETDEV_TX_OK; } +static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff *skb, + struct net_device *dev) +{ + struct ipv6hdr *ipv6h = ipv6_hdr(skb); + struct ip6_tnl *t = netdev_priv(dev); + struct dst_entry *dst = skb_dst(skb); + struct net_device_stats *stats; + bool truncate = false; + int encap_limit = -1; + __u8 dsfield = false; + struct flowi6 fl6; + int err = -EINVAL; + __u32 mtu; + + if (!ip6_tnl_xmit_ctl(t, &t->parms.laddr, &t->parms.raddr)) + goto tx_err; + + if (gre_handle_offloads(skb, false)) + goto tx_err; + + if (skb->len > dev->mtu + dev->hard_header_len) { + pskb_trim(skb, dev->mtu + dev->hard_header_len); + truncate = true; + } + + t->parms.o_flags &= ~TUNNEL_KEY; + IPCB(skb)->flags = 0; + + /* For collect_md mode, derive fl6 from the tunnel key, + * for native mode, call prepare_ip6gre_xmit_{ipv4,ipv6}. + */ + if (t->parms.collect_md) { + struct ip_tunnel_info *tun_info; + const struct ip_tunnel_key *key; + struct erspan_metadata *md; + + tun_info = skb_tunnel_info(skb); + if (unlikely(!tun_info || + !(tun_info->mode & IP_TUNNEL_INFO_TX) || + ip_tunnel_info_af(tun_info) != AF_INET6)) + return -EINVAL; + + key = &tun_info->key; + memset(&fl6, 0, sizeof(fl6)); + fl6.flowi6_proto = IPPROTO_GRE; + fl6.daddr = key->u.ipv6.dst; + fl6.flowlabel = key->label; + fl6.flowi6_uid = sock_net_uid(dev_net(dev), NULL); + + dsfield = key->tos; + if (!(tun_info->key.tun_flags & TUNNEL_ERSPAN_OPT)) + goto tx_err; + md = ip_tunnel_info_opts(tun_info); + if (!md) + goto tx_err; + + erspan_build_header(skb, tunnel_id_to_key32(key->tun_id), + ntohl(md->index), truncate, false); + + } else { + switch (skb->protocol) { + case htons(ETH_P_IP): + memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt)); + prepare_ip6gre_xmit_ipv4(skb, dev, &fl6, + &dsfield, &encap_limit); + break; + case htons(ETH_P_IPV6): + if (ipv6_addr_equal(&t->parms.raddr, &ipv6h->saddr)) + goto tx_err; + if (prepare_ip6gre_xmit_ipv6(skb, dev, &fl6, + &dsfield, &encap_limit)) + goto tx_err; + break; + default: + memcpy(&fl6, &t->fl.u.ip6, sizeof(fl6)); + break; + } + + erspan_build_header(skb, t->parms.o_key, t->parms.index, + truncate, false); + fl6.daddr = t->parms.raddr; + } + + /* Push GRE header. */ + gre_build_header(skb, 8, TUNNEL_SEQ, + htons(ETH_P_ERSPAN), 0, htonl(t->o_seqno++)); + + /* TooBig packet may have updated dst->dev's mtu */ + if (!t->parms.collect_md && dst && dst_mtu(dst) > dst->dev->mtu) + dst->ops->update_pmtu(dst, NULL, skb, dst->dev->mtu, false); + + err = ip6_tnl_xmit(skb, dev, dsfield, &fl6, encap_limit, &mtu, + NEXTHDR_GRE); + if (err != 0) { + /* XXX: send ICMP error even if DF is not set. */ + if (err == -EMSGSIZE) { + if (skb->protocol == htons(ETH_P_IP)) + icmp_send(skb, ICMP_DEST_UNREACH, + ICMP_FRAG_NEEDED, htonl(mtu)); + else + icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu); + } + + goto tx_err; + } + return NETDEV_TX_OK; + +tx_err: + stats = &t->dev->stats; + stats->tx_errors++; + stats->tx_dropped++; + kfree_skb(skb); + return NETDEV_TX_OK; +} + static void ip6gre_tnl_link_config(struct ip6_tnl *t, int set_mtu) { struct net_device *dev = t->dev; @@ -764,7 +1037,7 @@ static void ip6gre_tnl_link_config(struct ip6_tnl *t, int set_mtu) struct rt6_info *rt = rt6_lookup(t->net, &p->raddr, &p->laddr, - p->link, strict); + p->link, NULL, strict); if (!rt) return; @@ -1094,6 +1367,10 @@ static int ip6gre_tunnel_init_common(struct net_device *dev) if (!(tunnel->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) dev->mtu -= 8; + if (tunnel->parms.collect_md) { + dev->features |= NETIF_F_NETNS_LOCAL; + netif_keep_dst(dev); + } ip6gre_tnl_init_features(dev); return 0; @@ -1110,6 +1387,9 @@ static int ip6gre_tunnel_init(struct net_device *dev) tunnel = netdev_priv(dev); + if (tunnel->parms.collect_md) + return 0; + memcpy(dev->dev_addr, &tunnel->parms.laddr, sizeof(struct in6_addr)); memcpy(dev->broadcast, &tunnel->parms.raddr, sizeof(struct in6_addr)); @@ -1130,7 +1410,6 @@ static void ip6gre_fb_tunnel_init(struct net_device *dev) tunnel->hlen = sizeof(struct ipv6hdr) + 4; } - static struct inet6_protocol ip6gre_protocol __read_mostly = { .handler = gre_rcv, .err_handler = ip6gre_err, @@ -1145,7 +1424,8 @@ static void ip6gre_destroy_tunnels(struct net *net, struct list_head *head) for_each_netdev_safe(net, dev, aux) if (dev->rtnl_link_ops == &ip6gre_link_ops || - dev->rtnl_link_ops == &ip6gre_tap_ops) + dev->rtnl_link_ops == &ip6gre_tap_ops || + dev->rtnl_link_ops == &ip6erspan_tap_ops) unregister_netdevice_queue(dev, head); for (prio = 0; prio < 4; prio++) { @@ -1205,19 +1485,21 @@ err_alloc_dev: return err; } -static void __net_exit ip6gre_exit_net(struct net *net) +static void __net_exit ip6gre_exit_batch_net(struct list_head *net_list) { + struct net *net; LIST_HEAD(list); rtnl_lock(); - ip6gre_destroy_tunnels(net, &list); + list_for_each_entry(net, net_list, exit_list) + ip6gre_destroy_tunnels(net, &list); unregister_netdevice_many(&list); rtnl_unlock(); } static struct pernet_operations ip6gre_net_ops = { .init = ip6gre_init_net, - .exit = ip6gre_exit_net, + .exit_batch = ip6gre_exit_batch_net, .id = &ip6gre_net_id, .size = sizeof(struct ip6gre_net), }; @@ -1266,6 +1548,47 @@ out: return ip6gre_tunnel_validate(tb, data, extack); } +static int ip6erspan_tap_validate(struct nlattr *tb[], struct nlattr *data[], + struct netlink_ext_ack *extack) +{ + __be16 flags = 0; + int ret; + + if (!data) + return 0; + + ret = ip6gre_tap_validate(tb, data, extack); + if (ret) + return ret; + + /* ERSPAN should only have GRE sequence and key flag */ + if (data[IFLA_GRE_OFLAGS]) + flags |= nla_get_be16(data[IFLA_GRE_OFLAGS]); + if (data[IFLA_GRE_IFLAGS]) + flags |= nla_get_be16(data[IFLA_GRE_IFLAGS]); + if (!data[IFLA_GRE_COLLECT_METADATA] && + flags != (GRE_SEQ | GRE_KEY)) + return -EINVAL; + + /* ERSPAN Session ID only has 10-bit. Since we reuse + * 32-bit key field as ID, check it's range. + */ + if (data[IFLA_GRE_IKEY] && + (ntohl(nla_get_be32(data[IFLA_GRE_IKEY])) & ~ID_MASK)) + return -EINVAL; + + if (data[IFLA_GRE_OKEY] && + (ntohl(nla_get_be32(data[IFLA_GRE_OKEY])) & ~ID_MASK)) + return -EINVAL; + + if (data[IFLA_GRE_ERSPAN_INDEX]) { + u32 index = nla_get_u32(data[IFLA_GRE_ERSPAN_INDEX]); + + if (index & ~INDEX_MASK) + return -EINVAL; + } + return 0; +} static void ip6gre_netlink_parms(struct nlattr *data[], struct __ip6_tnl_parm *parms) @@ -1312,6 +1635,12 @@ static void ip6gre_netlink_parms(struct nlattr *data[], if (data[IFLA_GRE_FWMARK]) parms->fwmark = nla_get_u32(data[IFLA_GRE_FWMARK]); + + if (data[IFLA_GRE_ERSPAN_INDEX]) + parms->index = nla_get_u32(data[IFLA_GRE_ERSPAN_INDEX]); + + if (data[IFLA_GRE_COLLECT_METADATA]) + parms->collect_md = true; } static int ip6gre_tap_init(struct net_device *dev) @@ -1338,6 +1667,59 @@ static const struct net_device_ops ip6gre_tap_netdev_ops = { .ndo_get_iflink = ip6_tnl_get_iflink, }; +static int ip6erspan_tap_init(struct net_device *dev) +{ + struct ip6_tnl *tunnel; + int t_hlen; + int ret; + + tunnel = netdev_priv(dev); + + tunnel->dev = dev; + tunnel->net = dev_net(dev); + strcpy(tunnel->parms.name, dev->name); + + dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats); + if (!dev->tstats) + return -ENOMEM; + + ret = dst_cache_init(&tunnel->dst_cache, GFP_KERNEL); + if (ret) { + free_percpu(dev->tstats); + dev->tstats = NULL; + return ret; + } + + tunnel->tun_hlen = 8; + tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen + + sizeof(struct erspanhdr); + t_hlen = tunnel->hlen + sizeof(struct ipv6hdr); + + dev->hard_header_len = LL_MAX_HEADER + t_hlen; + dev->mtu = ETH_DATA_LEN - t_hlen; + if (dev->type == ARPHRD_ETHER) + dev->mtu -= ETH_HLEN; + if (!(tunnel->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) + dev->mtu -= 8; + + dev->priv_flags |= IFF_LIVE_ADDR_CHANGE; + tunnel = netdev_priv(dev); + ip6gre_tnl_link_config(tunnel, 1); + + return 0; +} + +static const struct net_device_ops ip6erspan_netdev_ops = { + .ndo_init = ip6erspan_tap_init, + .ndo_uninit = ip6gre_tunnel_uninit, + .ndo_start_xmit = ip6erspan_tunnel_xmit, + .ndo_set_mac_address = eth_mac_addr, + .ndo_validate_addr = eth_validate_addr, + .ndo_change_mtu = ip6_tnl_change_mtu, + .ndo_get_stats64 = ip_tunnel_get_stats64, + .ndo_get_iflink = ip6_tnl_get_iflink, +}; + static void ip6gre_tap_setup(struct net_device *dev) { @@ -1408,8 +1790,13 @@ static int ip6gre_newlink(struct net *src_net, struct net_device *dev, ip6gre_netlink_parms(data, &nt->parms); - if (ip6gre_tunnel_find(net, &nt->parms, dev->type)) - return -EEXIST; + if (nt->parms.collect_md) { + if (rtnl_dereference(ign->collect_md_tun)) + return -EEXIST; + } else { + if (ip6gre_tunnel_find(net, &nt->parms, dev->type)) + return -EEXIST; + } if (dev->type == ARPHRD_ETHER && !tb[IFLA_ADDRESS]) eth_hw_addr_random(dev); @@ -1512,8 +1899,12 @@ static size_t ip6gre_get_size(const struct net_device *dev) nla_total_size(2) + /* IFLA_GRE_ENCAP_DPORT */ nla_total_size(2) + + /* IFLA_GRE_COLLECT_METADATA */ + nla_total_size(0) + /* IFLA_GRE_FWMARK */ nla_total_size(4) + + /* IFLA_GRE_ERSPAN_INDEX */ + nla_total_size(4) + 0; } @@ -1535,7 +1926,8 @@ static int ip6gre_fill_info(struct sk_buff *skb, const struct net_device *dev) nla_put_u8(skb, IFLA_GRE_ENCAP_LIMIT, p->encap_limit) || nla_put_be32(skb, IFLA_GRE_FLOWINFO, p->flowinfo) || nla_put_u32(skb, IFLA_GRE_FLAGS, p->flags) || - nla_put_u32(skb, IFLA_GRE_FWMARK, p->fwmark)) + nla_put_u32(skb, IFLA_GRE_FWMARK, p->fwmark) || + nla_put_u32(skb, IFLA_GRE_ERSPAN_INDEX, p->index)) goto nla_put_failure; if (nla_put_u16(skb, IFLA_GRE_ENCAP_TYPE, @@ -1548,6 +1940,11 @@ static int ip6gre_fill_info(struct sk_buff *skb, const struct net_device *dev) t->encap.flags)) goto nla_put_failure; + if (p->collect_md) { + if (nla_put_flag(skb, IFLA_GRE_COLLECT_METADATA)) + goto nla_put_failure; + } + return 0; nla_put_failure: @@ -1570,9 +1967,25 @@ static const struct nla_policy ip6gre_policy[IFLA_GRE_MAX + 1] = { [IFLA_GRE_ENCAP_FLAGS] = { .type = NLA_U16 }, [IFLA_GRE_ENCAP_SPORT] = { .type = NLA_U16 }, [IFLA_GRE_ENCAP_DPORT] = { .type = NLA_U16 }, + [IFLA_GRE_COLLECT_METADATA] = { .type = NLA_FLAG }, [IFLA_GRE_FWMARK] = { .type = NLA_U32 }, + [IFLA_GRE_ERSPAN_INDEX] = { .type = NLA_U32 }, }; +static void ip6erspan_tap_setup(struct net_device *dev) +{ + ether_setup(dev); + + dev->netdev_ops = &ip6erspan_netdev_ops; + dev->needs_free_netdev = true; + dev->priv_destructor = ip6gre_dev_free; + + dev->features |= NETIF_F_NETNS_LOCAL; + dev->priv_flags &= ~IFF_TX_SKB_SHARING; + dev->priv_flags |= IFF_LIVE_ADDR_CHANGE; + netif_keep_dst(dev); +} + static struct rtnl_link_ops ip6gre_link_ops __read_mostly = { .kind = "ip6gre", .maxtype = IFLA_GRE_MAX, @@ -1602,6 +2015,20 @@ static struct rtnl_link_ops ip6gre_tap_ops __read_mostly = { .get_link_net = ip6_tnl_get_link_net, }; +static struct rtnl_link_ops ip6erspan_tap_ops __read_mostly = { + .kind = "ip6erspan", + .maxtype = IFLA_GRE_MAX, + .policy = ip6gre_policy, + .priv_size = sizeof(struct ip6_tnl), + .setup = ip6erspan_tap_setup, + .validate = ip6erspan_tap_validate, + .newlink = ip6gre_newlink, + .changelink = ip6gre_changelink, + .get_size = ip6gre_get_size, + .fill_info = ip6gre_fill_info, + .get_link_net = ip6_tnl_get_link_net, +}; + /* * And now the modules code and kernel interface. */ @@ -1630,9 +2057,15 @@ static int __init ip6gre_init(void) if (err < 0) goto tap_ops_failed; + err = rtnl_link_register(&ip6erspan_tap_ops); + if (err < 0) + goto erspan_link_failed; + out: return err; +erspan_link_failed: + rtnl_link_unregister(&ip6gre_tap_ops); tap_ops_failed: rtnl_link_unregister(&ip6gre_link_ops); rtnl_link_failed: @@ -1646,6 +2079,7 @@ static void __exit ip6gre_fini(void) { rtnl_link_unregister(&ip6gre_tap_ops); rtnl_link_unregister(&ip6gre_link_ops); + rtnl_link_unregister(&ip6erspan_tap_ops); inet6_del_protocol(&ip6gre_protocol, IPPROTO_GRE); unregister_pernet_device(&ip6gre_net_ops); } diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c index 9d3bfaec452b..d4dd28afbd6a 100644 --- a/net/ipv6/ip6_input.c +++ b/net/ipv6/ip6_input.c @@ -347,7 +347,7 @@ int ip6_mc_input(struct sk_buff *skb) bool deliver; __IP6_UPD_PO_STATS(dev_net(skb_dst(skb)->dev), - ip6_dst_idev(skb_dst(skb)), IPSTATS_MIB_INMCAST, + __in6_dev_get_safely(skb->dev), IPSTATS_MIB_INMCAST, skb->len); hdr = ipv6_hdr(skb); diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 2e66900a753d..4282229e6519 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -425,27 +425,6 @@ static inline int ip6_forward_finish(struct net *net, struct sock *sk, return dst_output(net, sk, skb); } -static unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst) -{ - unsigned int mtu; - struct inet6_dev *idev; - - if (dst_metric_locked(dst, RTAX_MTU)) { - mtu = dst_metric_raw(dst, RTAX_MTU); - if (mtu) - return mtu; - } - - mtu = IPV6_MIN_MTU; - rcu_read_lock(); - idev = __in6_dev_get(dst->dev); - if (idev) - mtu = idev->cnf.mtu6; - rcu_read_unlock(); - - return mtu; -} - static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) { if (skb->len <= mtu) @@ -466,6 +445,7 @@ static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) int ip6_forward(struct sk_buff *skb) { + struct inet6_dev *idev = __in6_dev_get_safely(skb->dev); struct dst_entry *dst = skb_dst(skb); struct ipv6hdr *hdr = ipv6_hdr(skb); struct inet6_skb_parm *opt = IP6CB(skb); @@ -485,8 +465,7 @@ int ip6_forward(struct sk_buff *skb) goto drop; if (!xfrm6_policy_check(NULL, XFRM_POLICY_FWD, skb)) { - __IP6_INC_STATS(net, ip6_dst_idev(dst), - IPSTATS_MIB_INDISCARDS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS); goto drop; } @@ -517,8 +496,7 @@ int ip6_forward(struct sk_buff *skb) /* Force OUTPUT device used as source address */ skb->dev = dst->dev; icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT, 0); - __IP6_INC_STATS(net, ip6_dst_idev(dst), - IPSTATS_MIB_INHDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); kfree_skb(skb); return -ETIMEDOUT; @@ -531,15 +509,13 @@ int ip6_forward(struct sk_buff *skb) if (proxied > 0) return ip6_input(skb); else if (proxied < 0) { - __IP6_INC_STATS(net, ip6_dst_idev(dst), - IPSTATS_MIB_INDISCARDS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS); goto drop; } } if (!xfrm6_route_forward(skb)) { - __IP6_INC_STATS(net, ip6_dst_idev(dst), - IPSTATS_MIB_INDISCARDS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS); goto drop; } dst = skb_dst(skb); @@ -596,8 +572,7 @@ int ip6_forward(struct sk_buff *skb) /* Again, force OUTPUT device used as source address */ skb->dev = dst->dev; icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu); - __IP6_INC_STATS(net, ip6_dst_idev(dst), - IPSTATS_MIB_INTOOBIGERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INTOOBIGERRORS); __IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_FRAGFAILS); kfree_skb(skb); @@ -621,7 +596,7 @@ int ip6_forward(struct sk_buff *skb) ip6_forward_finish); error: - __IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_INADDRERRORS); + __IP6_INC_STATS(net, idev, IPSTATS_MIB_INADDRERRORS); drop: kfree_skb(skb); return -EINVAL; @@ -643,7 +618,6 @@ static void ip6_copy_metadata(struct sk_buff *to, struct sk_buff *from) to->tc_index = from->tc_index; #endif nf_copy(to, from); - skb_ext_copy(to, from); skb_copy_secmark(to, from); } @@ -1018,15 +992,21 @@ static int ip6_dst_lookup_tail(struct net *net, const struct sock *sk, * that's why we try it again later. */ if (ipv6_addr_any(&fl6->saddr) && (!*dst || !(*dst)->error)) { + struct fib6_info *from; struct rt6_info *rt; bool had_dst = *dst != NULL; if (!had_dst) *dst = ip6_route_output(net, sk, fl6); rt = (*dst)->error ? NULL : (struct rt6_info *)*dst; - err = ip6_route_get_saddr(net, rt, &fl6->daddr, + + rcu_read_lock(); + from = rt ? rcu_dereference(rt->from) : NULL; + err = ip6_route_get_saddr(net, from, &fl6->daddr, sk ? inet6_sk(sk)->srcprefs : 0, &fl6->saddr); + rcu_read_unlock(); + if (err) goto out_err_release; @@ -1273,7 +1253,7 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork, READ_ONCE(rt->dst.dev->mtu) : dst_mtu(&rt->dst); else mtu = np->pmtudisc >= IPV6_PMTUDISC_PROBE ? - READ_ONCE(rt->dst.dev->mtu) : dst_mtu(rt->dst.path); + READ_ONCE(rt->dst.dev->mtu) : dst_mtu(xfrm_dst_path(&rt->dst)); if (np->frag_size < mtu) { if (np->frag_size) mtu = np->frag_size; @@ -1282,7 +1262,7 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork, cork->base.gso_size = sk->sk_type == SOCK_DGRAM && sk->sk_protocol == IPPROTO_UDP ? ipc6->gso_size : 0; - if (dst_allfrag(rt->dst.path)) + if (dst_allfrag(xfrm_dst_path(&rt->dst))) cork->base.flags |= IPCORK_ALLFRAG; cork->base.length = 0; diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 7fa0d474d47a..bf992c0d05ad 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -691,7 +691,7 @@ ip6ip6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, /* Try to guess incoming interface */ rt = rt6_lookup(dev_net(skb->dev), &ipv6_hdr(skb2)->saddr, - NULL, 0, 0); + NULL, 0, skb2, 0); if (rt && rt->dst.dev) skb2->dev = rt->dst.dev; @@ -879,7 +879,7 @@ int ip6_tnl_rcv(struct ip6_tnl *t, struct sk_buff *skb, if (tpi->proto == htons(ETH_P_IP)) dscp_ecn_decapsulate = ip4ip6_dscp_ecn_decapsulate; - return __ip6_tnl_rcv(t, skb, tpi, NULL, dscp_ecn_decapsulate, + return __ip6_tnl_rcv(t, skb, tpi, tun_dst, dscp_ecn_decapsulate, log_ecn_err); } EXPORT_SYMBOL(ip6_tnl_rcv); @@ -998,6 +998,9 @@ int ip6_tnl_xmit_ctl(struct ip6_tnl *t, int ret = 0; struct net *net = t->net; + if (t->parms.collect_md) + return 1; + if ((p->flags & IP6_TNL_F_CAP_XMIT) || ((p->flags & IP6_TNL_F_CAP_PER_PACKET) && (ip6_tnl_get_cap(t, laddr, raddr) & IP6_TNL_F_CAP_XMIT))) { @@ -1461,7 +1464,7 @@ static void ip6_tnl_link_config(struct ip6_tnl *t) struct rt6_info *rt = rt6_lookup(t->net, &p->raddr, &p->laddr, - p->link, strict); + p->link, NULL, strict); if (!rt) return; @@ -2193,17 +2196,16 @@ static struct xfrm6_tunnel ip6ip6_handler __read_mostly = { .priority = 1, }; -static void __net_exit ip6_tnl_destroy_tunnels(struct net *net) +static void __net_exit ip6_tnl_destroy_tunnels(struct net *net, struct list_head *list) { struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id); struct net_device *dev, *aux; int h; struct ip6_tnl *t; - LIST_HEAD(list); for_each_netdev_safe(net, dev, aux) if (dev->rtnl_link_ops == &ip6_link_ops) - unregister_netdevice_queue(dev, &list); + unregister_netdevice_queue(dev, list); for (h = 0; h < IP6_TUNNEL_HASH_SIZE; h++) { t = rtnl_dereference(ip6n->tnls_r_l[h]); @@ -2212,12 +2214,10 @@ static void __net_exit ip6_tnl_destroy_tunnels(struct net *net) * been added to the list by the previous loop. */ if (!net_eq(dev_net(t->dev), net)) - unregister_netdevice_queue(t->dev, &list); + unregister_netdevice_queue(t->dev, list); t = rtnl_dereference(t->next); } } - - unregister_netdevice_many(&list); } static int __net_init ip6_tnl_init_net(struct net *net) @@ -2261,16 +2261,21 @@ err_alloc_dev: return err; } -static void __net_exit ip6_tnl_exit_net(struct net *net) +static void __net_exit ip6_tnl_exit_batch_net(struct list_head *net_list) { + struct net *net; + LIST_HEAD(list); + rtnl_lock(); - ip6_tnl_destroy_tunnels(net); + list_for_each_entry(net, net_list, exit_list) + ip6_tnl_destroy_tunnels(net, &list); + unregister_netdevice_many(&list); rtnl_unlock(); } static struct pernet_operations ip6_tnl_net_ops = { .init = ip6_tnl_init_net, - .exit = ip6_tnl_exit_net, + .exit_batch = ip6_tnl_exit_batch_net, .id = &ip6_tnl_net_id, .size = sizeof(struct ip6_tnl_net), }; diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c index 312c80912401..199067817e79 100644 --- a/net/ipv6/ip6_vti.c +++ b/net/ipv6/ip6_vti.c @@ -657,6 +657,7 @@ static void vti6_link_config(struct ip6_tnl *t) { struct net_device *dev = t->dev; struct __ip6_tnl_parm *p = &t->parms; + struct net_device *tdev = NULL; memcpy(dev->dev_addr, &p->laddr, sizeof(struct in6_addr)); memcpy(dev->broadcast, &p->raddr, sizeof(struct in6_addr)); @@ -669,6 +670,25 @@ static void vti6_link_config(struct ip6_tnl *t) dev->flags |= IFF_POINTOPOINT; else dev->flags &= ~IFF_POINTOPOINT; + + if (p->flags & IP6_TNL_F_CAP_XMIT) { + int strict = (ipv6_addr_type(&p->raddr) & + (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL)); + struct rt6_info *rt = rt6_lookup(t->net, + &p->raddr, &p->laddr, + p->link, NULL, strict); + + if (rt) + tdev = rt->dst.dev; + ip6_rt_put(rt); + } + + if (!tdev && p->link) + tdev = __dev_get_by_index(t->net, p->link); + + if (tdev) + dev->mtu = max_t(int, tdev->mtu - dev->hard_header_len, + IPV6_MIN_MTU); } /** @@ -1086,23 +1106,22 @@ static struct rtnl_link_ops vti6_link_ops __read_mostly = { .get_link_net = ip6_tnl_get_link_net, }; -static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n) +static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n, + struct list_head *list) { int h; struct ip6_tnl *t; - LIST_HEAD(list); for (h = 0; h < IP6_VTI_HASH_SIZE; h++) { t = rtnl_dereference(ip6n->tnls_r_l[h]); while (t) { - unregister_netdevice_queue(t->dev, &list); + unregister_netdevice_queue(t->dev, list); t = rtnl_dereference(t->next); } } t = rtnl_dereference(ip6n->tnls_wc[0]); - unregister_netdevice_queue(t->dev, &list); - unregister_netdevice_many(&list); + unregister_netdevice_queue(t->dev, list); } static int __net_init vti6_init_net(struct net *net) @@ -1142,18 +1161,24 @@ err_alloc_dev: return err; } -static void __net_exit vti6_exit_net(struct net *net) +static void __net_exit vti6_exit_batch_net(struct list_head *net_list) { - struct vti6_net *ip6n = net_generic(net, vti6_net_id); + struct vti6_net *ip6n; + struct net *net; + LIST_HEAD(list); rtnl_lock(); - vti6_destroy_tunnels(ip6n); + list_for_each_entry(net, net_list, exit_list) { + ip6n = net_generic(net, vti6_net_id); + vti6_destroy_tunnels(ip6n, &list); + } + unregister_netdevice_many(&list); rtnl_unlock(); } static struct pernet_operations vti6_net_ops = { .init = vti6_init_net, - .exit = vti6_exit_net, + .exit_batch = vti6_exit_batch_net, .id = &vti6_net_id, .size = sizeof(struct vti6_net), }; diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index f3a291a9b2f8..9eb33abe83b9 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -165,7 +165,7 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr) if (ifindex == 0) { struct rt6_info *rt; - rt = rt6_lookup(net, addr, NULL, 0, 0); + rt = rt6_lookup(net, addr, NULL, 0, NULL, 0); if (rt) { dev = rt->dst.dev; ip6_rt_put(rt); @@ -254,7 +254,7 @@ static struct inet6_dev *ip6_mc_find_dev_rcu(struct net *net, struct inet6_dev *idev = NULL; if (ifindex == 0) { - struct rt6_info *rt = rt6_lookup(net, group, NULL, 0, 0); + struct rt6_info *rt = rt6_lookup(net, group, NULL, 0, NULL, 0); if (rt) { dev = rt->dst.dev; diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c index 2874a95e848c..96feb57d9ef8 100644 --- a/net/ipv6/ndisc.c +++ b/net/ipv6/ndisc.c @@ -1169,7 +1169,8 @@ static void ndisc_router_discovery(struct sk_buff *skb) struct ra_msg *ra_msg = (struct ra_msg *)skb_transport_header(skb); struct neighbour *neigh = NULL; struct inet6_dev *in6_dev; - struct rt6_info *rt = NULL; + struct fib6_info *rt = NULL; + struct net *net; int lifetime; struct ndisc_options ndopts; int optlen; @@ -1267,9 +1268,9 @@ static void ndisc_router_discovery(struct sk_buff *skb) /* Do not accept RA with source-addr found on local machine unless * accept_ra_from_local is set to true. */ + net = dev_net(in6_dev->dev); if (!in6_dev->cnf.accept_ra_from_local && - ipv6_chk_addr(dev_net(in6_dev->dev), &ipv6_hdr(skb)->saddr, - in6_dev->dev, 0)) { + ipv6_chk_addr(net, &ipv6_hdr(skb)->saddr, in6_dev->dev, 0)) { ND_PRINTK(2, info, "RA from local address detected on dev: %s: default router ignored\n", skb->dev->name); @@ -1286,20 +1287,22 @@ static void ndisc_router_discovery(struct sk_buff *skb) pref = ICMPV6_ROUTER_PREF_MEDIUM; #endif - rt = rt6_get_dflt_router(&ipv6_hdr(skb)->saddr, skb->dev); + rt = rt6_get_dflt_router(net, &ipv6_hdr(skb)->saddr, skb->dev); if (rt) { - neigh = dst_neigh_lookup(&rt->dst, &ipv6_hdr(skb)->saddr); + neigh = ip6_neigh_lookup(&rt->fib6_nh.nh_gw, + rt->fib6_nh.nh_dev, NULL, + &ipv6_hdr(skb)->saddr); if (!neigh) { ND_PRINTK(0, err, "RA: %s got default router without neighbour\n", __func__); - ip6_rt_put(rt); + fib6_info_release(rt); return; } } if (rt && lifetime == 0) { - ip6_del_rt(rt); + ip6_del_rt(net, rt); rt = NULL; } @@ -1308,7 +1311,8 @@ static void ndisc_router_discovery(struct sk_buff *skb) if (!rt && lifetime) { ND_PRINTK(3, info, "RA: adding default router\n"); - rt = rt6_add_dflt_router(&ipv6_hdr(skb)->saddr, skb->dev, pref); + rt = rt6_add_dflt_router(net, &ipv6_hdr(skb)->saddr, + skb->dev, pref); if (!rt) { ND_PRINTK(0, err, "RA: %s failed to add default route\n", @@ -1316,28 +1320,29 @@ static void ndisc_router_discovery(struct sk_buff *skb) return; } - neigh = dst_neigh_lookup(&rt->dst, &ipv6_hdr(skb)->saddr); + neigh = ip6_neigh_lookup(&rt->fib6_nh.nh_gw, + rt->fib6_nh.nh_dev, NULL, + &ipv6_hdr(skb)->saddr); if (!neigh) { ND_PRINTK(0, err, "RA: %s got default router without neighbour\n", __func__); - ip6_rt_put(rt); + fib6_info_release(rt); return; } neigh->flags |= NTF_ROUTER; } else if (rt) { - rt->rt6i_flags = (rt->rt6i_flags & ~RTF_PREF_MASK) | RTF_PREF(pref); + rt->fib6_flags = (rt->fib6_flags & ~RTF_PREF_MASK) | RTF_PREF(pref); } if (rt) - rt6_set_expires(rt, jiffies + (HZ * lifetime)); + fib6_set_expires(rt, jiffies + (HZ * lifetime)); if (in6_dev->cnf.accept_ra_min_hop_limit < 256 && ra_msg->icmph.icmp6_hop_limit) { if (in6_dev->cnf.accept_ra_min_hop_limit <= ra_msg->icmph.icmp6_hop_limit) { in6_dev->cnf.hop_limit = ra_msg->icmph.icmp6_hop_limit; - if (rt) - dst_metric_set(&rt->dst, RTAX_HOPLIMIT, - ra_msg->icmph.icmp6_hop_limit); + fib6_metric_set(rt, RTAX_HOPLIMIT, + ra_msg->icmph.icmp6_hop_limit); } else { ND_PRINTK(2, warn, "RA: Got route advertisement with lower hop_limit than minimum\n"); } @@ -1489,10 +1494,7 @@ skip_routeinfo: ND_PRINTK(2, warn, "RA: invalid mtu: %d\n", mtu); } else if (in6_dev->cnf.mtu6 != mtu) { in6_dev->cnf.mtu6 = mtu; - - if (rt) - dst_metric_set(&rt->dst, RTAX_MTU, mtu); - + fib6_metric_set(rt, RTAX_MTU, mtu); rt6_mtu_change(skb->dev, mtu); } } @@ -1511,7 +1513,7 @@ skip_routeinfo: ND_PRINTK(2, warn, "RA: invalid RA options\n"); } out: - ip6_rt_put(rt); + fib6_info_release(rt); if (neigh) neigh_release(neigh); } diff --git a/net/ipv6/netfilter/ip6t_rpfilter.c b/net/ipv6/netfilter/ip6t_rpfilter.c index 2a7c8f864fa7..5262d4e68dec 100644 --- a/net/ipv6/netfilter/ip6t_rpfilter.c +++ b/net/ipv6/netfilter/ip6t_rpfilter.c @@ -63,7 +63,7 @@ static bool rpfilter_lookup_reverse6(struct net *net, const struct sk_buff *skb, (flags & XT_RPFILTER_LOOSE) == 0) fl6.flowi6_oif = dev->ifindex; - rt = (void *) ip6_route_lookup(net, &fl6, lookup_flags); + rt = (void *)ip6_route_lookup(net, &fl6, skb, lookup_flags); if (rt->dst.error) goto out; diff --git a/net/ipv6/netfilter/nft_fib_ipv6.c b/net/ipv6/netfilter/nft_fib_ipv6.c index bcc673e433e6..bb5eafc282ac 100644 --- a/net/ipv6/netfilter/nft_fib_ipv6.c +++ b/net/ipv6/netfilter/nft_fib_ipv6.c @@ -185,7 +185,8 @@ void nft_fib6_eval(const struct nft_expr *expr, struct nft_regs *regs, } *dest = 0; - rt = (void *)ip6_route_lookup(nft_net(pkt), &fl6, lookup_flags); + rt = (void *)ip6_route_lookup(nft_net(pkt), &fl6, pkt->skb, + lookup_flags); if (rt->dst.error) goto put_rt_err; diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c index b2f7a335a12b..60dfd0d11851 100644 --- a/net/ipv6/reassembly.c +++ b/net/ipv6/reassembly.c @@ -375,7 +375,7 @@ static int ipv6_frag_rcv(struct sk_buff *skb) spin_unlock(&fq->q.lock); inet_frag_put(&fq->q); if (prob_offset) { - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), + __IP6_INC_STATS(net, __in6_dev_get_safely(skb->dev), IPSTATS_MIB_INHDRERRORS); /* icmpv6_param_prob() calls kfree_skb(skb) */ icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, prob_offset); @@ -388,7 +388,7 @@ static int ipv6_frag_rcv(struct sk_buff *skb) return -1; fail_hdr: - __IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)), + __IP6_INC_STATS(net, __in6_dev_get_safely(skb->dev), IPSTATS_MIB_INHDRERRORS); icmpv6_param_prob(skb, ICMPV6_HDR_FIELD, skb_network_header_len(skb)); return -1; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 69288fc80a02..d8cc24ed39fb 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -44,6 +44,7 @@ #include #include #include +#include #include #include #include @@ -77,11 +78,11 @@ enum rt6_nud_state { RT6_NUD_SUCCEED = 1 }; -static void ip6_rt_copy_init(struct rt6_info *rt, struct rt6_info *ort); static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie); static unsigned int ip6_default_advmss(const struct dst_entry *dst); static unsigned int ip6_mtu(const struct dst_entry *dst); -static struct dst_entry *ip6_negative_advice(struct dst_entry *); +static void ip6_negative_advice(struct sock *sk, + struct dst_entry *dst); static void ip6_dst_destroy(struct dst_entry *); static void ip6_dst_ifdown(struct dst_entry *, struct net_device *dev, int how); @@ -97,22 +98,24 @@ static void ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk, bool confirm_neigh); static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_buff *skb); -static void rt6_dst_from_metrics_check(struct rt6_info *rt); -static int rt6_score_route(struct rt6_info *rt, int oif, int strict); -static size_t rt6_nlmsg_size(struct rt6_info *rt); -static int rt6_fill_node(struct net *net, - struct sk_buff *skb, struct rt6_info *rt, - struct in6_addr *dst, struct in6_addr *src, +static int rt6_score_route(struct fib6_info *rt, int oif, int strict); +static size_t rt6_nlmsg_size(struct fib6_info *rt); +static int rt6_fill_node(struct net *net, struct sk_buff *skb, + struct fib6_info *rt, struct dst_entry *dst, + struct in6_addr *dest, struct in6_addr *src, int iif, int type, u32 portid, u32 seq, unsigned int flags); +static struct rt6_info *rt6_find_cached_rt(struct fib6_info *rt, + struct in6_addr *daddr, + struct in6_addr *saddr); #ifdef CONFIG_IPV6_ROUTE_INFO -static struct rt6_info *rt6_add_route_info(struct net *net, +static struct fib6_info *rt6_add_route_info(struct net *net, const struct in6_addr *prefix, int prefixlen, const struct in6_addr *gwaddr, struct net_device *dev, unsigned int pref); -static struct rt6_info *rt6_get_route_info(struct net *net, +static struct fib6_info *rt6_get_route_info(struct net *net, const struct in6_addr *prefix, int prefixlen, const struct in6_addr *gwaddr, struct net_device *dev); @@ -140,9 +143,11 @@ static void rt6_uncached_list_del(struct rt6_info *rt) { if (!list_empty(&rt->rt6i_uncached)) { struct uncached_list *ul = rt->rt6i_uncached_list; + struct net *net = dev_net(rt->dst.dev); spin_lock_bh(&ul->lock); list_del(&rt->rt6i_uncached); + atomic_dec(&net->ipv6.rt6_stats->fib_rt_uncache); spin_unlock_bh(&ul->lock); } } @@ -179,29 +184,10 @@ static void rt6_uncached_list_flush_dev(struct net *net, struct net_device *dev) } } -static u32 *rt6_pcpu_cow_metrics(struct rt6_info *rt) -{ - return dst_metrics_write_ptr(rt->dst.from); -} - -static u32 *ipv6_cow_metrics(struct dst_entry *dst, unsigned long old) -{ - struct rt6_info *rt = (struct rt6_info *)dst; - - if (rt->rt6i_flags & RTF_PCPU) - return rt6_pcpu_cow_metrics(rt); - else if (rt->rt6i_flags & RTF_CACHE) - return NULL; - else - return dst_cow_metrics_generic(dst, old); -} - -static inline const void *choose_neigh_daddr(struct rt6_info *rt, +static inline const void *choose_neigh_daddr(const struct in6_addr *p, struct sk_buff *skb, const void *daddr) { - struct in6_addr *p = &rt->rt6i_gateway; - if (!ipv6_addr_any(p)) return (const void *) p; else if (skb) @@ -209,18 +195,27 @@ static inline const void *choose_neigh_daddr(struct rt6_info *rt, return daddr; } -static struct neighbour *ip6_neigh_lookup(const struct dst_entry *dst, - struct sk_buff *skb, - const void *daddr) +struct neighbour *ip6_neigh_lookup(const struct in6_addr *gw, + struct net_device *dev, + struct sk_buff *skb, + const void *daddr) { - struct rt6_info *rt = (struct rt6_info *) dst; struct neighbour *n; - daddr = choose_neigh_daddr(rt, skb, daddr); - n = __ipv6_neigh_lookup(dst->dev, daddr); + daddr = choose_neigh_daddr(gw, skb, daddr); + n = __ipv6_neigh_lookup(dev, daddr); if (n) return n; - return neigh_create(&nd_tbl, daddr, dst->dev); + return neigh_create(&nd_tbl, daddr, dev); +} + +static struct neighbour *ip6_dst_neigh_lookup(const struct dst_entry *dst, + struct sk_buff *skb, + const void *daddr) +{ + const struct rt6_info *rt = container_of(dst, struct rt6_info, dst); + + return ip6_neigh_lookup(&rt->rt6i_gateway, dst->dev, skb, daddr); } static void ip6_confirm_neigh(const struct dst_entry *dst, const void *daddr) @@ -228,7 +223,7 @@ static void ip6_confirm_neigh(const struct dst_entry *dst, const void *daddr) struct net_device *dev = dst->dev; struct rt6_info *rt = (struct rt6_info *)dst; - daddr = choose_neigh_daddr(rt, NULL, daddr); + daddr = choose_neigh_daddr(&rt->rt6i_gateway, NULL, daddr); if (!daddr) return; if (dev->flags & (IFF_NOARP | IFF_LOOPBACK)) @@ -245,7 +240,7 @@ static struct dst_ops ip6_dst_ops_template = { .check = ip6_dst_check, .default_advmss = ip6_default_advmss, .mtu = ip6_mtu, - .cow_metrics = ipv6_cow_metrics, + .cow_metrics = dst_cow_metrics_generic, .destroy = ip6_dst_destroy, .ifdown = ip6_dst_ifdown, .negative_advice = ip6_negative_advice, @@ -253,7 +248,7 @@ static struct dst_ops ip6_dst_ops_template = { .update_pmtu = ip6_rt_update_pmtu, .redirect = rt6_do_redirect, .local_out = __ip6_local_out, - .neigh_lookup = ip6_neigh_lookup, + .neigh_lookup = ip6_dst_neigh_lookup, .confirm_neigh = ip6_confirm_neigh, }; @@ -284,13 +279,22 @@ static struct dst_ops ip6_dst_blackhole_ops = { .update_pmtu = ip6_rt_blackhole_update_pmtu, .redirect = ip6_rt_blackhole_redirect, .cow_metrics = dst_cow_metrics_generic, - .neigh_lookup = ip6_neigh_lookup, + .neigh_lookup = ip6_dst_neigh_lookup, }; static const u32 ip6_template_metrics[RTAX_MAX] = { [RTAX_HOPLIMIT - 1] = 0, }; +static const struct fib6_info fib6_null_entry_template = { + .fib6_flags = (RTF_REJECT | RTF_NONEXTHOP), + .fib6_protocol = RTPROT_KERNEL, + .fib6_metric = ~(u32)0, + .fib6_ref = ATOMIC_INIT(1), + .fib6_type = RTN_UNREACHABLE, + .fib6_metrics = (struct dst_metrics *)&dst_default_metrics, +}; + static const struct rt6_info ip6_null_entry_template = { .dst = { .__refcnt = ATOMIC_INIT(1), @@ -301,9 +305,6 @@ static const struct rt6_info ip6_null_entry_template = { .output = ip6_pkt_discard_out, }, .rt6i_flags = (RTF_REJECT | RTF_NONEXTHOP), - .rt6i_protocol = RTPROT_KERNEL, - .rt6i_metric = ~(u32) 0, - .rt6i_ref = ATOMIC_INIT(1), }; #ifdef CONFIG_IPV6_MULTIPLE_TABLES @@ -318,9 +319,6 @@ static const struct rt6_info ip6_prohibit_entry_template = { .output = ip6_pkt_prohibit_out, }, .rt6i_flags = (RTF_REJECT | RTF_NONEXTHOP), - .rt6i_protocol = RTPROT_KERNEL, - .rt6i_metric = ~(u32) 0, - .rt6i_ref = ATOMIC_INIT(1), }; static const struct rt6_info ip6_blk_hole_entry_template = { @@ -333,9 +331,6 @@ static const struct rt6_info ip6_blk_hole_entry_template = { .output = dst_discard_out, }, .rt6i_flags = (RTF_REJECT | RTF_NONEXTHOP), - .rt6i_protocol = RTPROT_KERNEL, - .rt6i_metric = ~(u32) 0, - .rt6i_ref = ATOMIC_INIT(1), }; #endif @@ -345,46 +340,19 @@ static void rt6_info_init(struct rt6_info *rt) struct dst_entry *dst = &rt->dst; memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst)); - INIT_LIST_HEAD(&rt->rt6i_siblings); INIT_LIST_HEAD(&rt->rt6i_uncached); } /* allocate dst with ip6_dst_ops */ -static struct rt6_info *__ip6_dst_alloc(struct net *net, - struct net_device *dev, - int flags) +struct rt6_info *ip6_dst_alloc(struct net *net, struct net_device *dev, + int flags) { struct rt6_info *rt = dst_alloc(&net->ipv6.ip6_dst_ops, dev, 1, DST_OBSOLETE_FORCE_CHK, flags); - if (rt) - rt6_info_init(rt); - - return rt; -} - -struct rt6_info *ip6_dst_alloc(struct net *net, - struct net_device *dev, - int flags) -{ - struct rt6_info *rt = __ip6_dst_alloc(net, dev, flags); - if (rt) { - rt->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, GFP_ATOMIC); - if (rt->rt6i_pcpu) { - int cpu; - - for_each_possible_cpu(cpu) { - struct rt6_info **p; - - p = per_cpu_ptr(rt->rt6i_pcpu, cpu); - /* no one shares rt */ - *p = NULL; - } - } else { - dst_release_immediate(&rt->dst); - return NULL; - } + rt6_info_init(rt); + atomic_inc(&net->ipv6.rt6_stats->fib_rt_alloc); } return rt; @@ -394,11 +362,10 @@ EXPORT_SYMBOL(ip6_dst_alloc); static void ip6_dst_destroy(struct dst_entry *dst) { struct rt6_info *rt = (struct rt6_info *)dst; - struct dst_entry *from = dst->from; + struct fib6_info *from; struct inet6_dev *idev; dst_destroy_metrics_generic(dst); - free_percpu(rt->rt6i_pcpu); rt6_uncached_list_del(rt); idev = rt->rt6i_idev; @@ -407,8 +374,11 @@ static void ip6_dst_destroy(struct dst_entry *dst) in6_dev_put(idev); } - dst->from = NULL; - dst_release(from); + rcu_read_lock(); + from = rcu_dereference(rt->from); + rcu_assign_pointer(rt->from, NULL); + fib6_info_release(from); + rcu_read_unlock(); } static void ip6_dst_ifdown(struct dst_entry *dst, struct net_device *dev, @@ -438,80 +408,78 @@ static bool __rt6_check_expired(const struct rt6_info *rt) static bool rt6_check_expired(const struct rt6_info *rt) { + struct fib6_info *from; + + from = rcu_dereference(rt->from); + if (rt->rt6i_flags & RTF_EXPIRES) { if (time_after(jiffies, rt->dst.expires)) return true; - } else if (rt->dst.from) { + } else if (from) { return rt->dst.obsolete != DST_OBSOLETE_FORCE_CHK || - rt6_check_expired((struct rt6_info *)rt->dst.from); + fib6_check_expired(from); } return false; } -static struct rt6_info *rt6_multipath_select(struct rt6_info *match, - struct flowi6 *fl6, int oif, - int strict) +struct fib6_info *fib6_multipath_select(const struct net *net, + struct fib6_info *match, + struct flowi6 *fl6, int oif, + const struct sk_buff *skb, + int strict) { - struct rt6_info *sibling, *next_sibling; - int route_choosen; + struct fib6_info *sibling, *next_sibling; /* We might have already computed the hash for ICMPv6 errors. In such * case it will always be non-zero. Otherwise now is the time to do it. */ if (!fl6->mp_hash) - fl6->mp_hash = rt6_multipath_hash(fl6, NULL); + fl6->mp_hash = rt6_multipath_hash(net, fl6, skb, NULL); + + if (fl6->mp_hash <= atomic_read(&match->fib6_nh.nh_upper_bound)) + return match; + + list_for_each_entry_safe(sibling, next_sibling, &match->fib6_siblings, + fib6_siblings) { + int nh_upper_bound; + + nh_upper_bound = atomic_read(&sibling->fib6_nh.nh_upper_bound); + if (fl6->mp_hash > nh_upper_bound) + continue; + if (rt6_score_route(sibling, oif, strict) < 0) + break; + match = sibling; + break; + } - route_choosen = fl6->mp_hash % (match->rt6i_nsiblings + 1); - /* Don't change the route, if route_choosen == 0 - * (siblings does not include ourself) - */ - if (route_choosen) - list_for_each_entry_safe(sibling, next_sibling, - &match->rt6i_siblings, rt6i_siblings) { - route_choosen--; - if (route_choosen == 0) { - if (rt6_score_route(sibling, oif, strict) < 0) - break; - match = sibling; - break; - } - } return match; } /* - * Route lookup. Any table->tb6_lock is implied. + * Route lookup. rcu_read_lock() should be held. */ -static inline struct rt6_info *rt6_device_match(struct net *net, - struct rt6_info *rt, +static inline struct fib6_info *rt6_device_match(struct net *net, + struct fib6_info *rt, const struct in6_addr *saddr, int oif, int flags) { - struct rt6_info *local = NULL; - struct rt6_info *sprt; + struct fib6_info *sprt; - if (!oif && ipv6_addr_any(saddr)) - goto out; + if (!oif && ipv6_addr_any(saddr) && + !(rt->fib6_nh.nh_flags & RTNH_F_DEAD)) + return rt; - for (sprt = rt; sprt; sprt = sprt->dst.rt6_next) { - struct net_device *dev = sprt->dst.dev; + for (sprt = rt; sprt; sprt = rcu_dereference(sprt->fib6_next)) { + const struct net_device *dev = sprt->fib6_nh.nh_dev; + + if (sprt->fib6_nh.nh_flags & RTNH_F_DEAD) + continue; if (oif) { if (dev->ifindex == oif) return sprt; - if (dev->flags & IFF_LOOPBACK) { - if (!sprt->rt6i_idev || - sprt->rt6i_idev->dev->ifindex != oif) { - if (flags & RT6_LOOKUP_F_IFACE) - continue; - if (local && - local->rt6i_idev->dev->ifindex == oif) - continue; - } - local = sprt; - } } else { if (ipv6_chk_addr(net, saddr, dev, flags & RT6_LOOKUP_F_IFACE)) @@ -519,15 +487,10 @@ static inline struct rt6_info *rt6_device_match(struct net *net, } } - if (oif) { - if (local) - return local; + if (oif && flags & RT6_LOOKUP_F_IFACE) + return net->ipv6.fib6_null_entry; - if (flags & RT6_LOOKUP_F_IFACE) - return net->ipv6.ip6_null_entry; - } -out: - return rt; + return rt->fib6_nh.nh_flags & RTNH_F_DEAD ? net->ipv6.fib6_null_entry : rt; } #ifdef CONFIG_IPV6_ROUTER_PREF @@ -549,10 +512,14 @@ static void rt6_probe_deferred(struct work_struct *w) kfree(work); } -static void rt6_probe(struct rt6_info *rt) +static void rt6_probe(struct fib6_info *rt) { - struct __rt6_probe_work *work; + struct __rt6_probe_work *work = NULL; + const struct in6_addr *nh_gw; struct neighbour *neigh; + struct net_device *dev; + struct inet6_dev *idev; + /* * Okay, this does not seem to be appropriate * for now, however, we need to check if it @@ -561,34 +528,38 @@ static void rt6_probe(struct rt6_info *rt) * Router Reachability Probe MUST be rate-limited * to no more than one per minute. */ - if (!rt || !(rt->rt6i_flags & RTF_GATEWAY)) + if (!rt || !(rt->fib6_flags & RTF_GATEWAY)) return; + + nh_gw = &rt->fib6_nh.nh_gw; + dev = rt->fib6_nh.nh_dev; rcu_read_lock_bh(); - neigh = __ipv6_neigh_lookup_noref(rt->dst.dev, &rt->rt6i_gateway); + idev = __in6_dev_get(dev); + neigh = __ipv6_neigh_lookup_noref(dev, nh_gw); if (neigh) { if (neigh->nud_state & NUD_VALID) goto out; - work = NULL; write_lock(&neigh->lock); if (!(neigh->nud_state & NUD_VALID) && time_after(jiffies, - neigh->updated + - rt->rt6i_idev->cnf.rtr_probe_interval)) { + neigh->updated + idev->cnf.rtr_probe_interval)) { work = kmalloc(sizeof(*work), GFP_ATOMIC); if (work) __neigh_set_probe_once(neigh); } write_unlock(&neigh->lock); - } else { + } else if (time_after(jiffies, rt->last_probe + + idev->cnf.rtr_probe_interval)) { work = kmalloc(sizeof(*work), GFP_ATOMIC); } if (work) { + rt->last_probe = jiffies; INIT_WORK(&work->work, rt6_probe_deferred); - work->target = rt->rt6i_gateway; - dev_hold(rt->dst.dev); - work->dev = rt->dst.dev; + work->target = *nh_gw; + dev_hold(dev); + work->dev = dev; schedule_work(&work->work); } @@ -596,7 +567,7 @@ out: rcu_read_unlock_bh(); } #else -static inline void rt6_probe(struct rt6_info *rt) +static inline void rt6_probe(struct fib6_info *rt) { } #endif @@ -604,28 +575,27 @@ static inline void rt6_probe(struct rt6_info *rt) /* * Default Router Selection (RFC 2461 6.3.6) */ -static inline int rt6_check_dev(struct rt6_info *rt, int oif) +static inline int rt6_check_dev(struct fib6_info *rt, int oif) { - struct net_device *dev = rt->dst.dev; + const struct net_device *dev = rt->fib6_nh.nh_dev; + if (!oif || dev->ifindex == oif) return 2; - if ((dev->flags & IFF_LOOPBACK) && - rt->rt6i_idev && rt->rt6i_idev->dev->ifindex == oif) - return 1; return 0; } -static inline enum rt6_nud_state rt6_check_neigh(struct rt6_info *rt) +static inline enum rt6_nud_state rt6_check_neigh(struct fib6_info *rt) { - struct neighbour *neigh; enum rt6_nud_state ret = RT6_NUD_FAIL_HARD; + struct neighbour *neigh; - if (rt->rt6i_flags & RTF_NONEXTHOP || - !(rt->rt6i_flags & RTF_GATEWAY)) + if (rt->fib6_flags & RTF_NONEXTHOP || + !(rt->fib6_flags & RTF_GATEWAY)) return RT6_NUD_SUCCEED; rcu_read_lock_bh(); - neigh = __ipv6_neigh_lookup_noref(rt->dst.dev, &rt->rt6i_gateway); + neigh = __ipv6_neigh_lookup_noref(rt->fib6_nh.nh_dev, + &rt->fib6_nh.nh_gw); if (neigh) { read_lock(&neigh->lock); if (neigh->nud_state & NUD_VALID) @@ -646,8 +616,7 @@ static inline enum rt6_nud_state rt6_check_neigh(struct rt6_info *rt) return ret; } -static int rt6_score_route(struct rt6_info *rt, int oif, - int strict) +static int rt6_score_route(struct fib6_info *rt, int oif, int strict) { int m; @@ -655,7 +624,7 @@ static int rt6_score_route(struct rt6_info *rt, int oif, if (!m && (strict & RT6_LOOKUP_F_IFACE)) return RT6_NUD_FAIL_HARD; #ifdef CONFIG_IPV6_ROUTER_PREF - m |= IPV6_DECODE_PREF(IPV6_EXTRACT_PREF(rt->rt6i_flags)) << 2; + m |= IPV6_DECODE_PREF(IPV6_EXTRACT_PREF(rt->fib6_flags)) << 2; #endif if (strict & RT6_LOOKUP_F_REACHABLE) { int n = rt6_check_neigh(rt); @@ -665,21 +634,37 @@ static int rt6_score_route(struct rt6_info *rt, int oif, return m; } -static struct rt6_info *find_match(struct rt6_info *rt, int oif, int strict, - int *mpri, struct rt6_info *match, +/* called with rc_read_lock held */ +static inline bool fib6_ignore_linkdown(const struct fib6_info *f6i) +{ + const struct net_device *dev = fib6_info_nh_dev(f6i); + bool rc = false; + + if (dev) { + const struct inet6_dev *idev = __in6_dev_get(dev); + + rc = !!idev->cnf.ignore_routes_with_linkdown; + } + + return rc; +} + +static struct fib6_info *find_match(struct fib6_info *rt, int oif, int strict, + int *mpri, struct fib6_info *match, bool *do_rr) { int m; bool match_do_rr = false; - struct inet6_dev *idev = rt->rt6i_idev; - struct net_device *dev = rt->dst.dev; - if (dev && !netif_carrier_ok(dev) && - idev->cnf.ignore_routes_with_linkdown && + if (rt->fib6_nh.nh_flags & RTNH_F_DEAD) + goto out; + + if (fib6_ignore_linkdown(rt) && + rt->fib6_nh.nh_flags & RTNH_F_LINKDOWN && !(strict & RT6_LOOKUP_F_IGNORE_LINKSTATE)) goto out; - if (rt6_check_expired(rt)) + if (fib6_check_expired(rt)) goto out; m = rt6_score_route(rt, oif, strict); @@ -703,18 +688,19 @@ out: return match; } -static struct rt6_info *find_rr_leaf(struct fib6_node *fn, - struct rt6_info *rr_head, +static struct fib6_info *find_rr_leaf(struct fib6_node *fn, + struct fib6_info *leaf, + struct fib6_info *rr_head, u32 metric, int oif, int strict, bool *do_rr) { - struct rt6_info *rt, *match, *cont; + struct fib6_info *rt, *match, *cont; int mpri = -1; match = NULL; cont = NULL; - for (rt = rr_head; rt; rt = rt->dst.rt6_next) { - if (rt->rt6i_metric != metric) { + for (rt = rr_head; rt; rt = rcu_dereference(rt->fib6_next)) { + if (rt->fib6_metric != metric) { cont = rt; break; } @@ -722,8 +708,9 @@ static struct rt6_info *find_rr_leaf(struct fib6_node *fn, match = find_match(rt, oif, strict, &mpri, match, do_rr); } - for (rt = fn->leaf; rt && rt != rr_head; rt = rt->dst.rt6_next) { - if (rt->rt6i_metric != metric) { + for (rt = leaf; rt && rt != rr_head; + rt = rcu_dereference(rt->fib6_next)) { + if (rt->fib6_metric != metric) { cont = rt; break; } @@ -734,43 +721,65 @@ static struct rt6_info *find_rr_leaf(struct fib6_node *fn, if (match || !cont) return match; - for (rt = cont; rt; rt = rt->dst.rt6_next) + for (rt = cont; rt; rt = rcu_dereference(rt->fib6_next)) match = find_match(rt, oif, strict, &mpri, match, do_rr); return match; } -static struct rt6_info *rt6_select(struct fib6_node *fn, int oif, int strict) +static struct fib6_info *rt6_select(struct net *net, struct fib6_node *fn, + int oif, int strict) { - struct rt6_info *match, *rt0; - struct net *net; + struct fib6_info *leaf = rcu_dereference(fn->leaf); + struct fib6_info *match, *rt0; bool do_rr = false; + int key_plen; - rt0 = fn->rr_ptr; + if (!leaf || leaf == net->ipv6.fib6_null_entry) + return net->ipv6.fib6_null_entry; + + rt0 = rcu_dereference(fn->rr_ptr); if (!rt0) - fn->rr_ptr = rt0 = fn->leaf; + rt0 = leaf; - match = find_rr_leaf(fn, rt0, rt0->rt6i_metric, oif, strict, + /* Double check to make sure fn is not an intermediate node + * and fn->leaf does not points to its child's leaf + * (This might happen if all routes under fn are deleted from + * the tree and fib6_repair_tree() is called on the node.) + */ + key_plen = rt0->fib6_dst.plen; +#ifdef CONFIG_IPV6_SUBTREES + if (rt0->fib6_src.plen) + key_plen = rt0->fib6_src.plen; +#endif + if (fn->fn_bit != key_plen) + return net->ipv6.fib6_null_entry; + + match = find_rr_leaf(fn, leaf, rt0, rt0->fib6_metric, oif, strict, &do_rr); if (do_rr) { - struct rt6_info *next = rt0->dst.rt6_next; + struct fib6_info *next = rcu_dereference(rt0->fib6_next); /* no entries matched; do round-robin */ - if (!next || next->rt6i_metric != rt0->rt6i_metric) - next = fn->leaf; + if (!next || next->fib6_metric != rt0->fib6_metric) + next = leaf; - if (next != rt0) - fn->rr_ptr = next; + if (next != rt0) { + spin_lock_bh(&leaf->fib6_table->tb6_lock); + /* make sure next is not being deleted from the tree */ + if (next->fib6_node) + rcu_assign_pointer(fn->rr_ptr, next); + spin_unlock_bh(&leaf->fib6_table->tb6_lock); + } } - net = dev_net(rt0->dst.dev); - return match ? match : net->ipv6.ip6_null_entry; + return match ? match : net->ipv6.fib6_null_entry; } -static bool rt6_is_gw_or_nonexthop(const struct rt6_info *rt) +static bool rt6_is_gw_or_nonexthop(const struct fib6_info *rt) { - return (rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)); + return (rt->fib6_flags & (RTF_NONEXTHOP | RTF_GATEWAY)); } #ifdef CONFIG_IPV6_ROUTE_INFO @@ -782,7 +791,7 @@ int rt6_route_rcv(struct net_device *dev, u8 *opt, int len, struct in6_addr prefix_buf, *prefix; unsigned int pref; unsigned long lifetime; - struct rt6_info *rt; + struct fib6_info *rt; if (len < sizeof(struct route_info)) { return -EINVAL; @@ -820,13 +829,13 @@ int rt6_route_rcv(struct net_device *dev, u8 *opt, int len, } if (rinfo->prefix_len == 0) - rt = rt6_get_dflt_router(gwaddr, dev); + rt = rt6_get_dflt_router(net, gwaddr, dev); else rt = rt6_get_route_info(net, prefix, rinfo->prefix_len, gwaddr, dev); if (rt && !lifetime) { - ip6_del_rt(rt); + ip6_del_rt(net, rt); rt = NULL; } @@ -834,31 +843,174 @@ int rt6_route_rcv(struct net_device *dev, u8 *opt, int len, rt = rt6_add_route_info(net, prefix, rinfo->prefix_len, gwaddr, dev, pref); else if (rt) - rt->rt6i_flags = RTF_ROUTEINFO | - (rt->rt6i_flags & ~RTF_PREF_MASK) | RTF_PREF(pref); + rt->fib6_flags = RTF_ROUTEINFO | + (rt->fib6_flags & ~RTF_PREF_MASK) | RTF_PREF(pref); if (rt) { if (!addrconf_finite_timeout(lifetime)) - rt6_clean_expires(rt); + fib6_clean_expires(rt); else - rt6_set_expires(rt, jiffies + HZ * lifetime); + fib6_set_expires(rt, jiffies + HZ * lifetime); - ip6_rt_put(rt); + fib6_info_release(rt); } return 0; } #endif +/* + * Misc support functions + */ + +/* called with rcu_lock held */ +static struct net_device *ip6_rt_get_dev_rcu(struct fib6_info *rt) +{ + struct net_device *dev = rt->fib6_nh.nh_dev; + + if (rt->fib6_flags & (RTF_LOCAL | RTF_ANYCAST)) { + /* for copies of local routes, dst->dev needs to be the + * device if it is a master device, the master device if + * device is enslaved, and the loopback as the default + */ + if (netif_is_l3_slave(dev) && + !rt6_need_strict(&rt->fib6_dst.addr)) + dev = l3mdev_master_dev_rcu(dev); + else if (!netif_is_l3_master(dev)) + dev = dev_net(dev)->loopback_dev; + /* last case is netif_is_l3_master(dev) is true in which + * case we want dev returned to be dev + */ + } + + return dev; +} + +static const int fib6_prop[RTN_MAX + 1] = { + [RTN_UNSPEC] = 0, + [RTN_UNICAST] = 0, + [RTN_LOCAL] = 0, + [RTN_BROADCAST] = 0, + [RTN_ANYCAST] = 0, + [RTN_MULTICAST] = 0, + [RTN_BLACKHOLE] = -EINVAL, + [RTN_UNREACHABLE] = -EHOSTUNREACH, + [RTN_PROHIBIT] = -EACCES, + [RTN_THROW] = -EAGAIN, + [RTN_NAT] = -EINVAL, + [RTN_XRESOLVE] = -EINVAL, +}; + +static int ip6_rt_type_to_error(u8 fib6_type) +{ + return fib6_prop[fib6_type]; +} + +static unsigned short fib6_info_dst_flags(struct fib6_info *rt) +{ + unsigned short flags = 0; + + if (rt->dst_nocount) + flags |= DST_NOCOUNT; + if (rt->dst_nopolicy) + flags |= DST_NOPOLICY; + if (rt->dst_host) + flags |= DST_HOST; + + return flags; +} + +static void ip6_rt_init_dst_reject(struct rt6_info *rt, struct fib6_info *ort) +{ + rt->dst.error = ip6_rt_type_to_error(ort->fib6_type); + + switch (ort->fib6_type) { + case RTN_BLACKHOLE: + rt->dst.output = dst_discard_out; + rt->dst.input = dst_discard; + break; + case RTN_PROHIBIT: + rt->dst.output = ip6_pkt_prohibit_out; + rt->dst.input = ip6_pkt_prohibit; + break; + case RTN_THROW: + case RTN_UNREACHABLE: + default: + rt->dst.output = ip6_pkt_discard_out; + rt->dst.input = ip6_pkt_discard; + break; + } +} + +static void ip6_rt_init_dst(struct rt6_info *rt, struct fib6_info *ort) +{ + rt->dst.flags |= fib6_info_dst_flags(ort); + + if (ort->fib6_flags & RTF_REJECT) { + ip6_rt_init_dst_reject(rt, ort); + return; + } + + rt->dst.error = 0; + rt->dst.output = ip6_output; + + if (ort->fib6_type == RTN_LOCAL) { + rt->dst.input = ip6_input; + } else if (ipv6_addr_type(&ort->fib6_dst.addr) & IPV6_ADDR_MULTICAST) { + rt->dst.input = ip6_mc_input; + } else { + rt->dst.input = ip6_forward; + } + + if (ort->fib6_nh.nh_lwtstate) { + rt->dst.lwtstate = lwtstate_get(ort->fib6_nh.nh_lwtstate); + lwtunnel_set_redirect(&rt->dst); + } + + rt->dst.lastuse = jiffies; +} + +/* Caller must already hold reference to @from */ +static void rt6_set_from(struct rt6_info *rt, struct fib6_info *from) +{ + rt->rt6i_flags &= ~RTF_EXPIRES; + rcu_assign_pointer(rt->from, from); + dst_init_metrics(&rt->dst, from->fib6_metrics->metrics, true); + if (from->fib6_metrics != &dst_default_metrics) { + rt->dst._metrics |= DST_METRICS_REFCOUNTED; + refcount_inc(&from->fib6_metrics->refcnt); + } +} + +/* Caller must already hold reference to @ort */ +static void ip6_rt_copy_init(struct rt6_info *rt, struct fib6_info *ort) +{ + struct net_device *dev = fib6_info_nh_dev(ort); + + ip6_rt_init_dst(rt, ort); + + rt->rt6i_dst = ort->fib6_dst; + rt->rt6i_idev = dev ? in6_dev_get(dev) : NULL; + rt->rt6i_gateway = ort->fib6_nh.nh_gw; + rt->rt6i_flags = ort->fib6_flags; + rt6_set_from(rt, ort); +#ifdef CONFIG_IPV6_SUBTREES + rt->rt6i_src = ort->fib6_src; +#endif + rt->rt6i_prefsrc = ort->fib6_prefsrc; + rt->dst.lwtstate = lwtstate_get(ort->fib6_nh.nh_lwtstate); +} + static struct fib6_node* fib6_backtrack(struct fib6_node *fn, struct in6_addr *saddr) { - struct fib6_node *pn; + struct fib6_node *pn, *sn; while (1) { if (fn->fn_flags & RTN_TL_ROOT) return NULL; - pn = fn->parent; - if (FIB6_SUBTREE(pn) && FIB6_SUBTREE(pn) != fn) - fn = fib6_lookup(FIB6_SUBTREE(pn), NULL, saddr); + pn = rcu_dereference(fn->parent); + sn = FIB6_SUBTREE(pn); + if (sn && sn != fn) + fn = fib6_node_lookup(sn, NULL, saddr); else fn = pn; if (fn->fn_flags & RTN_RTINFO) @@ -866,46 +1018,108 @@ static struct fib6_node* fib6_backtrack(struct fib6_node *fn, } } +static bool ip6_hold_safe(struct net *net, struct rt6_info **prt, + bool null_fallback) +{ + struct rt6_info *rt = *prt; + + if (dst_hold_safe(&rt->dst)) + return true; + if (null_fallback) { + rt = net->ipv6.ip6_null_entry; + dst_hold(&rt->dst); + } else { + rt = NULL; + } + *prt = rt; + return false; +} + +/* called with rcu_lock held */ +static struct rt6_info *ip6_create_rt_rcu(struct fib6_info *rt) +{ + unsigned short flags = fib6_info_dst_flags(rt); + struct net_device *dev = rt->fib6_nh.nh_dev; + struct rt6_info *nrt; + + if (!fib6_info_hold_safe(rt)) + return NULL; + + nrt = ip6_dst_alloc(dev_net(dev), dev, flags); + if (nrt) + ip6_rt_copy_init(nrt, rt); + else + fib6_info_release(rt); + + return nrt; +} + static struct rt6_info *ip6_pol_route_lookup(struct net *net, struct fib6_table *table, - struct flowi6 *fl6, int flags) + struct flowi6 *fl6, + const struct sk_buff *skb, + int flags) { + struct fib6_info *f6i; struct fib6_node *fn; struct rt6_info *rt; if (fl6->flowi6_flags & FLOWI_FLAG_SKIP_NH_OIF) flags &= ~RT6_LOOKUP_F_IFACE; - read_lock_bh(&table->tb6_lock); - fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); + rcu_read_lock(); + fn = fib6_node_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); restart: - rt = fn->leaf; - rt = rt6_device_match(net, rt, &fl6->saddr, fl6->flowi6_oif, flags); - if (rt->rt6i_nsiblings && fl6->flowi6_oif == 0) - rt = rt6_multipath_select(rt, fl6, fl6->flowi6_oif, flags); - if (rt == net->ipv6.ip6_null_entry) { + f6i = rcu_dereference(fn->leaf); + if (!f6i) { + f6i = net->ipv6.fib6_null_entry; + } else { + f6i = rt6_device_match(net, f6i, &fl6->saddr, + fl6->flowi6_oif, flags); + if (f6i->fib6_nsiblings && fl6->flowi6_oif == 0) + f6i = fib6_multipath_select(net, f6i, fl6, + fl6->flowi6_oif, skb, + flags); + } + if (f6i == net->ipv6.fib6_null_entry) { fn = fib6_backtrack(fn, &fl6->saddr); if (fn) goto restart; } - dst_use(&rt->dst, jiffies); - read_unlock_bh(&table->tb6_lock); - trace_fib6_table_lookup(net, rt, table->tb6_id, fl6); + /* Search through exception table */ + rt = rt6_find_cached_rt(f6i, &fl6->daddr, &fl6->saddr); + if (rt) { + if (ip6_hold_safe(net, &rt, true)) + dst_use_noref(&rt->dst, jiffies); + } else if (f6i == net->ipv6.fib6_null_entry) { + rt = net->ipv6.ip6_null_entry; + dst_hold(&rt->dst); + } else { + rt = ip6_create_rt_rcu(f6i); + if (!rt) { + rt = net->ipv6.ip6_null_entry; + dst_hold(&rt->dst); + } + } + + rcu_read_unlock(); + + trace_fib6_table_lookup(net, rt, table, fl6); return rt; - } struct dst_entry *ip6_route_lookup(struct net *net, struct flowi6 *fl6, - int flags) + const struct sk_buff *skb, int flags) { - return fib6_rule_lookup(net, fl6, flags, ip6_pol_route_lookup); + return fib6_rule_lookup(net, fl6, skb, flags, ip6_pol_route_lookup); } EXPORT_SYMBOL_GPL(ip6_route_lookup); struct rt6_info *rt6_lookup(struct net *net, const struct in6_addr *daddr, - const struct in6_addr *saddr, int oif, int strict) + const struct in6_addr *saddr, int oif, + const struct sk_buff *skb, int strict) { struct flowi6 fl6 = { .flowi6_oif = oif, @@ -919,7 +1133,7 @@ struct rt6_info *rt6_lookup(struct net *net, const struct in6_addr *daddr, flags |= RT6_LOOKUP_F_HAS_SADDR; } - dst = fib6_rule_lookup(net, &fl6, flags, ip6_pol_route_lookup); + dst = fib6_rule_lookup(net, &fl6, skb, flags, ip6_pol_route_lookup); if (dst->error == 0) return (struct rt6_info *) dst; @@ -935,55 +1149,28 @@ EXPORT_SYMBOL(rt6_lookup); * Caller must hold dst before calling it. */ -static int __ip6_ins_rt(struct rt6_info *rt, struct nl_info *info, - struct mx6_config *mxc, +static int __ip6_ins_rt(struct fib6_info *rt, struct nl_info *info, struct netlink_ext_ack *extack) { int err; struct fib6_table *table; - table = rt->rt6i_table; - write_lock_bh(&table->tb6_lock); - err = fib6_add(&table->tb6_root, rt, info, mxc, extack); - write_unlock_bh(&table->tb6_lock); + table = rt->fib6_table; + spin_lock_bh(&table->tb6_lock); + err = fib6_add(&table->tb6_root, rt, info, extack); + spin_unlock_bh(&table->tb6_lock); return err; } -int ip6_ins_rt(struct rt6_info *rt) +int ip6_ins_rt(struct net *net, struct fib6_info *rt) { - struct nl_info info = { .nl_net = dev_net(rt->dst.dev), }; - struct mx6_config mxc = { .mx = NULL, }; + struct nl_info info = { .nl_net = net, }; - /* Hold dst to account for the reference from the fib6 tree */ - dst_hold(&rt->dst); - return __ip6_ins_rt(rt, &info, &mxc, NULL); + return __ip6_ins_rt(rt, &info, NULL); } -/* called with rcu_lock held */ -static struct net_device *ip6_rt_get_dev_rcu(struct rt6_info *rt) -{ - struct net_device *dev = rt->dst.dev; - - if (rt->rt6i_flags & (RTF_LOCAL | RTF_ANYCAST)) { - /* for copies of local routes, dst->dev needs to be the - * device if it is a master device, the master device if - * device is enslaved, and the loopback as the default - */ - if (netif_is_l3_slave(dev) && - !rt6_need_strict(&rt->rt6i_dst.addr)) - dev = l3mdev_master_dev_rcu(dev); - else if (!netif_is_l3_master(dev)) - dev = dev_net(dev)->loopback_dev; - /* last case is netif_is_l3_master(dev) is true in which - * case we want dev returned to be dev - */ - } - - return dev; -} - -static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort, +static struct rt6_info *ip6_rt_cache_alloc(struct fib6_info *ort, const struct in6_addr *daddr, const struct in6_addr *saddr) { @@ -994,26 +1181,25 @@ static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort, * Clone the route. */ - if (ort->rt6i_flags & (RTF_CACHE | RTF_PCPU)) - ort = (struct rt6_info *)ort->dst.from; - - rcu_read_lock(); - dev = ip6_rt_get_dev_rcu(ort); - rt = __ip6_dst_alloc(dev_net(dev), dev, 0); - rcu_read_unlock(); - if (!rt) + if (!fib6_info_hold_safe(ort)) return NULL; + dev = ip6_rt_get_dev_rcu(ort); + rt = ip6_dst_alloc(dev_net(dev), dev, 0); + if (!rt) { + fib6_info_release(ort); + return NULL; + } + ip6_rt_copy_init(rt, ort); rt->rt6i_flags |= RTF_CACHE; - rt->rt6i_metric = 0; rt->dst.flags |= DST_HOST; rt->rt6i_dst.addr = *daddr; rt->rt6i_dst.plen = 128; if (!rt6_is_gw_or_nonexthop(ort)) { - if (ort->rt6i_dst.plen != 128 && - ipv6_addr_equal(&ort->rt6i_dst.addr, daddr)) + if (ort->fib6_dst.plen != 128 && + ipv6_addr_equal(&ort->fib6_dst.addr, daddr)) rt->rt6i_flags |= RTF_ANYCAST; #ifdef CONFIG_IPV6_SUBTREES if (rt->rt6i_src.plen && saddr) { @@ -1026,101 +1212,622 @@ static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort, return rt; } -static struct rt6_info *ip6_rt_pcpu_alloc(struct rt6_info *rt) +static struct rt6_info *ip6_rt_pcpu_alloc(struct fib6_info *rt) { + unsigned short flags = fib6_info_dst_flags(rt); struct net_device *dev; struct rt6_info *pcpu_rt; + if (!fib6_info_hold_safe(rt)) + return NULL; + rcu_read_lock(); dev = ip6_rt_get_dev_rcu(rt); - pcpu_rt = __ip6_dst_alloc(dev_net(dev), dev, rt->dst.flags); + pcpu_rt = ip6_dst_alloc(dev_net(dev), dev, flags); rcu_read_unlock(); - if (!pcpu_rt) + if (!pcpu_rt) { + fib6_info_release(rt); return NULL; + } ip6_rt_copy_init(pcpu_rt, rt); - pcpu_rt->rt6i_protocol = rt->rt6i_protocol; pcpu_rt->rt6i_flags |= RTF_PCPU; return pcpu_rt; } -/* It should be called with read_lock_bh(&tb6_lock) acquired */ -static struct rt6_info *rt6_get_pcpu_route(struct rt6_info *rt) +/* It should be called with rcu_read_lock() acquired */ +static struct rt6_info *rt6_get_pcpu_route(struct fib6_info *rt) { struct rt6_info *pcpu_rt, **p; p = this_cpu_ptr(rt->rt6i_pcpu); pcpu_rt = *p; - if (pcpu_rt) { - dst_hold(&pcpu_rt->dst); - rt6_dst_from_metrics_check(pcpu_rt); - } + if (pcpu_rt) + ip6_hold_safe(NULL, &pcpu_rt, false); + return pcpu_rt; } -static struct rt6_info *rt6_make_pcpu_route(struct rt6_info *rt) +static struct rt6_info *rt6_make_pcpu_route(struct net *net, + struct fib6_info *rt) { - struct fib6_table *table = rt->rt6i_table; struct rt6_info *pcpu_rt, *prev, **p; pcpu_rt = ip6_rt_pcpu_alloc(rt); if (!pcpu_rt) { - struct net *net = dev_net(rt->dst.dev); - dst_hold(&net->ipv6.ip6_null_entry->dst); return net->ipv6.ip6_null_entry; } - read_lock_bh(&table->tb6_lock); - if (rt->rt6i_pcpu) { - p = this_cpu_ptr(rt->rt6i_pcpu); - prev = cmpxchg(p, NULL, pcpu_rt); - if (prev) { - /* If someone did it before us, return prev instead */ - dst_release_immediate(&pcpu_rt->dst); - pcpu_rt = prev; - } - } else { - /* rt has been removed from the fib6 tree - * before we have a chance to acquire the read_lock. - * In this case, don't brother to create a pcpu rt - * since rt is going away anyway. The next - * dst_check() will trigger a re-lookup. - */ - dst_release_immediate(&pcpu_rt->dst); - pcpu_rt = rt; - } dst_hold(&pcpu_rt->dst); - rt6_dst_from_metrics_check(pcpu_rt); - read_unlock_bh(&table->tb6_lock); + p = this_cpu_ptr(rt->rt6i_pcpu); + prev = cmpxchg(p, NULL, pcpu_rt); + BUG_ON(prev); + + if (rt->fib6_destroying) { + struct fib6_info *from; + + from = xchg((__force struct fib6_info **)&pcpu_rt->from, NULL); + fib6_info_release(from); + } + return pcpu_rt; } -struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, - int oif, struct flowi6 *fl6, int flags) +/* exception hash table implementation + */ +static DEFINE_SPINLOCK(rt6_exception_lock); + +/* Remove rt6_ex from hash table and free the memory + * Caller must hold rt6_exception_lock + */ +static void rt6_remove_exception(struct rt6_exception_bucket *bucket, + struct rt6_exception *rt6_ex) +{ + struct net *net; + + if (!bucket || !rt6_ex) + return; + + net = dev_net(rt6_ex->rt6i->dst.dev); + hlist_del_rcu(&rt6_ex->hlist); + dst_release(&rt6_ex->rt6i->dst); + kfree_rcu(rt6_ex, rcu); + WARN_ON_ONCE(!bucket->depth); + bucket->depth--; + net->ipv6.rt6_stats->fib_rt_cache--; +} + +/* Remove oldest rt6_ex in bucket and free the memory + * Caller must hold rt6_exception_lock + */ +static void rt6_exception_remove_oldest(struct rt6_exception_bucket *bucket) +{ + struct rt6_exception *rt6_ex, *oldest = NULL; + + if (!bucket) + return; + + hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) { + if (!oldest || time_before(rt6_ex->stamp, oldest->stamp)) + oldest = rt6_ex; + } + rt6_remove_exception(bucket, oldest); +} + +static u32 rt6_exception_hash(const struct in6_addr *dst, + const struct in6_addr *src) +{ + static u32 seed __read_mostly; + u32 val; + + net_get_random_once(&seed, sizeof(seed)); + val = jhash(dst, sizeof(*dst), seed); + +#ifdef CONFIG_IPV6_SUBTREES + if (src) + val = jhash(src, sizeof(*src), val); +#endif + return hash_32(val, FIB6_EXCEPTION_BUCKET_SIZE_SHIFT); +} + +/* Helper function to find the cached rt in the hash table + * and update bucket pointer to point to the bucket for this + * (daddr, saddr) pair + * Caller must hold rt6_exception_lock + */ +static struct rt6_exception * +__rt6_find_exception_spinlock(struct rt6_exception_bucket **bucket, + const struct in6_addr *daddr, + const struct in6_addr *saddr) +{ + struct rt6_exception *rt6_ex; + u32 hval; + + if (!(*bucket) || !daddr) + return NULL; + + hval = rt6_exception_hash(daddr, saddr); + *bucket += hval; + + hlist_for_each_entry(rt6_ex, &(*bucket)->chain, hlist) { + struct rt6_info *rt6 = rt6_ex->rt6i; + bool matched = ipv6_addr_equal(daddr, &rt6->rt6i_dst.addr); + +#ifdef CONFIG_IPV6_SUBTREES + if (matched && saddr) + matched = ipv6_addr_equal(saddr, &rt6->rt6i_src.addr); +#endif + if (matched) + return rt6_ex; + } + return NULL; +} + +/* Helper function to find the cached rt in the hash table + * and update bucket pointer to point to the bucket for this + * (daddr, saddr) pair + * Caller must hold rcu_read_lock() + */ +static struct rt6_exception * +__rt6_find_exception_rcu(struct rt6_exception_bucket **bucket, + const struct in6_addr *daddr, + const struct in6_addr *saddr) +{ + struct rt6_exception *rt6_ex; + u32 hval; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + if (!(*bucket) || !daddr) + return NULL; + + hval = rt6_exception_hash(daddr, saddr); + *bucket += hval; + + hlist_for_each_entry_rcu(rt6_ex, &(*bucket)->chain, hlist) { + struct rt6_info *rt6 = rt6_ex->rt6i; + bool matched = ipv6_addr_equal(daddr, &rt6->rt6i_dst.addr); + +#ifdef CONFIG_IPV6_SUBTREES + if (matched && saddr) + matched = ipv6_addr_equal(saddr, &rt6->rt6i_src.addr); +#endif + if (matched) + return rt6_ex; + } + return NULL; +} + +static unsigned int fib6_mtu(const struct fib6_info *rt) +{ + unsigned int mtu; + + if (rt->fib6_pmtu) { + mtu = rt->fib6_pmtu; + } else { + struct net_device *dev = fib6_info_nh_dev(rt); + struct inet6_dev *idev; + + rcu_read_lock(); + idev = __in6_dev_get(dev); + mtu = idev->cnf.mtu6; + rcu_read_unlock(); + } + + mtu = min_t(unsigned int, mtu, IP6_MAX_MTU); + + return mtu - lwtunnel_headroom(rt->fib6_nh.nh_lwtstate, mtu); +} + +static int rt6_insert_exception(struct rt6_info *nrt, + struct fib6_info *ort) +{ + struct net *net = dev_net(nrt->dst.dev); + struct rt6_exception_bucket *bucket; + struct in6_addr *src_key = NULL; + struct rt6_exception *rt6_ex; + int err = 0; + + spin_lock_bh(&rt6_exception_lock); + + if (ort->exception_bucket_flushed) { + err = -EINVAL; + goto out; + } + + bucket = rcu_dereference_protected(ort->rt6i_exception_bucket, + lockdep_is_held(&rt6_exception_lock)); + if (!bucket) { + bucket = kcalloc(FIB6_EXCEPTION_BUCKET_SIZE, sizeof(*bucket), + GFP_ATOMIC); + if (!bucket) { + err = -ENOMEM; + goto out; + } + rcu_assign_pointer(ort->rt6i_exception_bucket, bucket); + } + +#ifdef CONFIG_IPV6_SUBTREES + /* rt6i_src.plen != 0 indicates ort is in subtree + * and exception table is indexed by a hash of + * both rt6i_dst and rt6i_src. + * Otherwise, the exception table is indexed by + * a hash of only rt6i_dst. + */ + if (ort->fib6_src.plen) + src_key = &nrt->rt6i_src.addr; +#endif + + /* Update rt6i_prefsrc as it could be changed + * in rt6_remove_prefsrc() + */ + nrt->rt6i_prefsrc = ort->fib6_prefsrc; + /* rt6_mtu_change() might lower mtu on ort. + * Only insert this exception route if its mtu + * is less than ort's mtu value. + */ + if (dst_metric_raw(&nrt->dst, RTAX_MTU) >= fib6_mtu(ort)) { + err = -EINVAL; + goto out; + } + + rt6_ex = __rt6_find_exception_spinlock(&bucket, &nrt->rt6i_dst.addr, + src_key); + if (rt6_ex) + rt6_remove_exception(bucket, rt6_ex); + + rt6_ex = kzalloc(sizeof(*rt6_ex), GFP_ATOMIC); + if (!rt6_ex) { + err = -ENOMEM; + goto out; + } + rt6_ex->rt6i = nrt; + rt6_ex->stamp = jiffies; + hlist_add_head_rcu(&rt6_ex->hlist, &bucket->chain); + bucket->depth++; + net->ipv6.rt6_stats->fib_rt_cache++; + + if (bucket->depth > FIB6_MAX_DEPTH) + rt6_exception_remove_oldest(bucket); + +out: + spin_unlock_bh(&rt6_exception_lock); + + /* Update fn->fn_sernum to invalidate all cached dst */ + if (!err) { + spin_lock_bh(&ort->fib6_table->tb6_lock); + fib6_update_sernum(net, ort); + spin_unlock_bh(&ort->fib6_table->tb6_lock); + fib6_force_start_gc(net); + } + + return err; +} + +void rt6_flush_exceptions(struct fib6_info *rt) +{ + struct rt6_exception_bucket *bucket; + struct rt6_exception *rt6_ex; + struct hlist_node *tmp; + int i; + + spin_lock_bh(&rt6_exception_lock); + /* Prevent rt6_insert_exception() to recreate the bucket list */ + rt->exception_bucket_flushed = 1; + + bucket = rcu_dereference_protected(rt->rt6i_exception_bucket, + lockdep_is_held(&rt6_exception_lock)); + if (!bucket) + goto out; + + for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) { + hlist_for_each_entry_safe(rt6_ex, tmp, &bucket->chain, hlist) + rt6_remove_exception(bucket, rt6_ex); + WARN_ON_ONCE(bucket->depth); + bucket++; + } + +out: + spin_unlock_bh(&rt6_exception_lock); +} + +/* Find cached rt in the hash table inside passed in rt + * Caller has to hold rcu_read_lock() + */ +static struct rt6_info *rt6_find_cached_rt(struct fib6_info *rt, + struct in6_addr *daddr, + struct in6_addr *saddr) +{ + struct rt6_exception_bucket *bucket; + struct in6_addr *src_key = NULL; + struct rt6_exception *rt6_ex; + struct rt6_info *res = NULL; + + bucket = rcu_dereference(rt->rt6i_exception_bucket); + +#ifdef CONFIG_IPV6_SUBTREES + /* rt6i_src.plen != 0 indicates rt is in subtree + * and exception table is indexed by a hash of + * both rt6i_dst and rt6i_src. + * Otherwise, the exception table is indexed by + * a hash of only rt6i_dst. + */ + if (rt->fib6_src.plen) + src_key = saddr; +#endif + rt6_ex = __rt6_find_exception_rcu(&bucket, daddr, src_key); + + if (rt6_ex && !rt6_check_expired(rt6_ex->rt6i)) + res = rt6_ex->rt6i; + + return res; +} + +/* Remove the passed in cached rt from the hash table that contains it */ +static int rt6_remove_exception_rt(struct rt6_info *rt) +{ + struct rt6_exception_bucket *bucket; + struct in6_addr *src_key = NULL; + struct rt6_exception *rt6_ex; + struct fib6_info *from; + int err; + + from = rcu_dereference_protected(rt->from, + lockdep_is_held(&rt6_exception_lock)); + if (!from || + !(rt->rt6i_flags | RTF_CACHE)) + return -EINVAL; + + if (!rcu_access_pointer(from->rt6i_exception_bucket)) + return -ENOENT; + + spin_lock_bh(&rt6_exception_lock); + bucket = rcu_dereference_protected(from->rt6i_exception_bucket, + lockdep_is_held(&rt6_exception_lock)); +#ifdef CONFIG_IPV6_SUBTREES + /* rt6i_src.plen != 0 indicates 'from' is in subtree + * and exception table is indexed by a hash of + * both rt6i_dst and rt6i_src. + * Otherwise, the exception table is indexed by + * a hash of only rt6i_dst. + */ + if (from->fib6_src.plen) + src_key = &rt->rt6i_src.addr; +#endif + rt6_ex = __rt6_find_exception_spinlock(&bucket, + &rt->rt6i_dst.addr, + src_key); + if (rt6_ex) { + rt6_remove_exception(bucket, rt6_ex); + err = 0; + } else { + err = -ENOENT; + } + + spin_unlock_bh(&rt6_exception_lock); + return err; +} + +/* Find rt6_ex which contains the passed in rt cache and + * refresh its stamp + */ +static void rt6_update_exception_stamp_rt(struct rt6_info *rt) +{ + struct rt6_exception_bucket *bucket; + struct fib6_info *from = rt->from; + struct in6_addr *src_key = NULL; + struct rt6_exception *rt6_ex; + + if (!from || + !(rt->rt6i_flags | RTF_CACHE)) + return; + + rcu_read_lock(); + bucket = rcu_dereference(from->rt6i_exception_bucket); + +#ifdef CONFIG_IPV6_SUBTREES + /* rt6i_src.plen != 0 indicates 'from' is in subtree + * and exception table is indexed by a hash of + * both rt6i_dst and rt6i_src. + * Otherwise, the exception table is indexed by + * a hash of only rt6i_dst. + */ + if (from->fib6_src.plen) + src_key = &rt->rt6i_src.addr; +#endif + rt6_ex = __rt6_find_exception_rcu(&bucket, + &rt->rt6i_dst.addr, + src_key); + if (rt6_ex) + rt6_ex->stamp = jiffies; + + rcu_read_unlock(); +} + +static void rt6_exceptions_remove_prefsrc(struct fib6_info *rt) +{ + struct rt6_exception_bucket *bucket; + struct rt6_exception *rt6_ex; + int i; + + bucket = rcu_dereference_protected(rt->rt6i_exception_bucket, + lockdep_is_held(&rt6_exception_lock)); + + if (bucket) { + for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) { + hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) { + rt6_ex->rt6i->rt6i_prefsrc.plen = 0; + } + bucket++; + } + } +} + +static bool rt6_mtu_change_route_allowed(struct inet6_dev *idev, + struct rt6_info *rt, int mtu) +{ + /* If the new MTU is lower than the route PMTU, this new MTU will be the + * lowest MTU in the path: always allow updating the route PMTU to + * reflect PMTU decreases. + * + * If the new MTU is higher, and the route PMTU is equal to the local + * MTU, this means the old MTU is the lowest in the path, so allow + * updating it: if other nodes now have lower MTUs, PMTU discovery will + * handle this. + */ + + if (dst_mtu(&rt->dst) >= mtu) + return true; + + if (dst_mtu(&rt->dst) == idev->cnf.mtu6) + return true; + + return false; +} + +static void rt6_exceptions_update_pmtu(struct inet6_dev *idev, + struct fib6_info *rt, int mtu) +{ + struct rt6_exception_bucket *bucket; + struct rt6_exception *rt6_ex; + int i; + + bucket = rcu_dereference_protected(rt->rt6i_exception_bucket, + lockdep_is_held(&rt6_exception_lock)); + + if (!bucket) + return; + + for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) { + hlist_for_each_entry(rt6_ex, &bucket->chain, hlist) { + struct rt6_info *entry = rt6_ex->rt6i; + + /* For RTF_CACHE with rt6i_pmtu == 0 (i.e. a redirected + * route), the metrics of its rt->from have already + * been updated. + */ + if (dst_metric_raw(&entry->dst, RTAX_MTU) && + rt6_mtu_change_route_allowed(idev, entry, mtu)) + dst_metric_set(&entry->dst, RTAX_MTU, mtu); + } + bucket++; + } +} + +#define RTF_CACHE_GATEWAY (RTF_GATEWAY | RTF_CACHE) + +static void rt6_exceptions_clean_tohost(struct fib6_info *rt, + struct in6_addr *gateway) +{ + struct rt6_exception_bucket *bucket; + struct rt6_exception *rt6_ex; + struct hlist_node *tmp; + int i; + + if (!rcu_access_pointer(rt->rt6i_exception_bucket)) + return; + + spin_lock_bh(&rt6_exception_lock); + bucket = rcu_dereference_protected(rt->rt6i_exception_bucket, + lockdep_is_held(&rt6_exception_lock)); + + if (bucket) { + for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) { + hlist_for_each_entry_safe(rt6_ex, tmp, + &bucket->chain, hlist) { + struct rt6_info *entry = rt6_ex->rt6i; + + if ((entry->rt6i_flags & RTF_CACHE_GATEWAY) == + RTF_CACHE_GATEWAY && + ipv6_addr_equal(gateway, + &entry->rt6i_gateway)) { + rt6_remove_exception(bucket, rt6_ex); + } + } + bucket++; + } + } + + spin_unlock_bh(&rt6_exception_lock); +} + +static void rt6_age_examine_exception(struct rt6_exception_bucket *bucket, + struct rt6_exception *rt6_ex, + struct fib6_gc_args *gc_args, + unsigned long now) +{ + struct rt6_info *rt = rt6_ex->rt6i; + + if (atomic_read(&rt->dst.__refcnt) == 1 && + time_after_eq(now, rt->dst.lastuse + gc_args->timeout)) { + RT6_TRACE("aging clone %p\n", rt); + rt6_remove_exception(bucket, rt6_ex); + return; + } else if (rt->rt6i_flags & RTF_GATEWAY) { + struct neighbour *neigh; + __u8 neigh_flags = 0; + + neigh = dst_neigh_lookup(&rt->dst, &rt->rt6i_gateway); + if (neigh) { + neigh_flags = neigh->flags; + neigh_release(neigh); + } + if (!(neigh_flags & NTF_ROUTER)) { + RT6_TRACE("purging route %p via non-router but gateway\n", + rt); + rt6_remove_exception(bucket, rt6_ex); + return; + } + } + gc_args->more++; +} + +void rt6_age_exceptions(struct fib6_info *rt, + struct fib6_gc_args *gc_args, + unsigned long now) +{ + struct rt6_exception_bucket *bucket; + struct rt6_exception *rt6_ex; + struct hlist_node *tmp; + int i; + + if (!rcu_access_pointer(rt->rt6i_exception_bucket)) + return; + + spin_lock_bh(&rt6_exception_lock); + bucket = rcu_dereference_protected(rt->rt6i_exception_bucket, + lockdep_is_held(&rt6_exception_lock)); + + if (bucket) { + for (i = 0; i < FIB6_EXCEPTION_BUCKET_SIZE; i++) { + hlist_for_each_entry_safe(rt6_ex, tmp, + &bucket->chain, hlist) { + rt6_age_examine_exception(bucket, rt6_ex, + gc_args, now); + } + bucket++; + } + } + spin_unlock_bh(&rt6_exception_lock); +} + +/* must be called with rcu lock held */ +struct fib6_info *fib6_table_lookup(struct net *net, struct fib6_table *table, + int oif, struct flowi6 *fl6, int strict) { struct fib6_node *fn, *saved_fn; - struct rt6_info *rt; - int strict = 0; + struct fib6_info *f6i; - strict |= flags & RT6_LOOKUP_F_IFACE; - strict |= flags & RT6_LOOKUP_F_IGNORE_LINKSTATE; - if (net->ipv6.devconf_all->forwarding == 0) - strict |= RT6_LOOKUP_F_REACHABLE; - - read_lock_bh(&table->tb6_lock); - - fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); + fn = fib6_node_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); saved_fn = fn; if (fl6->flowi6_flags & FLOWI_FLAG_SKIP_NH_OIF) oif = 0; redo_rt6_select: - rt = rt6_select(fn, oif, strict); - if (rt->rt6i_nsiblings) - rt = rt6_multipath_select(rt, fl6, oif, strict); - if (rt == net->ipv6.ip6_null_entry) { + f6i = rt6_select(net, fn, oif, strict); + if (f6i == net->ipv6.fib6_null_entry) { fn = fib6_backtrack(fn, &fl6->saddr); if (fn) goto redo_rt6_select; @@ -1132,42 +1839,70 @@ redo_rt6_select: } } + return f6i; +} - if (rt == net->ipv6.ip6_null_entry || (rt->rt6i_flags & RTF_CACHE)) { - dst_use(&rt->dst, jiffies); - read_unlock_bh(&table->tb6_lock); +struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, + int oif, struct flowi6 *fl6, + const struct sk_buff *skb, int flags) +{ + struct fib6_info *f6i; + struct rt6_info *rt; + int strict = 0; - rt6_dst_from_metrics_check(rt); + strict |= flags & RT6_LOOKUP_F_IFACE; + strict |= flags & RT6_LOOKUP_F_IGNORE_LINKSTATE; + if (net->ipv6.devconf_all->forwarding == 0) + strict |= RT6_LOOKUP_F_REACHABLE; - trace_fib6_table_lookup(net, rt, table->tb6_id, fl6); + rcu_read_lock(); + + f6i = fib6_table_lookup(net, table, oif, fl6, strict); + if (f6i->fib6_nsiblings) + f6i = fib6_multipath_select(net, f6i, fl6, oif, skb, strict); + + if (f6i == net->ipv6.fib6_null_entry) { + rt = net->ipv6.ip6_null_entry; + rcu_read_unlock(); + dst_hold(&rt->dst); + trace_fib6_table_lookup(net, rt, table, fl6); + return rt; + } + + /*Search through exception table */ + rt = rt6_find_cached_rt(f6i, &fl6->daddr, &fl6->saddr); + if (rt) { + if (ip6_hold_safe(net, &rt, true)) + dst_use_noref(&rt->dst, jiffies); + + rcu_read_unlock(); + trace_fib6_table_lookup(net, rt, table, fl6); return rt; } else if (unlikely((fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH) && - !(rt->rt6i_flags & RTF_GATEWAY))) { + !(f6i->fib6_flags & RTF_GATEWAY))) { /* Create a RTF_CACHE clone which will not be * owned by the fib6 tree. It is for the special case where * the daddr in the skb during the neighbor look-up is different * from the fl6->daddr used to look-up route here. */ - struct rt6_info *uncached_rt; - dst_use(&rt->dst, jiffies); - read_unlock_bh(&table->tb6_lock); + uncached_rt = ip6_rt_cache_alloc(f6i, &fl6->daddr, NULL); - uncached_rt = ip6_rt_cache_alloc(rt, &fl6->daddr, NULL); - dst_release(&rt->dst); + rcu_read_unlock(); if (uncached_rt) { /* Uncached_rt's refcnt is taken during ip6_rt_cache_alloc() * No need for another dst_hold() */ rt6_uncached_list_add(uncached_rt); + atomic_inc(&net->ipv6.rt6_stats->fib_rt_uncache); } else { uncached_rt = net->ipv6.ip6_null_entry; dst_hold(&uncached_rt->dst); } - trace_fib6_table_lookup(net, uncached_rt, table->tb6_id, fl6); + trace_fib6_table_lookup(net, uncached_rt, table, fl6); return uncached_rt; } else { @@ -1175,52 +1910,49 @@ redo_rt6_select: struct rt6_info *pcpu_rt; - rt->dst.lastuse = jiffies; - rt->dst.__use++; - pcpu_rt = rt6_get_pcpu_route(rt); + local_bh_disable(); + pcpu_rt = rt6_get_pcpu_route(f6i); - if (pcpu_rt) { - read_unlock_bh(&table->tb6_lock); - } else { - /* We have to do the read_unlock first - * because rt6_make_pcpu_route() may trigger - * ip6_dst_gc() which will take the write_lock. - */ - dst_hold(&rt->dst); - read_unlock_bh(&table->tb6_lock); - pcpu_rt = rt6_make_pcpu_route(rt); - dst_release(&rt->dst); - } + if (!pcpu_rt) + pcpu_rt = rt6_make_pcpu_route(net, f6i); - trace_fib6_table_lookup(net, pcpu_rt, table->tb6_id, fl6); + local_bh_enable(); + rcu_read_unlock(); + trace_fib6_table_lookup(net, pcpu_rt, table, fl6); return pcpu_rt; - } } EXPORT_SYMBOL_GPL(ip6_pol_route); -static struct rt6_info *ip6_pol_route_input(struct net *net, struct fib6_table *table, - struct flowi6 *fl6, int flags) +static struct rt6_info *ip6_pol_route_input(struct net *net, + struct fib6_table *table, + struct flowi6 *fl6, + const struct sk_buff *skb, + int flags) { - return ip6_pol_route(net, table, fl6->flowi6_iif, fl6, flags); + return ip6_pol_route(net, table, fl6->flowi6_iif, fl6, skb, flags); } struct dst_entry *ip6_route_input_lookup(struct net *net, struct net_device *dev, - struct flowi6 *fl6, int flags) + struct flowi6 *fl6, + const struct sk_buff *skb, + int flags) { if (rt6_need_strict(&fl6->daddr) && dev->type != ARPHRD_PIMREG) flags |= RT6_LOOKUP_F_IFACE; - return fib6_rule_lookup(net, fl6, flags, ip6_pol_route_input); + return fib6_rule_lookup(net, fl6, skb, flags, ip6_pol_route_input); } EXPORT_SYMBOL_GPL(ip6_route_input_lookup); static void ip6_multipath_l3_keys(const struct sk_buff *skb, - struct flow_keys *keys) + struct flow_keys *keys, + struct flow_keys *flkeys) { const struct ipv6hdr *outer_iph = ipv6_hdr(skb); const struct ipv6hdr *key_iph = outer_iph; + struct flow_keys *_flkeys = flkeys; const struct ipv6hdr *inner_iph; const struct icmp6hdr *icmph; struct ipv6hdr _inner_iph; @@ -1247,26 +1979,78 @@ static void ip6_multipath_l3_keys(const struct sk_buff *skb, goto out; key_iph = inner_iph; + _flkeys = NULL; out: memset(keys, 0, sizeof(*keys)); keys->control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; - keys->addrs.v6addrs.src = key_iph->saddr; - keys->addrs.v6addrs.dst = key_iph->daddr; - keys->tags.flow_label = ip6_flowlabel(key_iph); - keys->basic.ip_proto = key_iph->nexthdr; + if (_flkeys) { + keys->addrs.v6addrs.src = _flkeys->addrs.v6addrs.src; + keys->addrs.v6addrs.dst = _flkeys->addrs.v6addrs.dst; + keys->tags.flow_label = _flkeys->tags.flow_label; + keys->basic.ip_proto = _flkeys->basic.ip_proto; + } else { + keys->addrs.v6addrs.src = key_iph->saddr; + keys->addrs.v6addrs.dst = key_iph->daddr; + keys->tags.flow_label = ip6_flowlabel(key_iph); + keys->basic.ip_proto = key_iph->nexthdr; + } } /* if skb is set it will be used and fl6 can be NULL */ -u32 rt6_multipath_hash(const struct flowi6 *fl6, const struct sk_buff *skb) +u32 rt6_multipath_hash(const struct net *net, const struct flowi6 *fl6, + const struct sk_buff *skb, struct flow_keys *flkeys) { struct flow_keys hash_keys; + u32 mhash; - if (skb) { - ip6_multipath_l3_keys(skb, &hash_keys); - return flow_hash_from_keys(&hash_keys); + switch (net->ipv6.sysctl.multipath_hash_policy) { + case 0: + memset(&hash_keys, 0, sizeof(hash_keys)); + hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; + if (skb) { + ip6_multipath_l3_keys(skb, &hash_keys, flkeys); + } else { + hash_keys.addrs.v6addrs.src = fl6->saddr; + hash_keys.addrs.v6addrs.dst = fl6->daddr; + hash_keys.tags.flow_label = (__force u32)fl6->flowlabel; + hash_keys.basic.ip_proto = fl6->flowi6_proto; + } + break; + case 1: + if (skb) { + unsigned int flag = FLOW_DISSECTOR_F_STOP_AT_ENCAP; + struct flow_keys keys; + + /* short-circuit if we already have L4 hash present */ + if (skb->l4_hash) + return skb_get_hash_raw(skb) >> 1; + + memset(&hash_keys, 0, sizeof(hash_keys)); + + if (!flkeys) { + skb_flow_dissect_flow_keys(skb, &keys, flag); + flkeys = &keys; + } + hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; + hash_keys.addrs.v6addrs.src = flkeys->addrs.v6addrs.src; + hash_keys.addrs.v6addrs.dst = flkeys->addrs.v6addrs.dst; + hash_keys.ports.src = flkeys->ports.src; + hash_keys.ports.dst = flkeys->ports.dst; + hash_keys.basic.ip_proto = flkeys->basic.ip_proto; + } else { + memset(&hash_keys, 0, sizeof(hash_keys)); + hash_keys.control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; + hash_keys.addrs.v6addrs.src = fl6->saddr; + hash_keys.addrs.v6addrs.dst = fl6->daddr; + hash_keys.ports.src = fl6->fl6_sport; + hash_keys.ports.dst = fl6->fl6_dport; + hash_keys.basic.ip_proto = fl6->flowi6_proto; + } + break; } + mhash = flow_hash_from_keys(&hash_keys); - return get_hash_from_flowi6(fl6); + return mhash >> 1; } void ip6_route_input(struct sk_buff *skb) @@ -1283,20 +2067,29 @@ void ip6_route_input(struct sk_buff *skb) .flowi6_mark = skb->mark, .flowi6_proto = iph->nexthdr, }; + struct flow_keys *flkeys = NULL, _flkeys; tun_info = skb_tunnel_info(skb); if (tun_info && !(tun_info->mode & IP_TUNNEL_INFO_TX)) fl6.flowi6_tun_key.tun_id = tun_info->key.tun_id; + + if (fib6_rules_early_flow_dissect(net, skb, &fl6, &_flkeys)) + flkeys = &_flkeys; + if (unlikely(fl6.flowi6_proto == IPPROTO_ICMPV6)) - fl6.mp_hash = rt6_multipath_hash(&fl6, skb); + fl6.mp_hash = rt6_multipath_hash(net, &fl6, skb, flkeys); skb_dst_drop(skb); - skb_dst_set(skb, ip6_route_input_lookup(net, skb->dev, &fl6, flags)); + skb_dst_set(skb, + ip6_route_input_lookup(net, skb->dev, &fl6, skb, flags)); } -static struct rt6_info *ip6_pol_route_output(struct net *net, struct fib6_table *table, - struct flowi6 *fl6, int flags) +static struct rt6_info *ip6_pol_route_output(struct net *net, + struct fib6_table *table, + struct flowi6 *fl6, + const struct sk_buff *skb, + int flags) { - return ip6_pol_route(net, table, fl6->flowi6_oif, fl6, flags); + return ip6_pol_route(net, table, fl6->flowi6_oif, fl6, skb, flags); } struct dst_entry *ip6_route_output_flags(struct net *net, const struct sock *sk, @@ -1324,7 +2117,7 @@ struct dst_entry *ip6_route_output_flags(struct net *net, const struct sock *sk, else if (sk) flags |= rt6_srcprefs2flags(inet6_sk(sk)->srcprefs); - return fib6_rule_lookup(net, fl6, flags, ip6_pol_route_output); + return fib6_rule_lookup(net, fl6, NULL, flags, ip6_pol_route_output); } EXPORT_SYMBOL_GPL(ip6_route_output_flags); @@ -1338,6 +2131,7 @@ struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_ori DST_OBSOLETE_DEAD, 0); if (rt) { rt6_info_init(rt); + atomic_inc(&net->ipv6.rt6_stats->fib_rt_alloc); new = &rt->dst; new->__use = 1; @@ -1349,7 +2143,6 @@ struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_ori rt->rt6i_idev = in6_dev_get(loopback_dev); rt->rt6i_gateway = ort->rt6i_gateway; rt->rt6i_flags = ort->rt6i_flags & ~RTF_PCPU; - rt->rt6i_metric = 0; memcpy(&rt->rt6i_dst, &ort->rt6i_dst, sizeof(struct rt6key)); #ifdef CONFIG_IPV6_SUBTREES @@ -1365,18 +2158,28 @@ struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_ori * Destination cache support functions */ -static void rt6_dst_from_metrics_check(struct rt6_info *rt) -{ - if (rt->dst.from && - dst_metrics_ptr(&rt->dst) != dst_metrics_ptr(rt->dst.from)) - dst_init_metrics(&rt->dst, dst_metrics_ptr(rt->dst.from), true); -} - -static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie) +static bool fib6_check(struct fib6_info *f6i, u32 cookie) { u32 rt_cookie = 0; - if (!rt6_get_cookie_safe(rt, &rt_cookie) || rt_cookie != cookie) + if ((f6i && !fib6_get_cookie_safe(f6i, &rt_cookie)) || + rt_cookie != cookie) + return false; + + if (fib6_check_expired(f6i)) + return false; + + return true; +} + +static struct dst_entry *rt6_check(struct rt6_info *rt, + struct fib6_info *from, + u32 cookie) +{ + u32 rt_cookie = 0; + + if ((from && !fib6_get_cookie_safe(from, &rt_cookie)) || + rt_cookie != cookie) return NULL; if (rt6_check_expired(rt)) @@ -1385,11 +2188,13 @@ static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie) return &rt->dst; } -static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt, u32 cookie) +static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt, + struct fib6_info *from, + u32 cookie) { if (!__rt6_check_expired(rt) && rt->dst.obsolete == DST_OBSOLETE_FORCE_CHK && - rt6_check((struct rt6_info *)(rt->dst.from), cookie)) + fib6_check(from, cookie)) return &rt->dst; else return NULL; @@ -1397,40 +2202,48 @@ static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt, u32 cookie) static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie) { + struct dst_entry *dst_ret; + struct fib6_info *from; struct rt6_info *rt; - rt = (struct rt6_info *) dst; + rt = container_of(dst, struct rt6_info, dst); + + rcu_read_lock(); /* All IPV6 dsts are created with ->obsolete set to the value * DST_OBSOLETE_FORCE_CHK which forces validation calls down * into this function always. */ - rt6_dst_from_metrics_check(rt); + from = rcu_dereference(rt->from); - if (rt->rt6i_flags & RTF_PCPU || - (unlikely(!list_empty(&rt->rt6i_uncached)) && rt->dst.from)) - return rt6_dst_from_check(rt, cookie); + if (from && (rt->rt6i_flags & RTF_PCPU || + unlikely(!list_empty(&rt->rt6i_uncached)))) + dst_ret = rt6_dst_from_check(rt, from, cookie); else - return rt6_check(rt, cookie); + dst_ret = rt6_check(rt, from, cookie); + + rcu_read_unlock(); + + return dst_ret; } -static struct dst_entry *ip6_negative_advice(struct dst_entry *dst) +static void ip6_negative_advice(struct sock *sk, + struct dst_entry *dst) { struct rt6_info *rt = (struct rt6_info *) dst; - if (rt) { - if (rt->rt6i_flags & RTF_CACHE) { - if (rt6_check_expired(rt)) { - ip6_del_rt(rt); - dst = NULL; - } - } else { - dst_release(dst); - dst = NULL; + if (rt->rt6i_flags & RTF_CACHE) { + if (rt6_check_expired(rt)) { + /* counteract the dst_release() in sk_dst_reset() */ + dst_hold(dst); + sk_dst_reset(sk); + + rt6_remove_exception_rt(rt); } + return; } - return dst; + sk_dst_reset(sk); } static void ip6_link_failure(struct sk_buff *skb) @@ -1441,35 +2254,60 @@ static void ip6_link_failure(struct sk_buff *skb) rt = (struct rt6_info *) skb_dst(skb); if (rt) { + rcu_read_lock(); if (rt->rt6i_flags & RTF_CACHE) { if (dst_hold_safe(&rt->dst)) - ip6_del_rt(rt); + rt6_remove_exception_rt(rt); } else { + struct fib6_info *from; struct fib6_node *fn; - rcu_read_lock(); - fn = rcu_dereference(rt->rt6i_node); - if (fn && (rt->rt6i_flags & RTF_DEFAULT)) - fn->fn_sernum = -1; - rcu_read_unlock(); + from = rcu_dereference(rt->from); + if (from) { + fn = rcu_dereference(from->fib6_node); + if (fn && (rt->rt6i_flags & RTF_DEFAULT)) + WRITE_ONCE(fn->fn_sernum, -1); + } } + rcu_read_unlock(); } } +static void rt6_update_expires(struct rt6_info *rt0, int timeout) +{ + if (!(rt0->rt6i_flags & RTF_EXPIRES)) { + struct fib6_info *from; + + rcu_read_lock(); + from = rcu_dereference(rt0->from); + if (from) + rt0->dst.expires = from->expires; + rcu_read_unlock(); + } + + dst_set_expires(&rt0->dst, timeout); + rt0->rt6i_flags |= RTF_EXPIRES; +} + static void rt6_do_update_pmtu(struct rt6_info *rt, u32 mtu) { struct net *net = dev_net(rt->dst.dev); + dst_metric_set(&rt->dst, RTAX_MTU, mtu); rt->rt6i_flags |= RTF_MODIFIED; - rt->rt6i_pmtu = mtu; rt6_update_expires(rt, net->ipv6.sysctl.ip6_rt_mtu_expires); } static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt) { + bool from_set; + + rcu_read_lock(); + from_set = !!rcu_dereference(rt->from); + rcu_read_unlock(); + return !(rt->rt6i_flags & RTF_CACHE) && - (rt->rt6i_flags & RTF_PCPU || - rcu_access_pointer(rt->rt6i_node)); + (rt->rt6i_flags & RTF_PCPU || from_set); } static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk, @@ -1504,24 +2342,22 @@ static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk, if (!rt6_cache_allowed_for_pmtu(rt6)) { rt6_do_update_pmtu(rt6, mtu); + /* update rt6_ex->stamp for cache */ + if (rt6->rt6i_flags & RTF_CACHE) + rt6_update_exception_stamp_rt(rt6); } else if (daddr) { + struct fib6_info *from; struct rt6_info *nrt6; - nrt6 = ip6_rt_cache_alloc(rt6, daddr, saddr); + rcu_read_lock(); + from = rcu_dereference(rt6->from); + nrt6 = ip6_rt_cache_alloc(from, daddr, saddr); if (nrt6) { rt6_do_update_pmtu(nrt6, mtu); - - /* ip6_ins_rt(nrt6) will bump the - * rt6->rt6i_node->fn_sernum - * which will fail the next rt6_check() and - * invalidate the sk->sk_dst_cache. - */ - ip6_ins_rt(nrt6); - /* Release the reference taken in - * ip6_rt_cache_alloc() - */ - dst_release(&nrt6->dst); + if (rt6_insert_exception(nrt6, from)) + dst_release_immediate(&nrt6->dst); } + rcu_read_unlock(); } } @@ -1586,10 +2422,12 @@ struct ip6rd_flowi { static struct rt6_info *__ip6_route_redirect(struct net *net, struct fib6_table *table, struct flowi6 *fl6, + const struct sk_buff *skb, int flags) { struct ip6rd_flowi *rdfl = (struct ip6rd_flowi *)fl6; - struct rt6_info *rt; + struct rt6_info *ret = NULL, *rt_cache; + struct fib6_info *rt; struct fib6_node *fn; /* Get the "current" route for this destination and @@ -1602,48 +2440,69 @@ static struct rt6_info *__ip6_route_redirect(struct net *net, * routes. */ - read_lock_bh(&table->tb6_lock); - fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); + rcu_read_lock(); + fn = fib6_node_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr); restart: - for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) { - if (rt6_check_expired(rt)) + for_each_fib6_node_rt_rcu(fn) { + if (rt->fib6_nh.nh_flags & RTNH_F_DEAD) continue; - if (rt->dst.error) + if (fib6_check_expired(rt)) + continue; + if (rt->fib6_flags & RTF_REJECT) break; - if (!(rt->rt6i_flags & RTF_GATEWAY)) + if (!(rt->fib6_flags & RTF_GATEWAY)) continue; - if (fl6->flowi6_oif != rt->dst.dev->ifindex) + if (fl6->flowi6_oif != rt->fib6_nh.nh_dev->ifindex) continue; - if (!ipv6_addr_equal(&rdfl->gateway, &rt->rt6i_gateway)) + /* rt_cache's gateway might be different from its 'parent' + * in the case of an ip redirect. + * So we keep searching in the exception table if the gateway + * is different. + */ + if (!ipv6_addr_equal(&rdfl->gateway, &rt->fib6_nh.nh_gw)) { + rt_cache = rt6_find_cached_rt(rt, + &fl6->daddr, + &fl6->saddr); + if (rt_cache && + ipv6_addr_equal(&rdfl->gateway, + &rt_cache->rt6i_gateway)) { + ret = rt_cache; + break; + } continue; + } break; } if (!rt) - rt = net->ipv6.ip6_null_entry; - else if (rt->dst.error) { - rt = net->ipv6.ip6_null_entry; + rt = net->ipv6.fib6_null_entry; + else if (rt->fib6_flags & RTF_REJECT) { + ret = net->ipv6.ip6_null_entry; goto out; } - if (rt == net->ipv6.ip6_null_entry) { + if (rt == net->ipv6.fib6_null_entry) { fn = fib6_backtrack(fn, &fl6->saddr); if (fn) goto restart; } out: - dst_hold(&rt->dst); + if (ret) + ip6_hold_safe(net, &ret, true); + else + ret = ip6_create_rt_rcu(rt); - read_unlock_bh(&table->tb6_lock); + rcu_read_unlock(); - trace_fib6_table_lookup(net, rt, table->tb6_id, fl6); - return rt; + trace_fib6_table_lookup(net, ret, table, fl6); + return ret; }; static struct dst_entry *ip6_route_redirect(struct net *net, - const struct flowi6 *fl6, - const struct in6_addr *gateway) + const struct flowi6 *fl6, + const struct sk_buff *skb, + const struct in6_addr *gateway) { int flags = RT6_LOOKUP_F_HAS_SADDR; struct ip6rd_flowi rdfl; @@ -1651,7 +2510,7 @@ static struct dst_entry *ip6_route_redirect(struct net *net, rdfl.fl6 = *fl6; rdfl.gateway = *gateway; - return fib6_rule_lookup(net, &rdfl.fl6, + return fib6_rule_lookup(net, &rdfl.fl6, skb, flags, __ip6_route_redirect); } @@ -1671,7 +2530,7 @@ void ip6_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark, fl6.flowlabel = ip6_flowinfo(iph); fl6.flowi6_uid = uid; - dst = ip6_route_redirect(net, &fl6, &ipv6_hdr(skb)->saddr); + dst = ip6_route_redirect(net, &fl6, skb, &ipv6_hdr(skb)->saddr); rt6_do_redirect(dst, NULL, skb); dst_release(dst); } @@ -1693,7 +2552,7 @@ void ip6_redirect_no_header(struct sk_buff *skb, struct net *net, int oif, fl6.saddr = iph->daddr; fl6.flowi6_uid = sock_net_uid(net, NULL); - dst = ip6_route_redirect(net, &fl6, &iph->saddr); + dst = ip6_route_redirect(net, &fl6, skb, &iph->saddr); rt6_do_redirect(dst, NULL, skb); dst_release(dst); } @@ -1729,12 +2588,8 @@ static unsigned int ip6_default_advmss(const struct dst_entry *dst) static unsigned int ip6_mtu(const struct dst_entry *dst) { - const struct rt6_info *rt = (const struct rt6_info *)dst; - unsigned int mtu = rt->rt6i_pmtu; struct inet6_dev *idev; - - if (mtu) - goto out; + unsigned int mtu; mtu = dst_metric_raw(dst, RTAX_MTU); if (mtu) @@ -1754,6 +2609,54 @@ out: return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } +/* MTU selection: + * 1. mtu on route is locked - use it + * 2. mtu from nexthop exception + * 3. mtu from egress device + * + * based on ip6_dst_mtu_forward and exception logic of + * rt6_find_cached_rt; called with rcu_read_lock + */ +u32 ip6_mtu_from_fib6(struct fib6_info *f6i, struct in6_addr *daddr, + struct in6_addr *saddr) +{ + struct rt6_exception_bucket *bucket; + struct rt6_exception *rt6_ex; + struct in6_addr *src_key; + struct inet6_dev *idev; + u32 mtu = 0; + + if (unlikely(fib6_metric_locked(f6i, RTAX_MTU))) { + mtu = f6i->fib6_pmtu; + if (mtu) + goto out; + } + + src_key = NULL; +#ifdef CONFIG_IPV6_SUBTREES + if (f6i->fib6_src.plen) + src_key = saddr; +#endif + + bucket = rcu_dereference(f6i->rt6i_exception_bucket); + rt6_ex = __rt6_find_exception_rcu(&bucket, daddr, src_key); + if (rt6_ex && !rt6_check_expired(rt6_ex->rt6i)) + mtu = dst_metric_raw(&rt6_ex->rt6i->dst, RTAX_MTU); + + if (likely(!mtu)) { + struct net_device *dev = fib6_info_nh_dev(f6i); + + mtu = IPV6_MIN_MTU; + idev = __in6_dev_get(dev); + if (idev && idev->cnf.mtu6 > mtu) + mtu = idev->cnf.mtu6; + } + + mtu = min_t(unsigned int, mtu, IP6_MAX_MTU); +out: + return mtu - lwtunnel_headroom(fib6_info_nh_lwt(f6i), mtu); +} + struct dst_entry *icmp6_dst_alloc(struct net_device *dev, struct flowi6 *fl6) { @@ -1781,10 +2684,11 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev, rt->rt6i_idev = idev; dst_metric_set(&rt->dst, RTAX_HOPLIMIT, 0); - /* Add this dst into uncached_list so that rt6_ifdown() can + /* Add this dst into uncached_list so that rt6_disable_ip() can * do proper release of the net_device */ rt6_uncached_list_add(rt); + atomic_inc(&net->ipv6.rt6_stats->fib_rt_uncache); dst = xfrm_lookup(net, &rt->dst, flowi6_to_flowi(fl6), NULL, 0); @@ -1818,64 +2722,30 @@ out: atomic_set(&net->ipv6.ip6_rt_gc_expire, val - (val >> rt_elasticity)); } -static int ip6_convert_metrics(struct mx6_config *mxc, - const struct fib6_config *cfg) +static int ip6_convert_metrics(struct net *net, struct fib6_info *rt, + struct fib6_config *cfg) { - bool ecn_ca = false; - struct nlattr *nla; - int remaining; - u32 *mp; + int err = 0; - if (!cfg->fc_mx) - return 0; + if (cfg->fc_mx) { + rt->fib6_metrics = kzalloc(sizeof(*rt->fib6_metrics), + GFP_KERNEL); + if (unlikely(!rt->fib6_metrics)) + return -ENOMEM; - mp = kzalloc(sizeof(u32) * RTAX_MAX, GFP_KERNEL); - if (unlikely(!mp)) - return -ENOMEM; + refcount_set(&rt->fib6_metrics->refcnt, 1); - nla_for_each_attr(nla, cfg->fc_mx, cfg->fc_mx_len, remaining) { - int type = nla_type(nla); - u32 val; - - if (!type) - continue; - if (unlikely(type > RTAX_MAX)) - goto err; - - if (type == RTAX_CC_ALGO) { - char tmp[TCP_CA_NAME_MAX]; - - nla_strlcpy(tmp, nla, sizeof(tmp)); - val = tcp_ca_get_key_by_name(tmp, &ecn_ca); - if (val == TCP_CA_UNSPEC) - goto err; - } else { - val = nla_get_u32(nla); - } - if (type == RTAX_HOPLIMIT && val > 255) - val = 255; - if (type == RTAX_FEATURES && (val & ~RTAX_FEATURE_MASK)) - goto err; - - mp[type - 1] = val; - __set_bit(type - 1, mxc->mx_valid); + err = ip_metrics_convert(net, cfg->fc_mx, cfg->fc_mx_len, + rt->fib6_metrics->metrics); } - if (ecn_ca) { - __set_bit(RTAX_FEATURES - 1, mxc->mx_valid); - mp[RTAX_FEATURES - 1] |= DST_FEATURE_ECN_CA; - } - - mxc->mx = mp; - return 0; - err: - kfree(mp); - return -EINVAL; + return err; } static struct rt6_info *ip6_nh_lookup_table(struct net *net, struct fib6_config *cfg, - const struct in6_addr *gw_addr) + const struct in6_addr *gw_addr, + u32 tbid, int flags) { struct flowi6 fl6 = { .flowi6_oif = cfg->fc_ifindex, @@ -1884,16 +2754,16 @@ static struct rt6_info *ip6_nh_lookup_table(struct net *net, }; struct fib6_table *table; struct rt6_info *rt; - int flags = RT6_LOOKUP_F_IFACE | RT6_LOOKUP_F_IGNORE_LINKSTATE; - table = fib6_get_table(net, cfg->fc_table); + table = fib6_get_table(net, tbid); if (!table) return NULL; if (!ipv6_addr_any(&cfg->fc_prefsrc)) flags |= RT6_LOOKUP_F_HAS_SADDR; - rt = ip6_pol_route(net, table, cfg->fc_ifindex, &fl6, flags); + flags |= RT6_LOOKUP_F_IGNORE_LINKSTATE; + rt = ip6_pol_route(net, table, cfg->fc_ifindex, &fl6, NULL, flags); /* if table lookup failed, fall back to full lookup */ if (rt == net->ipv6.ip6_null_entry) { @@ -1904,11 +2774,150 @@ static struct rt6_info *ip6_nh_lookup_table(struct net *net, return rt; } -static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg, +static int ip6_route_check_nh_onlink(struct net *net, + struct fib6_config *cfg, + const struct net_device *dev, + struct netlink_ext_ack *extack) +{ + u32 tbid = l3mdev_fib_table(dev) ? : RT_TABLE_LOCAL; + const struct in6_addr *gw_addr = &cfg->fc_gateway; + u32 flags = RTF_LOCAL | RTF_ANYCAST | RTF_REJECT; + struct rt6_info *grt; + int err; + + err = 0; + grt = ip6_nh_lookup_table(net, cfg, gw_addr, tbid, 0); + if (grt) { + if (grt->rt6i_flags & flags || dev != grt->dst.dev) { + NL_SET_ERR_MSG(extack, "Nexthop has invalid gateway"); + err = -EINVAL; + } + + ip6_rt_put(grt); + } + + return err; +} + +static int ip6_route_check_nh(struct net *net, + struct fib6_config *cfg, + struct net_device **_dev, + struct inet6_dev **idev) +{ + const struct in6_addr *gw_addr = &cfg->fc_gateway; + struct net_device *dev = _dev ? *_dev : NULL; + struct rt6_info *grt = NULL; + int err = -EHOSTUNREACH; + + if (cfg->fc_table) { + int flags = RT6_LOOKUP_F_IFACE; + + grt = ip6_nh_lookup_table(net, cfg, gw_addr, + cfg->fc_table, flags); + if (grt) { + if (grt->rt6i_flags & RTF_GATEWAY || + (dev && dev != grt->dst.dev)) { + ip6_rt_put(grt); + grt = NULL; + } + } + } + + if (!grt) + grt = rt6_lookup(net, gw_addr, NULL, cfg->fc_ifindex, NULL, 1); + + if (!grt) + goto out; + + if (dev) { + if (dev != grt->dst.dev) { + ip6_rt_put(grt); + goto out; + } + } else { + *_dev = dev = grt->dst.dev; + *idev = grt->rt6i_idev; + dev_hold(dev); + in6_dev_hold(grt->rt6i_idev); + } + + if (!(grt->rt6i_flags & RTF_GATEWAY)) + err = 0; + + ip6_rt_put(grt); + +out: + return err; +} + +static int ip6_validate_gw(struct net *net, struct fib6_config *cfg, + struct net_device **_dev, struct inet6_dev **idev, + struct netlink_ext_ack *extack) +{ + const struct in6_addr *gw_addr = &cfg->fc_gateway; + int gwa_type = ipv6_addr_type(gw_addr); + const struct net_device *dev = *_dev; + int err = -EINVAL; + + /* if gw_addr is local we will fail to detect this in case + * address is still TENTATIVE (DAD in progress). rt6_lookup() + * will return already-added prefix route via interface that + * prefix route was assigned to, which might be non-loopback. + */ + if (ipv6_chk_addr_and_flags(net, gw_addr, + gwa_type & IPV6_ADDR_LINKLOCAL ? + dev : NULL, 0, 0)) { + NL_SET_ERR_MSG(extack, "Invalid gateway address"); + goto out; + } + + if (gwa_type != (IPV6_ADDR_LINKLOCAL | IPV6_ADDR_UNICAST)) { + /* IPv6 strictly inhibits using not link-local + * addresses as nexthop address. + * Otherwise, router will not able to send redirects. + * It is very good, but in some (rare!) circumstances + * (SIT, PtP, NBMA NOARP links) it is handy to allow + * some exceptions. --ANK + * We allow IPv4-mapped nexthops to support RFC4798-type + * addressing + */ + if (!(gwa_type & (IPV6_ADDR_UNICAST | IPV6_ADDR_MAPPED))) { + NL_SET_ERR_MSG(extack, "Invalid gateway address"); + goto out; + } + + if (cfg->fc_flags & RTNH_F_ONLINK) + err = ip6_route_check_nh_onlink(net, cfg, dev, extack); + else + err = ip6_route_check_nh(net, cfg, _dev, idev); + + if (err) + goto out; + } + + /* reload in case device was changed */ + dev = *_dev; + + err = -EINVAL; + if (!dev) { + NL_SET_ERR_MSG(extack, "Egress device not specified"); + goto out; + } else if (dev->flags & IFF_LOOPBACK) { + NL_SET_ERR_MSG(extack, + "Egress device can not be loopback device for this route"); + goto out; + } + err = 0; +out: + return err; +} + +static struct fib6_info *ip6_route_info_create(struct fib6_config *cfg, + gfp_t gfp_flags, struct netlink_ext_ack *extack) { struct net *net = cfg->fc_nlinfo.nl_net; - struct rt6_info *rt = NULL; + struct fib6_info *rt = NULL; struct net_device *dev = NULL; struct inet6_dev *idev = NULL; struct fib6_table *table; @@ -1921,6 +2930,17 @@ static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg, goto out; } + /* RTF_CACHE is an internal flag; can not be set by userspace */ + if (cfg->fc_flags & RTF_CACHE) { + NL_SET_ERR_MSG(extack, "Userspace can not set RTF_CACHE"); + goto out; + } + + if (cfg->fc_type > RTN_MAX) { + NL_SET_ERR_MSG(extack, "Invalid route type"); + goto out; + } + if (cfg->fc_dst_len > 128) { NL_SET_ERR_MSG(extack, "Invalid prefix length"); goto out; @@ -1949,6 +2969,21 @@ static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg, if (cfg->fc_metric == 0) cfg->fc_metric = IP6_RT_PRIO_USER; + if (cfg->fc_flags & RTNH_F_ONLINK) { + if (!dev) { + NL_SET_ERR_MSG(extack, + "Nexthop device required for onlink"); + err = -ENODEV; + goto out; + } + + if (!(dev->flags & IFF_UP)) { + NL_SET_ERR_MSG(extack, "Nexthop device is not up"); + err = -ENETDOWN; + goto out; + } + } + err = -ENOBUFS; if (cfg->fc_nlinfo.nlh && !(cfg->fc_nlinfo.nlh->nlmsg_flags & NLM_F_CREATE)) { @@ -1964,35 +2999,30 @@ static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg, if (!table) goto out; - rt = ip6_dst_alloc(net, NULL, - (cfg->fc_flags & RTF_ADDRCONF) ? 0 : DST_NOCOUNT); - - if (!rt) { - err = -ENOMEM; + err = -ENOMEM; + rt = fib6_info_alloc(gfp_flags); + if (!rt) + goto out; + + if (cfg->fc_flags & RTF_ADDRCONF) + rt->dst_nocount = true; + + err = ip6_convert_metrics(net, rt, cfg); + if (err < 0) goto out; - } if (cfg->fc_flags & RTF_EXPIRES) - rt6_set_expires(rt, jiffies + + fib6_set_expires(rt, jiffies + clock_t_to_jiffies(cfg->fc_expires)); else - rt6_clean_expires(rt); + fib6_clean_expires(rt); if (cfg->fc_protocol == RTPROT_UNSPEC) cfg->fc_protocol = RTPROT_BOOT; - rt->rt6i_protocol = cfg->fc_protocol; + rt->fib6_protocol = cfg->fc_protocol; addr_type = ipv6_addr_type(&cfg->fc_dst); - if (addr_type & IPV6_ADDR_MULTICAST) - rt->dst.input = ip6_mc_input; - else if (cfg->fc_flags & RTF_LOCAL) - rt->dst.input = ip6_input; - else - rt->dst.input = ip6_forward; - - rt->dst.output = ip6_output; - if (cfg->fc_encap) { struct lwtunnel_state *lwtstate; @@ -2001,28 +3031,23 @@ static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg, &lwtstate, extack); if (err) goto out; - rt->dst.lwtstate = lwtstate_get(lwtstate); - if (lwtunnel_output_redirect(rt->dst.lwtstate)) { - rt->dst.lwtstate->orig_output = rt->dst.output; - rt->dst.output = lwtunnel_output; - } - if (lwtunnel_input_redirect(rt->dst.lwtstate)) { - rt->dst.lwtstate->orig_input = rt->dst.input; - rt->dst.input = lwtunnel_input; - } + rt->fib6_nh.nh_lwtstate = lwtstate_get(lwtstate); } - ipv6_addr_prefix(&rt->rt6i_dst.addr, &cfg->fc_dst, cfg->fc_dst_len); - rt->rt6i_dst.plen = cfg->fc_dst_len; - if (rt->rt6i_dst.plen == 128) - rt->dst.flags |= DST_HOST; + ipv6_addr_prefix(&rt->fib6_dst.addr, &cfg->fc_dst, cfg->fc_dst_len); + rt->fib6_dst.plen = cfg->fc_dst_len; + if (rt->fib6_dst.plen == 128) + rt->dst_host = true; #ifdef CONFIG_IPV6_SUBTREES - ipv6_addr_prefix(&rt->rt6i_src.addr, &cfg->fc_src, cfg->fc_src_len); - rt->rt6i_src.plen = cfg->fc_src_len; + ipv6_addr_prefix(&rt->fib6_src.addr, &cfg->fc_src, cfg->fc_src_len); + rt->fib6_src.plen = cfg->fc_src_len; #endif - rt->rt6i_metric = cfg->fc_metric; + rt->fib6_metric = cfg->fc_metric; + rt->fib6_nh.nh_weight = 1; + + rt->fib6_type = cfg->fc_type; /* We cannot add true routes via loopback here, they would result in kernel looping; promote them to reject routes @@ -2045,117 +3070,16 @@ static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg, goto out; } } - rt->rt6i_flags = RTF_REJECT|RTF_NONEXTHOP; - switch (cfg->fc_type) { - case RTN_BLACKHOLE: - rt->dst.error = -EINVAL; - rt->dst.output = dst_discard_out; - rt->dst.input = dst_discard; - break; - case RTN_PROHIBIT: - rt->dst.error = -EACCES; - rt->dst.output = ip6_pkt_prohibit_out; - rt->dst.input = ip6_pkt_prohibit; - break; - case RTN_THROW: - case RTN_UNREACHABLE: - default: - rt->dst.error = (cfg->fc_type == RTN_THROW) ? -EAGAIN - : (cfg->fc_type == RTN_UNREACHABLE) - ? -EHOSTUNREACH : -ENETUNREACH; - rt->dst.output = ip6_pkt_discard_out; - rt->dst.input = ip6_pkt_discard; - break; - } + rt->fib6_flags = RTF_REJECT|RTF_NONEXTHOP; goto install_route; } if (cfg->fc_flags & RTF_GATEWAY) { - const struct in6_addr *gw_addr; - int gwa_type; - - gw_addr = &cfg->fc_gateway; - gwa_type = ipv6_addr_type(gw_addr); - - /* if gw_addr is local we will fail to detect this in case - * address is still TENTATIVE (DAD in progress). rt6_lookup() - * will return already-added prefix route via interface that - * prefix route was assigned to, which might be non-loopback. - */ - err = -EINVAL; - if (ipv6_chk_addr_and_flags(net, gw_addr, - gwa_type & IPV6_ADDR_LINKLOCAL ? - dev : NULL, 0, 0)) { - NL_SET_ERR_MSG(extack, "Invalid gateway address"); + err = ip6_validate_gw(net, cfg, &dev, &idev, extack); + if (err) goto out; - } - rt->rt6i_gateway = *gw_addr; - if (gwa_type != (IPV6_ADDR_LINKLOCAL|IPV6_ADDR_UNICAST)) { - struct rt6_info *grt = NULL; - - /* IPv6 strictly inhibits using not link-local - addresses as nexthop address. - Otherwise, router will not able to send redirects. - It is very good, but in some (rare!) circumstances - (SIT, PtP, NBMA NOARP links) it is handy to allow - some exceptions. --ANK - We allow IPv4-mapped nexthops to support RFC4798-type - addressing - */ - if (!(gwa_type & (IPV6_ADDR_UNICAST | - IPV6_ADDR_MAPPED))) { - NL_SET_ERR_MSG(extack, - "Invalid gateway address"); - goto out; - } - - if (cfg->fc_table) { - grt = ip6_nh_lookup_table(net, cfg, gw_addr); - - if (grt) { - if (grt->rt6i_flags & RTF_GATEWAY || - (dev && dev != grt->dst.dev)) { - ip6_rt_put(grt); - grt = NULL; - } - } - } - - if (!grt) - grt = rt6_lookup(net, gw_addr, NULL, - cfg->fc_ifindex, 1); - - err = -EHOSTUNREACH; - if (!grt) - goto out; - if (dev) { - if (dev != grt->dst.dev) { - ip6_rt_put(grt); - goto out; - } - } else { - dev = grt->dst.dev; - idev = grt->rt6i_idev; - dev_hold(dev); - in6_dev_hold(grt->rt6i_idev); - } - if (!(grt->rt6i_flags & RTF_GATEWAY)) - err = 0; - ip6_rt_put(grt); - - if (err) - goto out; - } - err = -EINVAL; - if (!dev) { - NL_SET_ERR_MSG(extack, "Egress device not specified"); - goto out; - } else if (dev->flags & IFF_LOOPBACK) { - NL_SET_ERR_MSG(extack, - "Egress device can not be loopback device for this route"); - goto out; - } + rt->fib6_nh.nh_gw = cfg->fc_gateway; } err = -ENODEV; @@ -2168,92 +3092,82 @@ static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg, err = -EINVAL; goto out; } - rt->rt6i_prefsrc.addr = cfg->fc_prefsrc; - rt->rt6i_prefsrc.plen = 128; + rt->fib6_prefsrc.addr = cfg->fc_prefsrc; + rt->fib6_prefsrc.plen = 128; } else - rt->rt6i_prefsrc.plen = 0; + rt->fib6_prefsrc.plen = 0; - rt->rt6i_flags = cfg->fc_flags; + rt->fib6_flags = cfg->fc_flags; install_route: - rt->dst.dev = dev; - rt->rt6i_idev = idev; - rt->rt6i_table = table; + if (!(rt->fib6_flags & (RTF_LOCAL | RTF_ANYCAST)) && + !netif_carrier_ok(dev)) + rt->fib6_nh.nh_flags |= RTNH_F_LINKDOWN; + rt->fib6_nh.nh_flags |= (cfg->fc_flags & RTNH_F_ONLINK); + rt->fib6_nh.nh_dev = dev; + rt->fib6_table = table; cfg->fc_nlinfo.nl_net = dev_net(dev); + if (idev) + in6_dev_put(idev); + return rt; out: if (dev) dev_put(dev); if (idev) in6_dev_put(idev); - if (rt) - dst_release_immediate(&rt->dst); + fib6_info_release(rt); return ERR_PTR(err); } -int ip6_route_add(struct fib6_config *cfg, +int ip6_route_add(struct fib6_config *cfg, gfp_t gfp_flags, struct netlink_ext_ack *extack) { - struct mx6_config mxc = { .mx = NULL, }; - struct rt6_info *rt; + struct fib6_info *rt; int err; - rt = ip6_route_info_create(cfg, extack); - if (IS_ERR(rt)) { - err = PTR_ERR(rt); - rt = NULL; - goto out; - } + rt = ip6_route_info_create(cfg, gfp_flags, extack); + if (IS_ERR(rt)) + return PTR_ERR(rt); - err = ip6_convert_metrics(&mxc, cfg); - if (err) - goto out; - - err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, &mxc, extack); - - kfree(mxc.mx); - - return err; -out: - if (rt) - dst_release_immediate(&rt->dst); + err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, extack); + fib6_info_release(rt); return err; } -static int __ip6_del_rt(struct rt6_info *rt, struct nl_info *info) +static int __ip6_del_rt(struct fib6_info *rt, struct nl_info *info) { - int err; + struct net *net = info->nl_net; struct fib6_table *table; - struct net *net = dev_net(rt->dst.dev); + int err; - if (rt == net->ipv6.ip6_null_entry) { + if (rt == net->ipv6.fib6_null_entry) { err = -ENOENT; goto out; } - table = rt->rt6i_table; - write_lock_bh(&table->tb6_lock); + table = rt->fib6_table; + spin_lock_bh(&table->tb6_lock); err = fib6_del(rt, info); - write_unlock_bh(&table->tb6_lock); + spin_unlock_bh(&table->tb6_lock); out: - ip6_rt_put(rt); + fib6_info_release(rt); return err; } -int ip6_del_rt(struct rt6_info *rt) +int ip6_del_rt(struct net *net, struct fib6_info *rt) { - struct nl_info info = { - .nl_net = dev_net(rt->dst.dev), - }; + struct nl_info info = { .nl_net = net }; + return __ip6_del_rt(rt, &info); } -static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config *cfg) +static int __ip6_del_rt_siblings(struct fib6_info *rt, struct fib6_config *cfg) { struct nl_info *info = &cfg->fc_nlinfo; struct net *net = info->nl_net; @@ -2261,20 +3175,20 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config *cfg) struct fib6_table *table; int err = -ENOENT; - if (rt == net->ipv6.ip6_null_entry) + if (rt == net->ipv6.fib6_null_entry) goto out_put; - table = rt->rt6i_table; - write_lock_bh(&table->tb6_lock); + table = rt->fib6_table; + spin_lock_bh(&table->tb6_lock); - if (rt->rt6i_nsiblings && cfg->fc_delete_all_nh) { - struct rt6_info *sibling, *next_sibling; + if (rt->fib6_nsiblings && cfg->fc_delete_all_nh) { + struct fib6_info *sibling, *next_sibling; /* prefer to send a single notification with all hops */ skb = nlmsg_new(rt6_nlmsg_size(rt), gfp_any()); if (skb) { u32 seq = info->nlh ? info->nlh->nlmsg_seq : 0; - if (rt6_fill_node(net, skb, rt, + if (rt6_fill_node(net, skb, rt, NULL, NULL, NULL, 0, RTM_DELROUTE, info->portid, seq, 0) < 0) { kfree_skb(skb); @@ -2284,8 +3198,8 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config *cfg) } list_for_each_entry_safe(sibling, next_sibling, - &rt->rt6i_siblings, - rt6i_siblings) { + &rt->fib6_siblings, + fib6_siblings) { err = fib6_del(sibling, info); if (err) goto out_unlock; @@ -2294,9 +3208,9 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config *cfg) err = fib6_del(rt, info); out_unlock: - write_unlock_bh(&table->tb6_lock); + spin_unlock_bh(&table->tb6_lock); out_put: - ip6_rt_put(rt); + fib6_info_release(rt); if (skb) { rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE, @@ -2305,12 +3219,29 @@ out_put: return err; } +static int ip6_del_cached_rt(struct rt6_info *rt, struct fib6_config *cfg) +{ + int rc = -ESRCH; + + if (cfg->fc_ifindex && rt->dst.dev->ifindex != cfg->fc_ifindex) + goto out; + + if (cfg->fc_flags & RTF_GATEWAY && + !ipv6_addr_equal(&cfg->fc_gateway, &rt->rt6i_gateway)) + goto out; + if (dst_hold_safe(&rt->dst)) + rc = rt6_remove_exception_rt(rt); +out: + return rc; +} + static int ip6_route_del(struct fib6_config *cfg, struct netlink_ext_ack *extack) { + struct rt6_info *rt_cache; struct fib6_table *table; + struct fib6_info *rt; struct fib6_node *fn; - struct rt6_info *rt; int err = -ESRCH; table = fib6_get_table(cfg->fc_nlinfo.nl_net, cfg->fc_table); @@ -2319,30 +3250,41 @@ static int ip6_route_del(struct fib6_config *cfg, return err; } - read_lock_bh(&table->tb6_lock); + rcu_read_lock(); fn = fib6_locate(&table->tb6_root, &cfg->fc_dst, cfg->fc_dst_len, - &cfg->fc_src, cfg->fc_src_len); + &cfg->fc_src, cfg->fc_src_len, + !(cfg->fc_flags & RTF_CACHE)); if (fn) { - for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) { - if ((rt->rt6i_flags & RTF_CACHE) && - !(cfg->fc_flags & RTF_CACHE)) + for_each_fib6_node_rt_rcu(fn) { + if (cfg->fc_flags & RTF_CACHE) { + int rc; + + rt_cache = rt6_find_cached_rt(rt, &cfg->fc_dst, + &cfg->fc_src); + if (rt_cache) { + rc = ip6_del_cached_rt(rt_cache, cfg); + if (rc != -ESRCH) + return rc; + } continue; + } if (cfg->fc_ifindex && - (!rt->dst.dev || - rt->dst.dev->ifindex != cfg->fc_ifindex)) + (!rt->fib6_nh.nh_dev || + rt->fib6_nh.nh_dev->ifindex != cfg->fc_ifindex)) continue; if (cfg->fc_flags & RTF_GATEWAY && - !ipv6_addr_equal(&cfg->fc_gateway, &rt->rt6i_gateway)) + !ipv6_addr_equal(&cfg->fc_gateway, &rt->fib6_nh.nh_gw)) continue; - if (cfg->fc_metric && cfg->fc_metric != rt->rt6i_metric) + if (cfg->fc_metric && cfg->fc_metric != rt->fib6_metric) continue; - if (cfg->fc_protocol && cfg->fc_protocol != rt->rt6i_protocol) + if (cfg->fc_protocol && cfg->fc_protocol != rt->fib6_protocol) continue; - dst_hold(&rt->dst); - read_unlock_bh(&table->tb6_lock); + if (!fib6_info_hold_safe(rt)) + continue; + rcu_read_unlock(); /* if gateway was specified only delete the one hop */ if (cfg->fc_flags & RTF_GATEWAY) @@ -2351,7 +3293,7 @@ static int ip6_route_del(struct fib6_config *cfg, return __ip6_del_rt_siblings(rt, cfg); } } - read_unlock_bh(&table->tb6_lock); + rcu_read_unlock(); return err; } @@ -2363,6 +3305,7 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu struct ndisc_options ndopts; struct inet6_dev *in6_dev; struct neighbour *neigh; + struct fib6_info *from; struct rd_msg *msg; int optlen, on_link; u8 *lladdr; @@ -2444,7 +3387,15 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu NEIGH_UPDATE_F_ISROUTER)), NDISC_REDIRECT, &ndopts); - nrt = ip6_rt_cache_alloc(rt, &msg->dest, NULL); + rcu_read_lock(); + from = rcu_dereference(rt->from); + /* This fib6_info_hold() is safe here because we hold reference to rt + * and rt already holds reference to fib6_info. + */ + fib6_info_hold(from); + rcu_read_unlock(); + + nrt = ip6_rt_cache_alloc(from, &msg->dest, NULL); if (!nrt) goto out; @@ -2452,11 +3403,16 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu if (on_link) nrt->rt6i_flags &= ~RTF_GATEWAY; - nrt->rt6i_protocol = RTPROT_REDIRECT; nrt->rt6i_gateway = *(struct in6_addr *)neigh->primary_key; - if (ip6_ins_rt(nrt)) - goto out_release; + /* No need to remove rt from the exception table if rt is + * a cached route because rt6_insert_exception() will + * takes care of it + */ + if (rt6_insert_exception(nrt, from)) { + dst_release_immediate(&nrt->dst); + goto out; + } netevent.old = &rt->dst; netevent.new = &nrt->dst; @@ -2464,93 +3420,48 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu netevent.neigh = neigh; call_netevent_notifiers(NETEVENT_REDIRECT, &netevent); - if (rt->rt6i_flags & RTF_CACHE) { - rt = (struct rt6_info *) dst_clone(&rt->dst); - ip6_del_rt(rt); - } - -out_release: - /* Release the reference taken in - * ip6_rt_cache_alloc() - */ - dst_release(&nrt->dst); - out: + fib6_info_release(from); neigh_release(neigh); } -/* - * Misc support functions - */ - -static void rt6_set_from(struct rt6_info *rt, struct rt6_info *from) -{ - BUG_ON(from->dst.from); - - rt->rt6i_flags &= ~RTF_EXPIRES; - dst_hold(&from->dst); - rt->dst.from = &from->dst; - dst_init_metrics(&rt->dst, dst_metrics_ptr(&from->dst), true); -} - -static void ip6_rt_copy_init(struct rt6_info *rt, struct rt6_info *ort) -{ - rt->dst.input = ort->dst.input; - rt->dst.output = ort->dst.output; - rt->rt6i_dst = ort->rt6i_dst; - rt->dst.error = ort->dst.error; - rt->rt6i_idev = ort->rt6i_idev; - if (rt->rt6i_idev) - in6_dev_hold(rt->rt6i_idev); - rt->dst.lastuse = jiffies; - rt->rt6i_gateway = ort->rt6i_gateway; - rt->rt6i_flags = ort->rt6i_flags; - rt6_set_from(rt, ort); - rt->rt6i_metric = ort->rt6i_metric; -#ifdef CONFIG_IPV6_SUBTREES - rt->rt6i_src = ort->rt6i_src; -#endif - rt->rt6i_prefsrc = ort->rt6i_prefsrc; - rt->rt6i_table = ort->rt6i_table; - rt->dst.lwtstate = lwtstate_get(ort->dst.lwtstate); -} - #ifdef CONFIG_IPV6_ROUTE_INFO -static struct rt6_info *rt6_get_route_info(struct net *net, +static struct fib6_info *rt6_get_route_info(struct net *net, const struct in6_addr *prefix, int prefixlen, const struct in6_addr *gwaddr, struct net_device *dev) { u32 tb_id = l3mdev_fib_table(dev) ? : addrconf_rt_table(dev, RT6_TABLE_INFO); struct fib6_node *fn; - struct rt6_info *rt = NULL; + struct fib6_info *rt = NULL; struct fib6_table *table; table = fib6_get_table(net, tb_id); if (!table) return NULL; - read_lock_bh(&table->tb6_lock); - fn = fib6_locate(&table->tb6_root, prefix, prefixlen, NULL, 0); + rcu_read_lock(); + fn = fib6_locate(&table->tb6_root, prefix, prefixlen, NULL, 0, true); if (!fn) goto out; - for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) { - if (rt->dst.dev->ifindex != dev->ifindex) + for_each_fib6_node_rt_rcu(fn) { + if (rt->fib6_nh.nh_dev->ifindex != dev->ifindex) continue; - if ((rt->rt6i_flags & (RTF_ROUTEINFO|RTF_GATEWAY)) != (RTF_ROUTEINFO|RTF_GATEWAY)) + if ((rt->fib6_flags & (RTF_ROUTEINFO|RTF_GATEWAY)) != (RTF_ROUTEINFO|RTF_GATEWAY)) continue; - if (!ipv6_addr_equal(&rt->rt6i_gateway, gwaddr)) + if (!ipv6_addr_equal(&rt->fib6_nh.nh_gw, gwaddr)) + continue; + if (!fib6_info_hold_safe(rt)) continue; - dst_hold(&rt->dst); break; } out: - read_unlock_bh(&table->tb6_lock); + rcu_read_unlock(); return rt; } -static struct rt6_info *rt6_add_route_info(struct net *net, +static struct fib6_info *rt6_add_route_info(struct net *net, const struct in6_addr *prefix, int prefixlen, const struct in6_addr *gwaddr, struct net_device *dev, @@ -2563,6 +3474,7 @@ static struct rt6_info *rt6_add_route_info(struct net *net, .fc_flags = RTF_GATEWAY | RTF_ADDRCONF | RTF_ROUTEINFO | RTF_UP | RTF_PREF(pref), .fc_protocol = RTPROT_RA, + .fc_type = RTN_UNICAST, .fc_nlinfo.portid = 0, .fc_nlinfo.nlh = NULL, .fc_nlinfo.nl_net = net, @@ -2576,36 +3488,39 @@ static struct rt6_info *rt6_add_route_info(struct net *net, if (!prefixlen) cfg.fc_flags |= RTF_DEFAULT; - ip6_route_add(&cfg, NULL); + ip6_route_add(&cfg, GFP_ATOMIC, NULL); return rt6_get_route_info(net, prefix, prefixlen, gwaddr, dev); } #endif -struct rt6_info *rt6_get_dflt_router(const struct in6_addr *addr, struct net_device *dev) +struct fib6_info *rt6_get_dflt_router(struct net *net, + const struct in6_addr *addr, + struct net_device *dev) { u32 tb_id = l3mdev_fib_table(dev) ? : addrconf_rt_table(dev, RT6_TABLE_MAIN); - struct rt6_info *rt; + struct fib6_info *rt; struct fib6_table *table; - table = fib6_get_table(dev_net(dev), tb_id); + table = fib6_get_table(net, tb_id); if (!table) return NULL; - read_lock_bh(&table->tb6_lock); - for (rt = table->tb6_root.leaf; rt; rt = rt->dst.rt6_next) { - if (dev == rt->dst.dev && - ((rt->rt6i_flags & (RTF_ADDRCONF | RTF_DEFAULT)) == (RTF_ADDRCONF | RTF_DEFAULT)) && - ipv6_addr_equal(&rt->rt6i_gateway, addr)) + rcu_read_lock(); + for_each_fib6_node_rt_rcu(&table->tb6_root) { + if (dev == rt->fib6_nh.nh_dev && + ((rt->fib6_flags & (RTF_ADDRCONF | RTF_DEFAULT)) == (RTF_ADDRCONF | RTF_DEFAULT)) && + ipv6_addr_equal(&rt->fib6_nh.nh_gw, addr)) break; } - if (rt) - dst_hold(&rt->dst); - read_unlock_bh(&table->tb6_lock); + if (rt && !fib6_info_hold_safe(rt)) + rt = NULL; + rcu_read_unlock(); return rt; } -struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr, +struct fib6_info *rt6_add_dflt_router(struct net *net, + const struct in6_addr *gwaddr, struct net_device *dev, unsigned int pref) { @@ -2616,14 +3531,15 @@ struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr, .fc_flags = RTF_GATEWAY | RTF_ADDRCONF | RTF_DEFAULT | RTF_UP | RTF_EXPIRES | RTF_PREF(pref), .fc_protocol = RTPROT_RA, + .fc_type = RTN_UNICAST, .fc_nlinfo.portid = 0, .fc_nlinfo.nlh = NULL, - .fc_nlinfo.nl_net = dev_net(dev), + .fc_nlinfo.nl_net = net, }; cfg.fc_gateway = *gwaddr; - if (!ip6_route_add(&cfg, NULL)) { + if (!ip6_route_add(&cfg, GFP_ATOMIC, NULL)) { struct fib6_table *table; table = fib6_get_table(dev_net(dev), cfg.fc_table); @@ -2631,12 +3547,15 @@ struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr, table->flags |= RT6_TABLE_HAS_DFLT_ROUTER; } - return rt6_get_dflt_router(gwaddr, dev); + return rt6_get_dflt_router(net, gwaddr, dev); } -int rt6_addrconf_purge(struct rt6_info *rt, void *arg) { - if (rt->rt6i_flags & (RTF_DEFAULT | RTF_ADDRCONF) && - (!rt->rt6i_idev || rt->rt6i_idev->cnf.accept_ra != 2)) +int rt6_addrconf_purge(struct fib6_info *rt, void *arg) { + struct net_device *dev = fib6_info_nh_dev(rt); + struct inet6_dev *idev = dev ? __in6_dev_get(dev) : NULL; + + if (rt->fib6_flags & (RTF_DEFAULT | RTF_ADDRCONF) && + (!idev || idev->cnf.accept_ra != 2)) return -1; return 0; } @@ -2660,6 +3579,7 @@ static void rtmsg_to_fib6_config(struct net *net, cfg->fc_dst_len = rtmsg->rtmsg_dst_len; cfg->fc_src_len = rtmsg->rtmsg_src_len; cfg->fc_flags = rtmsg->rtmsg_flags; + cfg->fc_type = rtmsg->rtmsg_type; cfg->fc_nlinfo.nl_net = net; @@ -2689,7 +3609,7 @@ int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg) rtnl_lock(); switch (cmd) { case SIOCADDRT: - err = ip6_route_add(&cfg, NULL); + err = ip6_route_add(&cfg, GFP_KERNEL, NULL); break; case SIOCDELRT: err = ip6_route_del(&cfg, NULL); @@ -2717,7 +3637,8 @@ static int ip6_pkt_drop(struct sk_buff *skb, u8 code, int ipstats_mib_noroutes) case IPSTATS_MIB_INNOROUTES: type = ipv6_addr_type(&ipv6_hdr(skb)->daddr); if (type == IPV6_ADDR_ANY) { - IP6_INC_STATS(dev_net(dst->dev), ip6_dst_idev(dst), + IP6_INC_STATS(dev_net(dst->dev), + __in6_dev_get_safely(skb->dev), IPSTATS_MIB_INADDRERRORS); break; } @@ -2758,40 +3679,40 @@ static int ip6_pkt_prohibit_out(struct net *net, struct sock *sk, struct sk_buff * Allocate a dst for local (unicast / anycast) address. */ -struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev, - const struct in6_addr *addr, - bool anycast) +struct fib6_info *addrconf_f6i_alloc(struct net *net, + struct inet6_dev *idev, + const struct in6_addr *addr, + bool anycast, gfp_t gfp_flags) { u32 tb_id; - struct net *net = dev_net(idev->dev); struct net_device *dev = idev->dev; - struct rt6_info *rt; + struct fib6_info *f6i; - rt = ip6_dst_alloc(net, dev, DST_NOCOUNT); - if (!rt) + f6i = fib6_info_alloc(gfp_flags); + if (!f6i) return ERR_PTR(-ENOMEM); - in6_dev_hold(idev); + f6i->dst_nocount = true; + f6i->dst_host = true; + f6i->fib6_protocol = RTPROT_KERNEL; + f6i->fib6_flags = RTF_UP | RTF_NONEXTHOP; + if (anycast) { + f6i->fib6_type = RTN_ANYCAST; + f6i->fib6_flags |= RTF_ANYCAST; + } else { + f6i->fib6_type = RTN_LOCAL; + f6i->fib6_flags |= RTF_LOCAL; + } - rt->dst.flags |= DST_HOST; - rt->dst.input = ip6_input; - rt->dst.output = ip6_output; - rt->rt6i_idev = idev; - - rt->rt6i_protocol = RTPROT_KERNEL; - rt->rt6i_flags = RTF_UP | RTF_NONEXTHOP; - if (anycast) - rt->rt6i_flags |= RTF_ANYCAST; - else - rt->rt6i_flags |= RTF_LOCAL; - - rt->rt6i_gateway = *addr; - rt->rt6i_dst.addr = *addr; - rt->rt6i_dst.plen = 128; + f6i->fib6_nh.nh_gw = *addr; + dev_hold(dev); + f6i->fib6_nh.nh_dev = dev; + f6i->fib6_dst.addr = *addr; + f6i->fib6_dst.plen = 128; tb_id = l3mdev_fib_table(idev->dev) ? : RT6_TABLE_LOCAL; - rt->rt6i_table = fib6_get_table(net, tb_id); + f6i->fib6_table = fib6_get_table(net, tb_id); - return rt; + return f6i; } /* remove deleted ip from prefsrc entries */ @@ -2801,17 +3722,21 @@ struct arg_dev_net_ip { struct in6_addr *addr; }; -static int fib6_remove_prefsrc(struct rt6_info *rt, void *arg) +static int fib6_remove_prefsrc(struct fib6_info *rt, void *arg) { struct net_device *dev = ((struct arg_dev_net_ip *)arg)->dev; struct net *net = ((struct arg_dev_net_ip *)arg)->net; struct in6_addr *addr = ((struct arg_dev_net_ip *)arg)->addr; - if (((void *)rt->dst.dev == dev || !dev) && - rt != net->ipv6.ip6_null_entry && - ipv6_addr_equal(addr, &rt->rt6i_prefsrc.addr)) { + if (((void *)rt->fib6_nh.nh_dev == dev || !dev) && + rt != net->ipv6.fib6_null_entry && + ipv6_addr_equal(addr, &rt->fib6_prefsrc.addr)) { + spin_lock_bh(&rt6_exception_lock); /* remove prefsrc entry */ - rt->rt6i_prefsrc.plen = 0; + rt->fib6_prefsrc.plen = 0; + /* need to update cache as well */ + rt6_exceptions_remove_prefsrc(rt); + spin_unlock_bh(&rt6_exception_lock); } return 0; } @@ -2828,18 +3753,23 @@ void rt6_remove_prefsrc(struct inet6_ifaddr *ifp) } #define RTF_RA_ROUTER (RTF_ADDRCONF | RTF_DEFAULT | RTF_GATEWAY) -#define RTF_CACHE_GATEWAY (RTF_GATEWAY | RTF_CACHE) /* Remove routers and update dst entries when gateway turn into host. */ -static int fib6_clean_tohost(struct rt6_info *rt, void *arg) +static int fib6_clean_tohost(struct fib6_info *rt, void *arg) { struct in6_addr *gateway = (struct in6_addr *)arg; - if ((((rt->rt6i_flags & RTF_RA_ROUTER) == RTF_RA_ROUTER) || - ((rt->rt6i_flags & RTF_CACHE_GATEWAY) == RTF_CACHE_GATEWAY)) && - ipv6_addr_equal(gateway, &rt->rt6i_gateway)) { + if (((rt->fib6_flags & RTF_RA_ROUTER) == RTF_RA_ROUTER) && + ipv6_addr_equal(gateway, &rt->fib6_nh.nh_gw)) { return -1; } + + /* Further clean up cached routes in exception table. + * This is needed because cached route may have a different + * gateway than its 'parent' in the case of an ip redirect. + */ + rt6_exceptions_clean_tohost(rt, gateway); + return 0; } @@ -2848,37 +3778,250 @@ void rt6_clean_tohost(struct net *net, struct in6_addr *gateway) fib6_clean_all(net, fib6_clean_tohost, gateway); } -struct arg_dev_net { - struct net_device *dev; - struct net *net; +struct arg_netdev_event { + const struct net_device *dev; + union { + unsigned int nh_flags; + unsigned long event; + }; }; -/* called with write lock held for table with rt */ -static int fib6_ifdown(struct rt6_info *rt, void *arg) +static struct fib6_info *rt6_multipath_first_sibling(const struct fib6_info *rt) { - const struct arg_dev_net *adn = arg; - const struct net_device *dev = adn->dev; + struct fib6_info *iter; + struct fib6_node *fn; - if ((rt->dst.dev == dev || !dev) && - rt != adn->net->ipv6.ip6_null_entry && - (rt->rt6i_nsiblings == 0 || - (dev && netdev_unregistering(dev)) || - !rt->rt6i_idev->cnf.ignore_routes_with_linkdown)) - return -1; + fn = rcu_dereference_protected(rt->fib6_node, + lockdep_is_held(&rt->fib6_table->tb6_lock)); + iter = rcu_dereference_protected(fn->leaf, + lockdep_is_held(&rt->fib6_table->tb6_lock)); + while (iter) { + if (iter->fib6_metric == rt->fib6_metric && + rt6_qualify_for_ecmp(iter)) + return iter; + iter = rcu_dereference_protected(iter->fib6_next, + lockdep_is_held(&rt->fib6_table->tb6_lock)); + } + + return NULL; +} + +static bool rt6_is_dead(const struct fib6_info *rt) +{ + if (rt->fib6_nh.nh_flags & RTNH_F_DEAD || + (rt->fib6_nh.nh_flags & RTNH_F_LINKDOWN && + fib6_ignore_linkdown(rt))) + return true; + + return false; +} + +static int rt6_multipath_total_weight(const struct fib6_info *rt) +{ + struct fib6_info *iter; + int total = 0; + + if (!rt6_is_dead(rt)) + total += rt->fib6_nh.nh_weight; + + list_for_each_entry(iter, &rt->fib6_siblings, fib6_siblings) { + if (!rt6_is_dead(iter)) + total += iter->fib6_nh.nh_weight; + } + + return total; +} + +static void rt6_upper_bound_set(struct fib6_info *rt, int *weight, int total) +{ + int upper_bound = -1; + + if (!rt6_is_dead(rt)) { + *weight += rt->fib6_nh.nh_weight; + upper_bound = DIV_ROUND_CLOSEST_ULL((u64) (*weight) << 31, + total) - 1; + } + atomic_set(&rt->fib6_nh.nh_upper_bound, upper_bound); +} + +static void rt6_multipath_upper_bound_set(struct fib6_info *rt, int total) +{ + struct fib6_info *iter; + int weight = 0; + + rt6_upper_bound_set(rt, &weight, total); + + list_for_each_entry(iter, &rt->fib6_siblings, fib6_siblings) + rt6_upper_bound_set(iter, &weight, total); +} + +void rt6_multipath_rebalance(struct fib6_info *rt) +{ + struct fib6_info *first; + int total; + + /* In case the entire multipath route was marked for flushing, + * then there is no need to rebalance upon the removal of every + * sibling route. + */ + if (!rt->fib6_nsiblings || rt->should_flush) + return; + + /* During lookup routes are evaluated in order, so we need to + * make sure upper bounds are assigned from the first sibling + * onwards. + */ + first = rt6_multipath_first_sibling(rt); + if (WARN_ON_ONCE(!first)) + return; + + total = rt6_multipath_total_weight(first); + rt6_multipath_upper_bound_set(first, total); +} + +static int fib6_ifup(struct fib6_info *rt, void *p_arg) +{ + const struct arg_netdev_event *arg = p_arg; + struct net *net = dev_net(arg->dev); + + if (rt != net->ipv6.fib6_null_entry && rt->fib6_nh.nh_dev == arg->dev) { + rt->fib6_nh.nh_flags &= ~arg->nh_flags; + fib6_update_sernum_upto_root(net, rt); + rt6_multipath_rebalance(rt); + } return 0; } -void rt6_ifdown(struct net *net, struct net_device *dev) +void rt6_sync_up(struct net_device *dev, unsigned int nh_flags) { - struct arg_dev_net adn = { + struct arg_netdev_event arg = { .dev = dev, - .net = net, + { + .nh_flags = nh_flags, + }, }; - fib6_clean_all(net, fib6_ifdown, &adn); - if (dev) - rt6_uncached_list_flush_dev(net, dev); + if (nh_flags & RTNH_F_DEAD && netif_carrier_ok(dev)) + arg.nh_flags |= RTNH_F_LINKDOWN; + + fib6_clean_all(dev_net(dev), fib6_ifup, &arg); +} + +static bool rt6_multipath_uses_dev(const struct fib6_info *rt, + const struct net_device *dev) +{ + struct fib6_info *iter; + + if (rt->fib6_nh.nh_dev == dev) + return true; + list_for_each_entry(iter, &rt->fib6_siblings, fib6_siblings) + if (iter->fib6_nh.nh_dev == dev) + return true; + + return false; +} + +static void rt6_multipath_flush(struct fib6_info *rt) +{ + struct fib6_info *iter; + + rt->should_flush = 1; + list_for_each_entry(iter, &rt->fib6_siblings, fib6_siblings) + iter->should_flush = 1; +} + +static unsigned int rt6_multipath_dead_count(const struct fib6_info *rt, + const struct net_device *down_dev) +{ + struct fib6_info *iter; + unsigned int dead = 0; + + if (rt->fib6_nh.nh_dev == down_dev || + rt->fib6_nh.nh_flags & RTNH_F_DEAD) + dead++; + list_for_each_entry(iter, &rt->fib6_siblings, fib6_siblings) + if (iter->fib6_nh.nh_dev == down_dev || + iter->fib6_nh.nh_flags & RTNH_F_DEAD) + dead++; + + return dead; +} + +static void rt6_multipath_nh_flags_set(struct fib6_info *rt, + const struct net_device *dev, + unsigned int nh_flags) +{ + struct fib6_info *iter; + + if (rt->fib6_nh.nh_dev == dev) + rt->fib6_nh.nh_flags |= nh_flags; + list_for_each_entry(iter, &rt->fib6_siblings, fib6_siblings) + if (iter->fib6_nh.nh_dev == dev) + iter->fib6_nh.nh_flags |= nh_flags; +} + +/* called with write lock held for table with rt */ +static int fib6_ifdown(struct fib6_info *rt, void *p_arg) +{ + const struct arg_netdev_event *arg = p_arg; + const struct net_device *dev = arg->dev; + struct net *net = dev_net(dev); + + if (rt == net->ipv6.fib6_null_entry) + return 0; + + switch (arg->event) { + case NETDEV_UNREGISTER: + return rt->fib6_nh.nh_dev == dev ? -1 : 0; + case NETDEV_DOWN: + if (rt->should_flush) + return -1; + if (!rt->fib6_nsiblings) + return rt->fib6_nh.nh_dev == dev ? -1 : 0; + if (rt6_multipath_uses_dev(rt, dev)) { + unsigned int count; + + count = rt6_multipath_dead_count(rt, dev); + if (rt->fib6_nsiblings + 1 == count) { + rt6_multipath_flush(rt); + return -1; + } + rt6_multipath_nh_flags_set(rt, dev, RTNH_F_DEAD | + RTNH_F_LINKDOWN); + fib6_update_sernum(net, rt); + rt6_multipath_rebalance(rt); + } + return -2; + case NETDEV_CHANGE: + if (rt->fib6_nh.nh_dev != dev || + rt->fib6_flags & (RTF_LOCAL | RTF_ANYCAST)) + break; + rt->fib6_nh.nh_flags |= RTNH_F_LINKDOWN; + rt6_multipath_rebalance(rt); + break; + } + + return 0; +} + +void rt6_sync_down_dev(struct net_device *dev, unsigned long event) +{ + struct arg_netdev_event arg = { + .dev = dev, + { + .event = event, + }, + }; + + fib6_clean_all(dev_net(dev), fib6_ifdown, &arg); +} + +void rt6_disable_ip(struct net_device *dev, unsigned long event) +{ + rt6_sync_down_dev(dev, event); + rt6_uncached_list_flush_dev(dev_net(dev), dev); + neigh_ifdown(&nd_tbl, dev); } struct rt6_mtu_change_arg { @@ -2886,7 +4029,7 @@ struct rt6_mtu_change_arg { unsigned int mtu; }; -static int rt6_mtu_change_route(struct rt6_info *rt, void *p_arg) +static int rt6_mtu_change_route(struct fib6_info *rt, void *p_arg) { struct rt6_mtu_change_arg *arg = (struct rt6_mtu_change_arg *) p_arg; struct inet6_dev *idev; @@ -2906,31 +4049,17 @@ static int rt6_mtu_change_route(struct rt6_info *rt, void *p_arg) Since RFC 1981 doesn't include administrative MTU increase update PMTU increase is a MUST. (i.e. jumbo frame) */ - /* - If new MTU is less than route PMTU, this new MTU will be the - lowest MTU in the path, update the route PMTU to reflect PMTU - decreases; if new MTU is greater than route PMTU, and the - old MTU is the lowest MTU in the path, update the route PMTU - to reflect the increase. In this case if the other nodes' MTU - also have the lowest MTU, TOO BIG MESSAGE will be lead to - PMTU discovery. - */ - if (rt->dst.dev == arg->dev && - dst_metric_raw(&rt->dst, RTAX_MTU) && - !dst_metric_locked(&rt->dst, RTAX_MTU)) { - if (rt->rt6i_flags & RTF_CACHE) { - /* For RTF_CACHE with rt6i_pmtu == 0 - * (i.e. a redirected route), - * the metrics of its rt->dst.from has already - * been updated. - */ - if (rt->rt6i_pmtu && rt->rt6i_pmtu > arg->mtu) - rt->rt6i_pmtu = arg->mtu; - } else if (dst_mtu(&rt->dst) >= arg->mtu || - (dst_mtu(&rt->dst) < arg->mtu && - dst_mtu(&rt->dst) == idev->cnf.mtu6)) { - dst_metric_set(&rt->dst, RTAX_MTU, arg->mtu); - } + if (rt->fib6_nh.nh_dev == arg->dev && + !fib6_metric_locked(rt, RTAX_MTU)) { + u32 mtu = rt->fib6_pmtu; + + if (mtu >= arg->mtu || + (mtu < arg->mtu && mtu == idev->cnf.mtu6)) + fib6_metric_set(rt, RTAX_MTU, arg->mtu); + + spin_lock_bh(&rt6_exception_lock); + rt6_exceptions_update_pmtu(idev, rt, arg->mtu); + spin_unlock_bh(&rt6_exception_lock); } return 0; } @@ -2999,6 +4128,8 @@ static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh, if (rtm->rtm_flags & RTM_F_CLONED) cfg->fc_flags |= RTF_CACHE; + cfg->fc_flags |= (rtm->rtm_flags & RTNH_F_ONLINK); + cfg->fc_nlinfo.portid = NETLINK_CB(skb).portid; cfg->fc_nlinfo.nlh = nlh; cfg->fc_nlinfo.nl_net = sock_net(skb->sk); @@ -3091,9 +4222,8 @@ errout: } struct rt6_nh { - struct rt6_info *rt6_info; + struct fib6_info *fib6_info; struct fib6_config r_cfg; - struct mx6_config mxc; struct list_head next; }; @@ -3108,23 +4238,25 @@ static void ip6_print_replace_route_err(struct list_head *rt6_nh_list) } } -static int ip6_route_info_append(struct list_head *rt6_nh_list, - struct rt6_info *rt, struct fib6_config *r_cfg) +static int ip6_route_info_append(struct net *net, + struct list_head *rt6_nh_list, + struct fib6_info *rt, + struct fib6_config *r_cfg) { struct rt6_nh *nh; int err = -EEXIST; list_for_each_entry(nh, rt6_nh_list, next) { - /* check if rt6_info already exists */ - if (rt6_duplicate_nexthop(nh->rt6_info, rt)) + /* check if fib6_info already exists */ + if (rt6_duplicate_nexthop(nh->fib6_info, rt)) return err; } nh = kzalloc(sizeof(*nh), GFP_KERNEL); if (!nh) return -ENOMEM; - nh->rt6_info = rt; - err = ip6_convert_metrics(&nh->mxc, r_cfg); + nh->fib6_info = rt; + err = ip6_convert_metrics(net, rt, r_cfg); if (err) { kfree(nh); return err; @@ -3135,8 +4267,8 @@ static int ip6_route_info_append(struct list_head *rt6_nh_list, return 0; } -static void ip6_route_mpath_notify(struct rt6_info *rt, - struct rt6_info *rt_last, +static void ip6_route_mpath_notify(struct fib6_info *rt, + struct fib6_info *rt_last, struct nl_info *info, __u16 nlflags) { @@ -3146,10 +4278,10 @@ static void ip6_route_mpath_notify(struct rt6_info *rt, * nexthop. Since sibling routes are always added at the end of * the list, find the first sibling of the last route appended */ - if ((nlflags & NLM_F_APPEND) && rt_last && rt_last->rt6i_nsiblings) { - rt = list_first_entry(&rt_last->rt6i_siblings, - struct rt6_info, - rt6i_siblings); + if ((nlflags & NLM_F_APPEND) && rt_last && rt_last->fib6_nsiblings) { + rt = list_first_entry(&rt_last->fib6_siblings, + struct fib6_info, + fib6_siblings); } if (rt) @@ -3172,11 +4304,11 @@ static int fib6_gw_from_attr(struct in6_addr *gw, struct nlattr *nla, static int ip6_route_multipath_add(struct fib6_config *cfg, struct netlink_ext_ack *extack) { - struct rt6_info *rt_notif = NULL, *rt_last = NULL; + struct fib6_info *rt_notif = NULL, *rt_last = NULL; struct nl_info *info = &cfg->fc_nlinfo; struct fib6_config r_cfg; struct rtnexthop *rtnh; - struct rt6_info *rt; + struct fib6_info *rt; struct rt6_nh *err_nh; struct rt6_nh *nh, *nh_safe; __u16 nlflags; @@ -3196,7 +4328,7 @@ static int ip6_route_multipath_add(struct fib6_config *cfg, rtnh = (struct rtnexthop *)cfg->fc_mp; /* Parse a Multipath Entry and build a list (rt6_nh_list) of - * rt6_info structs per nexthop + * fib6_info structs per nexthop */ while (rtnh_ok(rtnh, remaining)) { memcpy(&r_cfg, cfg, sizeof(*cfg)); @@ -3222,16 +4354,19 @@ static int ip6_route_multipath_add(struct fib6_config *cfg, r_cfg.fc_encap_type = nla_get_u16(nla); } - rt = ip6_route_info_create(&r_cfg, extack); + rt = ip6_route_info_create(&r_cfg, GFP_KERNEL, extack); if (IS_ERR(rt)) { err = PTR_ERR(rt); rt = NULL; goto cleanup; } - err = ip6_route_info_append(&rt6_nh_list, rt, &r_cfg); + rt->fib6_nh.nh_weight = rtnh->rtnh_hops + 1; + + err = ip6_route_info_append(info->nl_net, &rt6_nh_list, + rt, &r_cfg); if (err) { - dst_release_immediate(&rt->dst); + fib6_info_release(rt); goto cleanup; } @@ -3246,19 +4381,20 @@ static int ip6_route_multipath_add(struct fib6_config *cfg, err_nh = NULL; list_for_each_entry(nh, &rt6_nh_list, next) { - err = __ip6_ins_rt(nh->rt6_info, info, &nh->mxc, extack); + err = __ip6_ins_rt(nh->fib6_info, info, extack); + fib6_info_release(nh->fib6_info); if (!err) { /* save reference to last route successfully inserted */ - rt_last = nh->rt6_info; + rt_last = nh->fib6_info; /* save reference to first route for notification */ if (!rt_notif) - rt_notif = nh->rt6_info; + rt_notif = nh->fib6_info; } - /* nh->rt6_info is used or freed at this point, reset to NULL*/ - nh->rt6_info = NULL; + /* nh->fib6_info is used or freed at this point, reset to NULL*/ + nh->fib6_info = NULL; if (err) { if (replace && nhn) ip6_print_replace_route_err(&rt6_nh_list); @@ -3302,9 +4438,8 @@ add_errout: cleanup: list_for_each_entry_safe(nh, nh_safe, &rt6_nh_list, next) { - if (nh->rt6_info) - dst_release_immediate(&nh->rt6_info->dst); - kfree(nh->mxc.mx); + if (nh->fib6_info) + fib6_info_release(nh->fib6_info); list_del(&nh->next); kfree(nh); } @@ -3388,20 +4523,20 @@ static int inet6_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh, if (cfg.fc_mp) return ip6_route_multipath_add(&cfg, extack); else - return ip6_route_add(&cfg, extack); + return ip6_route_add(&cfg, GFP_KERNEL, extack); } -static size_t rt6_nlmsg_size(struct rt6_info *rt) +static size_t rt6_nlmsg_size(struct fib6_info *rt) { int nexthop_len = 0; - if (rt->rt6i_nsiblings) { + if (rt->fib6_nsiblings) { nexthop_len = nla_total_size(0) /* RTA_MULTIPATH */ + NLA_ALIGN(sizeof(struct rtnexthop)) + nla_total_size(16) /* RTA_GATEWAY */ - + lwtunnel_get_encap_size(rt->dst.lwtstate); + + lwtunnel_get_encap_size(rt->fib6_nh.nh_lwtstate); - nexthop_len *= rt->rt6i_nsiblings; + nexthop_len *= rt->fib6_nsiblings; } return NLMSG_ALIGN(sizeof(struct rtmsg)) @@ -3417,34 +4552,41 @@ static size_t rt6_nlmsg_size(struct rt6_info *rt) + nla_total_size(sizeof(struct rta_cacheinfo)) + nla_total_size(TCP_CA_NAME_MAX) /* RTAX_CC_ALGO */ + nla_total_size(1) /* RTA_PREF */ - + lwtunnel_get_encap_size(rt->dst.lwtstate) + + lwtunnel_get_encap_size(rt->fib6_nh.nh_lwtstate) + nexthop_len; } -static int rt6_nexthop_info(struct sk_buff *skb, struct rt6_info *rt, +static int rt6_nexthop_info(struct sk_buff *skb, struct fib6_info *rt, unsigned int *flags, bool skip_oif) { - if (!netif_running(rt->dst.dev) || !netif_carrier_ok(rt->dst.dev)) { + if (rt->fib6_nh.nh_flags & RTNH_F_DEAD) + *flags |= RTNH_F_DEAD; + + if (rt->fib6_nh.nh_flags & RTNH_F_LINKDOWN) { *flags |= RTNH_F_LINKDOWN; - if (rt->rt6i_idev->cnf.ignore_routes_with_linkdown) + + rcu_read_lock(); + if (fib6_ignore_linkdown(rt)) *flags |= RTNH_F_DEAD; + rcu_read_unlock(); } - if (rt->rt6i_flags & RTF_GATEWAY) { - if (nla_put_in6_addr(skb, RTA_GATEWAY, &rt->rt6i_gateway) < 0) + if (rt->fib6_flags & RTF_GATEWAY) { + if (nla_put_in6_addr(skb, RTA_GATEWAY, &rt->fib6_nh.nh_gw) < 0) goto nla_put_failure; } - if (rt->rt6i_nh_flags & RTNH_F_OFFLOAD) + *flags |= (rt->fib6_nh.nh_flags & RTNH_F_ONLINK); + if (rt->fib6_nh.nh_flags & RTNH_F_OFFLOAD) *flags |= RTNH_F_OFFLOAD; /* not needed for multipath encoding b/c it has a rtnexthop struct */ - if (!skip_oif && rt->dst.dev && - nla_put_u32(skb, RTA_OIF, rt->dst.dev->ifindex)) + if (!skip_oif && rt->fib6_nh.nh_dev && + nla_put_u32(skb, RTA_OIF, rt->fib6_nh.nh_dev->ifindex)) goto nla_put_failure; - if (rt->dst.lwtstate && - lwtunnel_fill_encap(skb, rt->dst.lwtstate) < 0) + if (rt->fib6_nh.nh_lwtstate && + lwtunnel_fill_encap(skb, rt->fib6_nh.nh_lwtstate) < 0) goto nla_put_failure; return 0; @@ -3454,8 +4596,9 @@ nla_put_failure: } /* add multipath next hop */ -static int rt6_add_nexthop(struct sk_buff *skb, struct rt6_info *rt) +static int rt6_add_nexthop(struct sk_buff *skb, struct fib6_info *rt) { + const struct net_device *dev = rt->fib6_nh.nh_dev; struct rtnexthop *rtnh; unsigned int flags = 0; @@ -3463,8 +4606,8 @@ static int rt6_add_nexthop(struct sk_buff *skb, struct rt6_info *rt) if (!rtnh) goto nla_put_failure; - rtnh->rtnh_hops = 0; - rtnh->rtnh_ifindex = rt->dst.dev ? rt->dst.dev->ifindex : 0; + rtnh->rtnh_hops = rt->fib6_nh.nh_weight - 1; + rtnh->rtnh_ifindex = dev ? dev->ifindex : 0; if (rt6_nexthop_info(skb, rt, &flags, true) < 0) goto nla_put_failure; @@ -3480,16 +4623,16 @@ nla_put_failure: return -EMSGSIZE; } -static int rt6_fill_node(struct net *net, - struct sk_buff *skb, struct rt6_info *rt, - struct in6_addr *dst, struct in6_addr *src, +static int rt6_fill_node(struct net *net, struct sk_buff *skb, + struct fib6_info *rt, struct dst_entry *dst, + struct in6_addr *dest, struct in6_addr *src, int iif, int type, u32 portid, u32 seq, unsigned int flags) { - u32 metrics[RTAX_MAX]; struct rtmsg *rtm; struct nlmsghdr *nlh; - long expires; + long expires = 0; + u32 *pmetrics; u32 table; nlh = nlmsg_put(skb, portid, seq, type, sizeof(*rtm), flags); @@ -3498,53 +4641,31 @@ static int rt6_fill_node(struct net *net, rtm = nlmsg_data(nlh); rtm->rtm_family = AF_INET6; - rtm->rtm_dst_len = rt->rt6i_dst.plen; - rtm->rtm_src_len = rt->rt6i_src.plen; + rtm->rtm_dst_len = rt->fib6_dst.plen; + rtm->rtm_src_len = rt->fib6_src.plen; rtm->rtm_tos = 0; - if (rt->rt6i_table) - table = rt->rt6i_table->tb6_id; + if (rt->fib6_table) + table = rt->fib6_table->tb6_id; else table = RT6_TABLE_UNSPEC; rtm->rtm_table = table < 256 ? table : RT_TABLE_COMPAT; if (nla_put_u32(skb, RTA_TABLE, table)) goto nla_put_failure; - if (rt->rt6i_flags & RTF_REJECT) { - switch (rt->dst.error) { - case -EINVAL: - rtm->rtm_type = RTN_BLACKHOLE; - break; - case -EACCES: - rtm->rtm_type = RTN_PROHIBIT; - break; - case -EAGAIN: - rtm->rtm_type = RTN_THROW; - break; - default: - rtm->rtm_type = RTN_UNREACHABLE; - break; - } - } - else if (rt->rt6i_flags & RTF_LOCAL) - rtm->rtm_type = RTN_LOCAL; - else if (rt->rt6i_flags & RTF_ANYCAST) - rtm->rtm_type = RTN_ANYCAST; - else if (rt->dst.dev && (rt->dst.dev->flags & IFF_LOOPBACK)) - rtm->rtm_type = RTN_LOCAL; - else - rtm->rtm_type = RTN_UNICAST; + + rtm->rtm_type = rt->fib6_type; rtm->rtm_flags = 0; rtm->rtm_scope = RT_SCOPE_UNIVERSE; - rtm->rtm_protocol = rt->rt6i_protocol; + rtm->rtm_protocol = rt->fib6_protocol; - if (rt->rt6i_flags & RTF_CACHE) + if (rt->fib6_flags & RTF_CACHE) rtm->rtm_flags |= RTM_F_CLONED; - if (dst) { - if (nla_put_in6_addr(skb, RTA_DST, dst)) + if (dest) { + if (nla_put_in6_addr(skb, RTA_DST, dest)) goto nla_put_failure; rtm->rtm_dst_len = 128; } else if (rtm->rtm_dst_len) - if (nla_put_in6_addr(skb, RTA_DST, &rt->rt6i_dst.addr)) + if (nla_put_in6_addr(skb, RTA_DST, &rt->fib6_dst.addr)) goto nla_put_failure; #ifdef CONFIG_IPV6_SUBTREES if (src) { @@ -3552,12 +4673,12 @@ static int rt6_fill_node(struct net *net, goto nla_put_failure; rtm->rtm_src_len = 128; } else if (rtm->rtm_src_len && - nla_put_in6_addr(skb, RTA_SRC, &rt->rt6i_src.addr)) + nla_put_in6_addr(skb, RTA_SRC, &rt->fib6_src.addr)) goto nla_put_failure; #endif if (iif) { #ifdef CONFIG_IPV6_MROUTE - if (ipv6_addr_is_multicast(&rt->rt6i_dst.addr)) { + if (ipv6_addr_is_multicast(&rt->fib6_dst.addr)) { int err = ip6mr_get_route(net, skb, rtm, portid); if (err == 0) @@ -3568,34 +4689,32 @@ static int rt6_fill_node(struct net *net, #endif if (nla_put_u32(skb, RTA_IIF, iif)) goto nla_put_failure; - } else if (dst) { + } else if (dest) { struct in6_addr saddr_buf; - if (ip6_route_get_saddr(net, rt, dst, 0, &saddr_buf) == 0 && + if (ip6_route_get_saddr(net, rt, dest, 0, &saddr_buf) == 0 && nla_put_in6_addr(skb, RTA_PREFSRC, &saddr_buf)) goto nla_put_failure; } - if (rt->rt6i_prefsrc.plen) { + if (rt->fib6_prefsrc.plen) { struct in6_addr saddr_buf; - saddr_buf = rt->rt6i_prefsrc.addr; + saddr_buf = rt->fib6_prefsrc.addr; if (nla_put_in6_addr(skb, RTA_PREFSRC, &saddr_buf)) goto nla_put_failure; } - memcpy(metrics, dst_metrics_ptr(&rt->dst), sizeof(metrics)); - if (rt->rt6i_pmtu) - metrics[RTAX_MTU - 1] = rt->rt6i_pmtu; - if (rtnetlink_put_metrics(skb, metrics) < 0) + pmetrics = dst ? dst_metrics_ptr(dst) : rt->fib6_metrics->metrics; + if (rtnetlink_put_metrics(skb, pmetrics) < 0) goto nla_put_failure; - if (nla_put_u32(skb, RTA_PRIORITY, rt->rt6i_metric)) + if (nla_put_u32(skb, RTA_PRIORITY, rt->fib6_metric)) goto nla_put_failure; /* For multipath routes, walk the siblings list and add * each as a nexthop within RTA_MULTIPATH. */ - if (rt->rt6i_nsiblings) { - struct rt6_info *sibling, *next_sibling; + if (rt->fib6_nsiblings) { + struct fib6_info *sibling, *next_sibling; struct nlattr *mp; mp = nla_nest_start(skb, RTA_MULTIPATH); @@ -3606,7 +4725,7 @@ static int rt6_fill_node(struct net *net, goto nla_put_failure; list_for_each_entry_safe(sibling, next_sibling, - &rt->rt6i_siblings, rt6i_siblings) { + &rt->fib6_siblings, fib6_siblings) { if (rt6_add_nexthop(skb, sibling) < 0) goto nla_put_failure; } @@ -3617,12 +4736,15 @@ static int rt6_fill_node(struct net *net, goto nla_put_failure; } - expires = (rt->rt6i_flags & RTF_EXPIRES) ? rt->dst.expires - jiffies : 0; + if (rt->fib6_flags & RTF_EXPIRES) { + expires = dst ? dst->expires : rt->expires; + expires -= jiffies; + } - if (rtnl_put_cacheinfo(skb, &rt->dst, 0, expires, rt->dst.error) < 0) + if (rtnl_put_cacheinfo(skb, dst, 0, expires, dst ? dst->error : 0) < 0) goto nla_put_failure; - if (nla_put_u8(skb, RTA_PREF, IPV6_EXTRACT_PREF(rt->rt6i_flags))) + if (nla_put_u8(skb, RTA_PREF, IPV6_EXTRACT_PREF(rt->fib6_flags))) goto nla_put_failure; @@ -3634,12 +4756,12 @@ nla_put_failure: return -EMSGSIZE; } -int rt6_dump_route(struct rt6_info *rt, void *p_arg) +int rt6_dump_route(struct fib6_info *rt, void *p_arg) { struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg; struct net *net = arg->net; - if (rt == net->ipv6.ip6_null_entry) + if (rt == net->ipv6.fib6_null_entry) return 0; if (nlmsg_len(arg->cb->nlh) >= sizeof(struct rtmsg)) { @@ -3647,16 +4769,15 @@ int rt6_dump_route(struct rt6_info *rt, void *p_arg) /* user wants prefix routes only */ if (rtm->rtm_flags & RTM_F_PREFIX && - !(rt->rt6i_flags & RTF_PREFIX_RT)) { + !(rt->fib6_flags & RTF_PREFIX_RT)) { /* success since this is not a prefix route */ return 1; } } - return rt6_fill_node(net, - arg->skb, rt, NULL, NULL, 0, RTM_NEWROUTE, - NETLINK_CB(arg->cb->skb).portid, arg->cb->nlh->nlmsg_seq, - NLM_F_MULTI); + return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0, + RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid, + arg->cb->nlh->nlmsg_seq, NLM_F_MULTI); } static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, @@ -3665,6 +4786,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, struct net *net = sock_net(in_skb->sk); struct nlattr *tb[RTA_MAX+1]; int err, iif = 0, oif = 0; + struct fib6_info *from; struct dst_entry *dst; struct rt6_info *rt; struct sk_buff *skb; @@ -3730,7 +4852,7 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, if (!ipv6_addr_any(&fl6.saddr)) flags |= RT6_LOOKUP_F_HAS_SADDR; - dst = ip6_route_input_lookup(net, dev, &fl6, flags); + dst = ip6_route_input_lookup(net, dev, &fl6, NULL, flags); rcu_read_unlock(); } else { @@ -3753,15 +4875,6 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, goto errout; } - if (fibmatch && rt->dst.from) { - struct rt6_info *ort = container_of(rt->dst.from, - struct rt6_info, dst); - - dst_hold(&ort->dst); - ip6_rt_put(rt); - rt = ort; - } - skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL); if (!skb) { ip6_rt_put(rt); @@ -3770,14 +4883,21 @@ static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, } skb_dst_set(skb, &rt->dst); + + rcu_read_lock(); + from = rcu_dereference(rt->from); + if (fibmatch) - err = rt6_fill_node(net, skb, rt, NULL, NULL, iif, + err = rt6_fill_node(net, skb, from, NULL, NULL, NULL, iif, RTM_NEWROUTE, NETLINK_CB(in_skb).portid, nlh->nlmsg_seq, 0); else - err = rt6_fill_node(net, skb, rt, &fl6.daddr, &fl6.saddr, iif, - RTM_NEWROUTE, NETLINK_CB(in_skb).portid, - nlh->nlmsg_seq, 0); + err = rt6_fill_node(net, skb, from, dst, &fl6.daddr, + &fl6.saddr, iif, RTM_NEWROUTE, + NETLINK_CB(in_skb).portid, nlh->nlmsg_seq, + 0); + rcu_read_unlock(); + if (err < 0) { kfree_skb(skb); goto errout; @@ -3788,7 +4908,7 @@ errout: return err; } -void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info, +void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info, unsigned int nlm_flags) { struct sk_buff *skb; @@ -3803,8 +4923,8 @@ void inet6_rt_notify(int event, struct rt6_info *rt, struct nl_info *info, if (!skb) goto errout; - err = rt6_fill_node(net, skb, rt, NULL, NULL, 0, - event, info->portid, seq, nlm_flags); + err = rt6_fill_node(net, skb, rt, NULL, NULL, NULL, 0, + event, info->portid, seq, nlm_flags); if (err < 0) { /* -EMSGSIZE implies BUG in rt6_nlmsg_size() */ WARN_ON(err == -EMSGSIZE); @@ -3829,6 +4949,7 @@ static int ip6_route_dev_notify(struct notifier_block *this, return NOTIFY_OK; if (event == NETDEV_REGISTER) { + net->ipv6.fib6_null_entry->fib6_nh.nh_dev = dev; net->ipv6.ip6_null_entry->dst.dev = dev; net->ipv6.ip6_null_entry->rt6i_idev = in6_dev_get(dev); #ifdef CONFIG_IPV6_MULTIPLE_TABLES @@ -3872,7 +4993,7 @@ static int rt6_stats_seq_show(struct seq_file *seq, void *v) seq_printf(seq, "%04x %04x %04x %04x %04x %04x %04x\n", net->ipv6.rt6_stats->fib_nodes, net->ipv6.rt6_stats->fib_route_nodes, - net->ipv6.rt6_stats->fib_rt_alloc, + atomic_read(&net->ipv6.rt6_stats->fib_rt_alloc), net->ipv6.rt6_stats->fib_rt_entries, net->ipv6.rt6_stats->fib_rt_cache, dst_entries_get_slow(&net->ipv6.ip6_dst_ops), @@ -4031,13 +5152,17 @@ static int __net_init ip6_route_net_init(struct net *net) if (dst_entries_init(&net->ipv6.ip6_dst_ops) < 0) goto out_ip6_dst_ops; + net->ipv6.fib6_null_entry = kmemdup(&fib6_null_entry_template, + sizeof(*net->ipv6.fib6_null_entry), + GFP_KERNEL); + if (!net->ipv6.fib6_null_entry) + goto out_ip6_dst_entries; + net->ipv6.ip6_null_entry = kmemdup(&ip6_null_entry_template, sizeof(*net->ipv6.ip6_null_entry), GFP_KERNEL); if (!net->ipv6.ip6_null_entry) - goto out_ip6_dst_entries; - net->ipv6.ip6_null_entry->dst.path = - (struct dst_entry *)net->ipv6.ip6_null_entry; + goto out_fib6_null_entry; net->ipv6.ip6_null_entry->dst.ops = &net->ipv6.ip6_dst_ops; dst_init_metrics(&net->ipv6.ip6_null_entry->dst, ip6_template_metrics, true); @@ -4049,8 +5174,6 @@ static int __net_init ip6_route_net_init(struct net *net) GFP_KERNEL); if (!net->ipv6.ip6_prohibit_entry) goto out_ip6_null_entry; - net->ipv6.ip6_prohibit_entry->dst.path = - (struct dst_entry *)net->ipv6.ip6_prohibit_entry; net->ipv6.ip6_prohibit_entry->dst.ops = &net->ipv6.ip6_dst_ops; dst_init_metrics(&net->ipv6.ip6_prohibit_entry->dst, ip6_template_metrics, true); @@ -4060,8 +5183,6 @@ static int __net_init ip6_route_net_init(struct net *net) GFP_KERNEL); if (!net->ipv6.ip6_blk_hole_entry) goto out_ip6_prohibit_entry; - net->ipv6.ip6_blk_hole_entry->dst.path = - (struct dst_entry *)net->ipv6.ip6_blk_hole_entry; net->ipv6.ip6_blk_hole_entry->dst.ops = &net->ipv6.ip6_dst_ops; dst_init_metrics(&net->ipv6.ip6_blk_hole_entry->dst, ip6_template_metrics, true); @@ -4088,6 +5209,8 @@ out_ip6_prohibit_entry: out_ip6_null_entry: kfree(net->ipv6.ip6_null_entry); #endif +out_fib6_null_entry: + kfree(net->ipv6.fib6_null_entry); out_ip6_dst_entries: dst_entries_destroy(&net->ipv6.ip6_dst_ops); out_ip6_dst_ops: @@ -4096,6 +5219,7 @@ out_ip6_dst_ops: static void __net_exit ip6_route_net_exit(struct net *net) { + kfree(net->ipv6.fib6_null_entry); kfree(net->ipv6.ip6_null_entry); #ifdef CONFIG_IPV6_MULTIPLE_TABLES kfree(net->ipv6.ip6_prohibit_entry); @@ -4166,6 +5290,7 @@ void __init ip6_route_init_special_entries(void) /* Registering of the loopback is done before this portion of code, * the loopback reference in rt6_info will not be taken, do it * manually for init_net */ + init_net.ipv6.fib6_null_entry->fib6_nh.nh_dev = init_net.loopback_dev; init_net.ipv6.ip6_null_entry->dst.dev = init_net.loopback_dev; init_net.ipv6.ip6_null_entry->rt6i_idev = in6_dev_get(init_net.loopback_dev); #ifdef CONFIG_IPV6_MULTIPLE_TABLES diff --git a/net/ipv6/seg6_local.c b/net/ipv6/seg6_local.c index 8f8ea7a76b99..4d7087aeef5d 100644 --- a/net/ipv6/seg6_local.c +++ b/net/ipv6/seg6_local.c @@ -1,8 +1,9 @@ /* * SR-IPv6 implementation * - * Author: + * Authors: * David Lebrun + * eBPF support: Mathieu Xhonneux * * * This program is free software; you can redistribute it and/or @@ -31,7 +32,9 @@ #ifdef CONFIG_IPV6_SEG6_HMAC #include #endif +#include #include +#include struct seg6_local_lwt; @@ -42,6 +45,11 @@ struct seg6_action_desc { int static_headroom; }; +struct bpf_lwt_prog { + struct bpf_prog *prog; + char *name; +}; + struct seg6_local_lwt { int action; struct ipv6_sr_hdr *srh; @@ -50,6 +58,7 @@ struct seg6_local_lwt { struct in6_addr nh6; int iif; int oif; + struct bpf_lwt_prog bpf; int headroom; struct seg6_action_desc *desc; @@ -142,8 +151,8 @@ static void advance_nextseg(struct ipv6_sr_hdr *srh, struct in6_addr *daddr) *daddr = *addr; } -static void lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr, - u32 tbl_id) +int seg6_lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr, + u32 tbl_id) { struct net *net = dev_net(skb->dev); struct ipv6hdr *hdr = ipv6_hdr(skb); @@ -163,7 +172,7 @@ static void lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr, fl6.flowi6_flags = FLOWI_FLAG_KNOWN_NH; if (!tbl_id) { - dst = ip6_route_input_lookup(net, skb->dev, &fl6, flags); + dst = ip6_route_input_lookup(net, skb->dev, &fl6, skb, flags); } else { struct fib6_table *table; @@ -171,7 +180,7 @@ static void lookup_nexthop(struct sk_buff *skb, struct in6_addr *nhaddr, if (!table) goto out; - rt = ip6_pol_route(net, table, 0, &fl6, flags); + rt = ip6_pol_route(net, table, 0, &fl6, skb, flags); dst = &rt->dst; } @@ -189,6 +198,7 @@ out: skb_dst_drop(skb); skb_dst_set(skb, dst); + return dst->error; } /* regular endpoint function */ @@ -202,7 +212,7 @@ static int input_action_end(struct sk_buff *skb, struct seg6_local_lwt *slwt) advance_nextseg(srh, &ipv6_hdr(skb)->daddr); - lookup_nexthop(skb, NULL, 0); + seg6_lookup_nexthop(skb, NULL, 0); return dst_input(skb); @@ -222,7 +232,7 @@ static int input_action_end_x(struct sk_buff *skb, struct seg6_local_lwt *slwt) advance_nextseg(srh, &ipv6_hdr(skb)->daddr); - lookup_nexthop(skb, &slwt->nh6, 0); + seg6_lookup_nexthop(skb, &slwt->nh6, 0); return dst_input(skb); @@ -241,7 +251,7 @@ static int input_action_end_t(struct sk_buff *skb, struct seg6_local_lwt *slwt) advance_nextseg(srh, &ipv6_hdr(skb)->daddr); - lookup_nexthop(skb, NULL, slwt->table); + seg6_lookup_nexthop(skb, NULL, slwt->table); return dst_input(skb); @@ -333,7 +343,7 @@ static int input_action_end_dx6(struct sk_buff *skb, if (!ipv6_addr_any(&slwt->nh6)) nhaddr = &slwt->nh6; - lookup_nexthop(skb, nhaddr, 0); + seg6_lookup_nexthop(skb, nhaddr, 0); return dst_input(skb); drop: @@ -382,7 +392,7 @@ static int input_action_end_dt6(struct sk_buff *skb, if (!pskb_may_pull(skb, sizeof(struct ipv6hdr))) goto drop; - lookup_nexthop(skb, NULL, slwt->table); + seg6_lookup_nexthop(skb, NULL, slwt->table); return dst_input(skb); @@ -407,7 +417,7 @@ static int input_action_end_b6(struct sk_buff *skb, struct seg6_local_lwt *slwt) skb_set_transport_header(skb, sizeof(struct ipv6hdr)); - lookup_nexthop(skb, NULL, 0); + seg6_lookup_nexthop(skb, NULL, 0); return dst_input(skb); @@ -438,7 +448,7 @@ static int input_action_end_b6_encap(struct sk_buff *skb, skb_set_transport_header(skb, sizeof(struct ipv6hdr)); - lookup_nexthop(skb, NULL, 0); + seg6_lookup_nexthop(skb, NULL, 0); return dst_input(skb); @@ -447,6 +457,85 @@ drop: return err; } +DEFINE_PER_CPU(struct seg6_bpf_srh_state, seg6_bpf_srh_states); + +bool seg6_bpf_has_valid_srh(struct sk_buff *skb) +{ + struct seg6_bpf_srh_state *srh_state = + this_cpu_ptr(&seg6_bpf_srh_states); + struct ipv6_sr_hdr *srh = srh_state->srh; + + if (unlikely(srh == NULL)) + return false; + + if (unlikely(!srh_state->valid)) { + if ((srh_state->hdrlen & 7) != 0) + return false; + + srh->hdrlen = (u8)(srh_state->hdrlen >> 3); + if (!seg6_validate_srh(srh, (srh->hdrlen + 1) << 3)) + return false; + + srh_state->valid = true; + } + + return true; +} + +static int input_action_end_bpf(struct sk_buff *skb, + struct seg6_local_lwt *slwt) +{ + struct seg6_bpf_srh_state *srh_state = + this_cpu_ptr(&seg6_bpf_srh_states); + struct ipv6_sr_hdr *srh; + int ret; + + srh = get_and_validate_srh(skb); + if (!srh) { + kfree_skb(skb); + return -EINVAL; + } + advance_nextseg(srh, &ipv6_hdr(skb)->daddr); + + /* preempt_disable is needed to protect the per-CPU buffer srh_state, + * which is also accessed by the bpf_lwt_seg6_* helpers + */ + preempt_disable(); + srh_state->srh = srh; + srh_state->hdrlen = srh->hdrlen << 3; + srh_state->valid = true; + + rcu_read_lock(); + bpf_compute_data_pointers(skb); + ret = bpf_prog_run_save_cb(slwt->bpf.prog, skb); + rcu_read_unlock(); + + switch (ret) { + case BPF_OK: + case BPF_REDIRECT: + break; + case BPF_DROP: + goto drop; + default: + pr_warn_once("bpf-seg6local: Illegal return value %u\n", ret); + goto drop; + } + + if (srh_state->srh && !seg6_bpf_has_valid_srh(skb)) + goto drop; + + preempt_enable(); + if (ret != BPF_REDIRECT) + seg6_lookup_nexthop(skb, NULL, 0); + + return dst_input(skb); + +drop: + preempt_enable(); + kfree_skb(skb); + return -EINVAL; +} + static struct seg6_action_desc seg6_action_table[] = { { .action = SEG6_LOCAL_ACTION_END, @@ -493,7 +582,13 @@ static struct seg6_action_desc seg6_action_table[] = { .attrs = (1 << SEG6_LOCAL_SRH), .input = input_action_end_b6_encap, .static_headroom = sizeof(struct ipv6hdr), - } + }, + { + .action = SEG6_LOCAL_ACTION_END_BPF, + .attrs = (1 << SEG6_LOCAL_BPF), + .input = input_action_end_bpf, + }, + }; static struct seg6_action_desc *__get_action_desc(int action) @@ -538,6 +633,7 @@ static const struct nla_policy seg6_local_policy[SEG6_LOCAL_MAX + 1] = { .len = sizeof(struct in6_addr) }, [SEG6_LOCAL_IIF] = { .type = NLA_U32 }, [SEG6_LOCAL_OIF] = { .type = NLA_U32 }, + [SEG6_LOCAL_BPF] = { .type = NLA_NESTED }, }; static int parse_nla_srh(struct nlattr **attrs, struct seg6_local_lwt *slwt) @@ -715,6 +811,75 @@ static int cmp_nla_oif(struct seg6_local_lwt *a, struct seg6_local_lwt *b) return 0; } +#define MAX_PROG_NAME 256 +static const struct nla_policy bpf_prog_policy[SEG6_LOCAL_BPF_PROG_MAX + 1] = { + [SEG6_LOCAL_BPF_PROG] = { .type = NLA_U32, }, + [SEG6_LOCAL_BPF_PROG_NAME] = { .type = NLA_NUL_STRING, + .len = MAX_PROG_NAME }, +}; + +static int parse_nla_bpf(struct nlattr **attrs, struct seg6_local_lwt *slwt) +{ + struct nlattr *tb[SEG6_LOCAL_BPF_PROG_MAX + 1]; + struct bpf_prog *p; + int ret; + u32 fd; + + ret = nla_parse_nested(tb, SEG6_LOCAL_BPF_PROG_MAX, + attrs[SEG6_LOCAL_BPF], bpf_prog_policy, NULL); + if (ret < 0) + return ret; + + if (!tb[SEG6_LOCAL_BPF_PROG] || !tb[SEG6_LOCAL_BPF_PROG_NAME]) + return -EINVAL; + + slwt->bpf.name = nla_memdup(tb[SEG6_LOCAL_BPF_PROG_NAME], GFP_KERNEL); + if (!slwt->bpf.name) + return -ENOMEM; + + fd = nla_get_u32(tb[SEG6_LOCAL_BPF_PROG]); + p = bpf_prog_get_type(fd, BPF_PROG_TYPE_LWT_SEG6LOCAL); + if (IS_ERR(p)) { + kfree(slwt->bpf.name); + return PTR_ERR(p); + } + + slwt->bpf.prog = p; + return 0; +} + +static int put_nla_bpf(struct sk_buff *skb, struct seg6_local_lwt *slwt) +{ + struct nlattr *nest; + + if (!slwt->bpf.prog) + return 0; + + nest = nla_nest_start(skb, SEG6_LOCAL_BPF); + if (!nest) + return -EMSGSIZE; + + if (nla_put_u32(skb, SEG6_LOCAL_BPF_PROG, slwt->bpf.prog->aux->id)) + return -EMSGSIZE; + + if (slwt->bpf.name && + nla_put_string(skb, SEG6_LOCAL_BPF_PROG_NAME, slwt->bpf.name)) + return -EMSGSIZE; + + return nla_nest_end(skb, nest); +} + +static int cmp_nla_bpf(struct seg6_local_lwt *a, struct seg6_local_lwt *b) +{ + if (!a->bpf.name && !b->bpf.name) + return 0; + + if (!a->bpf.name || !b->bpf.name) + return 1; + + return strcmp(a->bpf.name, b->bpf.name); +} + struct seg6_action_param { int (*parse)(struct nlattr **attrs, struct seg6_local_lwt *slwt); int (*put)(struct sk_buff *skb, struct seg6_local_lwt *slwt); @@ -745,6 +910,11 @@ static struct seg6_action_param seg6_action_params[SEG6_LOCAL_MAX + 1] = { [SEG6_LOCAL_OIF] = { .parse = parse_nla_oif, .put = put_nla_oif, .cmp = cmp_nla_oif }, + + [SEG6_LOCAL_BPF] = { .parse = parse_nla_bpf, + .put = put_nla_bpf, + .cmp = cmp_nla_bpf }, + }; static int parse_nla_action(struct nlattr **attrs, struct seg6_local_lwt *slwt) @@ -830,6 +1000,13 @@ static void seg6_local_destroy_state(struct lwtunnel_state *lwt) struct seg6_local_lwt *slwt = seg6_local_lwtunnel(lwt); kfree(slwt->srh); + + if (slwt->desc->attrs & (1 << SEG6_LOCAL_BPF)) { + kfree(slwt->bpf.name); + bpf_prog_put(slwt->bpf.prog); + } + + return; } static int seg6_local_fill_encap(struct sk_buff *skb, @@ -882,6 +1059,11 @@ static int seg6_local_get_encap_size(struct lwtunnel_state *lwt) if (attrs & (1 << SEG6_LOCAL_OIF)) nlsize += nla_total_size(4); + if (attrs & (1 << SEG6_LOCAL_BPF)) + nlsize += nla_total_size(sizeof(struct nlattr)) + + nla_total_size(MAX_PROG_NAME) + + nla_total_size(4); + return nlsize; } diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index a64d292fd4c7..5f6471225d31 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -1868,19 +1868,22 @@ err_alloc_dev: return err; } -static void __net_exit sit_exit_net(struct net *net) +static void __net_exit sit_exit_batch_net(struct list_head *net_list) { LIST_HEAD(list); + struct net *net; rtnl_lock(); - sit_destroy_tunnels(net, &list); + list_for_each_entry(net, net_list, exit_list) + sit_destroy_tunnels(net, &list); + unregister_netdevice_many(&list); rtnl_unlock(); } static struct pernet_operations sit_net_ops = { .init = sit_init_net, - .exit = sit_exit_net, + .exit_batch = sit_exit_batch_net, .id = &sit_net_id, .size = sizeof(struct sit_net), }; diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c index f7051ba5b8af..6fbdef630152 100644 --- a/net/ipv6/sysctl_net_ipv6.c +++ b/net/ipv6/sysctl_net_ipv6.c @@ -16,14 +16,31 @@ #include #include #include +#include #ifdef CONFIG_NETLABEL #include #endif +static int zero; static int one = 1; static int auto_flowlabels_min; static int auto_flowlabels_max = IP6_AUTO_FLOW_LABEL_MAX; +static int proc_rt6_multipath_hash_policy(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + struct net *net; + int ret; + + net = container_of(table->data, struct net, + ipv6.sysctl.multipath_hash_policy); + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (write && ret == 0) + call_netevent_notifiers(NETEVENT_IPV6_MPATH_HASH_UPDATE, net); + + return ret; +} static struct ctl_table ipv6_table_template[] = { { @@ -98,6 +115,43 @@ static struct ctl_table ipv6_table_template[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "max_dst_opts_number", + .data = &init_net.ipv6.sysctl.max_dst_opts_cnt, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "max_hbh_opts_number", + .data = &init_net.ipv6.sysctl.max_hbh_opts_cnt, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "max_dst_opts_length", + .data = &init_net.ipv6.sysctl.max_dst_opts_len, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "max_hbh_length", + .data = &init_net.ipv6.sysctl.max_hbh_opts_len, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { + .procname = "fib_multipath_hash_policy", + .data = &init_net.ipv6.sysctl.multipath_hash_policy, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_rt6_multipath_hash_policy, + .extra1 = &zero, + .extra2 = &one, + }, { } }; @@ -158,6 +212,11 @@ static int __net_init ipv6_sysctl_net_init(struct net *net) ipv6_table[7].data = &net->ipv6.sysctl.flowlabel_state_ranges; ipv6_table[8].data = &net->ipv6.sysctl.ip_nonlocal_bind; ipv6_table[9].data = &net->ipv6.sysctl.flowlabel_reflect; + ipv6_table[10].data = &net->ipv6.sysctl.max_dst_opts_cnt; + ipv6_table[11].data = &net->ipv6.sysctl.max_hbh_opts_cnt; + ipv6_table[12].data = &net->ipv6.sysctl.max_dst_opts_len; + ipv6_table[13].data = &net->ipv6.sysctl.max_hbh_opts_len; + ipv6_table[14].data = &net->ipv6.sysctl.multipath_hash_policy, ipv6_route_table = ipv6_route_sysctl_init(net); if (!ipv6_route_table) diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index e2f90a5cbd5e..4b9d5e509075 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1041,6 +1041,21 @@ static struct sock *tcp_v6_cookie_check(struct sock *sk, struct sk_buff *skb) return sk; } +u16 tcp_v6_get_syncookie(struct sock *sk, struct ipv6hdr *iph, + struct tcphdr *th, u32 *cookie) +{ + u16 mss = 0; +#ifdef CONFIG_SYN_COOKIES + mss = tcp_get_syncookie_mss(&tcp6_request_sock_ops, + &tcp_request_sock_ipv6_ops, sk, th); + if (mss) { + *cookie = __cookie_v6_init_sequence(iph, th, &mss); + tcp_synq_overflow(sk); + } +#endif + return mss; +} + static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb) { if (skb->protocol == htons(ETH_P_IP)) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index f0405d98444d..a503ca8d0800 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -102,28 +102,12 @@ static u32 udp6_ehashfn(const struct net *net, udp6_ehash_secret + net_hash_mix(net)); } -static u32 udp6_portaddr_hash(const struct net *net, - const struct in6_addr *addr6, - unsigned int port) -{ - unsigned int hash, mix = net_hash_mix(net); - - if (ipv6_addr_any(addr6)) - hash = jhash_1word(0, mix); - else if (ipv6_addr_v4mapped(addr6)) - hash = jhash_1word((__force u32)addr6->s6_addr32[3], mix); - else - hash = jhash2((__force u32 *)addr6->s6_addr32, 4, mix); - - return hash ^ port; -} - int udp_v6_get_port(struct sock *sk, unsigned short snum) { unsigned int hash2_nulladdr = - udp6_portaddr_hash(sock_net(sk), &in6addr_any, snum); + ipv6_portaddr_hash(sock_net(sk), &in6addr_any, snum); unsigned int hash2_partial = - udp6_portaddr_hash(sock_net(sk), &sk->sk_v6_rcv_saddr, 0); + ipv6_portaddr_hash(sock_net(sk), &sk->sk_v6_rcv_saddr, 0); /* precompute partial secondary hash */ udp_sk(sk)->udp_portaddr_hash = hash2_partial; @@ -132,7 +116,7 @@ int udp_v6_get_port(struct sock *sk, unsigned short snum) static void udp_v6_rehash(struct sock *sk) { - u16 new_hash = udp6_portaddr_hash(sock_net(sk), + u16 new_hash = ipv6_portaddr_hash(sock_net(sk), &sk->sk_v6_rcv_saddr, inet_sk(sk)->inet_num); @@ -197,7 +181,7 @@ static struct sock *udp6_lib_lookup2(struct net *net, struct udp_hslot *hslot2, struct sk_buff *skb) { struct sock *sk, *result; - int score, badness, matches = 0, reuseport = 0; + int score, badness; u32 hash = 0; result = NULL; @@ -206,8 +190,7 @@ static struct sock *udp6_lib_lookup2(struct net *net, score = compute_score(sk, net, saddr, sport, daddr, hnum, dif, sdif, exact_dif); if (score > badness) { - reuseport = sk->sk_reuseport; - if (reuseport) { + if (sk->sk_reuseport) { hash = udp6_ehashfn(net, daddr, hnum, saddr, sport); @@ -215,15 +198,9 @@ static struct sock *udp6_lib_lookup2(struct net *net, sizeof(struct udphdr)); if (result) return result; - matches = 1; } result = sk; badness = score; - } else if (score == badness && reuseport) { - matches++; - if (reciprocal_scale(hash, matches) == 0) - result = sk; - hash = next_pseudo_random32(hash); } } return result; @@ -241,11 +218,11 @@ struct sock *__udp6_lib_lookup(struct net *net, unsigned int hash2, slot2, slot = udp_hashfn(net, hnum, udptable->mask); struct udp_hslot *hslot2, *hslot = &udptable->hash[slot]; bool exact_dif = udp6_lib_exact_dif_match(net, skb); - int score, badness, matches = 0, reuseport = 0; + int score, badness; u32 hash = 0; if (hslot->count > 10) { - hash2 = udp6_portaddr_hash(net, daddr, hnum); + hash2 = ipv6_portaddr_hash(net, daddr, hnum); slot2 = hash2 & udptable->mask; hslot2 = &udptable->hash2[slot2]; if (hslot->count < hslot2->count) @@ -256,7 +233,7 @@ struct sock *__udp6_lib_lookup(struct net *net, hslot2, skb); if (!result) { unsigned int old_slot2 = slot2; - hash2 = udp6_portaddr_hash(net, &in6addr_any, hnum); + hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum); slot2 = hash2 & udptable->mask; /* avoid searching the same slot again. */ if (unlikely(slot2 == old_slot2)) @@ -271,6 +248,8 @@ struct sock *__udp6_lib_lookup(struct net *net, exact_dif, hslot2, skb); } + if (unlikely(IS_ERR(result))) + return NULL; return result; } begin: @@ -280,23 +259,18 @@ begin: score = compute_score(sk, net, saddr, sport, daddr, hnum, dif, sdif, exact_dif); if (score > badness) { - reuseport = sk->sk_reuseport; - if (reuseport) { + if (sk->sk_reuseport) { hash = udp6_ehashfn(net, daddr, hnum, saddr, sport); result = reuseport_select_sock(sk, hash, skb, sizeof(struct udphdr)); + if (unlikely(IS_ERR(result))) + return NULL; if (result) return result; - matches = 1; } result = sk; badness = score; - } else if (score == badness && reuseport) { - matches++; - if (reciprocal_scale(hash, matches) == 0) - result = sk; - hash = next_pseudo_random32(hash); } } return result; @@ -761,9 +735,9 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb, struct sk_buff *nskb; if (use_hash2) { - hash2_any = udp6_portaddr_hash(net, &in6addr_any, hnum) & + hash2_any = ipv6_portaddr_hash(net, &in6addr_any, hnum) & udptable->mask; - hash2 = udp6_portaddr_hash(net, daddr, hnum) & udptable->mask; + hash2 = ipv6_portaddr_hash(net, daddr, hnum) & udptable->mask; start_lookup: hslot = &udptable->hash2[hash2]; offset = offsetof(typeof(*sk), __sk_common.skc_portaddr_node); @@ -958,7 +932,7 @@ static struct sock *__udp6_lib_demux_lookup(struct net *net, int dif, int sdif) { unsigned short hnum = ntohs(loc_port); - unsigned int hash2 = udp6_portaddr_hash(net, loc_addr, hnum); + unsigned int hash2 = ipv6_portaddr_hash(net, loc_addr, hnum); unsigned int slot2 = hash2 & udp_table.mask; struct udp_hslot *hslot2 = &udp_table.hash2[slot2]; const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum); diff --git a/net/ipv6/xfrm6_mode_tunnel.c b/net/ipv6/xfrm6_mode_tunnel.c index 49bbc69e6f4d..e66b94f46532 100644 --- a/net/ipv6/xfrm6_mode_tunnel.c +++ b/net/ipv6/xfrm6_mode_tunnel.c @@ -59,7 +59,7 @@ static int xfrm6_mode_tunnel_output(struct xfrm_state *x, struct sk_buff *skb) if (x->props.flags & XFRM_STATE_NOECN) dsfield &= ~INET_ECN_MASK; ipv6_change_dsfield(top_iph, 0, dsfield); - top_iph->hop_limit = ip6_dst_hoplimit(dst->child); + top_iph->hop_limit = ip6_dst_hoplimit(xfrm_dst_child(dst)); top_iph->saddr = *(struct in6_addr *)&x->props.saddr; top_iph->daddr = *(struct in6_addr *)&x->id.daddr; return 0; @@ -112,8 +112,10 @@ static void xfrm6_mode_tunnel_xmit(struct xfrm_state *x, struct sk_buff *skb) { struct xfrm_offload *xo = xfrm_offload(skb); - if (xo->flags & XFRM_GSO_SEGMENT) + if (xo->flags & XFRM_GSO_SEGMENT) { + skb->network_header = skb->network_header - x->props.header_len; skb->transport_header = skb->network_header + sizeof(struct ipv6hdr); + } skb_reset_mac_len(skb); pskb_pull(skb, skb->mac_len + x->props.header_len); diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index 3e4aa4def9f8..1bad876be17f 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -113,8 +113,6 @@ static int xfrm6_fill_dst(struct xfrm_dst *xdst, struct net_device *dev, * it was magically lost, so this code needs audit */ xdst->u.rt6.rt6i_flags = rt->rt6i_flags & (RTF_ANYCAST | RTF_LOCAL); - xdst->u.rt6.rt6i_metric = rt->rt6i_metric; - xdst->u.rt6.rt6i_node = rt->rt6i_node; xdst->route_cookie = rt6_get_cookie(rt); xdst->u.rt6.rt6i_gateway = rt->rt6i_gateway; xdst->u.rt6.rt6i_dst = rt->rt6i_dst; @@ -271,7 +269,7 @@ static void xfrm6_dst_ifdown(struct dst_entry *dst, struct net_device *dev, in6_dev_put(xdst->u.rt6.rt6i_idev); xdst->u.rt6.rt6i_idev = loopback_idev; in6_dev_hold(loopback_idev); - xdst = (struct xfrm_dst *)xdst->u.dst.child; + xdst = (struct xfrm_dst *)xfrm_dst_child(&xdst->u.dst); } while (xdst->u.dst.xfrm); __in6_dev_put(loopback_idev); diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index 8792cad28e29..f6fec281ba45 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -266,12 +266,13 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs, /* check and decrement ttl */ if (ipv6_hdr(skb)->hop_limit <= 1) { + struct inet6_dev *idev = __in6_dev_get_safely(skb->dev); + /* Force OUTPUT device used as source address */ skb->dev = dst->dev; icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT, 0); - IP6_INC_STATS(net, ip6_dst_idev(dst), - IPSTATS_MIB_INHDRERRORS); + IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); return false; } diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c index 6a34cad6eab6..72dc268aaa7f 100644 --- a/net/netfilter/nft_meta.c +++ b/net/netfilter/nft_meta.c @@ -212,7 +212,7 @@ void nft_meta_get_eval(const struct nft_expr *expr, } #ifdef CONFIG_XFRM case NFT_META_SECPATH: - nft_reg_store8(dest, secpath_exists(skb)); + nft_reg_store8(dest, !!skb->sp); break; #endif default: diff --git a/net/netfilter/xt_bpf.c b/net/netfilter/xt_bpf.c index 7908b67e4b6a..a2cf8a6236d6 100644 --- a/net/netfilter/xt_bpf.c +++ b/net/netfilter/xt_bpf.c @@ -62,7 +62,6 @@ static int __bpf_mt_check_path(const char *path, struct bpf_prog **ret) *ret = bpf_prog_get_type_path(path, BPF_PROG_TYPE_SOCKET_FILTER); return PTR_ERR_OR_ZERO(*ret); - } static int bpf_mt_check(const struct xt_mtchk_param *par) diff --git a/net/netfilter/xt_policy.c b/net/netfilter/xt_policy.c index 2b4ab189bba7..5639fb03bdd9 100644 --- a/net/netfilter/xt_policy.c +++ b/net/netfilter/xt_policy.c @@ -93,7 +93,8 @@ match_policy_out(const struct sk_buff *skb, const struct xt_policy_info *info, if (dst->xfrm == NULL) return -1; - for (i = 0; dst && dst->xfrm; dst = dst->child, i++) { + for (i = 0; dst && dst->xfrm; + dst = ((struct xfrm_dst *)dst)->child, i++) { pos = strict ? i : 0; if (pos >= info->len) return 0; diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c index 3d74e33bf829..3117b0c143d0 100644 --- a/net/openvswitch/conntrack.c +++ b/net/openvswitch/conntrack.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright (c) 2015 Nicira, Inc. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h index bc7efd1867ab..e2ba460ec91e 100644 --- a/net/openvswitch/conntrack.h +++ b/net/openvswitch/conntrack.h @@ -1,14 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright (c) 2015 Nicira, Inc. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef OVS_CONNTRACK_H diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c index a4a4260615d6..01281514aec3 100644 --- a/net/openvswitch/flow_netlink.c +++ b/net/openvswitch/flow_netlink.c @@ -48,6 +48,7 @@ #include #include #include +#include #include "flow_netlink.h" @@ -315,7 +316,8 @@ size_t ovs_tun_key_attr_size(void) + nla_total_size(0) /* OVS_TUNNEL_KEY_ATTR_CSUM */ + nla_total_size(0) /* OVS_TUNNEL_KEY_ATTR_OAM */ + nla_total_size(256) /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */ - /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with + /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and + * OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS is mutually exclusive with * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it. */ + nla_total_size(2) /* OVS_TUNNEL_KEY_ATTR_TP_SRC */ @@ -371,6 +373,7 @@ static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1] .next = ovs_vxlan_ext_key_lens }, [OVS_TUNNEL_KEY_ATTR_IPV6_SRC] = { .len = sizeof(struct in6_addr) }, [OVS_TUNNEL_KEY_ATTR_IPV6_DST] = { .len = sizeof(struct in6_addr) }, + [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS] = { .len = OVS_ATTR_VARIABLE }, }; /* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute. */ @@ -593,6 +596,33 @@ static int vxlan_tun_opt_from_nlattr(const struct nlattr *attr, return 0; } +static int erspan_tun_opt_from_nlattr(const struct nlattr *a, + struct sw_flow_match *match, bool is_mask, + bool log) +{ + unsigned long opt_key_offset; + + BUILD_BUG_ON(sizeof(struct erspan_metadata) > + sizeof(match->key->tun_opts)); + + if (nla_len(a) > sizeof(match->key->tun_opts)) { + OVS_NLERR(log, "ERSPAN option length err (len %d, max %zu).", + nla_len(a), sizeof(match->key->tun_opts)); + return -EINVAL; + } + + if (!is_mask) + SW_FLOW_KEY_PUT(match, tun_opts_len, + sizeof(struct erspan_metadata), false); + else + SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true); + + opt_key_offset = TUN_METADATA_OFFSET(nla_len(a)); + SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, nla_data(a), + nla_len(a), is_mask); + return 0; +} + static int ip_tun_from_nlattr(const struct nlattr *attr, struct sw_flow_match *match, bool is_mask, bool log) @@ -700,6 +730,20 @@ static int ip_tun_from_nlattr(const struct nlattr *attr, break; case OVS_TUNNEL_KEY_ATTR_PAD: break; + case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS: + if (opts_type) { + OVS_NLERR(log, "Multiple metadata blocks provided"); + return -EINVAL; + } + + err = erspan_tun_opt_from_nlattr(a, match, is_mask, + log); + if (err) + return err; + + tun_flags |= TUNNEL_ERSPAN_OPT; + opts_type = type; + break; default: OVS_NLERR(log, "Unknown IP tunnel attribute %d", type); @@ -824,6 +868,10 @@ static int __ip_tun_to_nlattr(struct sk_buff *skb, else if (output->tun_flags & TUNNEL_VXLAN_OPT && vxlan_opt_to_nlattr(skb, tun_opts, swkey_tun_opts_len)) return -EMSGSIZE; + else if (output->tun_flags & TUNNEL_ERSPAN_OPT && + nla_put(skb, OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS, + swkey_tun_opts_len, tun_opts)) + return -EMSGSIZE; } return 0; @@ -2177,7 +2225,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr, struct ovs_tunnel_info *ovs_tun; struct nlattr *a; int err = 0, start, opts_type; + __be16 dst_opt_type; + dst_opt_type = 0; ovs_match_init(&match, &key, true, NULL); opts_type = ip_tun_from_nlattr(nla_data(attr), &match, false, log); if (opts_type < 0) @@ -2189,8 +2239,13 @@ static int validate_and_copy_set_tun(const struct nlattr *attr, err = validate_geneve_opts(&key); if (err < 0) return err; + dst_opt_type = TUNNEL_GENEVE_OPT; break; case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS: + dst_opt_type = TUNNEL_VXLAN_OPT; + break; + case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS: + dst_opt_type = TUNNEL_ERSPAN_OPT; break; } }; @@ -2233,7 +2288,7 @@ static int validate_and_copy_set_tun(const struct nlattr *attr, */ ip_tunnel_info_opts_set(tun_info, TUN_METADATA_OPTS(&key, key.tun_opts_len), - key.tun_opts_len); + key.tun_opts_len, dst_opt_type); add_nested_action_end(*sfa, start); return err; diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 5d8ab4d411aa..e3edf0c7e2e4 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -250,14 +250,12 @@ static int packet_direct_xmit(struct sk_buff *skb) struct sk_buff *orig_skb = skb; struct netdev_queue *txq; int ret = NETDEV_TX_BUSY; - bool again = false; - if (unlikely(!netif_running(dev) || !netif_carrier_ok(dev))) goto drop; - skb = validate_xmit_skb_list(skb, dev, &again); + skb = validate_xmit_skb_list(skb, dev); if (skb != orig_skb) goto drop; diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c index 1c98a026b41a..159cea6d26b6 100644 --- a/net/rxrpc/call_object.c +++ b/net/rxrpc/call_object.c @@ -412,7 +412,7 @@ void rxrpc_incoming_call(struct rxrpc_sock *rx, bool rxrpc_queue_call(struct rxrpc_call *call) { const void *here = __builtin_return_address(0); - int n = __atomic_add_unless(&call->usage, 1, 0); + int n = atomic_fetch_add_unless(&call->usage, 1, 0); if (n == 0) return false; if (rxrpc_queue_work(&call->processor)) diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c index 0e5087b9e07c..26b9f1614728 100644 --- a/net/rxrpc/conn_object.c +++ b/net/rxrpc/conn_object.c @@ -250,7 +250,7 @@ void rxrpc_kill_connection(struct rxrpc_connection *conn) bool rxrpc_queue_conn(struct rxrpc_connection *conn) { const void *here = __builtin_return_address(0); - int n = __atomic_add_unless(&conn->usage, 1, 0); + int n = atomic_fetch_add_unless(&conn->usage, 1, 0); if (n == 0) return false; if (rxrpc_queue_work(&conn->processor)) @@ -293,7 +293,7 @@ rxrpc_get_connection_maybe(struct rxrpc_connection *conn) const void *here = __builtin_return_address(0); if (conn) { - int n = __atomic_add_unless(&conn->usage, 1, 0); + int n = atomic_fetch_add_unless(&conn->usage, 1, 0); if (n > 0) trace_rxrpc_conn(conn, rxrpc_conn_got, n + 1, here); else diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c index de0cd73a5a5d..b0285f15b2ee 100644 --- a/net/sched/act_connmark.c +++ b/net/sched/act_connmark.c @@ -46,17 +46,20 @@ static int tcf_connmark(struct sk_buff *skb, const struct tc_action *a, tcf_lastuse_update(&ca->tcf_tm); bstats_update(&ca->tcf_bstats, skb); - if (skb->protocol == htons(ETH_P_IP)) { + switch (skb_protocol(skb, true)) { + case htons(ETH_P_IP): if (skb->len < sizeof(struct iphdr)) goto out; proto = NFPROTO_IPV4; - } else if (skb->protocol == htons(ETH_P_IPV6)) { + break; + case htons(ETH_P_IPV6): if (skb->len < sizeof(struct ipv6hdr)) goto out; proto = NFPROTO_IPV6; - } else { + break; + default: goto out; } diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c index a449594553d0..10587d0aac10 100644 --- a/net/sched/act_csum.c +++ b/net/sched/act_csum.c @@ -552,7 +552,7 @@ static int tcf_csum(struct sk_buff *skb, const struct tc_action *a, if (unlikely(action == TC_ACT_SHOT)) goto drop; - switch (tc_skb_protocol(skb)) { + switch (skb_protocol(skb, false)) { case cpu_to_be16(ETH_P_IP): if (!tcf_csum_ipv4(skb, update_flags)) goto drop; diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c index 1a8a49e33320..d9a3f5d01da6 100644 --- a/net/sched/act_skbedit.c +++ b/net/sched/act_skbedit.c @@ -23,6 +23,9 @@ #include #include #include +#include +#include +#include #include #include @@ -41,6 +44,25 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a, if (d->flags & SKBEDIT_F_PRIORITY) skb->priority = d->priority; + if (d->flags & SKBEDIT_F_INHERITDSFIELD) { + int wlen = skb_network_offset(skb); + + switch (skb_protocol(skb, true)) { + case htons(ETH_P_IP): + wlen += sizeof(struct iphdr); + if (!pskb_may_pull(skb, wlen)) + goto err; + skb->priority = ipv4_get_dsfield(ip_hdr(skb)) >> 2; + break; + + case htons(ETH_P_IPV6): + wlen += sizeof(struct ipv6hdr); + if (!pskb_may_pull(skb, wlen)) + goto err; + skb->priority = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2; + break; + } + } if (d->flags & SKBEDIT_F_QUEUE_MAPPING && skb->dev->real_num_tx_queues > d->queue_mapping) skb_set_queue_mapping(skb, d->queue_mapping); @@ -53,6 +75,11 @@ static int tcf_skbedit(struct sk_buff *skb, const struct tc_action *a, spin_unlock(&d->tcf_lock); return d->tcf_action; + +err: + d->tcf_qstats.drops++; + spin_unlock(&d->tcf_lock); + return TC_ACT_SHOT; } static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = { @@ -62,6 +89,7 @@ static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = { [TCA_SKBEDIT_MARK] = { .len = sizeof(u32) }, [TCA_SKBEDIT_PTYPE] = { .len = sizeof(u16) }, [TCA_SKBEDIT_MASK] = { .len = sizeof(u32) }, + [TCA_SKBEDIT_FLAGS] = { .len = sizeof(u64) }, }; static int tcf_skbedit_init(struct net *net, struct nlattr *nla, @@ -114,6 +142,13 @@ static int tcf_skbedit_init(struct net *net, struct nlattr *nla, mask = nla_data(tb[TCA_SKBEDIT_MASK]); } + if (tb[TCA_SKBEDIT_FLAGS] != NULL) { + u64 *pure_flags = nla_data(tb[TCA_SKBEDIT_FLAGS]); + + if (*pure_flags & SKBEDIT_F_INHERITDSFIELD) + flags |= SKBEDIT_F_INHERITDSFIELD; + } + parm = nla_data(tb[TCA_SKBEDIT_PARMS]); exists = tcf_idr_check(tn, parm->index, a, bind); @@ -178,6 +213,7 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct tc_action *a, .action = d->tcf_action, }; struct tcf_t t; + u64 pure_flags = 0; if (nla_put(skb, TCA_SKBEDIT_PARMS, sizeof(opt), &opt)) goto nla_put_failure; @@ -196,6 +232,11 @@ static int tcf_skbedit_dump(struct sk_buff *skb, struct tc_action *a, if ((d->flags & SKBEDIT_F_MASK) && nla_put_u32(skb, TCA_SKBEDIT_MASK, d->mask)) goto nla_put_failure; + if (d->flags & SKBEDIT_F_INHERITDSFIELD) + pure_flags |= SKBEDIT_F_INHERITDSFIELD; + if (pure_flags != 0 && + nla_put(skb, TCA_SKBEDIT_FLAGS, sizeof(pure_flags), &pure_flags)) + goto nla_put_failure; tcf_tm_dump(&t, &d->tcf_tm); if (nla_put_64bit(skb, TCA_SKBEDIT_TM, sizeof(t), &t, TCA_SKBEDIT_PAD)) diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 8808133e78a3..14d93ed68e21 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -325,7 +325,7 @@ int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp, reclassify: #endif for (; tp; tp = rcu_dereference_bh(tp->next)) { - __be16 protocol = tc_skb_protocol(skb); + __be16 protocol = skb_protocol(skb, false); int err; if (tp->protocol != protocol && diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c index 052073137987..b41718504205 100644 --- a/net/sched/cls_bpf.c +++ b/net/sched/cls_bpf.c @@ -367,15 +367,17 @@ static int cls_bpf_prog_from_ops(struct nlattr **tb, struct cls_bpf_prog *prog) } static int cls_bpf_prog_from_efd(struct nlattr **tb, struct cls_bpf_prog *prog, - const struct tcf_proto *tp) + u32 gen_flags, const struct tcf_proto *tp) { struct bpf_prog *fp; char *name = NULL; + bool skip_sw; u32 bpf_fd; bpf_fd = nla_get_u32(tb[TCA_BPF_FD]); + skip_sw = gen_flags & TCA_CLS_FLAGS_SKIP_SW; - fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_SCHED_CLS); + fp = bpf_prog_get_type_dev(bpf_fd, BPF_PROG_TYPE_SCHED_CLS, skip_sw); if (IS_ERR(fp)) return PTR_ERR(fp); @@ -433,7 +435,7 @@ static int cls_bpf_set_parms(struct net *net, struct tcf_proto *tp, prog->gen_flags = gen_flags; ret = is_bpf ? cls_bpf_prog_from_ops(tb, prog) : - cls_bpf_prog_from_efd(tb, prog, tp); + cls_bpf_prog_from_efd(tb, prog, gen_flags, tp); if (ret < 0) return ret; diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c index 0d364166ccc9..ef5af64d9452 100644 --- a/net/sched/cls_flow.c +++ b/net/sched/cls_flow.c @@ -87,7 +87,7 @@ static u32 flow_get_dst(const struct sk_buff *skb, const struct flow_keys *flow) if (dst) return ntohl(dst); - return addr_fold(skb_dst(skb)) ^ (__force u16) tc_skb_protocol(skb); + return addr_fold(skb_dst(skb)) ^ (__force u16)skb_protocol(skb, true); } static u32 flow_get_proto(const struct sk_buff *skb, @@ -111,7 +111,7 @@ static u32 flow_get_proto_dst(const struct sk_buff *skb, if (flow->ports.ports) return ntohs(flow->ports.dst); - return addr_fold(skb_dst(skb)) ^ (__force u16) tc_skb_protocol(skb); + return addr_fold(skb_dst(skb)) ^ (__force u16)skb_protocol(skb, true); } static u32 flow_get_iif(const struct sk_buff *skb) @@ -158,7 +158,7 @@ static u32 flow_get_nfct(const struct sk_buff *skb) static u32 flow_get_nfct_src(const struct sk_buff *skb, const struct flow_keys *flow) { - switch (tc_skb_protocol(skb)) { + switch (skb_protocol(skb, true)) { case htons(ETH_P_IP): return ntohl(CTTUPLE(skb, src.u3.ip)); case htons(ETH_P_IPV6): @@ -171,7 +171,7 @@ fallback: static u32 flow_get_nfct_dst(const struct sk_buff *skb, const struct flow_keys *flow) { - switch (tc_skb_protocol(skb)) { + switch (skb_protocol(skb, true)) { case htons(ETH_P_IP): return ntohl(CTTUPLE(skb, dst.u3.ip)); case htons(ETH_P_IPV6): diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 8974bd25c71e..3cc9dfe7fb02 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -191,7 +191,7 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp, /* skb_flow_dissect() does not set n_proto in case an unknown protocol, * so do it rather here. */ - skb_key.basic.n_proto = skb->protocol; + skb_key.basic.n_proto = skb_protocol(skb, false); skb_flow_dissect(skb, &head->dissector, &skb_key, 0); fl_set_masked_key(&skb_mkey, &skb_key, &head->mask); diff --git a/net/sched/em_ipset.c b/net/sched/em_ipset.c index c1b23e3060b8..ef3b6b66c26a 100644 --- a/net/sched/em_ipset.c +++ b/net/sched/em_ipset.c @@ -62,7 +62,7 @@ static int em_ipset_match(struct sk_buff *skb, struct tcf_ematch *em, }; int ret, network_offset; - switch (tc_skb_protocol(skb)) { + switch (skb_protocol(skb, true)) { case htons(ETH_P_IP): state.pf = NFPROTO_IPV4; if (!pskb_network_may_pull(skb, sizeof(struct iphdr))) diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c index d6e97115500b..e36fa9272259 100644 --- a/net/sched/em_meta.c +++ b/net/sched/em_meta.c @@ -199,7 +199,7 @@ META_COLLECTOR(int_priority) META_COLLECTOR(int_protocol) { /* Let userspace take care of the byte ordering */ - dst->value = tc_skb_protocol(skb); + dst->value = skb_protocol(skb, false); } META_COLLECTOR(int_pkttype) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 05c7196ae849..4a76ceeca6fd 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -30,7 +30,6 @@ #include #include #include -#include /* Qdisc to use by default */ const struct Qdisc_ops *default_qdisc_ops = &pfifo_fast_ops; @@ -120,8 +119,6 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, if (unlikely(skb)) { /* skb in gso_skb were already validated */ *validate = false; - if (xfrm_offload(skb)) - *validate = true; /* check the reason of requeuing without tx lock first */ txq = skb_get_tx_queue(txq->dev, skb); if (!netif_xmit_frozen_or_stopped(txq)) { @@ -175,23 +172,13 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, spinlock_t *root_lock, bool validate) { int ret = NETDEV_TX_BUSY; - bool again = false; /* And release qdisc */ spin_unlock(root_lock); /* Note that we validate skb (GSO, checksum, ...) outside of locks */ if (validate) - skb = validate_xmit_skb_list(skb, dev, &again); - -#ifdef CONFIG_XFRM_OFFLOAD - if (unlikely(again)) { - if (root_lock) - spin_lock(root_lock); - dev_requeue_skb(skb, q); - return false; - } -#endif + skb = validate_xmit_skb_list(skb, dev); if (likely(skb)) { HARD_TX_LOCK(dev, txq, smp_processor_id()); diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c index 5d8c095ec5ec..b55ddc3c8aa4 100644 --- a/net/sched/sch_teql.c +++ b/net/sched/sch_teql.c @@ -245,7 +245,7 @@ __teql_resolve(struct sk_buff *skb, struct sk_buff *skb_res, char haddr[MAX_ADDR_LEN]; neigh_ha_snapshot(haddr, n, dev); - err = dev_hard_header(skb, dev, ntohs(tc_skb_protocol(skb)), + err = dev_hard_header(skb, dev, ntohs(skb_protocol(skb, false)), haddr, NULL, skb->len); if (err < 0) diff --git a/net/socket.c b/net/socket.c index 415b918c036c..f4dad9c24336 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1885,9 +1885,9 @@ static int __sys_setsockopt(int fd, int level, int optname, err = BPF_CGROUP_RUN_PROG_SETSOCKOPT(sock->sk, &level, &optname, optval, &optlen, &kernel_optval); + if (err < 0) { goto out_put; - } else if (err > 0) { err = 0; goto out_put; @@ -1911,7 +1911,6 @@ static int __sys_setsockopt(int fd, int level, int optname, set_fs(oldfs); kfree(kernel_optval); } - out_put: fput_light(sock->file, fput_needed); } @@ -1956,7 +1955,6 @@ static int __sys_getsockopt(int fd, int level, int optname, err = BPF_CGROUP_RUN_PROG_GETSOCKOPT(sock->sk, level, optname, optval, optlen, max_optlen, err); - out_put: fput_light(sock->file, fput_needed); } diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c index a68c754e84ea..32288e67b22f 100644 --- a/net/strparser/strparser.c +++ b/net/strparser/strparser.c @@ -29,18 +29,10 @@ static struct workqueue_struct *strp_wq; -struct _strp_msg { - /* Internal cb structure. struct strp_msg must be first for passing - * to upper layer. - */ - struct strp_msg strp; - int accum_len; -}; - static inline struct _strp_msg *_strp_msg(struct sk_buff *skb) { return (struct _strp_msg *)((void *)skb->cb + - offsetof(struct qdisc_skb_cb, data)); + offsetof(struct sk_skb_cb, strp)); } /* Lower lock held */ @@ -497,6 +489,19 @@ int strp_init(struct strparser *strp, struct sock *sk, } EXPORT_SYMBOL_GPL(strp_init); +/* Sock process lock held (lock_sock) */ +void __strp_unpause(struct strparser *strp) +{ + strp->paused = 0; + + if (strp->need_bytes) { + if (strp_peek_len(strp) < strp->need_bytes) + return; + } + strp_read_sock(strp); +} +EXPORT_SYMBOL_GPL(__strp_unpause); + void strp_unpause(struct strparser *strp) { strp->paused = 0; diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c index 556989b0b5fc..09c9b4ca84bc 100644 --- a/net/sunrpc/cache.c +++ b/net/sunrpc/cache.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * net/sunrpc/cache.c * @@ -5,9 +6,6 @@ * used by sunrpc clients and servers. * * Copyright (C) 2002 Neil Brown - * - * Released under terms in GPL version 2. See COPYING. - * */ #include diff --git a/net/tls/Kconfig b/net/tls/Kconfig index eb583038c67e..99c1a19c17b1 100644 --- a/net/tls/Kconfig +++ b/net/tls/Kconfig @@ -7,9 +7,21 @@ config TLS select CRYPTO select CRYPTO_AES select CRYPTO_GCM + select STREAM_PARSER + select NET_SOCK_MSG default n ---help--- Enable kernel support for TLS protocol. This allows symmetric encryption handling of the TLS protocol to be done in-kernel. If unsure, say N. + +config TLS_DEVICE + bool "Transport Layer Security HW offload" + depends on TLS + select SOCK_VALIDATE_XMIT + default n + help + Enable kernel support for HW offload of the TLS protocol. + + If unsure, say N. diff --git a/net/tls/Makefile b/net/tls/Makefile index a930fd1c4f7b..4d6b728a67d0 100644 --- a/net/tls/Makefile +++ b/net/tls/Makefile @@ -5,3 +5,5 @@ obj-$(CONFIG_TLS) += tls.o tls-y := tls_main.o tls_sw.o + +tls-$(CONFIG_TLS_DEVICE) += tls_device.o tls_device_fallback.o diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c new file mode 100644 index 000000000000..b290eb3ae155 --- /dev/null +++ b/net/tls/tls_device.c @@ -0,0 +1,1054 @@ +/* Copyright (c) 2018, Mellanox Technologies All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +/* device_offload_lock is used to synchronize tls_dev_add + * against NETDEV_DOWN notifications. + */ +static DECLARE_RWSEM(device_offload_lock); + +static void tls_device_gc_task(struct work_struct *work); + +static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task); +static LIST_HEAD(tls_device_gc_list); +static LIST_HEAD(tls_device_list); +static DEFINE_SPINLOCK(tls_device_lock); + +static void tls_device_free_ctx(struct tls_context *ctx) +{ + if (ctx->tx_conf == TLS_HW) { + kfree(tls_offload_ctx_tx(ctx)); + kfree(ctx->tx.rec_seq); + kfree(ctx->tx.iv); + } + + if (ctx->rx_conf == TLS_HW) + kfree(tls_offload_ctx_rx(ctx)); + + tls_ctx_free(ctx); +} + +static void tls_device_gc_task(struct work_struct *work) +{ + struct tls_context *ctx, *tmp; + unsigned long flags; + LIST_HEAD(gc_list); + + spin_lock_irqsave(&tls_device_lock, flags); + list_splice_init(&tls_device_gc_list, &gc_list); + spin_unlock_irqrestore(&tls_device_lock, flags); + + list_for_each_entry_safe(ctx, tmp, &gc_list, list) { + struct net_device *netdev = ctx->netdev; + + if (netdev && ctx->tx_conf == TLS_HW) { + netdev->tlsdev_ops->tls_dev_del(netdev, ctx, + TLS_OFFLOAD_CTX_DIR_TX); + dev_put(netdev); + ctx->netdev = NULL; + } + + list_del(&ctx->list); + tls_device_free_ctx(ctx); + } +} + +static void tls_device_attach(struct tls_context *ctx, struct sock *sk, + struct net_device *netdev) +{ + if (sk->sk_destruct != tls_device_sk_destruct) { + refcount_set(&ctx->refcount, 1); + dev_hold(netdev); + ctx->netdev = netdev; + spin_lock_irq(&tls_device_lock); + list_add_tail(&ctx->list, &tls_device_list); + spin_unlock_irq(&tls_device_lock); + + ctx->sk_destruct = sk->sk_destruct; + sk->sk_destruct = tls_device_sk_destruct; + } +} + +static void tls_device_queue_ctx_destruction(struct tls_context *ctx) +{ + unsigned long flags; + + spin_lock_irqsave(&tls_device_lock, flags); + if (unlikely(!refcount_dec_and_test(&ctx->refcount))) + goto unlock; + + list_move_tail(&ctx->list, &tls_device_gc_list); + + /* schedule_work inside the spinlock + * to make sure tls_device_down waits for that work. + */ + schedule_work(&tls_device_gc_work); +unlock: + spin_unlock_irqrestore(&tls_device_lock, flags); +} + +/* We assume that the socket is already connected */ +static struct net_device *get_netdev_for_sock(struct sock *sk) +{ + struct dst_entry *dst = sk_dst_get(sk); + struct net_device *netdev = NULL; + + if (likely(dst)) { + netdev = dst->dev; + dev_hold(netdev); + } + + dst_release(dst); + + return netdev; +} + +static void destroy_record(struct tls_record_info *record) +{ + int nr_frags = record->num_frags; + skb_frag_t *frag; + + while (nr_frags-- > 0) { + frag = &record->frags[nr_frags]; + __skb_frag_unref(frag); + } + kfree(record); +} + +static void delete_all_records(struct tls_offload_context_tx *offload_ctx) +{ + struct tls_record_info *info, *temp; + + list_for_each_entry_safe(info, temp, &offload_ctx->records_list, list) { + list_del(&info->list); + destroy_record(info); + } + + offload_ctx->retransmit_hint = NULL; +} + +static void tls_icsk_clean_acked(struct sock *sk, u32 acked_seq) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_record_info *info, *temp; + struct tls_offload_context_tx *ctx; + u64 deleted_records = 0; + unsigned long flags; + + if (!tls_ctx) + return; + + ctx = tls_offload_ctx_tx(tls_ctx); + + spin_lock_irqsave(&ctx->lock, flags); + info = ctx->retransmit_hint; + if (info && !before(acked_seq, info->end_seq)) { + ctx->retransmit_hint = NULL; + list_del(&info->list); + destroy_record(info); + deleted_records++; + } + + list_for_each_entry_safe(info, temp, &ctx->records_list, list) { + if (before(acked_seq, info->end_seq)) + break; + list_del(&info->list); + + destroy_record(info); + deleted_records++; + } + + ctx->unacked_record_sn += deleted_records; + spin_unlock_irqrestore(&ctx->lock, flags); +} + +/* At this point, there should be no references on this + * socket and no in-flight SKBs associated with this + * socket, so it is safe to free all the resources. + */ +void tls_device_sk_destruct(struct sock *sk) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_offload_context_tx *ctx = tls_offload_ctx_tx(tls_ctx); + + tls_ctx->sk_destruct(sk); + + if (tls_ctx->tx_conf == TLS_HW) { + if (ctx->open_record) + destroy_record(ctx->open_record); + delete_all_records(ctx); + crypto_free_aead(ctx->aead_send); + clean_acked_data_disable(inet_csk(sk)); + } + + tls_device_queue_ctx_destruction(tls_ctx); +} +EXPORT_SYMBOL(tls_device_sk_destruct); + +static void tls_append_frag(struct tls_record_info *record, + struct page_frag *pfrag, + int size) +{ + skb_frag_t *frag; + + frag = &record->frags[record->num_frags - 1]; + if (frag->page.p == pfrag->page && + frag->page_offset + frag->size == pfrag->offset) { + frag->size += size; + } else { + ++frag; + frag->page.p = pfrag->page; + frag->page_offset = pfrag->offset; + frag->size = size; + ++record->num_frags; + get_page(pfrag->page); + } + + pfrag->offset += size; + record->len += size; +} + +static int tls_push_record(struct sock *sk, + struct tls_context *ctx, + struct tls_offload_context_tx *offload_ctx, + struct tls_record_info *record, + struct page_frag *pfrag, + int flags, + unsigned char record_type) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct page_frag dummy_tag_frag; + skb_frag_t *frag; + int i; + + /* fill prepend */ + frag = &record->frags[0]; + tls_fill_prepend(ctx, + skb_frag_address(frag), + record->len - ctx->tx.prepend_size, + record_type); + + /* HW doesn't care about the data in the tag, because it fills it. */ + dummy_tag_frag.page = skb_frag_page(frag); + dummy_tag_frag.offset = 0; + + tls_append_frag(record, &dummy_tag_frag, ctx->tx.tag_size); + record->end_seq = tp->write_seq + record->len; + spin_lock_irq(&offload_ctx->lock); + list_add_tail(&record->list, &offload_ctx->records_list); + spin_unlock_irq(&offload_ctx->lock); + offload_ctx->open_record = NULL; + set_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags); + tls_advance_record_sn(sk, &ctx->tx); + + for (i = 0; i < record->num_frags; i++) { + frag = &record->frags[i]; + sg_unmark_end(&offload_ctx->sg_tx_data[i]); + sg_set_page(&offload_ctx->sg_tx_data[i], skb_frag_page(frag), + frag->size, frag->page_offset); + sk_mem_charge(sk, frag->size); + get_page(skb_frag_page(frag)); + } + sg_mark_end(&offload_ctx->sg_tx_data[record->num_frags - 1]); + + /* all ready, send */ + return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags); +} + +static int tls_create_new_record(struct tls_offload_context_tx *offload_ctx, + struct page_frag *pfrag, + size_t prepend_size) +{ + struct tls_record_info *record; + skb_frag_t *frag; + + record = kmalloc(sizeof(*record), GFP_KERNEL); + if (!record) + return -ENOMEM; + + frag = &record->frags[0]; + __skb_frag_set_page(frag, pfrag->page); + frag->page_offset = pfrag->offset; + skb_frag_size_set(frag, prepend_size); + + get_page(pfrag->page); + pfrag->offset += prepend_size; + + record->num_frags = 1; + record->len = prepend_size; + offload_ctx->open_record = record; + return 0; +} + +static int tls_do_allocation(struct sock *sk, + struct tls_offload_context_tx *offload_ctx, + struct page_frag *pfrag, + size_t prepend_size) +{ + int ret; + + if (!offload_ctx->open_record) { + if (unlikely(!skb_page_frag_refill(prepend_size, pfrag, + sk->sk_allocation))) { + sk->sk_prot->enter_memory_pressure(sk); + sk_stream_moderate_sndbuf(sk); + return -ENOMEM; + } + + ret = tls_create_new_record(offload_ctx, pfrag, prepend_size); + if (ret) + return ret; + + if (pfrag->size > pfrag->offset) + return 0; + } + + if (!sk_page_frag_refill(sk, pfrag)) + return -ENOMEM; + + return 0; +} + +static int tls_push_data(struct sock *sk, + struct iov_iter *msg_iter, + size_t size, int flags, + unsigned char record_type) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_offload_context_tx *ctx = tls_offload_ctx_tx(tls_ctx); + int tls_push_record_flags = flags | MSG_SENDPAGE_NOTLAST; + struct tls_record_info *record = ctx->open_record; + struct page_frag *pfrag; + size_t orig_size = size; + u32 max_open_record_len; + bool more = false; + bool done = false; + int copy, rc = 0; + long timeo; + + if (flags & + ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_SENDPAGE_NOTLAST)) + return -ENOTSUPP; + + if (sk->sk_err) + return -sk->sk_err; + + timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); + rc = tls_complete_pending_work(sk, tls_ctx, flags, &timeo); + if (rc < 0) + return rc; + + pfrag = sk_page_frag(sk); + + /* TLS_HEADER_SIZE is not counted as part of the TLS record, and + * we need to leave room for an authentication tag. + */ + max_open_record_len = TLS_MAX_PAYLOAD_SIZE + + tls_ctx->tx.prepend_size; + do { + rc = tls_do_allocation(sk, ctx, pfrag, + tls_ctx->tx.prepend_size); + if (rc) { + rc = sk_stream_wait_memory(sk, &timeo); + if (!rc) + continue; + + record = ctx->open_record; + if (!record) + break; +handle_error: + if (record_type != TLS_RECORD_TYPE_DATA) { + /* avoid sending partial + * record with type != + * application_data + */ + size = orig_size; + destroy_record(record); + ctx->open_record = NULL; + } else if (record->len > tls_ctx->tx.prepend_size) { + goto last_record; + } + + break; + } + + record = ctx->open_record; + copy = min_t(size_t, size, (pfrag->size - pfrag->offset)); + copy = min_t(size_t, copy, (max_open_record_len - record->len)); + + if (copy_from_iter_nocache(page_address(pfrag->page) + + pfrag->offset, + copy, msg_iter) != copy) { + rc = -EFAULT; + goto handle_error; + } + tls_append_frag(record, pfrag, copy); + + size -= copy; + if (!size) { +last_record: + tls_push_record_flags = flags; + if (flags & (MSG_SENDPAGE_NOTLAST | MSG_MORE)) { + more = true; + break; + } + + done = true; + } + + if (done || record->len >= max_open_record_len || + (record->num_frags >= MAX_SKB_FRAGS - 1)) { + rc = tls_push_record(sk, + tls_ctx, + ctx, + record, + pfrag, + tls_push_record_flags, + record_type); + if (rc < 0) + break; + } + } while (!done); + + tls_ctx->pending_open_record_frags = more; + + if (orig_size - size > 0) + rc = orig_size - size; + + return rc; +} + +int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) +{ + unsigned char record_type = TLS_RECORD_TYPE_DATA; + int rc; + + lock_sock(sk); + + if (unlikely(msg->msg_controllen)) { + rc = tls_proccess_cmsg(sk, msg, &record_type); + if (rc) + goto out; + } + + rc = tls_push_data(sk, &msg->msg_iter, size, + msg->msg_flags, record_type); + +out: + release_sock(sk); + return rc; +} + +int tls_device_sendpage(struct sock *sk, struct page *page, + int offset, size_t size, int flags) +{ + struct iov_iter msg_iter; + char *kaddr; + struct kvec iov; + int rc; + + if (flags & MSG_SENDPAGE_NOTLAST) + flags |= MSG_MORE; + + lock_sock(sk); + + if (flags & MSG_OOB) { + rc = -ENOTSUPP; + goto out; + } + + kaddr = kmap(page); + iov.iov_base = kaddr + offset; + iov.iov_len = size; + iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, &iov, 1, size); + rc = tls_push_data(sk, &msg_iter, size, + flags, TLS_RECORD_TYPE_DATA); + kunmap(page); + +out: + release_sock(sk); + return rc; +} + +struct tls_record_info *tls_get_record(struct tls_offload_context_tx *context, + u32 seq, u64 *p_record_sn) +{ + u64 record_sn = context->hint_record_sn; + struct tls_record_info *info, *last; + + info = context->retransmit_hint; + if (!info || + before(seq, info->end_seq - info->len)) { + /* if retransmit_hint is irrelevant start + * from the beggining of the list + */ + info = list_first_entry(&context->records_list, + struct tls_record_info, list); + + /* send the start_marker record if seq number is before the + * tls offload start marker sequence number. This record is + * required to handle TCP packets which are before TLS offload + * started. + * And if it's not start marker, look if this seq number + * belongs to the list. + */ + if (likely(!tls_record_is_start_marker(info))) { + /* we have the first record, get the last record to see + * if this seq number belongs to the list. + */ + last = list_last_entry(&context->records_list, + struct tls_record_info, list); + + if (!between(seq, tls_record_start_seq(info), + last->end_seq)) + return NULL; + } + record_sn = context->unacked_record_sn; + } + + list_for_each_entry_from(info, &context->records_list, list) { + if (before(seq, info->end_seq)) { + if (!context->retransmit_hint || + after(info->end_seq, + context->retransmit_hint->end_seq)) { + context->hint_record_sn = record_sn; + context->retransmit_hint = info; + } + *p_record_sn = record_sn; + return info; + } + record_sn++; + } + + return NULL; +} +EXPORT_SYMBOL(tls_get_record); + +static int tls_device_push_pending_record(struct sock *sk, int flags) +{ + struct iov_iter msg_iter; + + iov_iter_kvec(&msg_iter, WRITE | ITER_KVEC, NULL, 0, 0); + return tls_push_data(sk, &msg_iter, 0, flags, TLS_RECORD_TYPE_DATA); +} + +static void tls_device_resync_rx(struct tls_context *tls_ctx, + struct sock *sk, u32 seq, u64 rcd_sn) +{ + struct net_device *netdev; + + if (WARN_ON(test_and_set_bit(TLS_RX_SYNC_RUNNING, &tls_ctx->flags))) + return; + netdev = READ_ONCE(tls_ctx->netdev); + if (netdev) + netdev->tlsdev_ops->tls_dev_resync_rx(netdev, sk, seq, rcd_sn); + clear_bit_unlock(TLS_RX_SYNC_RUNNING, &tls_ctx->flags); +} + +void handle_device_resync(struct sock *sk, u32 seq, u64 rcd_sn) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_offload_context_rx *rx_ctx; + u32 is_req_pending; + s64 resync_req; + u32 req_seq; + + if (tls_ctx->rx_conf != TLS_HW) + return; + + rx_ctx = tls_offload_ctx_rx(tls_ctx); + resync_req = atomic64_read(&rx_ctx->resync_req); + req_seq = ntohl(resync_req >> 32) - ((u32)TLS_HEADER_SIZE - 1); + is_req_pending = resync_req; + + if (unlikely(is_req_pending) && req_seq == seq && + atomic64_try_cmpxchg(&rx_ctx->resync_req, &resync_req, 0)) { + seq += TLS_HEADER_SIZE - 1; + tls_device_resync_rx(tls_ctx, sk, seq, rcd_sn); + } +} + +static int tls_device_reencrypt(struct sock *sk, struct sk_buff *skb) +{ + struct strp_msg *rxm = strp_msg(skb); + int err = 0, offset = rxm->offset, copy, nsg, data_len, pos; + struct sk_buff *skb_iter, *unused; + struct scatterlist sg[1]; + char *orig_buf, *buf; + + orig_buf = kmalloc(rxm->full_len + TLS_HEADER_SIZE + + TLS_CIPHER_AES_GCM_128_IV_SIZE, sk->sk_allocation); + if (!orig_buf) + return -ENOMEM; + buf = orig_buf; + + nsg = skb_cow_data(skb, 0, &unused); + if (unlikely(nsg < 0)) { + err = nsg; + goto free_buf; + } + + sg_init_table(sg, 1); + sg_set_buf(&sg[0], buf, + rxm->full_len + TLS_HEADER_SIZE + + TLS_CIPHER_AES_GCM_128_IV_SIZE); + skb_copy_bits(skb, offset, buf, + TLS_HEADER_SIZE + TLS_CIPHER_AES_GCM_128_IV_SIZE); + + /* We are interested only in the decrypted data not the auth */ + err = decrypt_skb(sk, skb, sg); + if (err != -EBADMSG) + goto free_buf; + else + err = 0; + + data_len = rxm->full_len - TLS_CIPHER_AES_GCM_128_TAG_SIZE; + + if (skb_pagelen(skb) > offset) { + copy = min_t(int, skb_pagelen(skb) - offset, data_len); + + if (skb->decrypted) + skb_store_bits(skb, offset, buf, copy); + + offset += copy; + buf += copy; + } + + pos = skb_pagelen(skb); + skb_walk_frags(skb, skb_iter) { + int frag_pos; + + /* Practically all frags must belong to msg if reencrypt + * is needed with current strparser and coalescing logic, + * but strparser may "get optimized", so let's be safe. + */ + if (pos + skb_iter->len <= offset) + goto done_with_frag; + if (pos >= data_len + rxm->offset) + break; + + frag_pos = offset - pos; + copy = min_t(int, skb_iter->len - frag_pos, + data_len + rxm->offset - offset); + + if (skb_iter->decrypted) + skb_store_bits(skb_iter, frag_pos, buf, copy); + + offset += copy; + buf += copy; +done_with_frag: + pos += skb_iter->len; + } + +free_buf: + kfree(orig_buf); + return err; +} + +int tls_device_decrypted(struct sock *sk, struct sk_buff *skb) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_offload_context_rx *ctx = tls_offload_ctx_rx(tls_ctx); + int is_decrypted = skb->decrypted; + int is_encrypted = !is_decrypted; + struct sk_buff *skb_iter; + + /* Skip if it is already decrypted */ + if (ctx->sw.decrypted) + return 0; + + /* Check if all the data is decrypted already */ + skb_walk_frags(skb, skb_iter) { + is_decrypted &= skb_iter->decrypted; + is_encrypted &= !skb_iter->decrypted; + } + + ctx->sw.decrypted |= is_decrypted; + + /* Return immedeatly if the record is either entirely plaintext or + * entirely ciphertext. Otherwise handle reencrypt partially decrypted + * record. + */ + return (is_encrypted || is_decrypted) ? 0 : + tls_device_reencrypt(sk, skb); +} + +int tls_set_device_offload(struct sock *sk, struct tls_context *ctx) +{ + u16 nonce_size, tag_size, iv_size, rec_seq_size; + struct tls_record_info *start_marker_record; + struct tls_offload_context_tx *offload_ctx; + struct tls_crypto_info *crypto_info; + struct net_device *netdev; + char *iv, *rec_seq; + struct sk_buff *skb; + int rc = -EINVAL; + __be64 rcd_sn; + + if (!ctx) + goto out; + + if (ctx->priv_ctx_tx) { + rc = -EEXIST; + goto out; + } + + start_marker_record = kmalloc(sizeof(*start_marker_record), GFP_KERNEL); + if (!start_marker_record) { + rc = -ENOMEM; + goto out; + } + + offload_ctx = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE_TX, GFP_KERNEL); + if (!offload_ctx) { + rc = -ENOMEM; + goto free_marker_record; + } + + crypto_info = &ctx->crypto_send.info; + switch (crypto_info->cipher_type) { + case TLS_CIPHER_AES_GCM_128: + nonce_size = TLS_CIPHER_AES_GCM_128_IV_SIZE; + tag_size = TLS_CIPHER_AES_GCM_128_TAG_SIZE; + iv_size = TLS_CIPHER_AES_GCM_128_IV_SIZE; + iv = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->iv; + rec_seq_size = TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE; + rec_seq = + ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->rec_seq; + break; + default: + rc = -EINVAL; + goto free_offload_ctx; + } + + ctx->tx.prepend_size = TLS_HEADER_SIZE + nonce_size; + ctx->tx.tag_size = tag_size; + ctx->tx.overhead_size = ctx->tx.prepend_size + ctx->tx.tag_size; + ctx->tx.iv_size = iv_size; + ctx->tx.iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE, + GFP_KERNEL); + if (!ctx->tx.iv) { + rc = -ENOMEM; + goto free_offload_ctx; + } + + memcpy(ctx->tx.iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size); + + ctx->tx.rec_seq_size = rec_seq_size; + ctx->tx.rec_seq = kmemdup(rec_seq, rec_seq_size, GFP_KERNEL); + if (!ctx->tx.rec_seq) { + rc = -ENOMEM; + goto free_iv; + } + + rc = tls_sw_fallback_init(sk, offload_ctx, crypto_info); + if (rc) + goto free_rec_seq; + + /* start at rec_seq - 1 to account for the start marker record */ + memcpy(&rcd_sn, ctx->tx.rec_seq, sizeof(rcd_sn)); + offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1; + + start_marker_record->end_seq = tcp_sk(sk)->write_seq; + start_marker_record->len = 0; + start_marker_record->num_frags = 0; + + INIT_LIST_HEAD(&offload_ctx->records_list); + list_add_tail(&start_marker_record->list, &offload_ctx->records_list); + spin_lock_init(&offload_ctx->lock); + sg_init_table(offload_ctx->sg_tx_data, + ARRAY_SIZE(offload_ctx->sg_tx_data)); + + clean_acked_data_enable(inet_csk(sk), &tls_icsk_clean_acked); + ctx->push_pending_record = tls_device_push_pending_record; + + /* TLS offload is greatly simplified if we don't send + * SKBs where only part of the payload needs to be encrypted. + * So mark the last skb in the write queue as end of record. + */ + skb = tcp_write_queue_tail(sk); + if (skb) + TCP_SKB_CB(skb)->eor = 1; + + /* We support starting offload on multiple sockets + * concurrently, so we only need a read lock here. + * This lock must precede get_netdev_for_sock to prevent races between + * NETDEV_DOWN and setsockopt. + */ + down_read(&device_offload_lock); + netdev = get_netdev_for_sock(sk); + if (!netdev) { + pr_err_ratelimited("%s: netdev not found\n", __func__); + rc = -EINVAL; + goto release_lock; + } + + if (!(netdev->features & NETIF_F_HW_TLS_TX)) { + rc = -ENOTSUPP; + goto release_netdev; + } + + /* Avoid offloading if the device is down + * We don't want to offload new flows after + * the NETDEV_DOWN event + */ + if (!(netdev->flags & IFF_UP)) { + rc = -EINVAL; + goto release_netdev; + } + + ctx->priv_ctx_tx = offload_ctx; + rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX, + &ctx->crypto_send.info, + tcp_sk(sk)->write_seq); + if (rc) + goto release_netdev; + + tls_device_attach(ctx, sk, netdev); + + /* following this assignment tls_is_sk_tx_device_offloaded + * will return true and the context might be accessed + * by the netdev's xmit function. + */ + smp_store_release(&sk->sk_validate_xmit_skb, tls_validate_xmit_skb); + dev_put(netdev); + up_read(&device_offload_lock); + goto out; + +release_netdev: + dev_put(netdev); +release_lock: + up_read(&device_offload_lock); + clean_acked_data_disable(inet_csk(sk)); + crypto_free_aead(offload_ctx->aead_send); +free_rec_seq: + kfree(ctx->tx.rec_seq); +free_iv: + kfree(ctx->tx.iv); +free_offload_ctx: + kfree(offload_ctx); + ctx->priv_ctx_tx = NULL; +free_marker_record: + kfree(start_marker_record); +out: + return rc; +} + +int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx) +{ + struct tls_offload_context_rx *context; + struct net_device *netdev; + int rc = 0; + + /* We support starting offload on multiple sockets + * concurrently, so we only need a read lock here. + * This lock must precede get_netdev_for_sock to prevent races between + * NETDEV_DOWN and setsockopt. + */ + down_read(&device_offload_lock); + netdev = get_netdev_for_sock(sk); + if (!netdev) { + pr_err_ratelimited("%s: netdev not found\n", __func__); + rc = -EINVAL; + goto release_lock; + } + + if (!(netdev->features & NETIF_F_HW_TLS_RX)) { + pr_err_ratelimited("%s: netdev %s with no TLS offload\n", + __func__, netdev->name); + rc = -ENOTSUPP; + goto release_netdev; + } + + /* Avoid offloading if the device is down + * We don't want to offload new flows after + * the NETDEV_DOWN event + */ + if (!(netdev->flags & IFF_UP)) { + rc = -EINVAL; + goto release_netdev; + } + + context = kzalloc(TLS_OFFLOAD_CONTEXT_SIZE_RX, GFP_KERNEL); + if (!context) { + rc = -ENOMEM; + goto release_netdev; + } + + ctx->priv_ctx_rx = context; + rc = tls_set_sw_offload(sk, ctx, 0); + if (rc) + goto release_ctx; + + rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_RX, + &ctx->crypto_recv.info, + tcp_sk(sk)->copied_seq); + if (rc) { + pr_err_ratelimited("%s: The netdev has refused to offload this socket\n", + __func__); + goto free_sw_resources; + } + + tls_device_attach(ctx, sk, netdev); + goto release_netdev; + +free_sw_resources: + up_read(&device_offload_lock); + tls_sw_free_resources_rx(sk); + down_read(&device_offload_lock); +release_ctx: + ctx->priv_ctx_rx = NULL; +release_netdev: + dev_put(netdev); +release_lock: + up_read(&device_offload_lock); + return rc; +} + +void tls_device_offload_cleanup_rx(struct sock *sk) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct net_device *netdev; + + down_read(&device_offload_lock); + netdev = tls_ctx->netdev; + if (!netdev) + goto out; + + netdev->tlsdev_ops->tls_dev_del(netdev, tls_ctx, + TLS_OFFLOAD_CTX_DIR_RX); + + if (tls_ctx->tx_conf != TLS_HW) { + dev_put(netdev); + tls_ctx->netdev = NULL; + } else { + set_bit(TLS_RX_DEV_CLOSED, &tls_ctx->flags); + } +out: + up_read(&device_offload_lock); + tls_sw_release_resources_rx(sk); +} + +static int tls_device_down(struct net_device *netdev) +{ + struct tls_context *ctx, *tmp; + unsigned long flags; + LIST_HEAD(list); + + /* Request a write lock to block new offload attempts */ + down_write(&device_offload_lock); + + spin_lock_irqsave(&tls_device_lock, flags); + list_for_each_entry_safe(ctx, tmp, &tls_device_list, list) { + if (ctx->netdev != netdev || + !refcount_inc_not_zero(&ctx->refcount)) + continue; + + list_move(&ctx->list, &list); + } + spin_unlock_irqrestore(&tls_device_lock, flags); + + list_for_each_entry_safe(ctx, tmp, &list, list) { + if (ctx->tx_conf == TLS_HW) + netdev->tlsdev_ops->tls_dev_del(netdev, ctx, + TLS_OFFLOAD_CTX_DIR_TX); + if (ctx->rx_conf == TLS_HW && + !test_bit(TLS_RX_DEV_CLOSED, &ctx->flags)) + netdev->tlsdev_ops->tls_dev_del(netdev, ctx, + TLS_OFFLOAD_CTX_DIR_RX); + WRITE_ONCE(ctx->netdev, NULL); + smp_mb__before_atomic(); /* pairs with test_and_set_bit() */ + while (test_bit(TLS_RX_SYNC_RUNNING, &ctx->flags)) + usleep_range(10, 200); + dev_put(netdev); + list_del_init(&ctx->list); + + if (refcount_dec_and_test(&ctx->refcount)) + tls_device_free_ctx(ctx); + } + + up_write(&device_offload_lock); + + flush_work(&tls_device_gc_work); + + return NOTIFY_DONE; +} + +static int tls_dev_event(struct notifier_block *this, unsigned long event, + void *ptr) +{ + struct net_device *dev = netdev_notifier_info_to_dev(ptr); + + if (!dev->tlsdev_ops && + !(dev->features & (NETIF_F_HW_TLS_RX | NETIF_F_HW_TLS_TX))) + return NOTIFY_DONE; + + switch (event) { + case NETDEV_REGISTER: + case NETDEV_FEAT_CHANGE: + if ((dev->features & NETIF_F_HW_TLS_RX) && + !dev->tlsdev_ops->tls_dev_resync_rx) + return NOTIFY_BAD; + + if (dev->tlsdev_ops && + dev->tlsdev_ops->tls_dev_add && + dev->tlsdev_ops->tls_dev_del) + return NOTIFY_DONE; + else + return NOTIFY_BAD; + case NETDEV_DOWN: + return tls_device_down(dev); + } + return NOTIFY_DONE; +} + +static struct notifier_block tls_dev_notifier = { + .notifier_call = tls_dev_event, +}; + +void __init tls_device_init(void) +{ + register_netdevice_notifier(&tls_dev_notifier); +} + +void __exit tls_device_cleanup(void) +{ + unregister_netdevice_notifier(&tls_dev_notifier); + flush_work(&tls_device_gc_work); +} diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c new file mode 100644 index 000000000000..6cf832891b53 --- /dev/null +++ b/net/tls/tls_device_fallback.c @@ -0,0 +1,463 @@ +/* Copyright (c) 2018, Mellanox Technologies All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include + +static void chain_to_walk(struct scatterlist *sg, struct scatter_walk *walk) +{ + struct scatterlist *src = walk->sg; + int diff = walk->offset - src->offset; + + sg_set_page(sg, sg_page(src), + src->length - diff, walk->offset); + + scatterwalk_crypto_chain(sg, sg_next(src), 2); +} + +static int tls_enc_record(struct aead_request *aead_req, + struct crypto_aead *aead, char *aad, + char *iv, __be64 rcd_sn, + struct scatter_walk *in, + struct scatter_walk *out, int *in_len) +{ + unsigned char buf[TLS_HEADER_SIZE + TLS_CIPHER_AES_GCM_128_IV_SIZE]; + struct scatterlist sg_in[3]; + struct scatterlist sg_out[3]; + u16 len; + int rc; + + len = min_t(int, *in_len, ARRAY_SIZE(buf)); + + scatterwalk_copychunks(buf, in, len, 0); + scatterwalk_copychunks(buf, out, len, 1); + + *in_len -= len; + if (!*in_len) + return 0; + + scatterwalk_pagedone(in, 0, 1); + scatterwalk_pagedone(out, 1, 1); + + len = buf[4] | (buf[3] << 8); + len -= TLS_CIPHER_AES_GCM_128_IV_SIZE; + + tls_make_aad(aad, len - TLS_CIPHER_AES_GCM_128_TAG_SIZE, + (char *)&rcd_sn, sizeof(rcd_sn), buf[0]); + + memcpy(iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, buf + TLS_HEADER_SIZE, + TLS_CIPHER_AES_GCM_128_IV_SIZE); + + sg_init_table(sg_in, ARRAY_SIZE(sg_in)); + sg_init_table(sg_out, ARRAY_SIZE(sg_out)); + sg_set_buf(sg_in, aad, TLS_AAD_SPACE_SIZE); + sg_set_buf(sg_out, aad, TLS_AAD_SPACE_SIZE); + chain_to_walk(sg_in + 1, in); + chain_to_walk(sg_out + 1, out); + + *in_len -= len; + if (*in_len < 0) { + *in_len += TLS_CIPHER_AES_GCM_128_TAG_SIZE; + /* the input buffer doesn't contain the entire record. + * trim len accordingly. The resulting authentication tag + * will contain garbage, but we don't care, so we won't + * include any of it in the output skb + * Note that we assume the output buffer length + * is larger then input buffer length + tag size + */ + if (*in_len < 0) + len += *in_len; + + *in_len = 0; + } + + if (*in_len) { + scatterwalk_copychunks(NULL, in, len, 2); + scatterwalk_pagedone(in, 0, 1); + scatterwalk_copychunks(NULL, out, len, 2); + scatterwalk_pagedone(out, 1, 1); + } + + len -= TLS_CIPHER_AES_GCM_128_TAG_SIZE; + aead_request_set_crypt(aead_req, sg_in, sg_out, len, iv); + + rc = crypto_aead_encrypt(aead_req); + + return rc; +} + +static void tls_init_aead_request(struct aead_request *aead_req, + struct crypto_aead *aead) +{ + aead_request_set_tfm(aead_req, aead); + aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE); +} + +static struct aead_request *tls_alloc_aead_request(struct crypto_aead *aead, + gfp_t flags) +{ + unsigned int req_size = sizeof(struct aead_request) + + crypto_aead_reqsize(aead); + struct aead_request *aead_req; + + aead_req = kzalloc(req_size, flags); + if (aead_req) + tls_init_aead_request(aead_req, aead); + return aead_req; +} + +static int tls_enc_records(struct aead_request *aead_req, + struct crypto_aead *aead, struct scatterlist *sg_in, + struct scatterlist *sg_out, char *aad, char *iv, + u64 rcd_sn, int len) +{ + struct scatter_walk out, in; + int rc; + + scatterwalk_start(&in, sg_in); + scatterwalk_start(&out, sg_out); + + do { + rc = tls_enc_record(aead_req, aead, aad, iv, + cpu_to_be64(rcd_sn), &in, &out, &len); + rcd_sn++; + + } while (rc == 0 && len); + + scatterwalk_done(&in, 0, 0); + scatterwalk_done(&out, 1, 0); + + return rc; +} + +/* Can't use icsk->icsk_af_ops->send_check here because the ip addresses + * might have been changed by NAT. + */ +static void update_chksum(struct sk_buff *skb, int headln) +{ + struct tcphdr *th = tcp_hdr(skb); + int datalen = skb->len - headln; + const struct ipv6hdr *ipv6h; + const struct iphdr *iph; + + /* We only changed the payload so if we are using partial we don't + * need to update anything. + */ + if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) + return; + + skb->ip_summed = CHECKSUM_PARTIAL; + skb->csum_start = skb_transport_header(skb) - skb->head; + skb->csum_offset = offsetof(struct tcphdr, check); + + if (skb->sk->sk_family == AF_INET6) { + ipv6h = ipv6_hdr(skb); + th->check = ~csum_ipv6_magic(&ipv6h->saddr, &ipv6h->daddr, + datalen, IPPROTO_TCP, 0); + } else { + iph = ip_hdr(skb); + th->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr, datalen, + IPPROTO_TCP, 0); + } +} + +static void complete_skb(struct sk_buff *nskb, struct sk_buff *skb, int headln) +{ + struct sock *sk = skb->sk; + int delta; + + skb_copy_header(nskb, skb); + + skb_put(nskb, skb->len); + memcpy(nskb->data, skb->data, headln); + + nskb->destructor = skb->destructor; + nskb->sk = sk; + skb->destructor = NULL; + skb->sk = NULL; + + update_chksum(nskb, headln); + + /* sock_efree means skb must gone through skb_orphan_partial() */ + if (nskb->destructor == sock_efree) + return; + + delta = nskb->truesize - skb->truesize; + if (likely(delta < 0)) + WARN_ON_ONCE(refcount_sub_and_test(-delta, &sk->sk_wmem_alloc)); + else if (delta) + refcount_add(delta, &sk->sk_wmem_alloc); +} + +/* This function may be called after the user socket is already + * closed so make sure we don't use anything freed during + * tls_sk_proto_close here + */ + +static int fill_sg_in(struct scatterlist *sg_in, + struct sk_buff *skb, + struct tls_offload_context_tx *ctx, + u64 *rcd_sn, + s32 *sync_size, + int *resync_sgs) +{ + int tcp_payload_offset = skb_transport_offset(skb) + tcp_hdrlen(skb); + int payload_len = skb->len - tcp_payload_offset; + u32 tcp_seq = ntohl(tcp_hdr(skb)->seq); + struct tls_record_info *record; + unsigned long flags; + int remaining; + int i; + + spin_lock_irqsave(&ctx->lock, flags); + record = tls_get_record(ctx, tcp_seq, rcd_sn); + if (!record) { + spin_unlock_irqrestore(&ctx->lock, flags); + WARN(1, "Record not found for seq %u\n", tcp_seq); + return -EINVAL; + } + + *sync_size = tcp_seq - tls_record_start_seq(record); + if (*sync_size < 0) { + int is_start_marker = tls_record_is_start_marker(record); + + spin_unlock_irqrestore(&ctx->lock, flags); + /* This should only occur if the relevant record was + * already acked. In that case it should be ok + * to drop the packet and avoid retransmission. + * + * There is a corner case where the packet contains + * both an acked and a non-acked record. + * We currently don't handle that case and rely + * on TCP to retranmit a packet that doesn't contain + * already acked payload. + */ + if (!is_start_marker) + *sync_size = 0; + return -EINVAL; + } + + remaining = *sync_size; + for (i = 0; remaining > 0; i++) { + skb_frag_t *frag = &record->frags[i]; + + __skb_frag_ref(frag); + sg_set_page(sg_in + i, skb_frag_page(frag), + skb_frag_size(frag), frag->page_offset); + + remaining -= skb_frag_size(frag); + + if (remaining < 0) + sg_in[i].length += remaining; + } + *resync_sgs = i; + + spin_unlock_irqrestore(&ctx->lock, flags); + if (skb_to_sgvec(skb, &sg_in[i], tcp_payload_offset, payload_len) < 0) + return -EINVAL; + + return 0; +} + +static void fill_sg_out(struct scatterlist sg_out[3], void *buf, + struct tls_context *tls_ctx, + struct sk_buff *nskb, + int tcp_payload_offset, + int payload_len, + int sync_size, + void *dummy_buf) +{ + sg_set_buf(&sg_out[0], dummy_buf, sync_size); + sg_set_buf(&sg_out[1], nskb->data + tcp_payload_offset, payload_len); + /* Add room for authentication tag produced by crypto */ + dummy_buf += sync_size; + sg_set_buf(&sg_out[2], dummy_buf, TLS_CIPHER_AES_GCM_128_TAG_SIZE); +} + +static struct sk_buff *tls_enc_skb(struct tls_context *tls_ctx, + struct scatterlist sg_out[3], + struct scatterlist *sg_in, + struct sk_buff *skb, + s32 sync_size, u64 rcd_sn) +{ + int tcp_payload_offset = skb_transport_offset(skb) + tcp_hdrlen(skb); + struct tls_offload_context_tx *ctx = tls_offload_ctx_tx(tls_ctx); + int payload_len = skb->len - tcp_payload_offset; + void *buf, *iv, *aad, *dummy_buf; + struct aead_request *aead_req; + struct sk_buff *nskb = NULL; + int buf_len; + + aead_req = tls_alloc_aead_request(ctx->aead_send, GFP_ATOMIC); + if (!aead_req) + return NULL; + + buf_len = TLS_CIPHER_AES_GCM_128_SALT_SIZE + + TLS_CIPHER_AES_GCM_128_IV_SIZE + + TLS_AAD_SPACE_SIZE + + sync_size + + TLS_CIPHER_AES_GCM_128_TAG_SIZE; + buf = kmalloc(buf_len, GFP_ATOMIC); + if (!buf) + goto free_req; + + iv = buf; + memcpy(iv, tls_ctx->crypto_send.aes_gcm_128.salt, + TLS_CIPHER_AES_GCM_128_SALT_SIZE); + aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE + + TLS_CIPHER_AES_GCM_128_IV_SIZE; + dummy_buf = aad + TLS_AAD_SPACE_SIZE; + + nskb = alloc_skb(skb_headroom(skb) + skb->len, GFP_ATOMIC); + if (!nskb) + goto free_buf; + + skb_reserve(nskb, skb_headroom(skb)); + + fill_sg_out(sg_out, buf, tls_ctx, nskb, tcp_payload_offset, + payload_len, sync_size, dummy_buf); + + if (tls_enc_records(aead_req, ctx->aead_send, sg_in, sg_out, aad, iv, + rcd_sn, sync_size + payload_len) < 0) + goto free_nskb; + + complete_skb(nskb, skb, tcp_payload_offset); + + /* validate_xmit_skb_list assumes that if the skb wasn't segmented + * nskb->prev will point to the skb itself + */ + nskb->prev = nskb; + +free_buf: + kfree(buf); +free_req: + kfree(aead_req); + return nskb; +free_nskb: + kfree_skb(nskb); + nskb = NULL; + goto free_buf; +} + +static struct sk_buff *tls_sw_fallback(struct sock *sk, struct sk_buff *skb) +{ + int tcp_payload_offset = skb_transport_offset(skb) + tcp_hdrlen(skb); + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_offload_context_tx *ctx = tls_offload_ctx_tx(tls_ctx); + int payload_len = skb->len - tcp_payload_offset; + struct scatterlist *sg_in, sg_out[3]; + struct sk_buff *nskb = NULL; + int sg_in_max_elements; + int resync_sgs = 0; + s32 sync_size = 0; + u64 rcd_sn; + + /* worst case is: + * MAX_SKB_FRAGS in tls_record_info + * MAX_SKB_FRAGS + 1 in SKB head and frags. + */ + sg_in_max_elements = 2 * MAX_SKB_FRAGS + 1; + + if (!payload_len) + return skb; + + sg_in = kmalloc_array(sg_in_max_elements, sizeof(*sg_in), GFP_ATOMIC); + if (!sg_in) + goto free_orig; + + sg_init_table(sg_in, sg_in_max_elements); + sg_init_table(sg_out, ARRAY_SIZE(sg_out)); + + if (fill_sg_in(sg_in, skb, ctx, &rcd_sn, &sync_size, &resync_sgs)) { + /* bypass packets before kernel TLS socket option was set */ + if (sync_size < 0 && payload_len <= -sync_size) + nskb = skb_get(skb); + goto put_sg; + } + + nskb = tls_enc_skb(tls_ctx, sg_out, sg_in, skb, sync_size, rcd_sn); + +put_sg: + while (resync_sgs) + put_page(sg_page(&sg_in[--resync_sgs])); + kfree(sg_in); +free_orig: + kfree_skb(skb); + return nskb; +} + +struct sk_buff *tls_validate_xmit_skb(struct sock *sk, + struct net_device *dev, + struct sk_buff *skb) +{ + if (dev == tls_get_ctx(sk)->netdev) + return skb; + + return tls_sw_fallback(sk, skb); +} +EXPORT_SYMBOL_GPL(tls_validate_xmit_skb); + +int tls_sw_fallback_init(struct sock *sk, + struct tls_offload_context_tx *offload_ctx, + struct tls_crypto_info *crypto_info) +{ + const u8 *key; + int rc; + + offload_ctx->aead_send = + crypto_alloc_aead("gcm(aes)", 0, CRYPTO_ALG_ASYNC); + if (IS_ERR(offload_ctx->aead_send)) { + rc = PTR_ERR(offload_ctx->aead_send); + pr_err_ratelimited("crypto_alloc_aead failed rc=%d\n", rc); + offload_ctx->aead_send = NULL; + goto err_out; + } + + key = ((struct tls12_crypto_info_aes_gcm_128 *)crypto_info)->key; + + rc = crypto_aead_setkey(offload_ctx->aead_send, key, + TLS_CIPHER_AES_GCM_128_KEY_SIZE); + if (rc) + goto free_aead; + + rc = crypto_aead_setauthsize(offload_ctx->aead_send, + TLS_CIPHER_AES_GCM_128_TAG_SIZE); + if (rc) + goto free_aead; + + return 0; +free_aead: + crypto_free_aead(offload_ctx->aead_send); +err_out: + return rc; +} diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c index 3205335bc7b9..0ebe0be35322 100644 --- a/net/tls/tls_main.c +++ b/net/tls/tls_main.c @@ -38,6 +38,7 @@ #include #include #include +#include #include @@ -52,21 +53,18 @@ enum { TLS_NUM_PROTS, }; -enum { - TLS_BASE_TX, - TLS_SW_TX, - TLS_NUM_CONFIG, -}; - static struct proto *saved_tcpv6_prot; static DEFINE_MUTEX(tcpv6_prot_mutex); -static struct proto tls_prots[TLS_NUM_PROTS][TLS_NUM_CONFIG]; +static LIST_HEAD(device_list); +static DEFINE_MUTEX(device_mutex); +static struct proto tls_prots[TLS_NUM_PROTS][TLS_NUM_CONFIG][TLS_NUM_CONFIG]; +static struct proto_ops tls_sw_proto_ops; -static inline void update_sk_prot(struct sock *sk, struct tls_context *ctx) +static void update_sk_prot(struct sock *sk, struct tls_context *ctx) { int ip_ver = sk->sk_family == AF_INET6 ? TLSV6 : TLSV4; - sk->sk_prot = &tls_prots[ip_ver][ctx->tx_conf]; + sk->sk_prot = &tls_prots[ip_ver][ctx->tx_conf][ctx->rx_conf]; } int wait_on_pending_writer(struct sock *sk, long *timeo) @@ -143,7 +141,6 @@ retry: size = sg->length; } - clear_bit(TLS_PENDING_CLOSED_RECORD, &ctx->flags); ctx->in_tcp_sendpages = false; ctx->sk_write_space(sk); @@ -195,15 +192,12 @@ int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg, return rc; } -int tls_push_pending_closed_record(struct sock *sk, struct tls_context *ctx, - int flags, long *timeo) +int tls_push_partial_record(struct sock *sk, struct tls_context *ctx, + int flags) { struct scatterlist *sg; u16 offset; - if (!tls_is_partially_sent_record(ctx)) - return ctx->push_pending_record(sk, flags); - sg = ctx->partially_sent_record; offset = ctx->partially_sent_offset; @@ -211,9 +205,23 @@ int tls_push_pending_closed_record(struct sock *sk, struct tls_context *ctx, return tls_push_sg(sk, ctx, sg, offset, flags); } +int tls_push_pending_closed_record(struct sock *sk, + struct tls_context *tls_ctx, + int flags, long *timeo) +{ + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + + if (tls_is_partially_sent_record(tls_ctx) || + !list_empty(&ctx->tx_list)) + return tls_tx_records(sk, flags); + else + return tls_ctx->push_pending_record(sk, flags); +} + static void tls_write_space(struct sock *sk) { struct tls_context *ctx = tls_get_ctx(sk); + struct tls_sw_context_tx *tx_ctx = tls_sw_ctx_tx(ctx); /* If in_tcp_sendpages call lower protocol write space handler * to ensure we wake up any waiting operations there. For example @@ -224,31 +232,23 @@ static void tls_write_space(struct sock *sk) return; } - if (!sk->sk_write_pending && tls_is_pending_closed_record(ctx)) { - gfp_t sk_allocation = sk->sk_allocation; - int rc; - long timeo = 0; - - sk->sk_allocation = GFP_ATOMIC; - rc = tls_push_pending_closed_record(sk, ctx, - MSG_DONTWAIT | - MSG_NOSIGNAL, - &timeo); - sk->sk_allocation = sk_allocation; - - if (rc < 0) - return; + /* Schedule the transmission if tx list is ready */ + if (is_tx_ready(tx_ctx) && !sk->sk_write_pending) { + /* Schedule the transmission */ + if (!test_and_set_bit(BIT_TX_SCHEDULED, &tx_ctx->tx_bitmask)) + schedule_delayed_work(&tx_ctx->tx_work.work, 0); } ctx->sk_write_space(sk); } -static void tls_ctx_free(struct tls_context *ctx) +void tls_ctx_free(struct tls_context *ctx) { if (!ctx) return; memzero_explicit(&ctx->crypto_send, sizeof(ctx->crypto_send)); + memzero_explicit(&ctx->crypto_recv, sizeof(ctx->crypto_recv)); kfree(ctx); } @@ -257,42 +257,52 @@ static void tls_sk_proto_close(struct sock *sk, long timeout) struct tls_context *ctx = tls_get_ctx(sk); long timeo = sock_sndtimeo(sk, 0); void (*sk_proto_close)(struct sock *sk, long timeout); + bool free_ctx = false; lock_sock(sk); sk_proto_close = ctx->sk_proto_close; - if (ctx->tx_conf == TLS_BASE_TX) { - tls_ctx_free(ctx); + if ((ctx->tx_conf == TLS_HW_RECORD && ctx->rx_conf == TLS_HW_RECORD) || + (ctx->tx_conf == TLS_BASE && ctx->rx_conf == TLS_BASE)) { + free_ctx = true; goto skip_tx_cleanup; } if (!tls_complete_pending_work(sk, ctx, 0, &timeo)) tls_handle_open_record(sk, 0); - if (ctx->partially_sent_record) { - struct scatterlist *sg = ctx->partially_sent_record; - - while (1) { - put_page(sg_page(sg)); - sk_mem_uncharge(sk, sg->length); - - if (sg_is_last(sg)) - break; - sg++; - } + /* We need these for tls_sw_fallback handling of other packets */ + if (ctx->tx_conf == TLS_SW) { + kfree(ctx->tx.rec_seq); + kfree(ctx->tx.iv); + tls_sw_free_resources_tx(sk); } - kfree(ctx->rec_seq); - kfree(ctx->iv); + if (ctx->rx_conf == TLS_SW) + tls_sw_free_resources_rx(sk); - if (ctx->tx_conf == TLS_SW_TX) { - tls_sw_free_tx_resources(sk); +#ifdef CONFIG_TLS_DEVICE + if (ctx->rx_conf == TLS_HW) + tls_device_offload_cleanup_rx(sk); + + if (ctx->tx_conf != TLS_HW && ctx->rx_conf != TLS_HW) { +#else + { +#endif + if (sk->sk_write_space == tls_write_space) + sk->sk_write_space = ctx->sk_write_space; tls_ctx_free(ctx); + ctx = NULL; } skip_tx_cleanup: release_sock(sk); sk_proto_close(sk, timeout); + /* free ctx for TLS_HW_RECORD, used by tcp_set_state + * for sk->sk_prot->unhash [tls_hw_unhash] + */ + if (free_ctx) + tls_ctx_free(ctx); } static int do_tls_getsockopt_tx(struct sock *sk, char __user *optval, @@ -344,8 +354,10 @@ static int do_tls_getsockopt_tx(struct sock *sk, char __user *optval, } lock_sock(sk); memcpy(crypto_info_aes_gcm_128->iv, - ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, + ctx->tx.iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, TLS_CIPHER_AES_GCM_128_IV_SIZE); + memcpy(crypto_info_aes_gcm_128->rec_seq, ctx->tx.rec_seq, + TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE); release_sock(sk); if (copy_to_user(optval, crypto_info_aes_gcm_128, @@ -388,20 +400,24 @@ static int tls_getsockopt(struct sock *sk, int level, int optname, return do_tls_getsockopt(sk, optname, optval, optlen); } -static int do_tls_setsockopt_tx(struct sock *sk, char __user *optval, - unsigned int optlen) +static int do_tls_setsockopt_conf(struct sock *sk, char __user *optval, + unsigned int optlen, int tx) { struct tls_crypto_info *crypto_info; struct tls_context *ctx = tls_get_ctx(sk); int rc = 0; - int tx_conf; + int conf; if (!optval || (optlen < sizeof(*crypto_info))) { rc = -EINVAL; goto out; } - crypto_info = &ctx->crypto_send.info; + if (tx) + crypto_info = &ctx->crypto_send.info; + else + crypto_info = &ctx->crypto_recv.info; + /* Currently we don't support set crypto info more than one time */ if (TLS_CRYPTO_INFO_READY(crypto_info)) { rc = -EBUSY; @@ -411,7 +427,7 @@ static int do_tls_setsockopt_tx(struct sock *sk, char __user *optval, rc = copy_from_user(crypto_info, optval, sizeof(*crypto_info)); if (rc) { rc = -EFAULT; - goto out; + goto err_crypto_info; } /* check version */ @@ -439,16 +455,44 @@ static int do_tls_setsockopt_tx(struct sock *sk, char __user *optval, goto err_crypto_info; } - /* currently SW is default, we will have ethtool in future */ - rc = tls_set_sw_offload(sk, ctx); - tx_conf = TLS_SW_TX; + if (tx) { +#ifdef CONFIG_TLS_DEVICE + rc = tls_set_device_offload(sk, ctx); + conf = TLS_HW; + if (rc) { +#else + { +#endif + rc = tls_set_sw_offload(sk, ctx, 1); + conf = TLS_SW; + } + } else { +#ifdef CONFIG_TLS_DEVICE + rc = tls_set_device_offload_rx(sk, ctx); + conf = TLS_HW; + if (rc) { +#else + { +#endif + rc = tls_set_sw_offload(sk, ctx, 0); + conf = TLS_SW; + } + } + if (rc) goto err_crypto_info; - ctx->tx_conf = tx_conf; + if (tx) + ctx->tx_conf = conf; + else + ctx->rx_conf = conf; update_sk_prot(sk, ctx); - ctx->sk_write_space = sk->sk_write_space; - sk->sk_write_space = tls_write_space; + if (tx) { + ctx->sk_write_space = sk->sk_write_space; + sk->sk_write_space = tls_write_space; + } else { + sk->sk_socket->ops = &tls_sw_proto_ops; + } goto out; err_crypto_info: @@ -464,8 +508,10 @@ static int do_tls_setsockopt(struct sock *sk, int optname, switch (optname) { case TLS_TX: + case TLS_RX: lock_sock(sk); - rc = do_tls_setsockopt_tx(sk, optval, optlen); + rc = do_tls_setsockopt_conf(sk, optval, optlen, + optname == TLS_TX); release_sock(sk); break; default: @@ -486,25 +532,136 @@ static int tls_setsockopt(struct sock *sk, int level, int optname, return do_tls_setsockopt(sk, optname, optval, optlen); } -static void build_protos(struct proto *prot, struct proto *base) +static struct tls_context *create_ctx(struct sock *sk) { - prot[TLS_BASE_TX] = *base; - prot[TLS_BASE_TX].setsockopt = tls_setsockopt; - prot[TLS_BASE_TX].getsockopt = tls_getsockopt; - prot[TLS_BASE_TX].close = tls_sk_proto_close; + struct inet_connection_sock *icsk = inet_csk(sk); + struct tls_context *ctx; - prot[TLS_SW_TX] = prot[TLS_BASE_TX]; - prot[TLS_SW_TX].sendmsg = tls_sw_sendmsg; - prot[TLS_SW_TX].sendpage = tls_sw_sendpage; + ctx = kzalloc(sizeof(*ctx), GFP_ATOMIC); + if (!ctx) + return NULL; + + icsk->icsk_ulp_data = ctx; + ctx->setsockopt = sk->sk_prot->setsockopt; + ctx->getsockopt = sk->sk_prot->getsockopt; + ctx->sk_proto_close = sk->sk_prot->close; + return ctx; +} + +static int tls_hw_prot(struct sock *sk) +{ + struct tls_context *ctx; + struct tls_device *dev; + int rc = 0; + + mutex_lock(&device_mutex); + list_for_each_entry(dev, &device_list, dev_list) { + if (dev->feature && dev->feature(dev)) { + ctx = create_ctx(sk); + if (!ctx) + goto out; + + ctx->hash = sk->sk_prot->hash; + ctx->unhash = sk->sk_prot->unhash; + ctx->sk_proto_close = sk->sk_prot->close; + ctx->rx_conf = TLS_HW_RECORD; + ctx->tx_conf = TLS_HW_RECORD; + update_sk_prot(sk, ctx); + rc = 1; + break; + } + } +out: + mutex_unlock(&device_mutex); + return rc; +} + +static void tls_hw_unhash(struct sock *sk) +{ + struct tls_context *ctx = tls_get_ctx(sk); + struct tls_device *dev; + + mutex_lock(&device_mutex); + list_for_each_entry(dev, &device_list, dev_list) { + if (dev->unhash) + dev->unhash(dev, sk); + } + mutex_unlock(&device_mutex); + ctx->unhash(sk); +} + +static int tls_hw_hash(struct sock *sk) +{ + struct tls_context *ctx = tls_get_ctx(sk); + struct tls_device *dev; + int err; + + err = ctx->hash(sk); + mutex_lock(&device_mutex); + list_for_each_entry(dev, &device_list, dev_list) { + if (dev->hash) + err |= dev->hash(dev, sk); + } + mutex_unlock(&device_mutex); + + if (err) + tls_hw_unhash(sk); + return err; +} + +static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG], + struct proto *base) +{ + prot[TLS_BASE][TLS_BASE] = *base; + prot[TLS_BASE][TLS_BASE].setsockopt = tls_setsockopt; + prot[TLS_BASE][TLS_BASE].getsockopt = tls_getsockopt; + prot[TLS_BASE][TLS_BASE].close = tls_sk_proto_close; + + prot[TLS_SW][TLS_BASE] = prot[TLS_BASE][TLS_BASE]; + prot[TLS_SW][TLS_BASE].sendmsg = tls_sw_sendmsg; + prot[TLS_SW][TLS_BASE].sendpage = tls_sw_sendpage; + + prot[TLS_BASE][TLS_SW] = prot[TLS_BASE][TLS_BASE]; + prot[TLS_BASE][TLS_SW].recvmsg = tls_sw_recvmsg; + prot[TLS_BASE][TLS_SW].stream_memory_read = tls_sw_stream_read; + prot[TLS_BASE][TLS_SW].close = tls_sk_proto_close; + + prot[TLS_SW][TLS_SW] = prot[TLS_SW][TLS_BASE]; + prot[TLS_SW][TLS_SW].recvmsg = tls_sw_recvmsg; + prot[TLS_SW][TLS_SW].stream_memory_read = tls_sw_stream_read; + prot[TLS_SW][TLS_SW].close = tls_sk_proto_close; + +#ifdef CONFIG_TLS_DEVICE + prot[TLS_HW][TLS_BASE] = prot[TLS_BASE][TLS_BASE]; + prot[TLS_HW][TLS_BASE].sendmsg = tls_device_sendmsg; + prot[TLS_HW][TLS_BASE].sendpage = tls_device_sendpage; + + prot[TLS_HW][TLS_SW] = prot[TLS_BASE][TLS_SW]; + prot[TLS_HW][TLS_SW].sendmsg = tls_device_sendmsg; + prot[TLS_HW][TLS_SW].sendpage = tls_device_sendpage; + + prot[TLS_BASE][TLS_HW] = prot[TLS_BASE][TLS_SW]; + + prot[TLS_SW][TLS_HW] = prot[TLS_SW][TLS_SW]; + + prot[TLS_HW][TLS_HW] = prot[TLS_HW][TLS_SW]; +#endif + + prot[TLS_HW_RECORD][TLS_HW_RECORD] = *base; + prot[TLS_HW_RECORD][TLS_HW_RECORD].hash = tls_hw_hash; + prot[TLS_HW_RECORD][TLS_HW_RECORD].unhash = tls_hw_unhash; + prot[TLS_HW_RECORD][TLS_HW_RECORD].close = tls_sk_proto_close; } static int tls_init(struct sock *sk) { int ip_ver = sk->sk_family == AF_INET6 ? TLSV6 : TLSV4; - struct inet_connection_sock *icsk = inet_csk(sk); struct tls_context *ctx; int rc = 0; + if (tls_hw_prot(sk)) + goto out; + /* The TLS ulp is currently supported only for TCP sockets * in ESTABLISHED state. * Supporting sockets in LISTEN state will require us @@ -515,17 +672,13 @@ static int tls_init(struct sock *sk) return -ENOTSUPP; /* allocate tls context */ - ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + ctx = create_ctx(sk); if (!ctx) { rc = -ENOMEM; goto out; } - icsk->icsk_ulp_data = ctx; - ctx->setsockopt = sk->sk_prot->setsockopt; - ctx->getsockopt = sk->sk_prot->getsockopt; - ctx->sk_proto_close = sk->sk_prot->close; - /* Build IPv6 TLS whenever the address of tcpv6_prot changes */ + /* Build IPv6 TLS whenever the address of tcpv6 _prot changes */ if (ip_ver == TLSV6 && unlikely(sk->sk_prot != smp_load_acquire(&saved_tcpv6_prot))) { mutex_lock(&tcpv6_prot_mutex); @@ -536,12 +689,29 @@ static int tls_init(struct sock *sk) mutex_unlock(&tcpv6_prot_mutex); } - ctx->tx_conf = TLS_BASE_TX; + ctx->tx_conf = TLS_BASE; + ctx->rx_conf = TLS_BASE; update_sk_prot(sk, ctx); out: return rc; } +void tls_register_device(struct tls_device *device) +{ + mutex_lock(&device_mutex); + list_add_tail(&device->dev_list, &device_list); + mutex_unlock(&device_mutex); +} +EXPORT_SYMBOL(tls_register_device); + +void tls_unregister_device(struct tls_device *device) +{ + mutex_lock(&device_mutex); + list_del(&device->dev_list); + mutex_unlock(&device_mutex); +} +EXPORT_SYMBOL(tls_unregister_device); + static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = { .name = "tls", .uid = TCP_ULP_TLS, @@ -554,6 +724,12 @@ static int __init tls_register(void) { build_protos(tls_prots[TLSV4], &tcp_prot); + tls_sw_proto_ops = inet_stream_ops; + tls_sw_proto_ops.splice_read = tls_sw_splice_read; + +#ifdef CONFIG_TLS_DEVICE + tls_device_init(); +#endif tcp_register_ulp(&tcp_tls_ulp_ops); return 0; @@ -562,6 +738,9 @@ static int __init tls_register(void) static void __exit tls_unregister(void) { tcp_unregister_ulp(&tcp_tls_ulp_ops); +#ifdef CONFIG_TLS_DEVICE + tls_device_cleanup(); +#endif } module_init(tls_register); diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 5db3b495782f..52ab489c8b2a 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -4,6 +4,7 @@ * Copyright (c) 2016-2017, Lance Chao . All rights reserved. * Copyright (c) 2016, Fridolin Pokorny . All rights reserved. * Copyright (c) 2016, Nikos Mavrogiannopoulos . All rights reserved. + * Copyright (c) 2018, Covalent IO, Inc. http://covalent.io * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -34,233 +35,1038 @@ * SOFTWARE. */ +#include #include #include +#include #include -static inline void tls_make_aad(int recv, - char *buf, - size_t size, - char *record_sequence, - int record_sequence_size, - unsigned char record_type) -{ - memcpy(buf, record_sequence, record_sequence_size); +#define MAX_IV_SIZE TLS_CIPHER_AES_GCM_128_IV_SIZE - buf[8] = record_type; - buf[9] = TLS_1_2_VERSION_MAJOR; - buf[10] = TLS_1_2_VERSION_MINOR; - buf[11] = size >> 8; - buf[12] = size & 0xFF; -} - -static void trim_sg(struct sock *sk, struct scatterlist *sg, - int *sg_num_elem, unsigned int *sg_size, int target_size) -{ - int i = *sg_num_elem - 1; - int trim = *sg_size - target_size; - - if (trim <= 0) { - WARN_ON(trim < 0); - return; - } - - *sg_size = target_size; - while (trim >= sg[i].length) { - trim -= sg[i].length; - sk_mem_uncharge(sk, sg[i].length); - put_page(sg_page(&sg[i])); - i--; - - if (i < 0) - goto out; - } - - sg[i].length -= trim; - sk_mem_uncharge(sk, trim); - -out: - *sg_num_elem = i + 1; -} - -static void trim_both_sgl(struct sock *sk, int target_size) +static int tls_do_decryption(struct sock *sk, + struct scatterlist *sgin, + struct scatterlist *sgout, + char *iv_recv, + size_t data_len, + struct aead_request *aead_req) { struct tls_context *tls_ctx = tls_get_ctx(sk); - struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + int ret; - trim_sg(sk, ctx->sg_plaintext_data, - &ctx->sg_plaintext_num_elem, - &ctx->sg_plaintext_size, - target_size); + aead_request_set_tfm(aead_req, ctx->aead_recv); + aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE); + aead_request_set_crypt(aead_req, sgin, sgout, + data_len + tls_ctx->rx.tag_size, + (u8 *)iv_recv); + aead_request_set_callback(aead_req, CRYPTO_TFM_REQ_MAY_BACKLOG, + crypto_req_done, &ctx->async_wait); + ret = crypto_wait_req(crypto_aead_decrypt(aead_req), &ctx->async_wait); + return ret; +} + +static void tls_trim_both_msgs(struct sock *sk, int target_size) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct tls_rec *rec = ctx->open_rec; + + sk_msg_trim(sk, &rec->msg_plaintext, target_size); if (target_size > 0) - target_size += tls_ctx->overhead_size; - - trim_sg(sk, ctx->sg_encrypted_data, - &ctx->sg_encrypted_num_elem, - &ctx->sg_encrypted_size, - target_size); + target_size += tls_ctx->tx.overhead_size; + sk_msg_trim(sk, &rec->msg_encrypted, target_size); } -static int alloc_encrypted_sg(struct sock *sk, int len) +static int tls_alloc_encrypted_msg(struct sock *sk, int len) { struct tls_context *tls_ctx = tls_get_ctx(sk); - struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); - int rc = 0; + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct tls_rec *rec = ctx->open_rec; + struct sk_msg *msg_en = &rec->msg_encrypted; - rc = alloc_sg(sk, len, ctx->sg_encrypted_data, - &ctx->sg_encrypted_num_elem, &ctx->sg_encrypted_size, 0); - - rc = sk_alloc_sg(sk, len, - ctx->sg_encrypted_data, 0, - &ctx->sg_encrypted_num_elem, - &ctx->sg_encrypted_size, 0); - - return rc; + return sk_msg_alloc(sk, msg_en, len, 0); } -static int alloc_plaintext_sg(struct sock *sk, int len) +static int tls_clone_plaintext_msg(struct sock *sk, int required) { struct tls_context *tls_ctx = tls_get_ctx(sk); - struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); - int rc = 0; + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct tls_rec *rec = ctx->open_rec; + struct sk_msg *msg_pl = &rec->msg_plaintext; + struct sk_msg *msg_en = &rec->msg_encrypted; + int skip, len; - rc = sk_alloc_sg(sk, len, ctx->sg_plaintext_data, 0, - &ctx->sg_plaintext_num_elem, &ctx->sg_plaintext_size, - tls_ctx->pending_open_record_frags); + /* We add page references worth len bytes from encrypted sg + * at the end of plaintext sg. It is guaranteed that msg_en + * has enough required room (ensured by caller). + */ + len = required - msg_pl->sg.size; - if (rc == -ENOSPC) - ctx->sg_plaintext_num_elem = ARRAY_SIZE(ctx->sg_plaintext_data); + /* Skip initial bytes in msg_en's data to be able to use + * same offset of both plain and encrypted data. + */ + skip = tls_ctx->tx.prepend_size + msg_pl->sg.size; - return rc; + return sk_msg_clone(sk, msg_pl, msg_en, skip, len); } -static void free_sg(struct sock *sk, struct scatterlist *sg, - int *sg_num_elem, unsigned int *sg_size) +static struct tls_rec *tls_get_rec(struct sock *sk) { - int i, n = *sg_num_elem; + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct sk_msg *msg_pl, *msg_en; + struct tls_rec *rec; + int mem_size; - for (i = 0; i < n; ++i) { - sk_mem_uncharge(sk, sg[i].length); - put_page(sg_page(&sg[i])); + mem_size = sizeof(struct tls_rec) + crypto_aead_reqsize(ctx->aead_send); + + rec = kzalloc(mem_size, sk->sk_allocation); + if (!rec) + return NULL; + + msg_pl = &rec->msg_plaintext; + msg_en = &rec->msg_encrypted; + + sk_msg_init(msg_pl); + sk_msg_init(msg_en); + + sg_init_table(rec->sg_aead_in, 2); + sg_set_buf(&rec->sg_aead_in[0], rec->aad_space, + sizeof(rec->aad_space)); + sg_unmark_end(&rec->sg_aead_in[1]); + + sg_init_table(rec->sg_aead_out, 2); + sg_set_buf(&rec->sg_aead_out[0], rec->aad_space, + sizeof(rec->aad_space)); + sg_unmark_end(&rec->sg_aead_out[1]); + + return rec; +} + +static void tls_free_rec(struct sock *sk, struct tls_rec *rec) +{ + sk_msg_free(sk, &rec->msg_encrypted); + sk_msg_free(sk, &rec->msg_plaintext); + kfree(rec); +} + +static void tls_free_open_rec(struct sock *sk) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct tls_rec *rec = ctx->open_rec; + + if (rec) { + tls_free_rec(sk, rec); + ctx->open_rec = NULL; } - *sg_num_elem = 0; - *sg_size = 0; } -static void tls_free_both_sg(struct sock *sk) +int tls_tx_records(struct sock *sk, int flags) { struct tls_context *tls_ctx = tls_get_ctx(sk); - struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct tls_rec *rec, *tmp; + struct sk_msg *msg_en; + int tx_flags, rc = 0; - free_sg(sk, ctx->sg_encrypted_data, &ctx->sg_encrypted_num_elem, - &ctx->sg_encrypted_size); + if (tls_is_partially_sent_record(tls_ctx)) { + rec = list_first_entry(&ctx->tx_list, + struct tls_rec, list); - free_sg(sk, ctx->sg_plaintext_data, &ctx->sg_plaintext_num_elem, - &ctx->sg_plaintext_size); + if (flags == -1) + tx_flags = rec->tx_flags; + else + tx_flags = flags; + + rc = tls_push_partial_record(sk, tls_ctx, tx_flags); + if (rc) + goto tx_err; + + /* Full record has been transmitted. + * Remove the head of tx_list + */ + list_del(&rec->list); + sk_msg_free(sk, &rec->msg_plaintext); + kfree(rec); + } + + /* Tx all ready records */ + list_for_each_entry_safe(rec, tmp, &ctx->tx_list, list) { + if (READ_ONCE(rec->tx_ready)) { + if (flags == -1) + tx_flags = rec->tx_flags; + else + tx_flags = flags; + + msg_en = &rec->msg_encrypted; + rc = tls_push_sg(sk, tls_ctx, + &msg_en->sg.data[msg_en->sg.curr], + 0, tx_flags); + if (rc) + goto tx_err; + + list_del(&rec->list); + sk_msg_free(sk, &rec->msg_plaintext); + kfree(rec); + } else { + break; + } + } + +tx_err: + if (rc < 0 && rc != -EAGAIN) + tls_err_abort(sk, EBADMSG); + + return rc; } -static int tls_do_encryption(struct tls_context *tls_ctx, - struct tls_sw_context *ctx, - struct aead_request *aead_req, - size_t data_len) +static void tls_encrypt_done(struct crypto_async_request *req, int err) { + struct aead_request *aead_req = (struct aead_request *)req; + struct sock *sk = req->data; + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct scatterlist *sge; + struct sk_msg *msg_en; + struct tls_rec *rec; + bool ready = false; + int pending; + + rec = container_of(aead_req, struct tls_rec, aead_req); + msg_en = &rec->msg_encrypted; + + sge = sk_msg_elem(msg_en, msg_en->sg.curr); + sge->offset -= tls_ctx->tx.prepend_size; + sge->length += tls_ctx->tx.prepend_size; + + /* Check if error is previously set on socket */ + if (err || sk->sk_err) { + rec = NULL; + + /* If err is already set on socket, return the same code */ + if (sk->sk_err) { + ctx->async_wait.err = sk->sk_err; + } else { + ctx->async_wait.err = err; + tls_err_abort(sk, err); + } + } + + if (rec) { + struct tls_rec *first_rec; + + /* Mark the record as ready for transmission */ + smp_store_mb(rec->tx_ready, true); + + /* If received record is at head of tx_list, schedule tx */ + first_rec = list_first_entry(&ctx->tx_list, + struct tls_rec, list); + if (rec == first_rec) + ready = true; + } + + pending = atomic_dec_return(&ctx->encrypt_pending); + + if (!pending && READ_ONCE(ctx->async_notify)) + complete(&ctx->async_wait.completion); + + if (!ready) + return; + + /* Schedule the transmission */ + if (!test_and_set_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) + schedule_delayed_work(&ctx->tx_work.work, 1); +} + +static int tls_do_encryption(struct sock *sk, + struct tls_context *tls_ctx, + struct tls_sw_context_tx *ctx, + struct aead_request *aead_req, + size_t data_len, u32 start) +{ + struct tls_rec *rec = ctx->open_rec; + struct sk_msg *msg_en = &rec->msg_encrypted; + struct scatterlist *sge = sk_msg_elem(msg_en, start); int rc; - ctx->sg_encrypted_data[0].offset += tls_ctx->prepend_size; - ctx->sg_encrypted_data[0].length -= tls_ctx->prepend_size; + sge->offset += tls_ctx->tx.prepend_size; + sge->length -= tls_ctx->tx.prepend_size; + + msg_en->sg.curr = start; aead_request_set_tfm(aead_req, ctx->aead_send); aead_request_set_ad(aead_req, TLS_AAD_SPACE_SIZE); - aead_request_set_crypt(aead_req, ctx->sg_aead_in, ctx->sg_aead_out, - data_len, tls_ctx->iv); + aead_request_set_crypt(aead_req, rec->sg_aead_in, + rec->sg_aead_out, + data_len, tls_ctx->tx.iv); + + aead_request_set_callback(aead_req, CRYPTO_TFM_REQ_MAY_BACKLOG, + tls_encrypt_done, sk); + + /* Add the record in tx_list */ + list_add_tail((struct list_head *)&rec->list, &ctx->tx_list); + atomic_inc(&ctx->encrypt_pending); + rc = crypto_aead_encrypt(aead_req); + if (!rc || rc != -EINPROGRESS) { + atomic_dec(&ctx->encrypt_pending); + sge->offset -= tls_ctx->tx.prepend_size; + sge->length += tls_ctx->tx.prepend_size; + } - ctx->sg_encrypted_data[0].offset -= tls_ctx->prepend_size; - ctx->sg_encrypted_data[0].length += tls_ctx->prepend_size; + if (!rc) { + WRITE_ONCE(rec->tx_ready, true); + } else if (rc != -EINPROGRESS) { + list_del(&rec->list); + return rc; + } + /* Unhook the record from context if encryption is not failure */ + ctx->open_rec = NULL; + tls_advance_record_sn(sk, &tls_ctx->tx); return rc; } +static int tls_split_open_record(struct sock *sk, struct tls_rec *from, + struct tls_rec **to, struct sk_msg *msg_opl, + struct sk_msg *msg_oen, u32 split_point, + u32 tx_overhead_size, u32 *orig_end) +{ + u32 i, j, bytes = 0, apply = msg_opl->apply_bytes; + struct scatterlist *sge, *osge, *nsge; + u32 orig_size = msg_opl->sg.size; + struct scatterlist tmp = { }; + struct sk_msg *msg_npl; + struct tls_rec *new; + int ret; + + new = tls_get_rec(sk); + if (!new) + return -ENOMEM; + ret = sk_msg_alloc(sk, &new->msg_encrypted, msg_opl->sg.size + + tx_overhead_size, 0); + if (ret < 0) { + tls_free_rec(sk, new); + return ret; + } + + *orig_end = msg_opl->sg.end; + i = msg_opl->sg.start; + sge = sk_msg_elem(msg_opl, i); + while (apply && sge->length) { + if (sge->length > apply) { + u32 len = sge->length - apply; + + get_page(sg_page(sge)); + sg_set_page(&tmp, sg_page(sge), len, + sge->offset + apply); + sge->length = apply; + bytes += apply; + apply = 0; + } else { + apply -= sge->length; + bytes += sge->length; + } + + sk_msg_iter_var_next(i); + if (i == msg_opl->sg.end) + break; + sge = sk_msg_elem(msg_opl, i); + } + + msg_opl->sg.end = i; + msg_opl->sg.curr = i; + msg_opl->sg.copybreak = 0; + msg_opl->apply_bytes = 0; + msg_opl->sg.size = bytes; + + msg_npl = &new->msg_plaintext; + msg_npl->apply_bytes = apply; + msg_npl->sg.size = orig_size - bytes; + + j = msg_npl->sg.start; + nsge = sk_msg_elem(msg_npl, j); + if (tmp.length) { + memcpy(nsge, &tmp, sizeof(*nsge)); + sk_msg_iter_var_next(j); + nsge = sk_msg_elem(msg_npl, j); + } + + osge = sk_msg_elem(msg_opl, i); + while (osge->length) { + memcpy(nsge, osge, sizeof(*nsge)); + sg_unmark_end(nsge); + sk_msg_iter_var_next(i); + sk_msg_iter_var_next(j); + if (i == *orig_end) + break; + osge = sk_msg_elem(msg_opl, i); + nsge = sk_msg_elem(msg_npl, j); + } + + msg_npl->sg.end = j; + msg_npl->sg.curr = j; + msg_npl->sg.copybreak = 0; + + *to = new; + return 0; +} + +static void tls_merge_open_record(struct sock *sk, struct tls_rec *to, + struct tls_rec *from, u32 orig_end) +{ + struct sk_msg *msg_npl = &from->msg_plaintext; + struct sk_msg *msg_opl = &to->msg_plaintext; + struct scatterlist *osge, *nsge; + u32 i, j; + + i = msg_opl->sg.end; + sk_msg_iter_var_prev(i); + j = msg_npl->sg.start; + + osge = sk_msg_elem(msg_opl, i); + nsge = sk_msg_elem(msg_npl, j); + + if (sg_page(osge) == sg_page(nsge) && + osge->offset + osge->length == nsge->offset) { + osge->length += nsge->length; + put_page(sg_page(nsge)); + } + + msg_opl->sg.end = orig_end; + msg_opl->sg.curr = orig_end; + msg_opl->sg.copybreak = 0; + msg_opl->apply_bytes = msg_opl->sg.size + msg_npl->sg.size; + msg_opl->sg.size += msg_npl->sg.size; + + sk_msg_free(sk, &to->msg_encrypted); + sk_msg_xfer_full(&to->msg_encrypted, &from->msg_encrypted); + + kfree(from); +} + static int tls_push_record(struct sock *sk, int flags, unsigned char record_type) { struct tls_context *tls_ctx = tls_get_ctx(sk); - struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct tls_rec *rec = ctx->open_rec, *tmp = NULL; + u32 i, split_point, uninitialized_var(orig_end); + struct sk_msg *msg_pl, *msg_en; struct aead_request *req; + bool split; int rc; - req = kzalloc(sizeof(struct aead_request) + - crypto_aead_reqsize(ctx->aead_send), sk->sk_allocation); - if (!req) - return -ENOMEM; + if (!rec) + return 0; - sg_mark_end(ctx->sg_plaintext_data + ctx->sg_plaintext_num_elem - 1); - sg_mark_end(ctx->sg_encrypted_data + ctx->sg_encrypted_num_elem - 1); + msg_pl = &rec->msg_plaintext; + msg_en = &rec->msg_encrypted; - tls_make_aad(0, ctx->aad_space, ctx->sg_plaintext_size, - tls_ctx->rec_seq, tls_ctx->rec_seq_size, + split_point = msg_pl->apply_bytes; + split = split_point && split_point < msg_pl->sg.size; + if (split) { + rc = tls_split_open_record(sk, rec, &tmp, msg_pl, msg_en, + split_point, tls_ctx->tx.overhead_size, + &orig_end); + if (rc < 0) + return rc; + sk_msg_trim(sk, msg_en, msg_pl->sg.size + + tls_ctx->tx.overhead_size); + } + + rec->tx_flags = flags; + req = &rec->aead_req; + + i = msg_pl->sg.end; + sk_msg_iter_var_prev(i); + sg_mark_end(sk_msg_elem(msg_pl, i)); + + i = msg_pl->sg.start; + sg_chain(rec->sg_aead_in, 2, rec->inplace_crypto ? + &msg_en->sg.data[i] : &msg_pl->sg.data[i]); + + i = msg_en->sg.end; + sk_msg_iter_var_prev(i); + sg_mark_end(sk_msg_elem(msg_en, i)); + + i = msg_en->sg.start; + sg_chain(rec->sg_aead_out, 2, &msg_en->sg.data[i]); + + tls_make_aad(rec->aad_space, msg_pl->sg.size, + tls_ctx->tx.rec_seq, tls_ctx->tx.rec_seq_size, record_type); tls_fill_prepend(tls_ctx, - page_address(sg_page(&ctx->sg_encrypted_data[0])) + - ctx->sg_encrypted_data[0].offset, - ctx->sg_plaintext_size, record_type); + page_address(sg_page(&msg_en->sg.data[i])) + + msg_en->sg.data[i].offset, msg_pl->sg.size, + record_type); - tls_ctx->pending_open_record_frags = 0; - set_bit(TLS_PENDING_CLOSED_RECORD, &tls_ctx->flags); + tls_ctx->pending_open_record_frags = false; - rc = tls_do_encryption(tls_ctx, ctx, req, ctx->sg_plaintext_size); + rc = tls_do_encryption(sk, tls_ctx, ctx, req, msg_pl->sg.size, i); if (rc < 0) { - /* If we are called from write_space and - * we fail, we need to set this SOCK_NOSPACE - * to trigger another write_space in the future. - */ - set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); - goto out_req; + if (rc != -EINPROGRESS) { + tls_err_abort(sk, EBADMSG); + if (split) { + tls_ctx->pending_open_record_frags = true; + tls_merge_open_record(sk, rec, tmp, orig_end); + } + } + return rc; + } else if (split) { + msg_pl = &tmp->msg_plaintext; + msg_en = &tmp->msg_encrypted; + sk_msg_trim(sk, msg_en, msg_pl->sg.size + + tls_ctx->tx.overhead_size); + tls_ctx->pending_open_record_frags = true; + ctx->open_rec = tmp; } - free_sg(sk, ctx->sg_plaintext_data, &ctx->sg_plaintext_num_elem, - &ctx->sg_plaintext_size); + return tls_tx_records(sk, flags); +} - ctx->sg_encrypted_num_elem = 0; - ctx->sg_encrypted_size = 0; +static int bpf_exec_tx_verdict(struct sk_msg *msg, struct sock *sk, + bool full_record, u8 record_type, + size_t *copied, int flags) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct sk_msg msg_redir = { }; + struct sk_psock *psock; + struct sock *sk_redir; + struct tls_rec *rec; + int err = 0, send; + u32 delta = 0; + bool enospc; - /* Only pass through MSG_DONTWAIT and MSG_NOSIGNAL flags */ - rc = tls_push_sg(sk, tls_ctx, ctx->sg_encrypted_data, 0, flags); - if (rc < 0 && rc != -EAGAIN) - tls_err_abort(sk); + psock = sk_psock_get(sk); + if (!psock) + return tls_push_record(sk, flags, record_type); +more_data: + enospc = sk_msg_full(msg); + if (psock->eval == __SK_NONE) { + delta = msg->sg.size; + psock->eval = sk_psock_msg_verdict(sk, psock, msg); + if (delta < msg->sg.size) + delta -= msg->sg.size; + else + delta = 0; + } + if (msg->cork_bytes && msg->cork_bytes > msg->sg.size && + !enospc && !full_record) { + err = -ENOSPC; + goto out_err; + } + msg->cork_bytes = 0; + send = msg->sg.size; + if (msg->apply_bytes && msg->apply_bytes < send) + send = msg->apply_bytes; - tls_advance_record_sn(sk, tls_ctx); -out_req: - kfree(req); - return rc; + switch (psock->eval) { + case __SK_PASS: + err = tls_push_record(sk, flags, record_type); + if (err < 0) { + *copied -= sk_msg_free(sk, msg); + tls_free_open_rec(sk); + goto out_err; + } + break; + case __SK_REDIRECT: + sk_redir = psock->sk_redir; + memcpy(&msg_redir, msg, sizeof(*msg)); + if (msg->apply_bytes < send) + msg->apply_bytes = 0; + else + msg->apply_bytes -= send; + sk_msg_return_zero(sk, msg, send); + msg->sg.size -= send; + release_sock(sk); + err = tcp_bpf_sendmsg_redir(sk_redir, &msg_redir, send, flags); + lock_sock(sk); + if (err < 0) { + *copied -= sk_msg_free_nocharge(sk, &msg_redir); + msg->sg.size = 0; + } + if (msg->sg.size == 0) + tls_free_open_rec(sk); + break; + case __SK_DROP: + default: + sk_msg_free_partial(sk, msg, send); + if (msg->apply_bytes < send) + msg->apply_bytes = 0; + else + msg->apply_bytes -= send; + if (msg->sg.size == 0) + tls_free_open_rec(sk); + *copied -= (send + delta); + err = -EACCES; + } + + if (likely(!err)) { + bool reset_eval = !ctx->open_rec; + + rec = ctx->open_rec; + if (rec) { + msg = &rec->msg_plaintext; + if (!msg->apply_bytes) + reset_eval = true; + } + if (reset_eval) { + psock->eval = __SK_NONE; + if (psock->sk_redir) { + sock_put(psock->sk_redir); + psock->sk_redir = NULL; + } + } + if (rec) + goto more_data; + } + out_err: + sk_psock_put(sk, psock); + return err; } static int tls_sw_push_pending_record(struct sock *sk, int flags) { - return tls_push_record(sk, flags, TLS_RECORD_TYPE_DATA); + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct tls_rec *rec = ctx->open_rec; + struct sk_msg *msg_pl; + size_t copied; + + if (!rec) + return 0; + + msg_pl = &rec->msg_plaintext; + copied = msg_pl->sg.size; + if (!copied) + return 0; + + return bpf_exec_tx_verdict(msg_pl, sk, true, TLS_RECORD_TYPE_DATA, + &copied, flags); } -static int zerocopy_from_iter(struct sock *sk, struct iov_iter *from, - int length) +int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) +{ + long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct crypto_tfm *tfm = crypto_aead_tfm(ctx->aead_send); + bool async_capable = tfm->__crt_alg->cra_flags & CRYPTO_ALG_ASYNC; + unsigned char record_type = TLS_RECORD_TYPE_DATA; + bool is_kvec = msg->msg_iter.type & ITER_KVEC; + bool eor = !(msg->msg_flags & MSG_MORE); + size_t try_to_copy, copied = 0; + struct sk_msg *msg_pl, *msg_en; + struct tls_rec *rec; + int required_size; + int num_async = 0; + bool full_record; + int record_room; + int num_zc = 0; + int orig_size; + int ret = 0; + + if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) + return -ENOTSUPP; + + lock_sock(sk); + + /* Wait till there is any pending write on socket */ + if (unlikely(sk->sk_write_pending)) { + ret = wait_on_pending_writer(sk, &timeo); + if (unlikely(ret)) + goto send_end; + } + + if (unlikely(msg->msg_controllen)) { + ret = tls_proccess_cmsg(sk, msg, &record_type); + if (ret) { + if (ret == -EINPROGRESS) + num_async++; + else if (ret != -EAGAIN) + goto send_end; + } + } + + while (msg_data_left(msg)) { + if (sk->sk_err) { + ret = -sk->sk_err; + goto send_end; + } + + if (ctx->open_rec) + rec = ctx->open_rec; + else + rec = ctx->open_rec = tls_get_rec(sk); + if (!rec) { + ret = -ENOMEM; + goto send_end; + } + + msg_pl = &rec->msg_plaintext; + msg_en = &rec->msg_encrypted; + + orig_size = msg_pl->sg.size; + full_record = false; + try_to_copy = msg_data_left(msg); + record_room = TLS_MAX_PAYLOAD_SIZE - msg_pl->sg.size; + if (try_to_copy >= record_room) { + try_to_copy = record_room; + full_record = true; + } + + required_size = msg_pl->sg.size + try_to_copy + + tls_ctx->tx.overhead_size; + + if (!sk_stream_memory_free(sk)) + goto wait_for_sndbuf; + +alloc_encrypted: + ret = tls_alloc_encrypted_msg(sk, required_size); + if (ret) { + if (ret != -ENOSPC) + goto wait_for_memory; + + /* Adjust try_to_copy according to the amount that was + * actually allocated. The difference is due + * to max sg elements limit + */ + try_to_copy -= required_size - msg_en->sg.size; + full_record = true; + } + + if (!is_kvec && (full_record || eor) && !async_capable) { + u32 first = msg_pl->sg.end; + + ret = sk_msg_zerocopy_from_iter(sk, &msg->msg_iter, + msg_pl, try_to_copy); + if (ret) + goto fallback_to_reg_send; + + rec->inplace_crypto = 0; + + num_zc++; + copied += try_to_copy; + + sk_msg_sg_copy_set(msg_pl, first); + ret = bpf_exec_tx_verdict(msg_pl, sk, full_record, + record_type, &copied, + msg->msg_flags); + if (ret) { + if (ret == -EINPROGRESS) + num_async++; + else if (ret == -ENOMEM) + goto wait_for_memory; + else if (ret == -ENOSPC) + goto rollback_iter; + else if (ret != -EAGAIN) + goto send_end; + } + continue; +rollback_iter: + copied -= try_to_copy; + sk_msg_sg_copy_clear(msg_pl, first); + iov_iter_revert(&msg->msg_iter, + msg_pl->sg.size - orig_size); +fallback_to_reg_send: + sk_msg_trim(sk, msg_pl, orig_size); + } + + required_size = msg_pl->sg.size + try_to_copy; + + ret = tls_clone_plaintext_msg(sk, required_size); + if (ret) { + if (ret != -ENOSPC) + goto send_end; + + /* Adjust try_to_copy according to the amount that was + * actually allocated. The difference is due + * to max sg elements limit + */ + try_to_copy -= required_size - msg_pl->sg.size; + full_record = true; + sk_msg_trim(sk, msg_en, msg_pl->sg.size + + tls_ctx->tx.overhead_size); + } + + ret = sk_msg_memcopy_from_iter(sk, &msg->msg_iter, msg_pl, + try_to_copy); + if (ret < 0) + goto trim_sgl; + + /* Open records defined only if successfully copied, otherwise + * we would trim the sg but not reset the open record frags. + */ + tls_ctx->pending_open_record_frags = true; + copied += try_to_copy; + if (full_record || eor) { + ret = bpf_exec_tx_verdict(msg_pl, sk, full_record, + record_type, &copied, + msg->msg_flags); + if (ret) { + if (ret == -EINPROGRESS) + num_async++; + else if (ret == -ENOMEM) + goto wait_for_memory; + else if (ret != -EAGAIN) { + if (ret == -ENOSPC) + ret = 0; + goto send_end; + } + } + } + + continue; + +wait_for_sndbuf: + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); +wait_for_memory: + ret = sk_stream_wait_memory(sk, &timeo); + if (ret) { +trim_sgl: + tls_trim_both_msgs(sk, orig_size); + goto send_end; + } + + if (msg_en->sg.size < required_size) + goto alloc_encrypted; + } + + if (!num_async) { + goto send_end; + } else if (num_zc) { + /* Wait for pending encryptions to get completed */ + smp_store_mb(ctx->async_notify, true); + + if (atomic_read(&ctx->encrypt_pending)) + crypto_wait_req(-EINPROGRESS, &ctx->async_wait); + else + reinit_completion(&ctx->async_wait.completion); + + WRITE_ONCE(ctx->async_notify, false); + + if (ctx->async_wait.err) { + ret = ctx->async_wait.err; + copied = 0; + } + } + + /* Transmit if any encryptions have completed */ + if (test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) { + cancel_delayed_work(&ctx->tx_work.work); + tls_tx_records(sk, msg->msg_flags); + } + +send_end: + ret = sk_stream_error(sk, msg->msg_flags, ret); + + release_sock(sk); + return copied ? copied : ret; +} + +int tls_sw_sendpage(struct sock *sk, struct page *page, + int offset, size_t size, int flags) +{ + long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + unsigned char record_type = TLS_RECORD_TYPE_DATA; + struct sk_msg *msg_pl; + struct tls_rec *rec; + int num_async = 0; + size_t copied = 0; + bool full_record; + int record_room; + int ret = 0; + bool eor; + + if (flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | + MSG_SENDPAGE_NOTLAST)) + return -ENOTSUPP; + + /* No MSG_EOR from splice, only look at MSG_MORE */ + eor = !(flags & (MSG_MORE | MSG_SENDPAGE_NOTLAST)); + + lock_sock(sk); + + sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); + + /* Wait till there is any pending write on socket */ + if (unlikely(sk->sk_write_pending)) { + ret = wait_on_pending_writer(sk, &timeo); + if (unlikely(ret)) + goto sendpage_end; + } + + /* Call the sk_stream functions to manage the sndbuf mem. */ + while (size > 0) { + size_t copy, required_size; + + if (sk->sk_err) { + ret = -sk->sk_err; + goto sendpage_end; + } + + if (ctx->open_rec) + rec = ctx->open_rec; + else + rec = ctx->open_rec = tls_get_rec(sk); + if (!rec) { + ret = -ENOMEM; + goto sendpage_end; + } + + msg_pl = &rec->msg_plaintext; + + full_record = false; + record_room = TLS_MAX_PAYLOAD_SIZE - msg_pl->sg.size; + copied = 0; + copy = size; + if (copy >= record_room) { + copy = record_room; + full_record = true; + } + + required_size = msg_pl->sg.size + copy + + tls_ctx->tx.overhead_size; + + if (!sk_stream_memory_free(sk)) + goto wait_for_sndbuf; +alloc_payload: + ret = tls_alloc_encrypted_msg(sk, required_size); + if (ret) { + if (ret != -ENOSPC) + goto wait_for_memory; + + /* Adjust copy according to the amount that was + * actually allocated. The difference is due + * to max sg elements limit + */ + copy -= required_size - msg_pl->sg.size; + full_record = true; + } + + sk_msg_page_add(msg_pl, page, copy, offset); + sk_mem_charge(sk, copy); + + offset += copy; + size -= copy; + copied += copy; + + tls_ctx->pending_open_record_frags = true; + if (full_record || eor || sk_msg_full(msg_pl)) { + rec->inplace_crypto = 0; + ret = bpf_exec_tx_verdict(msg_pl, sk, full_record, + record_type, &copied, flags); + if (ret) { + if (ret == -EINPROGRESS) + num_async++; + else if (ret == -ENOMEM) + goto wait_for_memory; + else if (ret != -EAGAIN) { + if (ret == -ENOSPC) + ret = 0; + goto sendpage_end; + } + } + } + continue; +wait_for_sndbuf: + set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); +wait_for_memory: + ret = sk_stream_wait_memory(sk, &timeo); + if (ret) { + tls_trim_both_msgs(sk, msg_pl->sg.size); + goto sendpage_end; + } + + goto alloc_payload; + } + + if (num_async) { + /* Transmit if any encryptions have completed */ + if (test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) { + cancel_delayed_work(&ctx->tx_work.work); + tls_tx_records(sk, flags); + } + } +sendpage_end: + ret = sk_stream_error(sk, flags, ret); + release_sock(sk); + return copied ? copied : ret; +} + +static struct sk_buff *tls_wait_data(struct sock *sk, struct sk_psock *psock, + int flags, long timeo, int *err) { struct tls_context *tls_ctx = tls_get_ctx(sk); - struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); - struct page *pages[MAX_SKB_FRAGS]; + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + struct sk_buff *skb; + DEFINE_WAIT_FUNC(wait, woken_wake_function); - size_t offset; + while (!(skb = ctx->recv_pkt) && sk_psock_queue_empty(psock)) { + if (sk->sk_err) { + *err = sock_error(sk); + return NULL; + } + + if (!skb_queue_empty(&sk->sk_receive_queue)) { + __strp_unpause(&ctx->strp); + if (ctx->recv_pkt) + return ctx->recv_pkt; + } + + if (sk->sk_shutdown & RCV_SHUTDOWN) + return NULL; + + if (sock_flag(sk, SOCK_DONE)) + return NULL; + + if ((flags & MSG_DONTWAIT) || !timeo) { + *err = -EAGAIN; + return NULL; + } + + add_wait_queue(sk_sleep(sk), &wait); + sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); + sk_wait_event(sk, &timeo, + ctx->recv_pkt != skb || + !sk_psock_queue_empty(psock), + &wait); + sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk); + remove_wait_queue(sk_sleep(sk), &wait); + + /* Handle signals */ + if (signal_pending(current)) { + *err = sock_intr_errno(timeo); + return NULL; + } + } + + return skb; +} + +static int tls_setup_from_iter(struct sock *sk, struct iov_iter *from, + int length, int *pages_used, + unsigned int *size_used, + struct scatterlist *to, + int to_max_pages) +{ + int rc = 0, i = 0, num_elem = *pages_used, maxpages; + struct page *pages[MAX_SKB_FRAGS]; + unsigned int size = *size_used; ssize_t copied, use; - int i = 0; - unsigned int size = ctx->sg_plaintext_size; - int num_elem = ctx->sg_plaintext_num_elem; - int rc = 0; - int maxpages; + size_t offset; while (length > 0) { i = 0; - maxpages = ARRAY_SIZE(ctx->sg_plaintext_data) - num_elem; + maxpages = to_max_pages - num_elem; if (maxpages == 0) { rc = -EFAULT; goto out; @@ -280,336 +1086,608 @@ static int zerocopy_from_iter(struct sock *sk, struct iov_iter *from, while (copied) { use = min_t(int, copied, PAGE_SIZE - offset); - sg_set_page(&ctx->sg_plaintext_data[num_elem], + sg_set_page(&to[num_elem], pages[i], use, offset); - sg_unmark_end(&ctx->sg_plaintext_data[num_elem]); - sk_mem_charge(sk, use); + sg_unmark_end(&to[num_elem]); + /* We do not uncharge memory from this API */ offset = 0; copied -= use; - ++i; - ++num_elem; + i++; + num_elem++; } } - + /* Mark the end in the last sg entry if newly added */ + if (num_elem > *pages_used) + sg_mark_end(&to[num_elem - 1]); out: - ctx->sg_plaintext_size = size; - ctx->sg_plaintext_num_elem = num_elem; + if (rc) + iov_iter_revert(from, size - *size_used); + *size_used = size; + *pages_used = num_elem; + return rc; } -static int memcopy_from_iter(struct sock *sk, struct iov_iter *from, - int bytes) +/* This function decrypts the input skb into either out_iov or in out_sg + * or in skb buffers itself. The input parameter 'zc' indicates if + * zero-copy mode needs to be tried or not. With zero-copy mode, either + * out_iov or out_sg must be non-NULL. In case both out_iov and out_sg are + * NULL, then the decryption happens inside skb buffers itself, i.e. + * zero-copy gets disabled and 'zc' is updated. + */ + +static int decrypt_internal(struct sock *sk, struct sk_buff *skb, + struct iov_iter *out_iov, + struct scatterlist *out_sg, + int *chunk, bool *zc) { struct tls_context *tls_ctx = tls_get_ctx(sk); - struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); - struct scatterlist *sg = ctx->sg_plaintext_data; - int copy, i, rc = 0; + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + struct strp_msg *rxm = strp_msg(skb); + int n_sgin, n_sgout, nsg, mem_size, aead_size, err, pages = 0; + struct aead_request *aead_req; + struct sk_buff *unused; + u8 *aad, *iv, *mem = NULL; + struct scatterlist *sgin = NULL; + struct scatterlist *sgout = NULL; + const int data_len = rxm->full_len - tls_ctx->rx.overhead_size; - for (i = tls_ctx->pending_open_record_frags; - i < ctx->sg_plaintext_num_elem; ++i) { - copy = sg[i].length; - if (copy_from_iter( - page_address(sg_page(&sg[i])) + sg[i].offset, - copy, from) != copy) { - rc = -EFAULT; - goto out; + if (*zc && (out_iov || out_sg)) { + if (out_iov) + n_sgout = iov_iter_npages(out_iov, INT_MAX) + 1; + else + n_sgout = sg_nents(out_sg); + } else { + n_sgout = 0; + *zc = false; + } + + n_sgin = skb_cow_data(skb, 0, &unused); + if (n_sgin < 1) + return -EBADMSG; + + /* Increment to accommodate AAD */ + n_sgin = n_sgin + 1; + + nsg = n_sgin + n_sgout; + + aead_size = sizeof(*aead_req) + crypto_aead_reqsize(ctx->aead_recv); + mem_size = aead_size + (nsg * sizeof(struct scatterlist)); + mem_size = mem_size + TLS_AAD_SPACE_SIZE; + mem_size = mem_size + crypto_aead_ivsize(ctx->aead_recv); + + /* Allocate a single block of memory which contains + * aead_req || sgin[] || sgout[] || aad || iv. + * This order achieves correct alignment for aead_req, sgin, sgout. + */ + mem = kmalloc(mem_size, sk->sk_allocation); + if (!mem) + return -ENOMEM; + + /* Segment the allocated memory */ + aead_req = (struct aead_request *)mem; + sgin = (struct scatterlist *)(mem + aead_size); + sgout = sgin + n_sgin; + aad = (u8 *)(sgout + n_sgout); + iv = aad + TLS_AAD_SPACE_SIZE; + + /* Prepare IV */ + err = skb_copy_bits(skb, rxm->offset + TLS_HEADER_SIZE, + iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, + tls_ctx->rx.iv_size); + if (err < 0) { + kfree(mem); + return err; + } + memcpy(iv, tls_ctx->rx.iv, TLS_CIPHER_AES_GCM_128_SALT_SIZE); + + /* Prepare AAD */ + tls_make_aad(aad, rxm->full_len - tls_ctx->rx.overhead_size, + tls_ctx->rx.rec_seq, tls_ctx->rx.rec_seq_size, + ctx->control); + + /* Prepare sgin */ + sg_init_table(sgin, n_sgin); + sg_set_buf(&sgin[0], aad, TLS_AAD_SPACE_SIZE); + err = skb_to_sgvec(skb, &sgin[1], + rxm->offset + tls_ctx->rx.prepend_size, + rxm->full_len - tls_ctx->rx.prepend_size); + if (err < 0) { + kfree(mem); + return err; + } + + if (n_sgout) { + if (out_iov) { + sg_init_table(sgout, n_sgout); + sg_set_buf(&sgout[0], aad, TLS_AAD_SPACE_SIZE); + + *chunk = 0; + err = tls_setup_from_iter(sk, out_iov, data_len, + &pages, chunk, &sgout[1], + (n_sgout - 1)); + if (err < 0) + goto fallback_to_reg_recv; + } else if (out_sg) { + memcpy(sgout, out_sg, n_sgout * sizeof(*sgout)); + } else { + goto fallback_to_reg_recv; } - bytes -= copy; + } else { +fallback_to_reg_recv: + sgout = sgin; + pages = 0; + *chunk = 0; + *zc = false; + } - ++tls_ctx->pending_open_record_frags; + /* Prepare and submit AEAD request */ + err = tls_do_decryption(sk, sgin, sgout, iv, data_len, aead_req); - if (!bytes) + /* Release the pages in case iov was mapped to pages */ + for (; pages > 0; pages--) + put_page(sg_page(&sgout[pages])); + + kfree(mem); + return err; +} + +static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb, + struct iov_iter *dest, int *chunk, bool *zc) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + struct strp_msg *rxm = strp_msg(skb); + int err = 0; + +#ifdef CONFIG_TLS_DEVICE + err = tls_device_decrypted(sk, skb); + if (err < 0) + return err; +#endif + if (!ctx->decrypted) { + err = decrypt_internal(sk, skb, dest, NULL, chunk, zc); + if (err < 0) + return err; + } else { + *zc = false; + } + + rxm->offset += tls_ctx->rx.prepend_size; + rxm->full_len -= tls_ctx->rx.overhead_size; + tls_advance_record_sn(sk, &tls_ctx->rx); + ctx->decrypted = true; + ctx->saved_data_ready(sk); + + return err; +} + +int decrypt_skb(struct sock *sk, struct sk_buff *skb, + struct scatterlist *sgout) +{ + bool zc = true; + int chunk; + + return decrypt_internal(sk, skb, NULL, sgout, &chunk, &zc); +} + +static bool tls_sw_advance_skb(struct sock *sk, struct sk_buff *skb, + unsigned int len) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + struct strp_msg *rxm = strp_msg(skb); + + if (len < rxm->full_len) { + rxm->offset += len; + rxm->full_len -= len; + + return false; + } + + /* Finished with message */ + ctx->recv_pkt = NULL; + kfree_skb(skb); + __strp_unpause(&ctx->strp); + + return true; +} + +int tls_sw_recvmsg(struct sock *sk, + struct msghdr *msg, + size_t len, + int nonblock, + int flags, + int *addr_len) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + struct sk_psock *psock; + unsigned char control; + struct strp_msg *rxm; + struct sk_buff *skb; + ssize_t copied = 0; + bool cmsg = false; + int target, err = 0; + long timeo; + bool is_kvec = msg->msg_iter.type & ITER_KVEC; + + flags |= nonblock; + + if (unlikely(flags & MSG_ERRQUEUE)) + return sock_recv_errqueue(sk, msg, len, SOL_IP, IP_RECVERR); + + psock = sk_psock_get(sk); + lock_sock(sk); + + target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + do { + bool zc = false; + int chunk = 0; + + skb = tls_wait_data(sk, psock, flags, timeo, &err); + if (!skb) { + if (psock) { + int ret = __tcp_bpf_recvmsg(sk, psock, + msg, len, flags); + + if (ret > 0) { + copied += ret; + len -= ret; + continue; + } + } + goto recv_end; + } + + rxm = strp_msg(skb); + if (!cmsg) { + int cerr; + + cerr = put_cmsg(msg, SOL_TLS, TLS_GET_RECORD_TYPE, + sizeof(ctx->control), &ctx->control); + cmsg = true; + control = ctx->control; + if (ctx->control != TLS_RECORD_TYPE_DATA) { + if (cerr || msg->msg_flags & MSG_CTRUNC) { + err = -EIO; + goto recv_end; + } + } + } else if (control != ctx->control) { + goto recv_end; + } + + if (!ctx->decrypted) { + int to_copy = rxm->full_len - tls_ctx->rx.overhead_size; + + if (!is_kvec && to_copy <= len && + likely(!(flags & MSG_PEEK))) + zc = true; + + err = decrypt_skb_update(sk, skb, &msg->msg_iter, + &chunk, &zc); + if (err < 0) { + tls_err_abort(sk, EBADMSG); + goto recv_end; + } + ctx->decrypted = true; + } + + if (!zc) { + chunk = min_t(unsigned int, rxm->full_len, len); + err = skb_copy_datagram_msg(skb, rxm->offset, msg, + chunk); + if (err < 0) + goto recv_end; + } + + copied += chunk; + len -= chunk; + if (likely(!(flags & MSG_PEEK))) { + u8 control = ctx->control; + + if (tls_sw_advance_skb(sk, skb, chunk)) { + /* Return full control message to + * userspace before trying to parse + * another message type + */ + msg->msg_flags |= MSG_EOR; + if (control != TLS_RECORD_TYPE_DATA) + goto recv_end; + } + } else { + /* MSG_PEEK right now cannot look beyond current skb + * from strparser, meaning we cannot advance skb here + * and thus unpause strparser since we'd loose original + * one. + */ break; - } + } -out: - return rc; + /* If we have a new message from strparser, continue now. */ + if (copied >= target && !ctx->recv_pkt) + break; + } while (len); + +recv_end: + release_sock(sk); + if (psock) + sk_psock_put(sk, psock); + return copied ? : err; } -int tls_sw_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) +ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, + struct pipe_inode_info *pipe, + size_t len, unsigned int flags) { - struct tls_context *tls_ctx = tls_get_ctx(sk); - struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); - int ret; - int required_size; - long timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); - bool eor = !(msg->msg_flags & MSG_MORE); - size_t try_to_copy, copied = 0; - unsigned char record_type = TLS_RECORD_TYPE_DATA; - int record_room; - bool full_record; - int orig_size; - - if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) - return -ENOTSUPP; + struct tls_context *tls_ctx = tls_get_ctx(sock->sk); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + struct strp_msg *rxm = NULL; + struct sock *sk = sock->sk; + struct sk_buff *skb; + ssize_t copied = 0; + int err = 0; + long timeo; + int chunk; + bool zc = false; lock_sock(sk); - ret = tls_complete_pending_work(sk, tls_ctx, msg->msg_flags, &timeo); - if (ret) - goto send_end; + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); - if (unlikely(msg->msg_controllen)) { - ret = tls_proccess_cmsg(sk, msg, &record_type); - if (ret) - goto send_end; + skb = tls_wait_data(sk, NULL, flags, timeo, &err); + if (!skb) + goto splice_read_end; + + /* splice does not support reading control messages */ + if (ctx->control != TLS_RECORD_TYPE_DATA) { + err = -ENOTSUPP; + goto splice_read_end; } - while (msg_data_left(msg)) { - if (sk->sk_err) { - ret = -sk->sk_err; - goto send_end; + if (!ctx->decrypted) { + err = decrypt_skb_update(sk, skb, NULL, &chunk, &zc); + + if (err < 0) { + tls_err_abort(sk, EBADMSG); + goto splice_read_end; } - - orig_size = ctx->sg_plaintext_size; - full_record = false; - try_to_copy = msg_data_left(msg); - record_room = TLS_MAX_PAYLOAD_SIZE - ctx->sg_plaintext_size; - if (try_to_copy >= record_room) { - try_to_copy = record_room; - full_record = true; - } - - required_size = ctx->sg_plaintext_size + try_to_copy + - tls_ctx->overhead_size; - - if (!sk_stream_memory_free(sk)) - goto wait_for_sndbuf; -alloc_encrypted: - ret = alloc_encrypted_sg(sk, required_size); - if (ret) { - if (ret != -ENOSPC) - goto wait_for_memory; - - /* Adjust try_to_copy according to the amount that was - * actually allocated. The difference is due - * to max sg elements limit - */ - try_to_copy -= required_size - ctx->sg_encrypted_size; - full_record = true; - } - - if (full_record || eor) { - ret = zerocopy_from_iter(sk, &msg->msg_iter, - try_to_copy); - if (ret) - goto fallback_to_reg_send; - - copied += try_to_copy; - ret = tls_push_record(sk, msg->msg_flags, record_type); - if (!ret) - continue; - if (ret < 0) - goto send_end; - - copied -= try_to_copy; -fallback_to_reg_send: - iov_iter_revert(&msg->msg_iter, - ctx->sg_plaintext_size - orig_size); - trim_sg(sk, ctx->sg_plaintext_data, - &ctx->sg_plaintext_num_elem, - &ctx->sg_plaintext_size, - orig_size); - } - - required_size = ctx->sg_plaintext_size + try_to_copy; -alloc_plaintext: - ret = alloc_plaintext_sg(sk, required_size); - if (ret) { - if (ret != -ENOSPC) - goto wait_for_memory; - - /* Adjust try_to_copy according to the amount that was - * actually allocated. The difference is due - * to max sg elements limit - */ - try_to_copy -= required_size - ctx->sg_plaintext_size; - full_record = true; - - trim_sg(sk, ctx->sg_encrypted_data, - &ctx->sg_encrypted_num_elem, - &ctx->sg_encrypted_size, - ctx->sg_plaintext_size + - tls_ctx->overhead_size); - } - - ret = memcopy_from_iter(sk, &msg->msg_iter, try_to_copy); - if (ret) - goto trim_sgl; - - copied += try_to_copy; - if (full_record || eor) { -push_record: - ret = tls_push_record(sk, msg->msg_flags, record_type); - if (ret) { - if (ret == -ENOMEM) - goto wait_for_memory; - - goto send_end; - } - } - - continue; - -wait_for_sndbuf: - set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); -wait_for_memory: - ret = sk_stream_wait_memory(sk, &timeo); - if (ret) { -trim_sgl: - trim_both_sgl(sk, orig_size); - goto send_end; - } - - if (tls_is_pending_closed_record(tls_ctx)) - goto push_record; - - if (ctx->sg_encrypted_size < required_size) - goto alloc_encrypted; - - goto alloc_plaintext; + ctx->decrypted = true; } + rxm = strp_msg(skb); -send_end: - ret = sk_stream_error(sk, msg->msg_flags, ret); + chunk = min_t(unsigned int, rxm->full_len, len); + copied = skb_splice_bits(skb, sk, rxm->offset, pipe, chunk, flags); + if (copied < 0) + goto splice_read_end; + if (likely(!(flags & MSG_PEEK))) + tls_sw_advance_skb(sk, skb, copied); + +splice_read_end: release_sock(sk); - return copied ? copied : ret; + return copied ? : err; } -int tls_sw_sendpage(struct sock *sk, struct page *page, - int offset, size_t size, int flags) +bool tls_sw_stream_read(const struct sock *sk) { struct tls_context *tls_ctx = tls_get_ctx(sk); - struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + bool ingress_empty = true; + struct sk_psock *psock; + + rcu_read_lock(); + psock = sk_psock(sk); + if (psock) + ingress_empty = list_empty(&psock->ingress_msg); + rcu_read_unlock(); + + return !ingress_empty || ctx->recv_pkt; +} + +static int tls_read_size(struct strparser *strp, struct sk_buff *skb) +{ + struct tls_context *tls_ctx = tls_get_ctx(strp->sk); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + char header[TLS_HEADER_SIZE + MAX_IV_SIZE]; + struct strp_msg *rxm = strp_msg(skb); + size_t cipher_overhead; + size_t data_len = 0; int ret; - long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); - bool eor; - size_t orig_size = size; - unsigned char record_type = TLS_RECORD_TYPE_DATA; - struct scatterlist *sg; - bool full_record; - int record_room; - if (flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | - MSG_SENDPAGE_NOTLAST)) - return -ENOTSUPP; + /* Verify that we have a full TLS header, or wait for more data */ + if (rxm->offset + tls_ctx->rx.prepend_size > skb->len) + return 0; - /* No MSG_EOR from splice, only look at MSG_MORE */ - eor = !(flags & (MSG_MORE | MSG_SENDPAGE_NOTLAST)); - - lock_sock(sk); - - sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk); - - ret = tls_complete_pending_work(sk, tls_ctx, flags, &timeo); - if (ret) - goto sendpage_end; - - /* Call the sk_stream functions to manage the sndbuf mem. */ - while (size > 0) { - size_t copy, required_size; - - if (sk->sk_err) { - ret = -sk->sk_err; - goto sendpage_end; - } - - full_record = false; - record_room = TLS_MAX_PAYLOAD_SIZE - ctx->sg_plaintext_size; - copy = size; - if (copy >= record_room) { - copy = record_room; - full_record = true; - } - required_size = ctx->sg_plaintext_size + copy + - tls_ctx->overhead_size; - - if (!sk_stream_memory_free(sk)) - goto wait_for_sndbuf; -alloc_payload: - ret = alloc_encrypted_sg(sk, required_size); - if (ret) { - if (ret != -ENOSPC) - goto wait_for_memory; - - /* Adjust copy according to the amount that was - * actually allocated. The difference is due - * to max sg elements limit - */ - copy -= required_size - ctx->sg_plaintext_size; - full_record = true; - } - - get_page(page); - sg = ctx->sg_plaintext_data + ctx->sg_plaintext_num_elem; - sg_set_page(sg, page, copy, offset); - ctx->sg_plaintext_num_elem++; - - sk_mem_charge(sk, copy); - offset += copy; - size -= copy; - ctx->sg_plaintext_size += copy; - tls_ctx->pending_open_record_frags = ctx->sg_plaintext_num_elem; - - if (full_record || eor || - ctx->sg_plaintext_num_elem == - ARRAY_SIZE(ctx->sg_plaintext_data)) { -push_record: - ret = tls_push_record(sk, flags, record_type); - if (ret) { - if (ret == -ENOMEM) - goto wait_for_memory; - - goto sendpage_end; - } - } - continue; -wait_for_sndbuf: - set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); -wait_for_memory: - ret = sk_stream_wait_memory(sk, &timeo); - if (ret) { - trim_both_sgl(sk, ctx->sg_plaintext_size); - goto sendpage_end; - } - - if (tls_is_pending_closed_record(tls_ctx)) - goto push_record; - - goto alloc_payload; + /* Sanity-check size of on-stack buffer. */ + if (WARN_ON(tls_ctx->rx.prepend_size > sizeof(header))) { + ret = -EINVAL; + goto read_failure; } -sendpage_end: - if (orig_size > size) - ret = orig_size - size; - else - ret = sk_stream_error(sk, flags, ret); + /* Linearize header to local buffer */ + ret = skb_copy_bits(skb, rxm->offset, header, tls_ctx->rx.prepend_size); + + if (ret < 0) + goto read_failure; + + ctx->control = header[0]; + + data_len = ((header[4] & 0xFF) | (header[3] << 8)); + + cipher_overhead = tls_ctx->rx.tag_size + tls_ctx->rx.iv_size; + + if (data_len > TLS_MAX_PAYLOAD_SIZE + cipher_overhead) { + ret = -EMSGSIZE; + goto read_failure; + } + if (data_len < cipher_overhead) { + ret = -EBADMSG; + goto read_failure; + } + + if (header[1] != TLS_VERSION_MINOR(tls_ctx->crypto_recv.info.version) || + header[2] != TLS_VERSION_MAJOR(tls_ctx->crypto_recv.info.version)) { + ret = -EINVAL; + goto read_failure; + } + +#ifdef CONFIG_TLS_DEVICE + handle_device_resync(strp->sk, TCP_SKB_CB(skb)->seq + rxm->offset, + *(u64*)tls_ctx->rx.rec_seq); +#endif + return data_len + TLS_HEADER_SIZE; + +read_failure: + tls_err_abort(strp->sk, ret); - release_sock(sk); return ret; } -void tls_sw_free_tx_resources(struct sock *sk) +static void tls_queue(struct strparser *strp, struct sk_buff *skb) +{ + struct tls_context *tls_ctx = tls_get_ctx(strp->sk); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + + ctx->decrypted = false; + + ctx->recv_pkt = skb; + strp_pause(strp); + + ctx->saved_data_ready(strp->sk); +} + +static void tls_data_ready(struct sock *sk) { struct tls_context *tls_ctx = tls_get_ctx(sk); - struct tls_sw_context *ctx = tls_sw_ctx(tls_ctx); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + struct sk_psock *psock; - if (ctx->aead_send) - crypto_free_aead(ctx->aead_send); + strp_data_ready(&ctx->strp); - tls_free_both_sg(sk); + psock = sk_psock_get(sk); + if (psock && !list_empty(&psock->ingress_msg)) { + ctx->saved_data_ready(sk); + sk_psock_put(sk, psock); + } +} + +void tls_sw_free_resources_tx(struct sock *sk) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + struct tls_rec *rec, *tmp; + + /* Wait for any pending async encryptions to complete */ + smp_store_mb(ctx->async_notify, true); + if (atomic_read(&ctx->encrypt_pending)) + crypto_wait_req(-EINPROGRESS, &ctx->async_wait); + + cancel_delayed_work_sync(&ctx->tx_work.work); + + /* Tx whatever records we can transmit and abandon the rest */ + tls_tx_records(sk, -1); + + /* Free up un-sent records in tx_list. First, free + * the partially sent record if any at head of tx_list. + */ + if (tls_ctx->partially_sent_record) { + struct scatterlist *sg = tls_ctx->partially_sent_record; + + while (1) { + put_page(sg_page(sg)); + sk_mem_uncharge(sk, sg->length); + + if (sg_is_last(sg)) + break; + sg++; + } + + tls_ctx->partially_sent_record = NULL; + + rec = list_first_entry(&ctx->tx_list, + struct tls_rec, list); + list_del(&rec->list); + sk_msg_free(sk, &rec->msg_plaintext); + kfree(rec); + } + + list_for_each_entry_safe(rec, tmp, &ctx->tx_list, list) { + list_del(&rec->list); + sk_msg_free(sk, &rec->msg_encrypted); + sk_msg_free(sk, &rec->msg_plaintext); + kfree(rec); + } + + crypto_free_aead(ctx->aead_send); + tls_free_open_rec(sk); kfree(ctx); } -int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx) +void tls_sw_release_resources_rx(struct sock *sk) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + + kfree(tls_ctx->rx.rec_seq); + kfree(tls_ctx->rx.iv); + + if (ctx->aead_recv) { + kfree_skb(ctx->recv_pkt); + ctx->recv_pkt = NULL; + crypto_free_aead(ctx->aead_recv); + strp_stop(&ctx->strp); + write_lock_bh(&sk->sk_callback_lock); + sk->sk_data_ready = ctx->saved_data_ready; + write_unlock_bh(&sk->sk_callback_lock); + release_sock(sk); + strp_done(&ctx->strp); + lock_sock(sk); + } +} + +void tls_sw_free_resources_rx(struct sock *sk) +{ + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); + + tls_sw_release_resources_rx(sk); + + kfree(ctx); +} + +/* The work handler to transmitt the encrypted records in tx_list */ +static void tx_work_handler(struct work_struct *work) +{ + struct delayed_work *delayed_work = to_delayed_work(work); + struct tx_work *tx_work = container_of(delayed_work, + struct tx_work, work); + struct sock *sk = tx_work->sk; + struct tls_context *tls_ctx = tls_get_ctx(sk); + struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx); + + if (!test_and_clear_bit(BIT_TX_SCHEDULED, &ctx->tx_bitmask)) + return; + + lock_sock(sk); + tls_tx_records(sk, -1); + release_sock(sk); +} + +int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx) { struct tls_crypto_info *crypto_info; struct tls12_crypto_info_aes_gcm_128 *gcm_128_info; - struct tls_sw_context *sw_ctx; + struct tls_sw_context_tx *sw_ctx_tx = NULL; + struct tls_sw_context_rx *sw_ctx_rx = NULL; + struct cipher_context *cctx; + struct crypto_aead **aead; + struct strp_callbacks cb; u16 nonce_size, tag_size, iv_size, rec_seq_size; char *iv, *rec_seq; int rc = 0; @@ -619,20 +1697,47 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx) goto out; } - if (ctx->priv_ctx) { - rc = -EEXIST; - goto out; + if (tx) { + if (!ctx->priv_ctx_tx) { + sw_ctx_tx = kzalloc(sizeof(*sw_ctx_tx), GFP_KERNEL); + if (!sw_ctx_tx) { + rc = -ENOMEM; + goto out; + } + ctx->priv_ctx_tx = sw_ctx_tx; + } else { + sw_ctx_tx = + (struct tls_sw_context_tx *)ctx->priv_ctx_tx; + } + } else { + if (!ctx->priv_ctx_rx) { + sw_ctx_rx = kzalloc(sizeof(*sw_ctx_rx), GFP_KERNEL); + if (!sw_ctx_rx) { + rc = -ENOMEM; + goto out; + } + ctx->priv_ctx_rx = sw_ctx_rx; + } else { + sw_ctx_rx = + (struct tls_sw_context_rx *)ctx->priv_ctx_rx; + } } - sw_ctx = kzalloc(sizeof(*sw_ctx), GFP_KERNEL); - if (!sw_ctx) { - rc = -ENOMEM; - goto out; + if (tx) { + crypto_init_wait(&sw_ctx_tx->async_wait); + crypto_info = &ctx->crypto_send.info; + cctx = &ctx->tx; + aead = &sw_ctx_tx->aead_send; + INIT_LIST_HEAD(&sw_ctx_tx->tx_list); + INIT_DELAYED_WORK(&sw_ctx_tx->tx_work.work, tx_work_handler); + sw_ctx_tx->tx_work.sk = sk; + } else { + crypto_init_wait(&sw_ctx_rx->async_wait); + crypto_info = &ctx->crypto_recv.info; + cctx = &ctx->rx; + aead = &sw_ctx_rx->aead_recv; } - ctx->priv_ctx = (struct tls_offload_context *)sw_ctx; - - crypto_info = &ctx->crypto_send.info; switch (crypto_info->cipher_type) { case TLS_CIPHER_AES_GCM_128: { nonce_size = TLS_CIPHER_AES_GCM_128_IV_SIZE; @@ -651,73 +1756,86 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx) goto free_priv; } - ctx->prepend_size = TLS_HEADER_SIZE + nonce_size; - ctx->tag_size = tag_size; - ctx->overhead_size = ctx->prepend_size + ctx->tag_size; - ctx->iv_size = iv_size; - ctx->iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE, GFP_KERNEL); - if (!ctx->iv) { + /* Sanity-check the IV size for stack allocations. */ + if (iv_size > MAX_IV_SIZE || nonce_size > MAX_IV_SIZE) { + rc = -EINVAL; + goto free_priv; + } + + cctx->prepend_size = TLS_HEADER_SIZE + nonce_size; + cctx->tag_size = tag_size; + cctx->overhead_size = cctx->prepend_size + cctx->tag_size; + cctx->iv_size = iv_size; + cctx->iv = kmalloc(iv_size + TLS_CIPHER_AES_GCM_128_SALT_SIZE, + GFP_KERNEL); + if (!cctx->iv) { rc = -ENOMEM; goto free_priv; } - memcpy(ctx->iv, gcm_128_info->salt, TLS_CIPHER_AES_GCM_128_SALT_SIZE); - memcpy(ctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size); - ctx->rec_seq_size = rec_seq_size; - ctx->rec_seq = kmalloc(rec_seq_size, GFP_KERNEL); - if (!ctx->rec_seq) { + memcpy(cctx->iv, gcm_128_info->salt, TLS_CIPHER_AES_GCM_128_SALT_SIZE); + memcpy(cctx->iv + TLS_CIPHER_AES_GCM_128_SALT_SIZE, iv, iv_size); + cctx->rec_seq_size = rec_seq_size; + cctx->rec_seq = kmemdup(rec_seq, rec_seq_size, GFP_KERNEL); + if (!cctx->rec_seq) { rc = -ENOMEM; goto free_iv; } - memcpy(ctx->rec_seq, rec_seq, rec_seq_size); - sg_init_table(sw_ctx->sg_encrypted_data, - ARRAY_SIZE(sw_ctx->sg_encrypted_data)); - sg_init_table(sw_ctx->sg_plaintext_data, - ARRAY_SIZE(sw_ctx->sg_plaintext_data)); - - sg_init_table(sw_ctx->sg_aead_in, 2); - sg_set_buf(&sw_ctx->sg_aead_in[0], sw_ctx->aad_space, - sizeof(sw_ctx->aad_space)); - sg_unmark_end(&sw_ctx->sg_aead_in[1]); - sg_chain(sw_ctx->sg_aead_in, 2, sw_ctx->sg_plaintext_data); - sg_init_table(sw_ctx->sg_aead_out, 2); - sg_set_buf(&sw_ctx->sg_aead_out[0], sw_ctx->aad_space, - sizeof(sw_ctx->aad_space)); - sg_unmark_end(&sw_ctx->sg_aead_out[1]); - sg_chain(sw_ctx->sg_aead_out, 2, sw_ctx->sg_encrypted_data); - - if (!sw_ctx->aead_send) { - sw_ctx->aead_send = crypto_alloc_aead("gcm(aes)", 0, 0); - if (IS_ERR(sw_ctx->aead_send)) { - rc = PTR_ERR(sw_ctx->aead_send); - sw_ctx->aead_send = NULL; + if (!*aead) { + *aead = crypto_alloc_aead("gcm(aes)", 0, 0); + if (IS_ERR(*aead)) { + rc = PTR_ERR(*aead); + *aead = NULL; goto free_rec_seq; } } ctx->push_pending_record = tls_sw_push_pending_record; - rc = crypto_aead_setkey(sw_ctx->aead_send, gcm_128_info->key, + rc = crypto_aead_setkey(*aead, gcm_128_info->key, TLS_CIPHER_AES_GCM_128_KEY_SIZE); if (rc) goto free_aead; - rc = crypto_aead_setauthsize(sw_ctx->aead_send, ctx->tag_size); - if (!rc) - return 0; + rc = crypto_aead_setauthsize(*aead, cctx->tag_size); + if (rc) + goto free_aead; + + if (sw_ctx_rx) { + /* Set up strparser */ + memset(&cb, 0, sizeof(cb)); + cb.rcv_msg = tls_queue; + cb.parse_msg = tls_read_size; + + strp_init(&sw_ctx_rx->strp, sk, &cb); + + write_lock_bh(&sk->sk_callback_lock); + sw_ctx_rx->saved_data_ready = sk->sk_data_ready; + sk->sk_data_ready = tls_data_ready; + write_unlock_bh(&sk->sk_callback_lock); + + strp_check_rcv(&sw_ctx_rx->strp); + } + + goto out; free_aead: - crypto_free_aead(sw_ctx->aead_send); - sw_ctx->aead_send = NULL; + crypto_free_aead(*aead); + *aead = NULL; free_rec_seq: - kfree(ctx->rec_seq); - ctx->rec_seq = NULL; + kfree(cctx->rec_seq); + cctx->rec_seq = NULL; free_iv: - kfree(ctx->iv); - ctx->iv = NULL; + kfree(cctx->iv); + cctx->iv = NULL; free_priv: - kfree(ctx->priv_ctx); - ctx->priv_ctx = NULL; + if (tx) { + kfree(ctx->priv_ctx_tx); + ctx->priv_ctx_tx = NULL; + } else { + kfree(ctx->priv_ctx_rx); + ctx->priv_ctx_rx = NULL; + } out: return rc; } diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig new file mode 100644 index 000000000000..90e4a7152854 --- /dev/null +++ b/net/xdp/Kconfig @@ -0,0 +1,7 @@ +config XDP_SOCKETS + bool "XDP sockets" + depends on BPF_SYSCALL + default n + help + XDP sockets allows a channel between XDP programs and + userspace applications. diff --git a/net/xdp/Makefile b/net/xdp/Makefile new file mode 100644 index 000000000000..04f073146256 --- /dev/null +++ b/net/xdp/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c new file mode 100644 index 000000000000..e183f8c1ba34 --- /dev/null +++ b/net/xdp/xdp_umem.c @@ -0,0 +1,402 @@ +// SPDX-License-Identifier: GPL-2.0 +/* XDP user-space packet buffer + * Copyright(c) 2018 Intel Corporation. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "xdp_umem.h" +#include "xsk_queue.h" + +#define XDP_UMEM_MIN_CHUNK_SIZE 2048 + +void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs) +{ + unsigned long flags; + + if (!xs->tx) + return; + + spin_lock_irqsave(&umem->xsk_list_lock, flags); + list_add_rcu(&xs->list, &umem->xsk_list); + spin_unlock_irqrestore(&umem->xsk_list_lock, flags); +} + +void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs) +{ + unsigned long flags; + + if (!xs->tx) + return; + + spin_lock_irqsave(&umem->xsk_list_lock, flags); + list_del_rcu(&xs->list); + spin_unlock_irqrestore(&umem->xsk_list_lock, flags); +} + +/* The umem is stored both in the _rx struct and the _tx struct as we do + * not know if the device has more tx queues than rx, or the opposite. + * This might also change during run time. + */ +static void xdp_reg_umem_at_qid(struct net_device *dev, struct xdp_umem *umem, + u16 queue_id) +{ + if (queue_id < dev->real_num_rx_queues) + dev->_rx[queue_id].umem = umem; + if (queue_id < dev->real_num_tx_queues) + dev->_tx[queue_id].umem = umem; +} + +static struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, + u16 queue_id) +{ + if (queue_id < dev->real_num_rx_queues) + return dev->_rx[queue_id].umem; + if (queue_id < dev->real_num_tx_queues) + return dev->_tx[queue_id].umem; + + return NULL; +} + +static void xdp_clear_umem_at_qid(struct net_device *dev, u16 queue_id) +{ + /* Zero out the entry independent on how many queues are configured + * at this point in time, as it might be used in the future. + */ + if (queue_id < dev->num_rx_queues) + dev->_rx[queue_id].umem = NULL; + if (queue_id < dev->num_tx_queues) + dev->_tx[queue_id].umem = NULL; +} + +int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev, + u16 queue_id, u16 flags) +{ + bool force_zc, force_copy; + struct netdev_bpf bpf; + int err = 0; + + force_zc = flags & XDP_ZEROCOPY; + force_copy = flags & XDP_COPY; + + if (force_zc && force_copy) + return -EINVAL; + + rtnl_lock(); + if (xdp_get_umem_from_qid(dev, queue_id)) { + err = -EBUSY; + goto out_rtnl_unlock; + } + + xdp_reg_umem_at_qid(dev, umem, queue_id); + umem->dev = dev; + umem->queue_id = queue_id; + if (force_copy) + /* For copy-mode, we are done. */ + goto out_rtnl_unlock; + + if (!dev->netdev_ops->ndo_bpf || + !dev->netdev_ops->ndo_xsk_async_xmit) { + err = -EOPNOTSUPP; + goto err_unreg_umem; + } + + bpf.command = XDP_SETUP_XSK_UMEM; + bpf.xsk.umem = umem; + bpf.xsk.queue_id = queue_id; + + err = dev->netdev_ops->ndo_bpf(dev, &bpf); + if (err) + goto err_unreg_umem; + rtnl_unlock(); + + dev_hold(dev); + umem->zc = true; + return 0; + +err_unreg_umem: + xdp_clear_umem_at_qid(dev, queue_id); + if (!force_zc) + err = 0; /* fallback to copy mode */ +out_rtnl_unlock: + rtnl_unlock(); + return err; +} + +void xdp_umem_clear_dev(struct xdp_umem *umem) +{ + struct netdev_bpf bpf; + int err; + + ASSERT_RTNL(); + + if (!umem->dev) + return; + + if (umem->zc) { + bpf.command = XDP_SETUP_XSK_UMEM; + bpf.xsk.umem = NULL; + bpf.xsk.queue_id = umem->queue_id; + + err = umem->dev->netdev_ops->ndo_bpf(umem->dev, &bpf); + + if (err) + WARN(1, "failed to disable umem!\n"); + } + + xdp_clear_umem_at_qid(umem->dev, umem->queue_id); + + if (umem->zc) { + dev_put(umem->dev); + umem->zc = false; + } +} + +static void xdp_umem_unpin_pages(struct xdp_umem *umem) +{ + unsigned int i; + + for (i = 0; i < umem->npgs; i++) { + struct page *page = umem->pgs[i]; + + set_page_dirty_lock(page); + put_page(page); + } + + kfree(umem->pgs); + umem->pgs = NULL; +} + +static void xdp_umem_unaccount_pages(struct xdp_umem *umem) +{ + if (umem->user) { + atomic_long_sub(umem->npgs, &umem->user->locked_vm); + free_uid(umem->user); + } +} + +static void xdp_umem_release(struct xdp_umem *umem) +{ + rtnl_lock(); + xdp_umem_clear_dev(umem); + rtnl_unlock(); + + if (umem->fq) { + xskq_destroy(umem->fq); + umem->fq = NULL; + } + + if (umem->cq) { + xskq_destroy(umem->cq); + umem->cq = NULL; + } + + xsk_reuseq_destroy(umem); + + xdp_umem_unpin_pages(umem); + + kfree(umem->pages); + umem->pages = NULL; + + xdp_umem_unaccount_pages(umem); + kfree(umem); +} + +static void xdp_umem_release_deferred(struct work_struct *work) +{ + struct xdp_umem *umem = container_of(work, struct xdp_umem, work); + + xdp_umem_release(umem); +} + +void xdp_get_umem(struct xdp_umem *umem) +{ + refcount_inc(&umem->users); +} + +void xdp_put_umem(struct xdp_umem *umem) +{ + if (!umem) + return; + + if (refcount_dec_and_test(&umem->users)) { + INIT_WORK(&umem->work, xdp_umem_release_deferred); + schedule_work(&umem->work); + } +} + +static int xdp_umem_pin_pages(struct xdp_umem *umem) +{ + unsigned int gup_flags = FOLL_WRITE; + long npgs; + int err; + + umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs), + GFP_KERNEL | __GFP_NOWARN); + if (!umem->pgs) + return -ENOMEM; + + down_write(¤t->mm->mmap_sem); + npgs = get_user_pages(umem->address, umem->npgs, + gup_flags, &umem->pgs[0], NULL); + up_write(¤t->mm->mmap_sem); + + if (npgs != umem->npgs) { + if (npgs >= 0) { + umem->npgs = npgs; + err = -ENOMEM; + goto out_pin; + } + err = npgs; + goto out_pgs; + } + return 0; + +out_pin: + xdp_umem_unpin_pages(umem); +out_pgs: + kfree(umem->pgs); + umem->pgs = NULL; + return err; +} + +static int xdp_umem_account_pages(struct xdp_umem *umem) +{ + unsigned long lock_limit, new_npgs, old_npgs; + + if (capable(CAP_IPC_LOCK)) + return 0; + + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + umem->user = get_uid(current_user()); + + do { + old_npgs = atomic_long_read(&umem->user->locked_vm); + new_npgs = old_npgs + umem->npgs; + if (new_npgs > lock_limit) { + free_uid(umem->user); + umem->user = NULL; + return -ENOBUFS; + } + } while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs, + new_npgs) != old_npgs); + return 0; +} + +static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) +{ + u32 chunk_size = mr->chunk_size, headroom = mr->headroom; + u64 npgs, addr = mr->addr, size = mr->len; + unsigned int chunks, chunks_per_page; + int err, i; + + if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) { + /* Strictly speaking we could support this, if: + * - huge pages, or* + * - using an IOMMU, or + * - making sure the memory area is consecutive + * but for now, we simply say "computer says no". + */ + return -EINVAL; + } + + if (!is_power_of_2(chunk_size)) + return -EINVAL; + + if (!PAGE_ALIGNED(addr)) { + /* Memory area has to be page size aligned. For + * simplicity, this might change. + */ + return -EINVAL; + } + + if ((addr + size) < addr) + return -EINVAL; + + npgs = div_u64(size, PAGE_SIZE); + if (npgs > U32_MAX) + return -EINVAL; + + chunks = (unsigned int)div_u64(size, chunk_size); + if (chunks == 0) + return -EINVAL; + + chunks_per_page = PAGE_SIZE / chunk_size; + if (chunks < chunks_per_page || chunks % chunks_per_page) + return -EINVAL; + + headroom = ALIGN(headroom, 64); + + if (headroom >= chunk_size - XDP_PACKET_HEADROOM) + return -EINVAL; + + umem->address = (unsigned long)addr; + umem->props.chunk_mask = ~((u64)chunk_size - 1); + umem->props.size = size; + umem->headroom = headroom; + umem->chunk_size_nohr = chunk_size - headroom; + umem->npgs = (u32)npgs; + umem->pgs = NULL; + umem->user = NULL; + INIT_LIST_HEAD(&umem->xsk_list); + spin_lock_init(&umem->xsk_list_lock); + + refcount_set(&umem->users, 1); + + err = xdp_umem_account_pages(umem); + if (err) + return err; + + err = xdp_umem_pin_pages(umem); + if (err) + goto out_account; + + umem->pages = kcalloc(umem->npgs, sizeof(*umem->pages), GFP_KERNEL); + if (!umem->pages) { + err = -ENOMEM; + goto out_pin; + } + + for (i = 0; i < umem->npgs; i++) + umem->pages[i].addr = page_address(umem->pgs[i]); + + return 0; + +out_pin: + xdp_umem_unpin_pages(umem); +out_account: + xdp_umem_unaccount_pages(umem); + return err; +} + +struct xdp_umem *xdp_umem_create(struct xdp_umem_reg *mr) +{ + struct xdp_umem *umem; + int err; + + umem = kzalloc(sizeof(*umem), GFP_KERNEL); + if (!umem) + return ERR_PTR(-ENOMEM); + + err = xdp_umem_reg(umem, mr); + if (err) { + kfree(umem); + return ERR_PTR(err); + } + + return umem; +} + +bool xdp_umem_validate_queues(struct xdp_umem *umem) +{ + return umem->fq && umem->cq; +} diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h new file mode 100644 index 000000000000..a63a9fb251f5 --- /dev/null +++ b/net/xdp/xdp_umem.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* XDP user-space packet buffer + * Copyright(c) 2018 Intel Corporation. + */ + +#ifndef XDP_UMEM_H_ +#define XDP_UMEM_H_ + +#include + +int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev, + u16 queue_id, u16 flags); +void xdp_umem_clear_dev(struct xdp_umem *umem); +bool xdp_umem_validate_queues(struct xdp_umem *umem); +void xdp_get_umem(struct xdp_umem *umem); +void xdp_put_umem(struct xdp_umem *umem); +void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs); +void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs); +struct xdp_umem *xdp_umem_create(struct xdp_umem_reg *mr); + +#endif /* XDP_UMEM_H_ */ diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h new file mode 100644 index 000000000000..40eab10dfc49 --- /dev/null +++ b/net/xdp/xdp_umem_props.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* XDP user-space packet buffer + * Copyright(c) 2018 Intel Corporation. + */ + +#ifndef XDP_UMEM_PROPS_H_ +#define XDP_UMEM_PROPS_H_ + +struct xdp_umem_props { + u64 chunk_mask; + u64 size; +}; + +#endif /* XDP_UMEM_PROPS_H_ */ diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c new file mode 100644 index 000000000000..11ac829532c2 --- /dev/null +++ b/net/xdp/xsk.c @@ -0,0 +1,930 @@ +// SPDX-License-Identifier: GPL-2.0 +/* XDP sockets + * + * AF_XDP sockets allows a channel between XDP programs and userspace + * applications. + * Copyright(c) 2018 Intel Corporation. + * + * Author(s): Björn Töpel + * Magnus Karlsson + */ + +#define pr_fmt(fmt) "AF_XDP: %s: " fmt, __func__ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "xsk_queue.h" +#include "xdp_umem.h" + +#define TX_BATCH_SIZE 16 + +static struct xdp_sock *xdp_sk(struct sock *sk) +{ + return (struct xdp_sock *)sk; +} + +bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs) +{ + return READ_ONCE(xs->rx) && READ_ONCE(xs->umem) && + READ_ONCE(xs->umem->fq); +} + +u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr) +{ + return xskq_peek_addr(umem->fq, addr); +} +EXPORT_SYMBOL(xsk_umem_peek_addr); + +void xsk_umem_discard_addr(struct xdp_umem *umem) +{ + xskq_discard_addr(umem->fq); +} +EXPORT_SYMBOL(xsk_umem_discard_addr); + +static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len) +{ + void *buffer; + u64 addr; + int err; + + if (!xskq_peek_addr(xs->umem->fq, &addr) || + len > xs->umem->chunk_size_nohr) { + xs->rx_dropped++; + return -ENOSPC; + } + + addr += xs->umem->headroom; + + buffer = xdp_umem_get_data(xs->umem, addr); + memcpy(buffer, xdp->data, len); + err = xskq_produce_batch_desc(xs->rx, addr, len); + if (!err) { + xskq_discard_addr(xs->umem->fq); + xdp_return_buff(xdp); + return 0; + } + + xs->rx_dropped++; + return err; +} + +static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len) +{ + int err = xskq_produce_batch_desc(xs->rx, (u64)xdp->handle, len); + + if (err) + xs->rx_dropped++; + + return err; +} + +int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) +{ + u32 len; + + if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index) + return -EINVAL; + + len = xdp->data_end - xdp->data; + + return (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) ? + __xsk_rcv_zc(xs, xdp, len) : __xsk_rcv(xs, xdp, len); +} + +void xsk_flush(struct xdp_sock *xs) +{ + xskq_produce_flush_desc(xs->rx); + xs->sk.sk_data_ready(&xs->sk); +} + +int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) +{ + u32 len = xdp->data_end - xdp->data; + void *buffer; + u64 addr; + int err; + + if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index) + return -EINVAL; + + if (!xskq_peek_addr(xs->umem->fq, &addr) || + len > xs->umem->chunk_size_nohr) { + xs->rx_dropped++; + return -ENOSPC; + } + + addr += xs->umem->headroom; + + buffer = xdp_umem_get_data(xs->umem, addr); + memcpy(buffer, xdp->data, len); + err = xskq_produce_batch_desc(xs->rx, addr, len); + if (!err) { + xskq_discard_addr(xs->umem->fq); + xsk_flush(xs); + return 0; + } + + xs->rx_dropped++; + return err; +} + +void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries) +{ + xskq_produce_flush_addr_n(umem->cq, nb_entries); +} +EXPORT_SYMBOL(xsk_umem_complete_tx); + +void xsk_umem_consume_tx_done(struct xdp_umem *umem) +{ + struct xdp_sock *xs; + + rcu_read_lock(); + list_for_each_entry_rcu(xs, &umem->xsk_list, list) { + xs->sk.sk_write_space(&xs->sk); + } + rcu_read_unlock(); +} +EXPORT_SYMBOL(xsk_umem_consume_tx_done); + +bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len) +{ + struct xdp_desc desc; + struct xdp_sock *xs; + + rcu_read_lock(); + list_for_each_entry_rcu(xs, &umem->xsk_list, list) { + if (!xskq_peek_desc(xs->tx, &desc)) + continue; + + if (xskq_produce_addr_lazy(umem->cq, desc.addr)) + goto out; + + *dma = xdp_umem_get_dma(umem, desc.addr); + *len = desc.len; + + xskq_discard_desc(xs->tx); + rcu_read_unlock(); + return true; + } + +out: + rcu_read_unlock(); + return false; +} +EXPORT_SYMBOL(xsk_umem_consume_tx); + +static int xsk_zc_xmit(struct sock *sk) +{ + struct xdp_sock *xs = xdp_sk(sk); + struct net_device *dev = xs->dev; + + return dev->netdev_ops->ndo_xsk_async_xmit(dev, xs->queue_id); +} + +static void xsk_destruct_skb(struct sk_buff *skb) +{ + u64 addr = (u64)(long)skb_shinfo(skb)->destructor_arg; + struct xdp_sock *xs = xdp_sk(skb->sk); + unsigned long flags; + + spin_lock_irqsave(&xs->tx_completion_lock, flags); + WARN_ON_ONCE(xskq_produce_addr(xs->umem->cq, addr)); + spin_unlock_irqrestore(&xs->tx_completion_lock, flags); + + sock_wfree(skb); +} + +static int xsk_generic_xmit(struct sock *sk, struct msghdr *m, + size_t total_len) +{ + u32 max_batch = TX_BATCH_SIZE; + struct xdp_sock *xs = xdp_sk(sk); + bool sent_frame = false; + struct xdp_desc desc; + struct sk_buff *skb; + int err = 0; + + mutex_lock(&xs->mutex); + + if (xs->queue_id >= xs->dev->real_num_tx_queues) + goto out; + + while (xskq_peek_desc(xs->tx, &desc)) { + char *buffer; + u64 addr; + u32 len; + + if (max_batch-- == 0) { + err = -EAGAIN; + goto out; + } + + len = desc.len; + skb = sock_alloc_send_skb(sk, len, 1, &err); + if (unlikely(!skb)) + goto out; + + skb_put(skb, len); + addr = desc.addr; + buffer = xdp_umem_get_data(xs->umem, addr); + err = skb_store_bits(skb, 0, buffer, len); + if (unlikely(err) || xskq_reserve_addr(xs->umem->cq)) { + kfree_skb(skb); + goto out; + } + + skb->dev = xs->dev; + skb->priority = sk->sk_priority; + skb->mark = sk->sk_mark; + skb_shinfo(skb)->destructor_arg = (void *)(long)addr; + skb->destructor = xsk_destruct_skb; + + err = dev_direct_xmit(skb, xs->queue_id); + xskq_discard_desc(xs->tx); + /* Ignore NET_XMIT_CN as packet might have been sent */ + if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) { + /* SKB completed but not sent */ + err = -EBUSY; + goto out; + } + + sent_frame = true; + } + +out: + if (sent_frame) + sk->sk_write_space(sk); + + mutex_unlock(&xs->mutex); + return err; +} + +static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len) +{ + bool need_wait = !(m->msg_flags & MSG_DONTWAIT); + struct sock *sk = sock->sk; + struct xdp_sock *xs = xdp_sk(sk); + + if (unlikely(!xs->dev)) + return -ENXIO; + if (unlikely(!(xs->dev->flags & IFF_UP))) + return -ENETDOWN; + if (unlikely(!xs->tx)) + return -ENOBUFS; + if (need_wait) + return -EOPNOTSUPP; + + return (xs->zc) ? xsk_zc_xmit(sk) : xsk_generic_xmit(sk, m, total_len); +} + +static __poll_t xsk_poll(struct file *file, struct socket *sock, + struct poll_table_struct *wait) +{ + __poll_t mask = datagram_poll(file, sock, wait); + struct sock *sk = sock->sk; + struct xdp_sock *xs = xdp_sk(sk); + + if (xs->rx && !xskq_empty_desc(xs->rx)) + mask |= EPOLLIN | EPOLLRDNORM; + if (xs->tx && !xskq_full_desc(xs->tx)) + mask |= EPOLLOUT | EPOLLWRNORM; + + return mask; +} + +static int xsk_init_queue(u32 entries, struct xsk_queue **queue, + bool umem_queue) +{ + struct xsk_queue *q; + + if (entries == 0 || *queue || !is_power_of_2(entries)) + return -EINVAL; + + q = xskq_create(entries, umem_queue); + if (!q) + return -ENOMEM; + + /* Make sure queue is ready before it can be seen by others */ + smp_wmb(); + WRITE_ONCE(*queue, q); + return 0; +} + +static void xsk_unbind_dev(struct xdp_sock *xs) +{ + struct net_device *dev = xs->dev; + + if (!dev || xs->state != XSK_BOUND) + return; + + xs->state = XSK_UNBOUND; + + /* Wait for driver to stop using the xdp socket. */ + xdp_del_sk_umem(xs->umem, xs); + xs->dev = NULL; + synchronize_net(); + dev_put(dev); +} + +static struct xsk_map *xsk_get_map_list_entry(struct xdp_sock *xs, + struct xdp_sock ***map_entry) +{ + struct xsk_map *map = NULL; + struct xsk_map_node *node; + + *map_entry = NULL; + + spin_lock_bh(&xs->map_list_lock); + node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node, + node); + if (node) { + WARN_ON(xsk_map_inc(node->map)); + map = node->map; + *map_entry = node->map_entry; + } + spin_unlock_bh(&xs->map_list_lock); + return map; +} + +static void xsk_delete_from_maps(struct xdp_sock *xs) +{ + /* This function removes the current XDP socket from all the + * maps it resides in. We need to take extra care here, due to + * the two locks involved. Each map has a lock synchronizing + * updates to the entries, and each socket has a lock that + * synchronizes access to the list of maps (map_list). For + * deadlock avoidance the locks need to be taken in the order + * "map lock"->"socket map list lock". We start off by + * accessing the socket map list, and take a reference to the + * map to guarantee existence between the + * xsk_get_map_list_entry() and xsk_map_try_sock_delete() + * calls. Then we ask the map to remove the socket, which + * tries to remove the socket from the map. Note that there + * might be updates to the map between + * xsk_get_map_list_entry() and xsk_map_try_sock_delete(). + */ + struct xdp_sock **map_entry = NULL; + struct xsk_map *map; + + while ((map = xsk_get_map_list_entry(xs, &map_entry))) { + xsk_map_try_sock_delete(map, xs, map_entry); + xsk_map_put(map); + } +} + +static int xsk_release(struct socket *sock) +{ + struct sock *sk = sock->sk; + struct xdp_sock *xs = xdp_sk(sk); + struct net *net; + + if (!sk) + return 0; + + net = sock_net(sk); + + mutex_lock(&net->xdp.lock); + sk_del_node_init_rcu(sk); + mutex_unlock(&net->xdp.lock); + + local_bh_disable(); + sock_prot_inuse_add(net, sk->sk_prot, -1); + local_bh_enable(); + + xsk_delete_from_maps(xs); + xsk_unbind_dev(xs); + + xskq_destroy(xs->rx); + xskq_destroy(xs->tx); + + sock_orphan(sk); + sock->sk = NULL; + + sk_refcnt_debug_release(sk); + sock_put(sk); + + return 0; +} + +static struct socket *xsk_lookup_xsk_from_fd(int fd) +{ + struct socket *sock; + int err; + + sock = sockfd_lookup(fd, &err); + if (!sock) + return ERR_PTR(-ENOTSOCK); + + if (sock->sk->sk_family != PF_XDP) { + sockfd_put(sock); + return ERR_PTR(-ENOPROTOOPT); + } + + return sock; +} + +static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) +{ + struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr; + struct sock *sk = sock->sk; + struct xdp_sock *xs = xdp_sk(sk); + struct net_device *dev; + u32 flags, qid; + int err = 0; + + if (addr_len < sizeof(struct sockaddr_xdp)) + return -EINVAL; + if (sxdp->sxdp_family != AF_XDP) + return -EINVAL; + + mutex_lock(&xs->mutex); + if (xs->state != XSK_READY) { + err = -EBUSY; + goto out_release; + } + + dev = dev_get_by_index(sock_net(sk), sxdp->sxdp_ifindex); + if (!dev) { + err = -ENODEV; + goto out_release; + } + + if (!xs->rx && !xs->tx) { + err = -EINVAL; + goto out_unlock; + } + + qid = sxdp->sxdp_queue_id; + flags = sxdp->sxdp_flags; + + if (flags & XDP_SHARED_UMEM) { + struct xdp_sock *umem_xs; + struct socket *sock; + + if ((flags & XDP_COPY) || (flags & XDP_ZEROCOPY)) { + /* Cannot specify flags for shared sockets. */ + err = -EINVAL; + goto out_unlock; + } + + if (xs->umem) { + /* We have already our own. */ + err = -EINVAL; + goto out_unlock; + } + + sock = xsk_lookup_xsk_from_fd(sxdp->sxdp_shared_umem_fd); + if (IS_ERR(sock)) { + err = PTR_ERR(sock); + goto out_unlock; + } + + umem_xs = xdp_sk(sock->sk); + if (!umem_xs->umem) { + /* No umem to inherit. */ + err = -EBADF; + sockfd_put(sock); + goto out_unlock; + } else if (umem_xs->dev != dev || umem_xs->queue_id != qid) { + err = -EINVAL; + sockfd_put(sock); + goto out_unlock; + } + + xdp_get_umem(umem_xs->umem); + WRITE_ONCE(xs->umem, umem_xs->umem); + sockfd_put(sock); + } else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) { + err = -EINVAL; + goto out_unlock; + } else { + /* This xsk has its own umem. */ + xskq_set_umem(xs->umem->fq, &xs->umem->props); + xskq_set_umem(xs->umem->cq, &xs->umem->props); + + err = xdp_umem_assign_dev(xs->umem, dev, qid, flags); + if (err) + goto out_unlock; + } + + xs->dev = dev; + xs->zc = xs->umem->zc; + xs->queue_id = qid; + xskq_set_umem(xs->rx, &xs->umem->props); + xskq_set_umem(xs->tx, &xs->umem->props); + xdp_add_sk_umem(xs->umem, xs); + +out_unlock: + if (err) + dev_put(dev); + else + xs->state = XSK_BOUND; +out_release: + mutex_unlock(&xs->mutex); + return err; +} + +static int xsk_setsockopt(struct socket *sock, int level, int optname, + char __user *optval, unsigned int optlen) +{ + struct sock *sk = sock->sk; + struct xdp_sock *xs = xdp_sk(sk); + int err; + + if (level != SOL_XDP) + return -ENOPROTOOPT; + + switch (optname) { + case XDP_RX_RING: + case XDP_TX_RING: + { + struct xsk_queue **q; + int entries; + + if (optlen < sizeof(entries)) + return -EINVAL; + if (copy_from_user(&entries, optval, sizeof(entries))) + return -EFAULT; + + mutex_lock(&xs->mutex); + if (xs->state != XSK_READY) { + mutex_unlock(&xs->mutex); + return -EBUSY; + } + q = (optname == XDP_TX_RING) ? &xs->tx : &xs->rx; + err = xsk_init_queue(entries, q, false); + mutex_unlock(&xs->mutex); + return err; + } + case XDP_UMEM_REG: + { + struct xdp_umem_reg mr; + struct xdp_umem *umem; + + if (optlen < sizeof(mr)) + return -EINVAL; + if (copy_from_user(&mr, optval, sizeof(mr))) + return -EFAULT; + + mutex_lock(&xs->mutex); + if (xs->state != XSK_READY || xs->umem) { + mutex_unlock(&xs->mutex); + return -EBUSY; + } + + umem = xdp_umem_create(&mr); + if (IS_ERR(umem)) { + mutex_unlock(&xs->mutex); + return PTR_ERR(umem); + } + + /* Make sure umem is ready before it can be seen by others */ + smp_wmb(); + WRITE_ONCE(xs->umem, umem); + mutex_unlock(&xs->mutex); + return 0; + } + case XDP_UMEM_FILL_RING: + case XDP_UMEM_COMPLETION_RING: + { + struct xsk_queue **q; + int entries; + + if (optlen < sizeof(entries)) + return -EINVAL; + if (copy_from_user(&entries, optval, sizeof(entries))) + return -EFAULT; + + mutex_lock(&xs->mutex); + if (xs->state != XSK_READY) { + mutex_unlock(&xs->mutex); + return -EBUSY; + } + if (!xs->umem) { + mutex_unlock(&xs->mutex); + return -EINVAL; + } + + q = (optname == XDP_UMEM_FILL_RING) ? &xs->umem->fq : + &xs->umem->cq; + err = xsk_init_queue(entries, q, true); + mutex_unlock(&xs->mutex); + return err; + } + default: + break; + } + + return -ENOPROTOOPT; +} + +static int xsk_getsockopt(struct socket *sock, int level, int optname, + char __user *optval, int __user *optlen) +{ + struct sock *sk = sock->sk; + struct xdp_sock *xs = xdp_sk(sk); + int len; + + if (level != SOL_XDP) + return -ENOPROTOOPT; + + if (get_user(len, optlen)) + return -EFAULT; + if (len < 0) + return -EINVAL; + + switch (optname) { + case XDP_STATISTICS: + { + struct xdp_statistics stats; + + if (len < sizeof(stats)) + return -EINVAL; + + mutex_lock(&xs->mutex); + stats.rx_dropped = xs->rx_dropped; + stats.rx_invalid_descs = xskq_nb_invalid_descs(xs->rx); + stats.tx_invalid_descs = xskq_nb_invalid_descs(xs->tx); + mutex_unlock(&xs->mutex); + + if (copy_to_user(optval, &stats, sizeof(stats))) + return -EFAULT; + if (put_user(sizeof(stats), optlen)) + return -EFAULT; + + return 0; + } + case XDP_MMAP_OFFSETS: + { + struct xdp_mmap_offsets off; + + if (len < sizeof(off)) + return -EINVAL; + + off.rx.producer = offsetof(struct xdp_rxtx_ring, ptrs.producer); + off.rx.consumer = offsetof(struct xdp_rxtx_ring, ptrs.consumer); + off.rx.desc = offsetof(struct xdp_rxtx_ring, desc); + off.tx.producer = offsetof(struct xdp_rxtx_ring, ptrs.producer); + off.tx.consumer = offsetof(struct xdp_rxtx_ring, ptrs.consumer); + off.tx.desc = offsetof(struct xdp_rxtx_ring, desc); + + off.fr.producer = offsetof(struct xdp_umem_ring, ptrs.producer); + off.fr.consumer = offsetof(struct xdp_umem_ring, ptrs.consumer); + off.fr.desc = offsetof(struct xdp_umem_ring, desc); + off.cr.producer = offsetof(struct xdp_umem_ring, ptrs.producer); + off.cr.consumer = offsetof(struct xdp_umem_ring, ptrs.consumer); + off.cr.desc = offsetof(struct xdp_umem_ring, desc); + + len = sizeof(off); + if (copy_to_user(optval, &off, len)) + return -EFAULT; + if (put_user(len, optlen)) + return -EFAULT; + + return 0; + } + default: + break; + } + + return -EOPNOTSUPP; +} + +static int xsk_mmap(struct file *file, struct socket *sock, + struct vm_area_struct *vma) +{ + loff_t offset = (loff_t)vma->vm_pgoff << PAGE_SHIFT; + unsigned long size = vma->vm_end - vma->vm_start; + struct xdp_sock *xs = xdp_sk(sock->sk); + struct xsk_queue *q = NULL; + struct xdp_umem *umem; + unsigned long pfn; + struct page *qpg; + + if (xs->state != XSK_READY) + return -EBUSY; + + if (offset == XDP_PGOFF_RX_RING) { + q = READ_ONCE(xs->rx); + } else if (offset == XDP_PGOFF_TX_RING) { + q = READ_ONCE(xs->tx); + } else { + umem = READ_ONCE(xs->umem); + if (!umem) + return -EINVAL; + + /* Matches the smp_wmb() in XDP_UMEM_REG */ + smp_rmb(); + if (offset == XDP_UMEM_PGOFF_FILL_RING) + q = READ_ONCE(umem->fq); + else if (offset == XDP_UMEM_PGOFF_COMPLETION_RING) + q = READ_ONCE(umem->cq); + } + + if (!q) + return -EINVAL; + + /* Matches the smp_wmb() in xsk_init_queue */ + smp_rmb(); + qpg = virt_to_head_page(q->ring); + if (size > (PAGE_SIZE << compound_order(qpg))) + return -EINVAL; + + pfn = virt_to_phys(q->ring) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, + size, vma->vm_page_prot); +} + +static int xsk_notifier(struct notifier_block *this, + unsigned long msg, void *ptr) +{ + struct net_device *dev = netdev_notifier_info_to_dev(ptr); + struct net *net = dev_net(dev); + struct sock *sk; + + switch (msg) { + case NETDEV_UNREGISTER: + mutex_lock(&net->xdp.lock); + sk_for_each(sk, &net->xdp.list) { + struct xdp_sock *xs = xdp_sk(sk); + + mutex_lock(&xs->mutex); + if (xs->dev == dev) { + sk->sk_err = ENETDOWN; + if (!sock_flag(sk, SOCK_DEAD)) + sk->sk_error_report(sk); + + xsk_unbind_dev(xs); + + /* Clear device references in umem. */ + xdp_umem_clear_dev(xs->umem); + } + mutex_unlock(&xs->mutex); + } + mutex_unlock(&net->xdp.lock); + break; + } + return NOTIFY_DONE; +} + +static struct proto xsk_proto = { + .name = "XDP", + .owner = THIS_MODULE, + .obj_size = sizeof(struct xdp_sock), +}; + +static const struct proto_ops xsk_proto_ops = { + .family = PF_XDP, + .owner = THIS_MODULE, + .release = xsk_release, + .bind = xsk_bind, + .connect = sock_no_connect, + .socketpair = sock_no_socketpair, + .accept = sock_no_accept, + .getname = sock_no_getname, + .poll = xsk_poll, + .ioctl = sock_no_ioctl, + .listen = sock_no_listen, + .shutdown = sock_no_shutdown, + .setsockopt = xsk_setsockopt, + .getsockopt = xsk_getsockopt, + .sendmsg = xsk_sendmsg, + .recvmsg = sock_no_recvmsg, + .mmap = xsk_mmap, + .sendpage = sock_no_sendpage, +}; + +static void xsk_destruct(struct sock *sk) +{ + struct xdp_sock *xs = xdp_sk(sk); + + if (!sock_flag(sk, SOCK_DEAD)) + return; + + xdp_put_umem(xs->umem); + + sk_refcnt_debug_dec(sk); +} + +static int xsk_create(struct net *net, struct socket *sock, int protocol, + int kern) +{ + struct sock *sk; + struct xdp_sock *xs; + + if (!ns_capable(net->user_ns, CAP_NET_RAW)) + return -EPERM; + if (sock->type != SOCK_RAW) + return -ESOCKTNOSUPPORT; + + if (protocol) + return -EPROTONOSUPPORT; + + sock->state = SS_UNCONNECTED; + + sk = sk_alloc(net, PF_XDP, GFP_KERNEL, &xsk_proto, kern); + if (!sk) + return -ENOBUFS; + + sock->ops = &xsk_proto_ops; + + sock_init_data(sock, sk); + + sk->sk_family = PF_XDP; + + sk->sk_destruct = xsk_destruct; + sk_refcnt_debug_inc(sk); + + sock_set_flag(sk, SOCK_RCU_FREE); + + xs = xdp_sk(sk); + xs->state = XSK_READY; + mutex_init(&xs->mutex); + spin_lock_init(&xs->tx_completion_lock); + + INIT_LIST_HEAD(&xs->map_list); + spin_lock_init(&xs->map_list_lock); + + mutex_lock(&net->xdp.lock); + sk_add_node_rcu(sk, &net->xdp.list); + mutex_unlock(&net->xdp.lock); + + local_bh_disable(); + sock_prot_inuse_add(net, &xsk_proto, 1); + local_bh_enable(); + + return 0; +} + +static const struct net_proto_family xsk_family_ops = { + .family = PF_XDP, + .create = xsk_create, + .owner = THIS_MODULE, +}; + +static struct notifier_block xsk_netdev_notifier = { + .notifier_call = xsk_notifier, +}; + +static int __net_init xsk_net_init(struct net *net) +{ + mutex_init(&net->xdp.lock); + INIT_HLIST_HEAD(&net->xdp.list); + return 0; +} + +static void __net_exit xsk_net_exit(struct net *net) +{ + WARN_ON_ONCE(!hlist_empty(&net->xdp.list)); +} + +static struct pernet_operations xsk_net_ops = { + .init = xsk_net_init, + .exit = xsk_net_exit, +}; + +static int __init xsk_init(void) +{ + int err; + + err = proto_register(&xsk_proto, 0 /* no slab */); + if (err) + goto out; + + err = sock_register(&xsk_family_ops); + if (err) + goto out_proto; + + err = register_pernet_subsys(&xsk_net_ops); + if (err) + goto out_sk; + + err = register_netdevice_notifier(&xsk_netdev_notifier); + if (err) + goto out_pernet; + + return 0; + +out_pernet: + unregister_pernet_subsys(&xsk_net_ops); +out_sk: + sock_unregister(PF_XDP); +out_proto: + proto_unregister(&xsk_proto); +out: + return err; +} + +fs_initcall(xsk_init); diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c new file mode 100644 index 000000000000..f7a0f1f0756f --- /dev/null +++ b/net/xdp/xsk_queue.c @@ -0,0 +1,118 @@ +// SPDX-License-Identifier: GPL-2.0 +/* XDP user-space ring structure + * Copyright(c) 2018 Intel Corporation. + */ + +#include +#include +#include + +#include "xsk_queue.h" + +void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props) +{ + if (!q) + return; + + q->umem_props = *umem_props; +} + +static u32 xskq_umem_get_ring_size(struct xsk_queue *q) +{ + return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u64); +} + +static u32 xskq_rxtx_get_ring_size(struct xsk_queue *q) +{ + return sizeof(struct xdp_ring) + q->nentries * sizeof(struct xdp_desc); +} + +struct xsk_queue *xskq_create(u32 nentries, bool umem_queue) +{ + struct xsk_queue *q; + gfp_t gfp_flags; + size_t size; + + q = kzalloc(sizeof(*q), GFP_KERNEL); + if (!q) + return NULL; + + q->nentries = nentries; + q->ring_mask = nentries - 1; + + gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | + __GFP_COMP | __GFP_NORETRY; + size = umem_queue ? xskq_umem_get_ring_size(q) : + xskq_rxtx_get_ring_size(q); + + q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags, + get_order(size)); + if (!q->ring) { + kfree(q); + return NULL; + } + + return q; +} + +void xskq_destroy(struct xsk_queue *q) +{ + if (!q) + return; + + page_frag_free(q->ring); + kfree(q); +} + +struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries) +{ + struct xdp_umem_fq_reuse *newq; + + /* Check for overflow */ + if (nentries > (u32)roundup_pow_of_two(nentries)) + return NULL; + nentries = roundup_pow_of_two(nentries); + + newq = kvmalloc(struct_size(newq, handles, nentries), GFP_KERNEL); + if (!newq) + return NULL; + memset(newq, 0, offsetof(typeof(*newq), handles)); + + newq->nentries = nentries; + return newq; +} +EXPORT_SYMBOL_GPL(xsk_reuseq_prepare); + +struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem, + struct xdp_umem_fq_reuse *newq) +{ + struct xdp_umem_fq_reuse *oldq = umem->fq_reuse; + + if (!oldq) { + umem->fq_reuse = newq; + return NULL; + } + + if (newq->nentries < oldq->length) + return newq; + + memcpy(newq->handles, oldq->handles, + array_size(oldq->length, sizeof(u64))); + newq->length = oldq->length; + + umem->fq_reuse = newq; + return oldq; +} +EXPORT_SYMBOL_GPL(xsk_reuseq_swap); + +void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq) +{ + kvfree(rq); +} +EXPORT_SYMBOL_GPL(xsk_reuseq_free); + +void xsk_reuseq_destroy(struct xdp_umem *umem) +{ + xsk_reuseq_free(umem->fq_reuse); + umem->fq_reuse = NULL; +} diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h new file mode 100644 index 000000000000..62cf850fdad3 --- /dev/null +++ b/net/xdp/xsk_queue.h @@ -0,0 +1,266 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* XDP user-space ring structure + * Copyright(c) 2018 Intel Corporation. + */ + +#ifndef _LINUX_XSK_QUEUE_H +#define _LINUX_XSK_QUEUE_H + +#include +#include +#include + +#define RX_BATCH_SIZE 16 +#define LAZY_UPDATE_THRESHOLD 128 + +struct xdp_ring { + u32 producer ____cacheline_aligned_in_smp; + u32 consumer ____cacheline_aligned_in_smp; +}; + +/* Used for the RX and TX queues for packets */ +struct xdp_rxtx_ring { + struct xdp_ring ptrs; + struct xdp_desc desc[0] ____cacheline_aligned_in_smp; +}; + +/* Used for the fill and completion queues for buffers */ +struct xdp_umem_ring { + struct xdp_ring ptrs; + u64 desc[0] ____cacheline_aligned_in_smp; +}; + +struct xsk_queue { + struct xdp_umem_props umem_props; + u32 ring_mask; + u32 nentries; + u32 prod_head; + u32 prod_tail; + u32 cons_head; + u32 cons_tail; + struct xdp_ring *ring; + u64 invalid_descs; +}; + +/* Common functions operating for both RXTX and umem queues */ + +static inline u64 xskq_nb_invalid_descs(struct xsk_queue *q) +{ + return q ? q->invalid_descs : 0; +} + +static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt) +{ + u32 entries = q->prod_tail - q->cons_tail; + + if (entries == 0) { + /* Refresh the local pointer */ + q->prod_tail = READ_ONCE(q->ring->producer); + entries = q->prod_tail - q->cons_tail; + } + + return (entries > dcnt) ? dcnt : entries; +} + +static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt) +{ + u32 free_entries = q->nentries - (producer - q->cons_tail); + + if (free_entries >= dcnt) + return free_entries; + + /* Refresh the local tail pointer */ + q->cons_tail = READ_ONCE(q->ring->consumer); + return q->nentries - (producer - q->cons_tail); +} + +/* UMEM queue */ + +static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr) +{ + if (addr >= q->umem_props.size) { + q->invalid_descs++; + return false; + } + + return true; +} + +static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr) +{ + while (q->cons_tail != q->cons_head) { + struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring; + unsigned int idx = q->cons_tail & q->ring_mask; + + *addr = READ_ONCE(ring->desc[idx]) & q->umem_props.chunk_mask; + if (xskq_is_valid_addr(q, *addr)) + return addr; + + q->cons_tail++; + } + + return NULL; +} + +static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr) +{ + if (q->cons_tail == q->cons_head) { + WRITE_ONCE(q->ring->consumer, q->cons_tail); + q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE); + + /* Order consumer and data */ + smp_rmb(); + } + + return xskq_validate_addr(q, addr); +} + +static inline void xskq_discard_addr(struct xsk_queue *q) +{ + q->cons_tail++; +} + +static inline int xskq_produce_addr(struct xsk_queue *q, u64 addr) +{ + struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring; + + if (xskq_nb_free(q, q->prod_tail, 1) == 0) + return -ENOSPC; + + ring->desc[q->prod_tail++ & q->ring_mask] = addr; + + /* Order producer and data */ + smp_wmb(); + + WRITE_ONCE(q->ring->producer, q->prod_tail); + return 0; +} + +static inline int xskq_produce_addr_lazy(struct xsk_queue *q, u64 addr) +{ + struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring; + + if (xskq_nb_free(q, q->prod_head, LAZY_UPDATE_THRESHOLD) == 0) + return -ENOSPC; + + ring->desc[q->prod_head++ & q->ring_mask] = addr; + return 0; +} + +static inline void xskq_produce_flush_addr_n(struct xsk_queue *q, + u32 nb_entries) +{ + /* Order producer and data */ + smp_wmb(); + + q->prod_tail += nb_entries; + WRITE_ONCE(q->ring->producer, q->prod_tail); +} + +static inline int xskq_reserve_addr(struct xsk_queue *q) +{ + if (xskq_nb_free(q, q->prod_head, 1) == 0) + return -ENOSPC; + + q->prod_head++; + return 0; +} + +/* Rx/Tx queue */ + +static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d) +{ + if (!xskq_is_valid_addr(q, d->addr)) + return false; + + if (((d->addr + d->len) & q->umem_props.chunk_mask) != + (d->addr & q->umem_props.chunk_mask)) { + q->invalid_descs++; + return false; + } + + return true; +} + +static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q, + struct xdp_desc *desc) +{ + while (q->cons_tail != q->cons_head) { + struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring; + unsigned int idx = q->cons_tail & q->ring_mask; + + *desc = READ_ONCE(ring->desc[idx]); + if (xskq_is_valid_desc(q, desc)) + return desc; + + q->cons_tail++; + } + + return NULL; +} + +static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q, + struct xdp_desc *desc) +{ + if (q->cons_tail == q->cons_head) { + WRITE_ONCE(q->ring->consumer, q->cons_tail); + q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE); + + /* Order consumer and data */ + smp_rmb(); + } + + return xskq_validate_desc(q, desc); +} + +static inline void xskq_discard_desc(struct xsk_queue *q) +{ + q->cons_tail++; +} + +static inline int xskq_produce_batch_desc(struct xsk_queue *q, + u64 addr, u32 len) +{ + struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring; + unsigned int idx; + + if (xskq_nb_free(q, q->prod_head, 1) == 0) + return -ENOSPC; + + idx = (q->prod_head++) & q->ring_mask; + ring->desc[idx].addr = addr; + ring->desc[idx].len = len; + + return 0; +} + +static inline void xskq_produce_flush_desc(struct xsk_queue *q) +{ + /* Order producer and data */ + smp_wmb(); + + q->prod_tail = q->prod_head; + WRITE_ONCE(q->ring->producer, q->prod_tail); +} + +static inline bool xskq_full_desc(struct xsk_queue *q) +{ + /* No barriers needed since data is not accessed */ + return READ_ONCE(q->ring->producer) - READ_ONCE(q->ring->consumer) == + q->nentries; +} + +static inline bool xskq_empty_desc(struct xsk_queue *q) +{ + /* No barriers needed since data is not accessed */ + return READ_ONCE(q->ring->consumer) == READ_ONCE(q->ring->producer); +} + +void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props); +struct xsk_queue *xskq_create(u32 nentries, bool umem_queue); +void xskq_destroy(struct xsk_queue *q_ops); + +/* Executed by the core when the entire UMEM gets freed */ +void xsk_reuseq_destroy(struct xdp_umem *umem); + +#endif /* _LINUX_XSK_QUEUE_H */ diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c index 6150442639b2..b6ab6590f4b6 100644 --- a/net/xfrm/xfrm_device.c +++ b/net/xfrm/xfrm_device.c @@ -23,107 +23,32 @@ #include #ifdef CONFIG_XFRM_OFFLOAD -struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t features, bool *again) +int validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t features) { int err; - unsigned long flags; struct xfrm_state *x; - struct sk_buff *skb2; - netdev_features_t esp_features = features; - struct softnet_data *sd; struct xfrm_offload *xo = xfrm_offload(skb); - if (!xo) - return skb; + if (skb_is_gso(skb)) + return 0; - if (!(features & NETIF_F_HW_ESP)) - esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK); + if (xo) { + x = skb->sp->xvec[skb->sp->len - 1]; + if (xo->flags & XFRM_GRO || x->xso.flags & XFRM_OFFLOAD_INBOUND) + return 0; - x = skb->sp->xvec[skb->sp->len - 1]; - if (xo->flags & XFRM_GRO || x->xso.flags & XFRM_OFFLOAD_INBOUND) - return skb; - - local_irq_save(flags); - sd = this_cpu_ptr(&softnet_data); - err = !skb_queue_empty(&sd->xfrm_backlog); - local_irq_restore(flags); - - if (err) { - *again = true; - return skb; - } - - if (skb_is_gso(skb)) { - struct net_device *dev = skb->dev; - if (unlikely(!x->xso.offload_handle || (x->xso.dev != dev))) { - struct sk_buff *segs; - /* Packet got rerouted, fixup features and segment it. */ - esp_features = esp_features & ~(NETIF_F_HW_ESP - | NETIF_F_GSO_ESP); - - segs = skb_gso_segment(skb, esp_features); - if (IS_ERR(segs)) { - kfree_skb(skb); - atomic_long_inc(&dev->tx_dropped); - return NULL; - } else { - consume_skb(skb); - skb = segs; - } - } - } - - if (!skb->next) { x->outer_mode->xmit(x, skb); - xo->flags |= XFRM_DEV_RESUME; - - err = x->type_offload->xmit(x, skb, esp_features); + err = x->type_offload->xmit(x, skb, features); if (err) { - if (err == -EINPROGRESS) - return NULL; - XFRM_INC_STATS(xs_net(x), LINUX_MIB_XFRMOUTSTATEPROTOERROR); - kfree_skb(skb); - return NULL; + return err; } skb_push(skb, skb->data - skb_mac_header(skb)); - return skb; } - skb2 = skb; - do { - struct sk_buff *nskb = skb2->next; - skb2->next = NULL; - - xo = xfrm_offload(skb2); - xo->flags |= XFRM_DEV_RESUME; - - x->outer_mode->xmit(x, skb2); - err = x->type_offload->xmit(x, skb2, esp_features); - if (!err) { - skb2->next = nskb; - } else if (err != -EINPROGRESS) { - XFRM_INC_STATS(xs_net(x), LINUX_MIB_XFRMOUTSTATEPROTOERROR); - skb2->next = nskb; - kfree_skb_list(skb2); - return NULL; - } else { - if (skb == skb2) - skb = nskb; - - if (!skb) - return NULL; - - goto skip_push; - } - - skb_push(skb2, skb2->data - skb_mac_header(skb2)); -skip_push: - skb2 = nskb; - } while (skb2); - return skb; + return 0; } EXPORT_SYMBOL_GPL(validate_xmit_xfrm); @@ -200,8 +125,8 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x) if (!x->type_offload || x->encap) return false; - if ((x->xso.offload_handle && (dev == dst->path->dev)) && - !dst->child->xfrm && x->type->get_mtu) { + if ((x->xso.offload_handle && (dev == xfrm_dst_path(dst)->dev)) && + !xdst->child->xfrm && x->type->get_mtu) { mtu = x->type->get_mtu(x, xdst->child_mtu_cached); if (skb->len <= mtu) @@ -220,55 +145,6 @@ ok: return true; } EXPORT_SYMBOL_GPL(xfrm_dev_offload_ok); - -void xfrm_dev_resume(struct sk_buff *skb) -{ - struct net_device *dev = skb->dev; - int ret = NETDEV_TX_BUSY; - struct netdev_queue *txq; - struct softnet_data *sd; - unsigned long flags; - - rcu_read_lock(); - txq = netdev_pick_tx(dev, skb, NULL); - HARD_TX_LOCK(dev, txq, smp_processor_id()); - - if (!netif_xmit_frozen_or_stopped(txq)) - skb = dev_hard_start_xmit(skb, dev, txq, &ret); - - HARD_TX_UNLOCK(dev, txq); - if (!dev_xmit_complete(ret)) { - local_irq_save(flags); - sd = this_cpu_ptr(&softnet_data); - skb_queue_tail(&sd->xfrm_backlog, skb); - raise_softirq_irqoff(NET_TX_SOFTIRQ); - local_irq_restore(flags); - } - - rcu_read_unlock(); -} - -EXPORT_SYMBOL_GPL(xfrm_dev_resume); -void xfrm_dev_backlog(struct softnet_data *sd) -{ - struct sk_buff_head *xfrm_backlog = &sd->xfrm_backlog; - struct sk_buff_head list; - struct sk_buff *skb; - - if (skb_queue_empty(xfrm_backlog)) - return; - - __skb_queue_head_init(&list); - spin_lock(&xfrm_backlog->lock); - skb_queue_splice_init(xfrm_backlog, &list); - spin_unlock(&xfrm_backlog->lock); - - while (!skb_queue_empty(&list)) { - skb = __skb_dequeue(&list); - xfrm_dev_resume(skb); - } -} - #endif static int xfrm_dev_register(struct net_device *dev) diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c index 00ff2a1c5e5f..c46162887b94 100644 --- a/net/xfrm/xfrm_output.c +++ b/net/xfrm/xfrm_output.c @@ -44,7 +44,7 @@ static int xfrm_skb_check_space(struct sk_buff *skb) static struct dst_entry *skb_dst_pop(struct sk_buff *skb) { - struct dst_entry *child = dst_clone(skb_dst(skb)->child); + struct dst_entry *child = dst_clone(xfrm_dst_child(skb_dst(skb))); skb_dst_drop(skb); return child; diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index e61e32918f7b..b79cd7711f4e 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -1564,8 +1564,8 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy, unsigned long now = jiffies; struct net_device *dev; struct xfrm_mode *inner_mode; - struct dst_entry *dst_prev = NULL; - struct dst_entry *dst0 = NULL; + struct xfrm_dst *xdst_prev = NULL; + struct xfrm_dst *xdst0 = NULL; int i = 0; int err; int header_len = 0; @@ -1591,13 +1591,13 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy, goto put_states; } - if (!dst_prev) - dst0 = dst1; + if (!xdst_prev) + xdst0 = xdst; else /* Ref count is taken during xfrm_alloc_dst() * No need to do dst_clone() on dst1 */ - dst_prev->child = dst1; + xfrm_dst_set_child(xdst_prev, &xdst->u.dst); if (xfrm[i]->sel.family == AF_UNSPEC) { inner_mode = xfrm_ip2inner_mode(xfrm[i], @@ -1638,8 +1638,8 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy, dst1->input = dst_discard; dst1->output = inner_mode->afinfo->output; - dst1->next = dst_prev; - dst_prev = dst1; + dst1->next = &xdst_prev->u.dst; + xdst_prev = xdst; header_len += xfrm[i]->props.header_len; if (xfrm[i]->type->flags & XFRM_TYPE_NON_FRAGMENT) @@ -1647,40 +1647,39 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy, trailer_len += xfrm[i]->props.trailer_len; } - dst_prev->child = dst; - dst0->path = dst; + xfrm_dst_set_child(xdst_prev, dst); + xdst0->path = dst; err = -ENODEV; dev = dst->dev; if (!dev) goto free_dst; - xfrm_init_path((struct xfrm_dst *)dst0, dst, nfheader_len); - xfrm_init_pmtu(dst_prev); + xfrm_init_path(xdst0, dst, nfheader_len); + xfrm_init_pmtu(&xdst_prev->u.dst); - for (dst_prev = dst0; dst_prev != dst; dst_prev = dst_prev->child) { - struct xfrm_dst *xdst = (struct xfrm_dst *)dst_prev; - - err = xfrm_fill_dst(xdst, dev, fl); + for (xdst_prev = xdst0; xdst_prev != (struct xfrm_dst *)dst; + xdst_prev = (struct xfrm_dst *) xfrm_dst_child(&xdst_prev->u.dst)) { + err = xfrm_fill_dst(xdst_prev, dev, fl); if (err) goto free_dst; - dst_prev->header_len = header_len; - dst_prev->trailer_len = trailer_len; - header_len -= xdst->u.dst.xfrm->props.header_len; - trailer_len -= xdst->u.dst.xfrm->props.trailer_len; + xdst_prev->u.dst.header_len = header_len; + xdst_prev->u.dst.trailer_len = trailer_len; + header_len -= xdst_prev->u.dst.xfrm->props.header_len; + trailer_len -= xdst_prev->u.dst.xfrm->props.trailer_len; } out: - return dst0; + return &xdst0->u.dst; put_states: for (; i < nx; i++) xfrm_state_put(xfrm[i]); free_dst: - if (dst0) - dst_release_immediate(dst0); - dst0 = ERR_PTR(err); + if (xdst0) + dst_release_immediate(&xdst0->u.dst); + xdst0 = ERR_PTR(err); goto out; } @@ -1791,8 +1790,8 @@ static void xfrm_policy_queue_process(unsigned long arg) xfrm_decode_session(skb, &fl, dst->ops->family); spin_unlock(&pq->hold_queue.lock); - dst_hold(dst->path); - dst = xfrm_lookup(net, dst->path, &fl, sk, 0); + dst_hold(xfrm_dst_path(dst)); + dst = xfrm_lookup(net, xfrm_dst_path(dst), &fl, sk, 0); if (IS_ERR(dst)) goto purge_queue; @@ -1821,8 +1820,8 @@ static void xfrm_policy_queue_process(unsigned long arg) skb = __skb_dequeue(&list); xfrm_decode_session(skb, &fl, skb_dst(skb)->ops->family); - dst_hold(skb_dst(skb)->path); - dst = xfrm_lookup(net, skb_dst(skb)->path, &fl, skb->sk, 0); + dst_hold(xfrm_dst_path(skb_dst(skb))); + dst = xfrm_lookup(net, xfrm_dst_path(skb_dst(skb)), &fl, skb->sk, 0); if (IS_ERR(dst)) { kfree_skb(skb); continue; @@ -1923,8 +1922,8 @@ static struct xfrm_dst *xfrm_create_dummy_bundle(struct net *net, dst1->output = xdst_queue_output; dst_hold(dst); - dst1->child = dst; - dst1->path = dst; + xfrm_dst_set_child(xdst, dst); + xdst->path = dst; xfrm_init_path((struct xfrm_dst *)dst1, dst, 0); @@ -2539,7 +2538,7 @@ static int stale_bundle(struct dst_entry *dst) void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device *dev) { - while ((dst = dst->child) && dst->xfrm && dst->dev == dev) { + while ((dst = xfrm_dst_child(dst)) && dst->xfrm && dst->dev == dev) { dst->dev = dev_net(dev)->loopback_dev; dev_hold(dst->dev); dev_put(dev); @@ -2552,15 +2551,10 @@ static void xfrm_link_failure(struct sk_buff *skb) /* Impossible. Such dst must be popped before reaches point of failure. */ } -static struct dst_entry *xfrm_negative_advice(struct dst_entry *dst) +static void xfrm_negative_advice(struct sock *sk, struct dst_entry *dst) { - if (dst) { - if (dst->obsolete) { - dst_release(dst); - dst = NULL; - } - } - return dst; + if (dst->obsolete) + sk_dst_reset(sk); } static void xfrm_init_pmtu(struct dst_entry *dst) @@ -2569,7 +2563,7 @@ static void xfrm_init_pmtu(struct dst_entry *dst) struct xfrm_dst *xdst = (struct xfrm_dst *)dst; u32 pmtu, route_mtu_cached; - pmtu = dst_mtu(dst->child); + pmtu = dst_mtu(xfrm_dst_child(dst)); xdst->child_mtu_cached = pmtu; pmtu = xfrm_state_mtu(dst->xfrm, pmtu); @@ -2594,7 +2588,7 @@ static int xfrm_bundle_ok(struct xfrm_dst *first) struct xfrm_dst *last; u32 mtu; - if (!dst_check(dst->path, ((struct xfrm_dst *)dst)->path_cookie) || + if (!dst_check(xfrm_dst_path(dst), ((struct xfrm_dst *)dst)->path_cookie) || (dst->dev && !netif_running(dst->dev))) return 0; @@ -2614,7 +2608,7 @@ static int xfrm_bundle_ok(struct xfrm_dst *first) xdst->policy_genid != atomic_read(&xdst->pols[0]->genid)) return 0; - mtu = dst_mtu(dst->child); + mtu = dst_mtu(xfrm_dst_child(dst)); if (xdst->child_mtu_cached != mtu) { last = xdst; xdst->child_mtu_cached = mtu; @@ -2628,7 +2622,7 @@ static int xfrm_bundle_ok(struct xfrm_dst *first) xdst->route_mtu_cached = mtu; } - dst = dst->child; + dst = xfrm_dst_child(dst); } while (dst->xfrm); if (likely(!last)) @@ -2655,22 +2649,20 @@ static int xfrm_bundle_ok(struct xfrm_dst *first) static unsigned int xfrm_default_advmss(const struct dst_entry *dst) { - return dst_metric_advmss(dst->path); + return dst_metric_advmss(xfrm_dst_path(dst)); } static unsigned int xfrm_mtu(const struct dst_entry *dst) { unsigned int mtu = dst_metric_raw(dst, RTAX_MTU); - return mtu ? : dst_mtu(dst->path); + return mtu ? : dst_mtu(xfrm_dst_path(dst)); } static const void *xfrm_get_dst_nexthop(const struct dst_entry *dst, const void *daddr) { - const struct dst_entry *path = dst->path; - - for (; dst != path; dst = dst->child) { + while (dst->xfrm) { const struct xfrm_state *xfrm = dst->xfrm; if (xfrm->props.mode == XFRM_MODE_TRANSPORT) @@ -2679,6 +2671,8 @@ static const void *xfrm_get_dst_nexthop(const struct dst_entry *dst, daddr = xfrm->coaddr; else if (!(xfrm->type->flags & XFRM_TYPE_LOCAL_COADDR)) daddr = &xfrm->id.daddr; + + dst = xfrm_dst_child(dst); } return daddr; } @@ -2687,7 +2681,7 @@ static struct neighbour *xfrm_neigh_lookup(const struct dst_entry *dst, struct sk_buff *skb, const void *daddr) { - const struct dst_entry *path = dst->path; + const struct dst_entry *path = xfrm_dst_path(dst); if (!skb) daddr = xfrm_get_dst_nexthop(dst, daddr); @@ -2696,7 +2690,7 @@ static struct neighbour *xfrm_neigh_lookup(const struct dst_entry *dst, static void xfrm_confirm_neigh(const struct dst_entry *dst, const void *daddr) { - const struct dst_entry *path = dst->path; + const struct dst_entry *path = xfrm_dst_path(dst); daddr = xfrm_get_dst_nexthop(dst, daddr); path->ops->confirm_neigh(path, daddr); diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 3460036621e4..efb677f88b33 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -42,6 +42,7 @@ hostprogs-y += xdp_redirect hostprogs-y += xdp_redirect_map hostprogs-y += xdp_monitor hostprogs-y += syscall_tp +hostprogs-y += xdpsock # Libbpf dependencies LIBBPF := ../../tools/lib/bpf/bpf.o @@ -87,6 +88,7 @@ xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o +xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -132,6 +134,7 @@ always += xdp_redirect_kern.o always += xdp_redirect_map_kern.o always += xdp_monitor_kern.o always += syscall_tp_kern.o +always += xdpsock_kern.o HOSTCFLAGS += -I$(objtree)/usr/include HOSTCFLAGS += -I$(srctree)/tools/lib/ @@ -172,6 +175,7 @@ HOSTLOADLIBES_xdp_redirect += -lelf HOSTLOADLIBES_xdp_redirect_map += -lelf HOSTLOADLIBES_xdp_monitor += -lelf HOSTLOADLIBES_syscall_tp += -lelf +HOSTLOADLIBES_xdpsock += -lelf -pthread # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: # make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang diff --git a/samples/bpf/xdp_redirect_map_user.c b/samples/bpf/xdp_redirect_map_user.c index d4d86a273fba..98c270013353 100644 --- a/samples/bpf/xdp_redirect_map_user.c +++ b/samples/bpf/xdp_redirect_map_user.c @@ -1,13 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2017 Covalent IO, Inc. http://covalent.io - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/samples/bpf/xdp_redirect_user.c b/samples/bpf/xdp_redirect_user.c index bd9fa7a55a30..3ce2a63254e9 100644 --- a/samples/bpf/xdp_redirect_user.c +++ b/samples/bpf/xdp_redirect_user.c @@ -1,13 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0-only /* Copyright (c) 2016 John Fastabend - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of version 2 of the GNU General Public - * License as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h new file mode 100644 index 000000000000..533ab81adfa1 --- /dev/null +++ b/samples/bpf/xdpsock.h @@ -0,0 +1,11 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef XDPSOCK_H_ +#define XDPSOCK_H_ + +/* Power-of-2 number of sockets */ +#define MAX_SOCKS 4 + +/* Round-robin receive */ +#define RR_LB 0 + +#endif /* XDPSOCK_H_ */ diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c new file mode 100644 index 000000000000..d8806c41362e --- /dev/null +++ b/samples/bpf/xdpsock_kern.c @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0 +#define KBUILD_MODNAME "foo" +#include +#include "bpf_helpers.h" + +#include "xdpsock.h" + +struct bpf_map_def SEC("maps") qidconf_map = { + .type = BPF_MAP_TYPE_ARRAY, + .key_size = sizeof(int), + .value_size = sizeof(int), + .max_entries = 1, +}; + +struct bpf_map_def SEC("maps") xsks_map = { + .type = BPF_MAP_TYPE_XSKMAP, + .key_size = sizeof(int), + .value_size = sizeof(int), + .max_entries = 4, +}; + +struct bpf_map_def SEC("maps") rr_map = { + .type = BPF_MAP_TYPE_PERCPU_ARRAY, + .key_size = sizeof(int), + .value_size = sizeof(unsigned int), + .max_entries = 1, +}; + +SEC("xdp_sock") +int xdp_sock_prog(struct xdp_md *ctx) +{ + int *qidconf, key = 0, idx; + unsigned int *rr; + + qidconf = bpf_map_lookup_elem(&qidconf_map, &key); + if (!qidconf) + return XDP_ABORTED; + + if (*qidconf != ctx->rx_queue_index) + return XDP_PASS; + +#if RR_LB /* NB! RR_LB is configured in xdpsock.h */ + rr = bpf_map_lookup_elem(&rr_map, &key); + if (!rr) + return XDP_ABORTED; + + *rr = (*rr + 1) & (MAX_SOCKS - 1); + idx = *rr; +#else + idx = 0; +#endif + + return bpf_redirect_map(&xsks_map, idx, 0); +} + +char _license[] SEC("license") = "GPL"; diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c new file mode 100644 index 000000000000..4b8a7cf3e63b --- /dev/null +++ b/samples/bpf/xdpsock_user.c @@ -0,0 +1,948 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2017 - 2018 Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "bpf_load.h" +#include "bpf_util.h" +#include "libbpf.h" + +#include "xdpsock.h" + +#ifndef SOL_XDP +#define SOL_XDP 283 +#endif + +#ifndef AF_XDP +#define AF_XDP 44 +#endif + +#ifndef PF_XDP +#define PF_XDP AF_XDP +#endif + +#define NUM_FRAMES 131072 +#define FRAME_HEADROOM 0 +#define FRAME_SIZE 2048 +#define NUM_DESCS 1024 +#define BATCH_SIZE 16 + +#define FQ_NUM_DESCS 1024 +#define CQ_NUM_DESCS 1024 + +#define DEBUG_HEXDUMP 0 + +typedef __u32 u32; + +static unsigned long prev_time; + +enum benchmark_type { + BENCH_RXDROP = 0, + BENCH_TXONLY = 1, + BENCH_L2FWD = 2, +}; + +static enum benchmark_type opt_bench = BENCH_RXDROP; +static u32 opt_xdp_flags; +static const char *opt_if = ""; +static int opt_ifindex; +static int opt_queue; +static int opt_poll; +static int opt_shared_packet_buffer; +static int opt_interval = 1; + +struct xdp_umem_uqueue { + u32 cached_prod; + u32 cached_cons; + u32 mask; + u32 size; + struct xdp_umem_ring *ring; +}; + +struct xdp_umem { + char (*frames)[FRAME_SIZE]; + struct xdp_umem_uqueue fq; + struct xdp_umem_uqueue cq; + int fd; +}; + +struct xdp_uqueue { + u32 cached_prod; + u32 cached_cons; + u32 mask; + u32 size; + struct xdp_rxtx_ring *ring; +}; + +struct xdpsock { + struct xdp_uqueue rx; + struct xdp_uqueue tx; + int sfd; + struct xdp_umem *umem; + u32 outstanding_tx; + unsigned long rx_npkts; + unsigned long tx_npkts; + unsigned long prev_rx_npkts; + unsigned long prev_tx_npkts; +}; + +#define MAX_SOCKS 4 +static int num_socks; +struct xdpsock *xsks[MAX_SOCKS]; + +static unsigned long get_nsecs(void) +{ + struct timespec ts; + + clock_gettime(CLOCK_MONOTONIC, &ts); + return ts.tv_sec * 1000000000UL + ts.tv_nsec; +} + +static void dump_stats(void); + +#define lassert(expr) \ + do { \ + if (!(expr)) { \ + fprintf(stderr, "%s:%s:%i: Assertion failed: " \ + #expr ": errno: %d/\"%s\"\n", \ + __FILE__, __func__, __LINE__, \ + errno, strerror(errno)); \ + dump_stats(); \ + exit(EXIT_FAILURE); \ + } \ + } while (0) + +#define barrier() __asm__ __volatile__("": : :"memory") +#define u_smp_rmb() barrier() +#define u_smp_wmb() barrier() +#define likely(x) __builtin_expect(!!(x), 1) +#define unlikely(x) __builtin_expect(!!(x), 0) + +static const char pkt_data[] = + "\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00" + "\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14" + "\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b" + "\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa"; + +static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb) +{ + u32 free_entries = q->size - (q->cached_prod - q->cached_cons); + + if (free_entries >= nb) + return free_entries; + + /* Refresh the local tail pointer */ + q->cached_cons = q->ring->ptrs.consumer; + + return q->size - (q->cached_prod - q->cached_cons); +} + +static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs) +{ + u32 free_entries = q->cached_cons - q->cached_prod; + + if (free_entries >= ndescs) + return free_entries; + + /* Refresh the local tail pointer */ + q->cached_cons = q->ring->ptrs.consumer + q->size; + return q->cached_cons - q->cached_prod; +} + +static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb) +{ + u32 entries = q->cached_prod - q->cached_cons; + + if (entries == 0) { + q->cached_prod = q->ring->ptrs.producer; + entries = q->cached_prod - q->cached_cons; + } + + return (entries > nb) ? nb : entries; +} + +static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs) +{ + u32 entries = q->cached_prod - q->cached_cons; + + if (entries == 0) { + q->cached_prod = q->ring->ptrs.producer; + entries = q->cached_prod - q->cached_cons; + } + + return (entries > ndescs) ? ndescs : entries; +} + +static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq, + struct xdp_desc *d, + size_t nb) +{ + u32 i; + + if (umem_nb_free(fq, nb) < nb) + return -ENOSPC; + + for (i = 0; i < nb; i++) { + u32 idx = fq->cached_prod++ & fq->mask; + + fq->ring->desc[idx] = d[i].idx; + } + + u_smp_wmb(); + + fq->ring->ptrs.producer = fq->cached_prod; + + return 0; +} + +static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d, + size_t nb) +{ + u32 i; + + if (umem_nb_free(fq, nb) < nb) + return -ENOSPC; + + for (i = 0; i < nb; i++) { + u32 idx = fq->cached_prod++ & fq->mask; + + fq->ring->desc[idx] = d[i]; + } + + u_smp_wmb(); + + fq->ring->ptrs.producer = fq->cached_prod; + + return 0; +} + +static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq, + u32 *d, size_t nb) +{ + u32 idx, i, entries = umem_nb_avail(cq, nb); + + u_smp_rmb(); + + for (i = 0; i < entries; i++) { + idx = cq->cached_cons++ & cq->mask; + d[i] = cq->ring->desc[idx]; + } + + if (entries > 0) { + u_smp_wmb(); + + cq->ring->ptrs.consumer = cq->cached_cons; + } + + return entries; +} + +static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off) +{ + lassert(idx < NUM_FRAMES); + return &xsk->umem->frames[idx][off]; +} + +static inline int xq_enq(struct xdp_uqueue *uq, + const struct xdp_desc *descs, + unsigned int ndescs) +{ + struct xdp_rxtx_ring *r = uq->ring; + unsigned int i; + + if (xq_nb_free(uq, ndescs) < ndescs) + return -ENOSPC; + + for (i = 0; i < ndescs; i++) { + u32 idx = uq->cached_prod++ & uq->mask; + + r->desc[idx].idx = descs[i].idx; + r->desc[idx].len = descs[i].len; + r->desc[idx].offset = descs[i].offset; + } + + u_smp_wmb(); + + r->ptrs.producer = uq->cached_prod; + return 0; +} + +static inline int xq_enq_tx_only(struct xdp_uqueue *uq, + __u32 idx, unsigned int ndescs) +{ + struct xdp_rxtx_ring *q = uq->ring; + unsigned int i; + + if (xq_nb_free(uq, ndescs) < ndescs) + return -ENOSPC; + + for (i = 0; i < ndescs; i++) { + u32 idx = uq->cached_prod++ & uq->mask; + + q->desc[idx].idx = idx + i; + q->desc[idx].len = sizeof(pkt_data) - 1; + q->desc[idx].offset = 0; + } + + u_smp_wmb(); + + q->ptrs.producer = uq->cached_prod; + return 0; +} + +static inline int xq_deq(struct xdp_uqueue *uq, + struct xdp_desc *descs, + int ndescs) +{ + struct xdp_rxtx_ring *r = uq->ring; + unsigned int idx; + int i, entries; + + entries = xq_nb_avail(uq, ndescs); + + u_smp_rmb(); + + for (i = 0; i < entries; i++) { + idx = uq->cached_cons++ & uq->mask; + descs[i] = r->desc[idx]; + } + + if (entries > 0) { + u_smp_wmb(); + + r->ptrs.consumer = uq->cached_cons; + } + + return entries; +} + +static void swap_mac_addresses(void *data) +{ + struct ether_header *eth = (struct ether_header *)data; + struct ether_addr *src_addr = (struct ether_addr *)ð->ether_shost; + struct ether_addr *dst_addr = (struct ether_addr *)ð->ether_dhost; + struct ether_addr tmp; + + tmp = *src_addr; + *src_addr = *dst_addr; + *dst_addr = tmp; +} + +#if DEBUG_HEXDUMP +static void hex_dump(void *pkt, size_t length, const char *prefix) +{ + int i = 0; + const unsigned char *address = (unsigned char *)pkt; + const unsigned char *line = address; + size_t line_size = 32; + unsigned char c; + + printf("length = %zu\n", length); + printf("%s | ", prefix); + while (length-- > 0) { + printf("%02X ", *address++); + if (!(++i % line_size) || (length == 0 && i % line_size)) { + if (length == 0) { + while (i++ % line_size) + printf("__ "); + } + printf(" | "); /* right close */ + while (line < address) { + c = *line++; + printf("%c", (c < 33 || c == 255) ? 0x2E : c); + } + printf("\n"); + if (length > 0) + printf("%s | ", prefix); + } + } + printf("\n"); +} +#endif + +static size_t gen_eth_frame(char *frame) +{ + memcpy(frame, pkt_data, sizeof(pkt_data) - 1); + return sizeof(pkt_data) - 1; +} + +static struct xdp_umem *xdp_umem_configure(int sfd) +{ + int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS; + struct xdp_umem_reg mr; + struct xdp_umem *umem; + void *bufs; + + umem = calloc(1, sizeof(*umem)); + lassert(umem); + + lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */ + NUM_FRAMES * FRAME_SIZE) == 0); + + mr.addr = (__u64)bufs; + mr.len = NUM_FRAMES * FRAME_SIZE; + mr.frame_size = FRAME_SIZE; + mr.frame_headroom = FRAME_HEADROOM; + + lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0); + lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size, + sizeof(int)) == 0); + lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &cq_size, + sizeof(int)) == 0); + + umem->fq.ring = mmap(0, sizeof(struct xdp_umem_ring) + + FQ_NUM_DESCS * sizeof(u32), + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, sfd, + XDP_UMEM_PGOFF_FILL_RING); + lassert(umem->fq.ring != MAP_FAILED); + + umem->fq.mask = FQ_NUM_DESCS - 1; + umem->fq.size = FQ_NUM_DESCS; + + umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) + + CQ_NUM_DESCS * sizeof(u32), + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, sfd, + XDP_UMEM_PGOFF_COMPLETION_RING); + lassert(umem->cq.ring != MAP_FAILED); + + umem->cq.mask = CQ_NUM_DESCS - 1; + umem->cq.size = CQ_NUM_DESCS; + + umem->frames = (char (*)[FRAME_SIZE])bufs; + umem->fd = sfd; + + if (opt_bench == BENCH_TXONLY) { + int i; + + for (i = 0; i < NUM_FRAMES; i++) + (void)gen_eth_frame(&umem->frames[i][0]); + } + + return umem; +} + +static struct xdpsock *xsk_configure(struct xdp_umem *umem) +{ + struct sockaddr_xdp sxdp = {}; + int sfd, ndescs = NUM_DESCS; + struct xdpsock *xsk; + bool shared = true; + u32 i; + + sfd = socket(PF_XDP, SOCK_RAW, 0); + lassert(sfd >= 0); + + xsk = calloc(1, sizeof(*xsk)); + lassert(xsk); + + xsk->sfd = sfd; + xsk->outstanding_tx = 0; + + if (!umem) { + shared = false; + xsk->umem = xdp_umem_configure(sfd); + } else { + xsk->umem = umem; + } + + lassert(setsockopt(sfd, SOL_XDP, XDP_RX_RING, + &ndescs, sizeof(int)) == 0); + lassert(setsockopt(sfd, SOL_XDP, XDP_TX_RING, + &ndescs, sizeof(int)) == 0); + + /* Rx */ + xsk->rx.ring = mmap(NULL, + sizeof(struct xdp_ring) + + NUM_DESCS * sizeof(struct xdp_desc), + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, sfd, + XDP_PGOFF_RX_RING); + lassert(xsk->rx.ring != MAP_FAILED); + + if (!shared) { + for (i = 0; i < NUM_DESCS / 2; i++) + lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1) + == 0); + } + + /* Tx */ + xsk->tx.ring = mmap(NULL, + sizeof(struct xdp_ring) + + NUM_DESCS * sizeof(struct xdp_desc), + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, sfd, + XDP_PGOFF_TX_RING); + lassert(xsk->tx.ring != MAP_FAILED); + + xsk->rx.mask = NUM_DESCS - 1; + xsk->rx.size = NUM_DESCS; + + xsk->tx.mask = NUM_DESCS - 1; + xsk->tx.size = NUM_DESCS; + + sxdp.sxdp_family = PF_XDP; + sxdp.sxdp_ifindex = opt_ifindex; + sxdp.sxdp_queue_id = opt_queue; + if (shared) { + sxdp.sxdp_flags = XDP_SHARED_UMEM; + sxdp.sxdp_shared_umem_fd = umem->fd; + } + + lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0); + + return xsk; +} + +static void print_benchmark(bool running) +{ + const char *bench_str = "INVALID"; + + if (opt_bench == BENCH_RXDROP) + bench_str = "rxdrop"; + else if (opt_bench == BENCH_TXONLY) + bench_str = "txonly"; + else if (opt_bench == BENCH_L2FWD) + bench_str = "l2fwd"; + + printf("%s:%d %s ", opt_if, opt_queue, bench_str); + if (opt_xdp_flags & XDP_FLAGS_SKB_MODE) + printf("xdp-skb "); + else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE) + printf("xdp-drv "); + else + printf(" "); + + if (opt_poll) + printf("poll() "); + + if (running) { + printf("running..."); + fflush(stdout); + } +} + +static void dump_stats(void) +{ + unsigned long now = get_nsecs(); + long dt = now - prev_time; + int i; + + prev_time = now; + + for (i = 0; i < num_socks; i++) { + char *fmt = "%-15s %'-11.0f %'-11lu\n"; + double rx_pps, tx_pps; + + rx_pps = (xsks[i]->rx_npkts - xsks[i]->prev_rx_npkts) * + 1000000000. / dt; + tx_pps = (xsks[i]->tx_npkts - xsks[i]->prev_tx_npkts) * + 1000000000. / dt; + + printf("\n sock%d@", i); + print_benchmark(false); + printf("\n"); + + printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts", + dt / 1000000000.); + printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts); + printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts); + + xsks[i]->prev_rx_npkts = xsks[i]->rx_npkts; + xsks[i]->prev_tx_npkts = xsks[i]->tx_npkts; + } +} + +static void *poller(void *arg) +{ + (void)arg; + for (;;) { + sleep(opt_interval); + dump_stats(); + } + + return NULL; +} + +static void int_exit(int sig) +{ + (void)sig; + dump_stats(); + bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags); + exit(EXIT_SUCCESS); +} + +static struct option long_options[] = { + {"rxdrop", no_argument, 0, 'r'}, + {"txonly", no_argument, 0, 't'}, + {"l2fwd", no_argument, 0, 'l'}, + {"interface", required_argument, 0, 'i'}, + {"queue", required_argument, 0, 'q'}, + {"poll", no_argument, 0, 'p'}, + {"shared-buffer", no_argument, 0, 's'}, + {"xdp-skb", no_argument, 0, 'S'}, + {"xdp-native", no_argument, 0, 'N'}, + {"interval", required_argument, 0, 'n'}, + {0, 0, 0, 0} +}; + +static void usage(const char *prog) +{ + const char *str = + " Usage: %s [OPTIONS]\n" + " Options:\n" + " -r, --rxdrop Discard all incoming packets (default)\n" + " -t, --txonly Only send packets\n" + " -l, --l2fwd MAC swap L2 forwarding\n" + " -i, --interface=n Run on interface n\n" + " -q, --queue=n Use queue n (default 0)\n" + " -p, --poll Use poll syscall\n" + " -s, --shared-buffer Use shared packet buffer\n" + " -S, --xdp-skb=n Use XDP skb-mod\n" + " -N, --xdp-native=n Enfore XDP native mode\n" + " -n, --interval=n Specify statistics update interval (default 1 sec).\n" + "\n"; + fprintf(stderr, str, prog); + exit(EXIT_FAILURE); +} + +static void parse_command_line(int argc, char **argv) +{ + int option_index, c; + + opterr = 0; + + for (;;) { + c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options, + &option_index); + if (c == -1) + break; + + switch (c) { + case 'r': + opt_bench = BENCH_RXDROP; + break; + case 't': + opt_bench = BENCH_TXONLY; + break; + case 'l': + opt_bench = BENCH_L2FWD; + break; + case 'i': + opt_if = optarg; + break; + case 'q': + opt_queue = atoi(optarg); + break; + case 's': + opt_shared_packet_buffer = 1; + break; + case 'p': + opt_poll = 1; + break; + case 'S': + opt_xdp_flags |= XDP_FLAGS_SKB_MODE; + break; + case 'N': + opt_xdp_flags |= XDP_FLAGS_DRV_MODE; + break; + case 'n': + opt_interval = atoi(optarg); + break; + default: + usage(basename(argv[0])); + } + } + + opt_ifindex = if_nametoindex(opt_if); + if (!opt_ifindex) { + fprintf(stderr, "ERROR: interface \"%s\" does not exist\n", + opt_if); + usage(basename(argv[0])); + } +} + +static void kick_tx(int fd) +{ + int ret; + + ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0); + if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN) + return; + lassert(0); +} + +static inline void complete_tx_l2fwd(struct xdpsock *xsk) +{ + u32 descs[BATCH_SIZE]; + unsigned int rcvd; + size_t ndescs; + + if (!xsk->outstanding_tx) + return; + + kick_tx(xsk->sfd); + ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE : + xsk->outstanding_tx; + + /* re-add completed Tx buffers */ + rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs); + if (rcvd > 0) { + umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd); + xsk->outstanding_tx -= rcvd; + xsk->tx_npkts += rcvd; + } +} + +static inline void complete_tx_only(struct xdpsock *xsk) +{ + u32 descs[BATCH_SIZE]; + unsigned int rcvd; + + if (!xsk->outstanding_tx) + return; + + kick_tx(xsk->sfd); + + rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE); + if (rcvd > 0) { + xsk->outstanding_tx -= rcvd; + xsk->tx_npkts += rcvd; + } +} + +static void rx_drop(struct xdpsock *xsk) +{ + struct xdp_desc descs[BATCH_SIZE]; + unsigned int rcvd, i; + + rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE); + if (!rcvd) + return; + + for (i = 0; i < rcvd; i++) { + u32 idx = descs[i].idx; + + lassert(idx < NUM_FRAMES); +#if DEBUG_HEXDUMP + char *pkt; + char buf[32]; + + pkt = xq_get_data(xsk, idx, descs[i].offset); + sprintf(buf, "idx=%d", idx); + hex_dump(pkt, descs[i].len, buf); +#endif + } + + xsk->rx_npkts += rcvd; + + umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd); +} + +static void rx_drop_all(void) +{ + struct pollfd fds[MAX_SOCKS + 1]; + int i, ret, timeout, nfds = 1; + + memset(fds, 0, sizeof(fds)); + + for (i = 0; i < num_socks; i++) { + fds[i].fd = xsks[i]->sfd; + fds[i].events = POLLIN; + timeout = 1000; /* 1sn */ + } + + for (;;) { + if (opt_poll) { + ret = poll(fds, nfds, timeout); + if (ret <= 0) + continue; + } + + for (i = 0; i < num_socks; i++) + rx_drop(xsks[i]); + } +} + +static void tx_only(struct xdpsock *xsk) +{ + int timeout, ret, nfds = 1; + struct pollfd fds[nfds + 1]; + unsigned int idx = 0; + + memset(fds, 0, sizeof(fds)); + fds[0].fd = xsk->sfd; + fds[0].events = POLLOUT; + timeout = 1000; /* 1sn */ + + for (;;) { + if (opt_poll) { + ret = poll(fds, nfds, timeout); + if (ret <= 0) + continue; + + if (fds[0].fd != xsk->sfd || + !(fds[0].revents & POLLOUT)) + continue; + } + + if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) { + lassert(xq_enq_tx_only(&xsk->tx, idx, BATCH_SIZE) == 0); + + xsk->outstanding_tx += BATCH_SIZE; + idx += BATCH_SIZE; + idx %= NUM_FRAMES; + } + + complete_tx_only(xsk); + } +} + +static void l2fwd(struct xdpsock *xsk) +{ + for (;;) { + struct xdp_desc descs[BATCH_SIZE]; + unsigned int rcvd, i; + int ret; + + for (;;) { + complete_tx_l2fwd(xsk); + + rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE); + if (rcvd > 0) + break; + } + + for (i = 0; i < rcvd; i++) { + char *pkt = xq_get_data(xsk, descs[i].idx, + descs[i].offset); + + swap_mac_addresses(pkt); +#if DEBUG_HEXDUMP + char buf[32]; + u32 idx = descs[i].idx; + + sprintf(buf, "idx=%d", idx); + hex_dump(pkt, descs[i].len, buf); +#endif + } + + xsk->rx_npkts += rcvd; + + ret = xq_enq(&xsk->tx, descs, rcvd); + lassert(ret == 0); + xsk->outstanding_tx += rcvd; + } +} + +int main(int argc, char **argv) +{ + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; + char xdp_filename[256]; + int i, ret, key = 0; + pthread_t pt; + + parse_command_line(argc, argv); + + if (setrlimit(RLIMIT_MEMLOCK, &r)) { + fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n", + strerror(errno)); + exit(EXIT_FAILURE); + } + + snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]); + + if (load_bpf_file(xdp_filename)) { + fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf); + exit(EXIT_FAILURE); + } + + if (!prog_fd[0]) { + fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n", + strerror(errno)); + exit(EXIT_FAILURE); + } + + if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) { + fprintf(stderr, "ERROR: link set xdp fd failed\n"); + exit(EXIT_FAILURE); + } + + ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0); + if (ret) { + fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n"); + exit(EXIT_FAILURE); + } + + /* Create sockets... */ + xsks[num_socks++] = xsk_configure(NULL); + +#if RR_LB + for (i = 0; i < MAX_SOCKS - 1; i++) + xsks[num_socks++] = xsk_configure(xsks[0]->umem); +#endif + + /* ...and insert them into the map. */ + for (i = 0; i < num_socks; i++) { + key = i; + ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0); + if (ret) { + fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i); + exit(EXIT_FAILURE); + } + } + + signal(SIGINT, int_exit); + signal(SIGTERM, int_exit); + signal(SIGABRT, int_exit); + + setlocale(LC_ALL, ""); + + ret = pthread_create(&pt, NULL, poller, NULL); + lassert(ret == 0); + + prev_time = get_nsecs(); + + if (opt_bench == BENCH_RXDROP) + rx_drop_all(); + else if (opt_bench == BENCH_TXONLY) + tx_only(xsks[0]); + else + l2fwd(xsks[0]); + + return 0; +} diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py new file mode 100755 index 000000000000..30ba0fee36e4 --- /dev/null +++ b/scripts/bpf_helpers_doc.py @@ -0,0 +1,421 @@ +#!/usr/bin/python3 +# SPDX-License-Identifier: GPL-2.0-only +# +# Copyright (C) 2018 Netronome Systems, Inc. + +# In case user attempts to run with Python 2. +from __future__ import print_function + +import argparse +import re +import sys, os + +class NoHelperFound(BaseException): + pass + +class ParsingError(BaseException): + def __init__(self, line='', reader=None): + if reader: + BaseException.__init__(self, + 'Error at file offset %d, parsing line: %s' % + (reader.tell(), line)) + else: + BaseException.__init__(self, 'Error parsing line: %s' % line) + +class Helper(object): + """ + An object representing the description of an eBPF helper function. + @proto: function prototype of the helper function + @desc: textual description of the helper function + @ret: description of the return value of the helper function + """ + def __init__(self, proto='', desc='', ret=''): + self.proto = proto + self.desc = desc + self.ret = ret + + def proto_break_down(self): + """ + Break down helper function protocol into smaller chunks: return type, + name, distincts arguments. + """ + arg_re = re.compile('^((const )?(struct )?(\w+|...))( (\**)(\w+))?$') + res = {} + proto_re = re.compile('^(.+) (\**)(\w+)\(((([^,]+)(, )?){1,5})\)$') + + capture = proto_re.match(self.proto) + res['ret_type'] = capture.group(1) + res['ret_star'] = capture.group(2) + res['name'] = capture.group(3) + res['args'] = [] + + args = capture.group(4).split(', ') + for a in args: + capture = arg_re.match(a) + res['args'].append({ + 'type' : capture.group(1), + 'star' : capture.group(6), + 'name' : capture.group(7) + }) + + return res + +class HeaderParser(object): + """ + An object used to parse a file in order to extract the documentation of a + list of eBPF helper functions. All the helpers that can be retrieved are + stored as Helper object, in the self.helpers() array. + @filename: name of file to parse, usually include/uapi/linux/bpf.h in the + kernel tree + """ + def __init__(self, filename): + self.reader = open(filename, 'r') + self.line = '' + self.helpers = [] + + def parse_helper(self): + proto = self.parse_proto() + desc = self.parse_desc() + ret = self.parse_ret() + return Helper(proto=proto, desc=desc, ret=ret) + + def parse_proto(self): + # Argument can be of shape: + # - "void" + # - "type name" + # - "type *name" + # - Same as above, with "const" and/or "struct" in front of type + # - "..." (undefined number of arguments, for bpf_trace_printk()) + # There is at least one term ("void"), and at most five arguments. + p = re.compile('^ \* ((.+) \**\w+\((((const )?(struct )?(\w+|\.\.\.)( \**\w+)?)(, )?){1,5}\))$') + capture = p.match(self.line) + if not capture: + raise NoHelperFound + self.line = self.reader.readline() + return capture.group(1) + + def parse_desc(self): + p = re.compile('^ \* \tDescription$') + capture = p.match(self.line) + if not capture: + # Helper can have empty description and we might be parsing another + # attribute: return but do not consume. + return '' + # Description can be several lines, some of them possibly empty, and it + # stops when another subsection title is met. + desc = '' + while True: + self.line = self.reader.readline() + if self.line == ' *\n': + desc += '\n' + else: + p = re.compile('^ \* \t\t(.*)') + capture = p.match(self.line) + if capture: + desc += capture.group(1) + '\n' + else: + break + return desc + + def parse_ret(self): + p = re.compile('^ \* \tReturn$') + capture = p.match(self.line) + if not capture: + # Helper can have empty retval and we might be parsing another + # attribute: return but do not consume. + return '' + # Return value description can be several lines, some of them possibly + # empty, and it stops when another subsection title is met. + ret = '' + while True: + self.line = self.reader.readline() + if self.line == ' *\n': + ret += '\n' + else: + p = re.compile('^ \* \t\t(.*)') + capture = p.match(self.line) + if capture: + ret += capture.group(1) + '\n' + else: + break + return ret + + def run(self): + # Advance to start of helper function descriptions. + offset = self.reader.read().find('* Start of BPF helper function descriptions:') + if offset == -1: + raise Exception('Could not find start of eBPF helper descriptions list') + self.reader.seek(offset) + self.reader.readline() + self.reader.readline() + self.line = self.reader.readline() + + while True: + try: + helper = self.parse_helper() + self.helpers.append(helper) + except NoHelperFound: + break + + self.reader.close() + print('Parsed description of %d helper function(s)' % len(self.helpers), + file=sys.stderr) + +############################################################################### + +class Printer(object): + """ + A generic class for printers. Printers should be created with an array of + Helper objects, and implement a way to print them in the desired fashion. + @helpers: array of Helper objects to print to standard output + """ + def __init__(self, helpers): + self.helpers = helpers + + def print_header(self): + pass + + def print_footer(self): + pass + + def print_one(self, helper): + pass + + def print_all(self): + self.print_header() + for helper in self.helpers: + self.print_one(helper) + self.print_footer() + +class PrinterRST(Printer): + """ + A printer for dumping collected information about helpers as a ReStructured + Text page compatible with the rst2man program, which can be used to + generate a manual page for the helpers. + @helpers: array of Helper objects to print to standard output + """ + def print_header(self): + header = '''\ +.. Copyright (C) All BPF authors and contributors from 2014 to present. +.. See git log include/uapi/linux/bpf.h in kernel tree for details. +.. +.. %%%LICENSE_START(VERBATIM) +.. Permission is granted to make and distribute verbatim copies of this +.. manual provided the copyright notice and this permission notice are +.. preserved on all copies. +.. +.. Permission is granted to copy and distribute modified versions of this +.. manual under the conditions for verbatim copying, provided that the +.. entire resulting derived work is distributed under the terms of a +.. permission notice identical to this one. +.. +.. Since the Linux kernel and libraries are constantly changing, this +.. manual page may be incorrect or out-of-date. The author(s) assume no +.. responsibility for errors or omissions, or for damages resulting from +.. the use of the information contained herein. The author(s) may not +.. have taken the same level of care in the production of this manual, +.. which is licensed free of charge, as they might when working +.. professionally. +.. +.. Formatted or processed versions of this manual, if unaccompanied by +.. the source, must acknowledge the copyright and authors of this work. +.. %%%LICENSE_END +.. +.. Please do not edit this file. It was generated from the documentation +.. located in file include/uapi/linux/bpf.h of the Linux kernel sources +.. (helpers description), and from scripts/bpf_helpers_doc.py in the same +.. repository (header and footer). + +=========== +BPF-HELPERS +=========== +------------------------------------------------------------------------------- +list of eBPF helper functions +------------------------------------------------------------------------------- + +:Manual section: 7 + +DESCRIPTION +=========== + +The extended Berkeley Packet Filter (eBPF) subsystem consists in programs +written in a pseudo-assembly language, then attached to one of the several +kernel hooks and run in reaction of specific events. This framework differs +from the older, "classic" BPF (or "cBPF") in several aspects, one of them being +the ability to call special functions (or "helpers") from within a program. +These functions are restricted to a white-list of helpers defined in the +kernel. + +These helpers are used by eBPF programs to interact with the system, or with +the context in which they work. For instance, they can be used to print +debugging messages, to get the time since the system was booted, to interact +with eBPF maps, or to manipulate network packets. Since there are several eBPF +program types, and that they do not run in the same context, each program type +can only call a subset of those helpers. + +Due to eBPF conventions, a helper can not have more than five arguments. + +Internally, eBPF programs call directly into the compiled helper functions +without requiring any foreign-function interface. As a result, calling helpers +introduces no overhead, thus offering excellent performance. + +This document is an attempt to list and document the helpers available to eBPF +developers. They are sorted by chronological order (the oldest helpers in the +kernel at the top). + +HELPERS +======= +''' + print(header) + + def print_footer(self): + footer = ''' +EXAMPLES +======== + +Example usage for most of the eBPF helpers listed in this manual page are +available within the Linux kernel sources, at the following locations: + +* *samples/bpf/* +* *tools/testing/selftests/bpf/* + +LICENSE +======= + +eBPF programs can have an associated license, passed along with the bytecode +instructions to the kernel when the programs are loaded. The format for that +string is identical to the one in use for kernel modules (Dual licenses, such +as "Dual BSD/GPL", may be used). Some helper functions are only accessible to +programs that are compatible with the GNU Privacy License (GPL). + +In order to use such helpers, the eBPF program must be loaded with the correct +license string passed (via **attr**) to the **bpf**\ () system call, and this +generally translates into the C source code of the program containing a line +similar to the following: + +:: + + char ____license[] __attribute__((section("license"), used)) = "GPL"; + +IMPLEMENTATION +============== + +This manual page is an effort to document the existing eBPF helper functions. +But as of this writing, the BPF sub-system is under heavy development. New eBPF +program or map types are added, along with new helper functions. Some helpers +are occasionally made available for additional program types. So in spite of +the efforts of the community, this page might not be up-to-date. If you want to +check by yourself what helper functions exist in your kernel, or what types of +programs they can support, here are some files among the kernel tree that you +may be interested in: + +* *include/uapi/linux/bpf.h* is the main BPF header. It contains the full list + of all helper functions, as well as many other BPF definitions including most + of the flags, structs or constants used by the helpers. +* *net/core/filter.c* contains the definition of most network-related helper + functions, and the list of program types from which they can be used. +* *kernel/trace/bpf_trace.c* is the equivalent for most tracing program-related + helpers. +* *kernel/bpf/verifier.c* contains the functions used to check that valid types + of eBPF maps are used with a given helper function. +* *kernel/bpf/* directory contains other files in which additional helpers are + defined (for cgroups, sockmaps, etc.). + +Compatibility between helper functions and program types can generally be found +in the files where helper functions are defined. Look for the **struct +bpf_func_proto** objects and for functions returning them: these functions +contain a list of helpers that a given program type can call. Note that the +**default:** label of the **switch ... case** used to filter helpers can call +other functions, themselves allowing access to additional helpers. The +requirement for GPL license is also in those **struct bpf_func_proto**. + +Compatibility between helper functions and map types can be found in the +**check_map_func_compatibility**\ () function in file *kernel/bpf/verifier.c*. + +Helper functions that invalidate the checks on **data** and **data_end** +pointers for network processing are listed in function +**bpf_helper_changes_pkt_data**\ () in file *net/core/filter.c*. + +SEE ALSO +======== + +**bpf**\ (2), +**cgroups**\ (7), +**ip**\ (8), +**perf_event_open**\ (2), +**sendmsg**\ (2), +**socket**\ (7), +**tc-bpf**\ (8)''' + print(footer) + + def print_proto(self, helper): + """ + Format function protocol with bold and italics markers. This makes RST + file less readable, but gives nice results in the manual page. + """ + proto = helper.proto_break_down() + + print('**%s %s%s(' % (proto['ret_type'], + proto['ret_star'].replace('*', '\\*'), + proto['name']), + end='') + + comma = '' + for a in proto['args']: + one_arg = '{}{}'.format(comma, a['type']) + if a['name']: + if a['star']: + one_arg += ' {}**\ '.format(a['star'].replace('*', '\\*')) + else: + one_arg += '** ' + one_arg += '*{}*\\ **'.format(a['name']) + comma = ', ' + print(one_arg, end='') + + print(')**') + + def print_one(self, helper): + self.print_proto(helper) + + if (helper.desc): + print('\tDescription') + # Do not strip all newline characters: formatted code at the end of + # a section must be followed by a blank line. + for line in re.sub('\n$', '', helper.desc, count=1).split('\n'): + print('{}{}'.format('\t\t' if line else '', line)) + + if (helper.ret): + print('\tReturn') + for line in helper.ret.rstrip().split('\n'): + print('{}{}'.format('\t\t' if line else '', line)) + + print('') + +############################################################################### + +# If script is launched from scripts/ from kernel tree and can access +# ../include/uapi/linux/bpf.h, use it as a default name for the file to parse, +# otherwise the --filename argument will be required from the command line. +script = os.path.abspath(sys.argv[0]) +linuxRoot = os.path.dirname(os.path.dirname(script)) +bpfh = os.path.join(linuxRoot, 'include/uapi/linux/bpf.h') + +argParser = argparse.ArgumentParser(description=""" +Parse eBPF header file and generate documentation for eBPF helper functions. +The RST-formatted output produced can be turned into a manual page with the +rst2man utility. +""") +if (os.path.isfile(bpfh)): + argParser.add_argument('--filename', help='path to include/uapi/linux/bpf.h', + default=bpfh) +else: + argParser.add_argument('--filename', help='path to include/uapi/linux/bpf.h') +args = argParser.parse_args() + +# Parse file. +headerParser = HeaderParser(args.filename) +headerParser.run() + +# Print formatted output to standard output. +printer = PrinterRST(headerParser.helpers) +printer.print_all() diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh index c98e0393978c..fe62d2fc881b 100755 --- a/scripts/link-vmlinux.sh +++ b/scripts/link-vmlinux.sh @@ -40,7 +40,7 @@ set -e info() { if [ "${quiet}" != "silent_" ]; then - printf " %-7s %s\n" ${1} ${2} + printf " %-7s %s\n" "${1}" "${2}" fi } @@ -142,8 +142,8 @@ recordmcount() } # Link of vmlinux -# ${1} - optional extra .o files -# ${2} - output file +# ${1} - output file +# ${@:2} - optional extra .o files vmlinux_link() { local lds="${objtree}/${KBUILD_LDS}" @@ -165,17 +165,17 @@ vmlinux_link() --start-group \ ${KBUILD_VMLINUX_LIBS} \ --end-group \ - ${1}" + ${@:2}" else objects="${KBUILD_VMLINUX_INIT} \ --start-group \ ${KBUILD_VMLINUX_MAIN} \ ${KBUILD_VMLINUX_LIBS} \ --end-group \ - ${1}" + ${@:2}" fi - ${ld} ${ldflags} -o ${2} -T ${lds} ${objects} + ${ld} ${ldflags} -o ${1} -T ${lds} ${objects} else if [ -n "${CONFIG_THIN_ARCHIVES}" ]; then objects="-Wl,--whole-archive \ @@ -184,17 +184,17 @@ vmlinux_link() -Wl,--start-group \ ${KBUILD_VMLINUX_LIBS} \ -Wl,--end-group \ - ${1}" + ${@:2}" else objects="${KBUILD_VMLINUX_INIT} \ -Wl,--start-group \ ${KBUILD_VMLINUX_MAIN} \ ${KBUILD_VMLINUX_LIBS} \ -Wl,--end-group \ - ${1}" + ${@:2}" fi - ${CC} ${CFLAGS_vmlinux} -o ${2} \ + ${CC} ${CFLAGS_vmlinux} -o ${1} \ -Wl,-T,${lds} \ ${objects} \ -lutil -lrt -lpthread @@ -202,6 +202,40 @@ vmlinux_link() fi } +# generate .BTF typeinfo from DWARF debuginfo +# ${1} - vmlinux image +# ${2} - file to dump raw BTF data into +gen_btf() +{ + local pahole_ver + + if ! [ -x "$(command -v ${PAHOLE})" ]; then + info "BTF" "${1}: pahole (${PAHOLE}) is not available" + return 1 + fi + + pahole_ver=$(${PAHOLE} --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/') + if [ "${pahole_ver}" -lt "113" ]; then + info "BTF" "${1}: pahole version $(${PAHOLE} --version) is too old, need at least v1.13" + return 1 + fi + + info "BTF" ${2} + vmlinux_link ${1} + LLVM_OBJCOPY=${OBJCOPY} ${PAHOLE} -J ${1} + + # Create ${2} which contains just .BTF section but no symbols. Add + # SHF_ALLOC because .BTF will be part of the vmlinux image. --strip-all + # deletes all symbols including __start_BTF and __stop_BTF, which will + # be redefined in the linker script. Add 2>/dev/null to suppress GNU + # objcopy warnings: "empty loadable segment detected at ..." + ${OBJCOPY} --only-section=.BTF --set-section-flags .BTF=alloc,readonly \ + --strip-all ${1} ${2} 2>/dev/null + # Change e_type to ET_REL so that it can be used to link final vmlinux. + # Unlike GNU ld, lld does not allow an ET_EXEC input. + printf '\1' | dd of=${2} conv=notrunc bs=1 seek=16 status=none +} + # Create ${2} .o file with all symbols from the ${1} object file kallsyms() { @@ -273,6 +307,7 @@ sortextable() # Delete output files in case of error cleanup() { + rm -f .btf.* rm -f .old_version rm -f .tmp_System.map rm -f .tmp_kallsyms* @@ -366,6 +401,13 @@ if [ ! -z ${RTIC_MPGEN+x} ]; then KBUILD_VMLINUX_LIBS+=$RTIC_MP_O fi +btf_vmlinux_bin_o="" +if [ -n "${CONFIG_DEBUG_INFO_BTF}" ]; then + if gen_btf .tmp_vmlinux.btf .btf.vmlinux.bin.o ; then + btf_vmlinux_bin_o=.btf.vmlinux.bin.o + fi +fi + kallsymso="" kallsyms_vmlinux="" if [ -n "${CONFIG_KALLSYMS}" ]; then @@ -397,11 +439,11 @@ if [ -n "${CONFIG_KALLSYMS}" ]; then kallsyms_vmlinux=.tmp_vmlinux2 # step 1 - vmlinux_link "" .tmp_vmlinux1 + vmlinux_link .tmp_vmlinux1 ${btf_vmlinux_bin_o} kallsyms .tmp_vmlinux1 .tmp_kallsyms1.o # step 2 - vmlinux_link .tmp_kallsyms1.o .tmp_vmlinux2 + vmlinux_link .tmp_vmlinux2 .tmp_kallsyms1.o ${btf_vmlinux_bin_o} kallsyms .tmp_vmlinux2 .tmp_kallsyms2.o # step 3 @@ -412,8 +454,7 @@ if [ -n "${CONFIG_KALLSYMS}" ]; then kallsymso=.tmp_kallsyms3.o kallsyms_vmlinux=.tmp_vmlinux3 - vmlinux_link .tmp_kallsyms2.o .tmp_vmlinux3 - + vmlinux_link .tmp_vmlinux3 .tmp_kallsyms2.o ${btf_vmlinux_bin_o} kallsyms .tmp_vmlinux3 .tmp_kallsyms3.o fi fi @@ -435,7 +476,7 @@ if [ ! -z ${RTIC_MP_O} ]; then fi info LD vmlinux -vmlinux_link "${kallsymso}" vmlinux +vmlinux_link vmlinux "${kallsymso}" "${btf_vmlinux_bin_o}" if [ -n "${CONFIG_BUILDTIME_EXTABLE_SORT}" ]; then info SORTEX vmlinux diff --git a/scripts/tags.sh b/scripts/tags.sh index 086341e10aa8..4936366e9bd6 100755 --- a/scripts/tags.sh +++ b/scripts/tags.sh @@ -195,9 +195,9 @@ regex_c=( '/\label = &tracer->label; aad(sa)->peer = tracee; aad(sa)->request = 0; - aad(sa)->error = aa_capable(&tracer->label, CAP_SYS_PTRACE, 1); + aad(sa)->error = aa_capable(&tracer->label, CAP_SYS_PTRACE, + CAP_OPT_NONE); return aa_audit(AUDIT_APPARMOR_AUTO, tracer, sa, audit_ptrace_cb); } diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c index b9dcf7ec95a0..ccdc3d7b34ec 100644 --- a/security/apparmor/lsm.c +++ b/security/apparmor/lsm.c @@ -167,14 +167,14 @@ static int apparmor_capget(struct task_struct *target, kernel_cap_t *effective, } static int apparmor_capable(const struct cred *cred, struct user_namespace *ns, - int cap, int audit) + int cap, unsigned int opts) { struct aa_label *label; int error = 0; label = aa_get_newest_cred_label(cred); if (!unconfined(label)) - error = aa_capable(label, cap, audit); + error = aa_capable(label, cap, opts); aa_put_label(label); return error; diff --git a/security/apparmor/resource.c b/security/apparmor/resource.c index d8bc842594ed..e0268424e11c 100644 --- a/security/apparmor/resource.c +++ b/security/apparmor/resource.c @@ -124,7 +124,7 @@ int aa_task_setrlimit(struct aa_label *label, struct task_struct *task, */ if (label != peer && - !aa_capable(label, CAP_SYS_RESOURCE, SECURITY_CAP_NOAUDIT)) + aa_capable(label, CAP_SYS_RESOURCE, CAP_OPT_NOAUDIT) != 0) error = fn_for_each(label, profile, audit_resource(profile, resource, new_rlim->rlim_max, peer, diff --git a/security/commoncap.c b/security/commoncap.c index 776efd3c0b8c..b0626bbf3e0b 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -54,7 +54,7 @@ static void warn_setuid_and_fcaps_mixed(const char *fname) } /** - * __cap_capable - Determine whether a task has a particular effective capability + * cap_capable - Determine whether a task has a particular effective capability * @cred: The credentials to use * @ns: The user namespace in which we need the capability * @cap: The capability to check for @@ -68,8 +68,8 @@ static void warn_setuid_and_fcaps_mixed(const char *fname) * cap_has_capability() returns 0 when a task has a capability, but the * kernel's capable() and has_capability() returns 1 for this case. */ -int __cap_capable(const struct cred *cred, struct user_namespace *targ_ns, - int cap, int audit) +int cap_capable(const struct cred *cred, struct user_namespace *targ_ns, + int cap, unsigned int opts) { struct user_namespace *ns = targ_ns; @@ -113,11 +113,6 @@ int __cap_capable(const struct cred *cred, struct user_namespace *targ_ns, /* We never get here */ } -int cap_capable(const struct cred *cred, struct user_namespace *targ_ns, - int cap, int audit) -{ - return __cap_capable(cred, targ_ns, cap, audit); -} /** * cap_settime - Determine whether the current process may set the system clock * @ts: The time to set @@ -235,12 +230,11 @@ int cap_capget(struct task_struct *target, kernel_cap_t *effective, */ static inline int cap_inh_is_capped(void) { - /* they are so limited unless the current task has the CAP_SETPCAP * capability */ if (cap_capable(current_cred(), current_cred()->user_ns, - CAP_SETPCAP, SECURITY_CAP_AUDIT) == 0) + CAP_SETPCAP, CAP_OPT_NONE) == 0) return 0; return 1; } @@ -1179,8 +1173,9 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, || ((old->securebits & SECURE_ALL_LOCKS & ~arg2)) /*[2]*/ || (arg2 & ~(SECURE_ALL_LOCKS | SECURE_ALL_BITS)) /*[3]*/ || (cap_capable(current_cred(), - current_cred()->user_ns, CAP_SETPCAP, - SECURITY_CAP_AUDIT) != 0) /*[4]*/ + current_cred()->user_ns, + CAP_SETPCAP, + CAP_OPT_NONE) != 0) /*[4]*/ /* * [1] no changing of bits that are locked * [2] no unlocking of locks @@ -1275,9 +1270,10 @@ int cap_vm_enough_memory(struct mm_struct *mm, long pages) { int cap_sys_admin = 0; - if (cap_capable(current_cred(), &init_user_ns, CAP_SYS_ADMIN, - SECURITY_CAP_NOAUDIT) == 0) + if (cap_capable(current_cred(), &init_user_ns, + CAP_SYS_ADMIN, CAP_OPT_NOAUDIT) == 0) cap_sys_admin = 1; + return cap_sys_admin; } @@ -1296,7 +1292,7 @@ int cap_mmap_addr(unsigned long addr) if (addr < dac_mmap_min_addr) { ret = cap_capable(current_cred(), &init_user_ns, CAP_SYS_RAWIO, - SECURITY_CAP_AUDIT); + CAP_OPT_NONE); /* set PF_SUPERPRIV if it turns out we allow the low mmap */ if (ret == 0) current->flags |= PF_SUPERPRIV; diff --git a/security/security.c b/security/security.c index 960e25cc4983..4166afebbdb0 100644 --- a/security/security.c +++ b/security/security.c @@ -285,16 +285,12 @@ int security_capset(struct cred *new, const struct cred *old, effective, inheritable, permitted); } -int security_capable(const struct cred *cred, struct user_namespace *ns, - int cap) +int security_capable(const struct cred *cred, + struct user_namespace *ns, + int cap, + unsigned int opts) { - return call_int_hook(capable, 0, cred, ns, cap, SECURITY_CAP_AUDIT); -} - -int security_capable_noaudit(const struct cred *cred, struct user_namespace *ns, - int cap) -{ - return call_int_hook(capable, 0, cred, ns, cap, SECURITY_CAP_NOAUDIT); + return call_int_hook(capable, 0, cred, ns, cap, opts); } int security_quotactl(int cmds, int type, int id, struct super_block *sb) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 27551e16b75a..c526194d920e 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -1618,7 +1618,9 @@ static inline u16 socket_type_to_security_class(int family, int type, int protoc return SECCLASS_QIPCRTR_SOCKET; case PF_SMC: return SECCLASS_SMC_SOCKET; -#if PF_MAX > 44 + case PF_XDP: + return SECCLASS_XDP_SOCKET; +#if PF_MAX > 45 #error New address family defined, please update this function. #endif } @@ -1907,7 +1909,7 @@ static inline u32 signal_to_av(int sig) /* Check whether a task is allowed to use a capability. */ static int cred_has_capability(const struct cred *cred, - int cap, int audit, bool initns) + int cap, unsigned int opts, bool initns) { struct common_audit_data ad; struct av_decision avd; @@ -1935,7 +1937,7 @@ static int cred_has_capability(const struct cred *cred, rc = avc_has_perm_noaudit(&selinux_state, sid, sid, sclass, av, 0, &avd); - if (audit == SECURITY_CAP_AUDIT) { + if (!(opts & CAP_OPT_NOAUDIT)) { int rc2 = avc_audit(&selinux_state, sid, sid, sclass, av, &avd, rc, &ad, 0); if (rc2) @@ -2449,9 +2451,9 @@ static int selinux_capset(struct cred *new, const struct cred *old, */ static int selinux_capable(const struct cred *cred, struct user_namespace *ns, - int cap, int audit) + int cap, unsigned int opts) { - return cred_has_capability(cred, cap, audit, ns == &init_user_ns); + return cred_has_capability(cred, cap, opts, ns == &init_user_ns); } static int selinux_quotactl(int cmds, int type, int id, struct super_block *sb) @@ -2525,7 +2527,7 @@ static int selinux_vm_enough_memory(struct mm_struct *mm, long pages) int rc, cap_sys_admin = 0; rc = cred_has_capability(current_cred(), CAP_SYS_ADMIN, - SECURITY_CAP_NOAUDIT, true); + CAP_OPT_NOAUDIT, true); if (rc == 0) cap_sys_admin = 1; @@ -3411,11 +3413,11 @@ static int selinux_inode_setotherxattr(struct dentry *dentry, const char *name) static bool has_cap_mac_admin(bool audit) { const struct cred *cred = current_cred(); - int cap_audit = audit ? SECURITY_CAP_AUDIT : SECURITY_CAP_NOAUDIT; + unsigned int opts = audit ? CAP_OPT_NONE : CAP_OPT_NOAUDIT; - if (cap_capable(cred, &init_user_ns, CAP_MAC_ADMIN, cap_audit)) + if (cap_capable(cred, &init_user_ns, CAP_MAC_ADMIN, opts)) return false; - if (cred_has_capability(cred, CAP_MAC_ADMIN, cap_audit, true)) + if (cred_has_capability(cred, CAP_MAC_ADMIN, opts, true)) return false; return true; } @@ -3805,7 +3807,7 @@ static int selinux_file_ioctl(struct file *file, unsigned int cmd, case KDSKBENT: case KDSKBSENT: error = cred_has_capability(cred, CAP_SYS_TTY_CONFIG, - SECURITY_CAP_AUDIT, true); + CAP_OPT_NONE, true); break; /* default case assumes that the command will go diff --git a/security/selinux/ibpkey.c b/security/selinux/ibpkey.c index cb05ae28ce00..5887bff50560 100644 --- a/security/selinux/ibpkey.c +++ b/security/selinux/ibpkey.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Pkey table * @@ -11,21 +12,10 @@ * Paul Moore * (see security/selinux/netif.c and security/selinux/netport.c for more * information) - * */ /* * (c) Mellanox Technologies, 2016 - * - * This program is free software: you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * */ #include diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h index b52c7e1d7a84..2eea88e95b29 100644 --- a/security/selinux/include/classmap.h +++ b/security/selinux/include/classmap.h @@ -242,11 +242,13 @@ struct security_class_mapping secclass_map[] = { { "manage_subnet", NULL } }, { "bpf", {"map_create", "map_read", "map_write", "prog_load", "prog_run"} }, + { "xdp_socket", + { COMMON_SOCK_PERMS, NULL } }, { "perf_event", {"open", "cpu", "kernel", "tracepoint", "read", "write"} }, { NULL } }; -#if PF_MAX > 44 +#if PF_MAX > 45 #error New address family defined, please update secclass_map. #endif diff --git a/security/selinux/include/ibpkey.h b/security/selinux/include/ibpkey.h index e3c2b12ac22a..a684bf86933d 100644 --- a/security/selinux/include/ibpkey.h +++ b/security/selinux/include/ibpkey.h @@ -1,24 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * pkey table * * SELinux must keep a mapping of pkeys to labels/SIDs. This * mapping is maintained as part of the normal policy but a fast cache is * needed to reduce the lookup overhead. - * */ /* * (c) Mellanox Technologies, 2016 - * - * This program is free software: you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * */ #ifndef _SELINUX_IB_PKEY_H diff --git a/security/selinux/include/netnode.h b/security/selinux/include/netnode.h index 937668dd3024..e3f784a85840 100644 --- a/security/selinux/include/netnode.h +++ b/security/selinux/include/netnode.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Network node table * @@ -7,21 +8,10 @@ * a per-packet basis. * * Author: Paul Moore - * */ /* * (c) Copyright Hewlett-Packard Development Company, L.P., 2007 - * - * This program is free software: you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * */ #ifndef _SELINUX_NETNODE_H diff --git a/security/selinux/include/netport.h b/security/selinux/include/netport.h index d1ce896b2cb0..31bc16e29cd1 100644 --- a/security/selinux/include/netport.h +++ b/security/selinux/include/netport.h @@ -1,3 +1,4 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Network port table * @@ -6,21 +7,10 @@ * needed to reduce the lookup overhead. * * Author: Paul Moore - * */ /* * (c) Copyright Hewlett-Packard Development Company, L.P., 2008 - * - * This program is free software: you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * */ #ifndef _SELINUX_NETPORT_H diff --git a/security/selinux/netnode.c b/security/selinux/netnode.c index 08ee81bfcc1c..8335595f7050 100644 --- a/security/selinux/netnode.c +++ b/security/selinux/netnode.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Network node table * @@ -11,21 +12,10 @@ * This code is heavily based on the "netif" concept originally developed by * James Morris * (see security/selinux/netif.c for more information) - * */ /* * (c) Copyright Hewlett-Packard Development Company, L.P., 2007 - * - * This program is free software: you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * */ #include diff --git a/security/selinux/netport.c b/security/selinux/netport.c index 4e8e4458b55f..24c2bce74248 100644 --- a/security/selinux/netport.c +++ b/security/selinux/netport.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Network port table * @@ -10,21 +11,10 @@ * This code is heavily based on the "netif" concept originally developed by * James Morris * (see security/selinux/netif.c for more information) - * */ /* * (c) Copyright Hewlett-Packard Development Company, L.P., 2008 - * - * This program is free software: you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * */ #include diff --git a/security/selinux/xfrm.c b/security/selinux/xfrm.c index 8768e6b0226f..9e803d2a687a 100644 --- a/security/selinux/xfrm.c +++ b/security/selinux/xfrm.c @@ -459,7 +459,7 @@ int selinux_xfrm_postroute_last(u32 sk_sid, struct sk_buff *skb, if (dst) { struct dst_entry *iter; - for (iter = dst; iter != NULL; iter = iter->child) { + for (iter = dst; iter != NULL; iter = xfrm_dst_child(iter)) { struct xfrm_state *x = iter->xfrm; if (x && selinux_authorizable_xfrm(x)) diff --git a/security/smack/smack.h b/security/smack/smack.h index 6a71fc7831ab..f7db791fb566 100644 --- a/security/smack/smack.h +++ b/security/smack/smack.h @@ -321,6 +321,7 @@ struct smack_known *smk_import_entry(const char *, int); void smk_insert_entry(struct smack_known *skp); struct smack_known *smk_find_entry(const char *); bool smack_privileged(int cap); +bool smack_privileged_cred(int cap, const struct cred *cred); void smk_destroy_label_list(struct list_head *list); /* diff --git a/security/smack/smack_access.c b/security/smack/smack_access.c index c8e82d6a12b5..07d23b4f76f3 100644 --- a/security/smack/smack_access.c +++ b/security/smack/smack_access.c @@ -622,26 +622,24 @@ struct smack_known *smack_from_secid(const u32 secid) LIST_HEAD(smack_onlycap_list); DEFINE_MUTEX(smack_onlycap_lock); -/* +/** + * smack_privileged_cred - are all privilege requirements met by cred + * @cap: The requested capability + * @cred: the credential to use + * * Is the task privileged and allowed to be privileged * by the onlycap rule. * * Returns true if the task is allowed to be privileged, false if it's not. */ -bool smack_privileged(int cap) +bool smack_privileged_cred(int cap, const struct cred *cred) { - struct smack_known *skp = smk_of_current(); + struct task_smack *tsp = cred->security; + struct smack_known *skp = tsp->smk_task; struct smack_known_list_elem *sklep; int rc; - /* - * All kernel tasks are privileged - */ - if (unlikely(current->flags & PF_KTHREAD)) - return true; - - rc = cap_capable(current_cred(), &init_user_ns, cap, - SECURITY_CAP_AUDIT); + rc = cap_capable(cred, &init_user_ns, cap, CAP_OPT_NONE); if (rc) return false; @@ -661,3 +659,23 @@ bool smack_privileged(int cap) return false; } + +/** + * smack_privileged - are all privilege requirements met + * @cap: The requested capability + * + * Is the task privileged and allowed to be privileged + * by the onlycap rule. + * + * Returns true if the task is allowed to be privileged, false if it's not. + */ +bool smack_privileged(int cap) +{ + /* + * All kernel tasks are privileged + */ + if (unlikely(current->flags & PF_KTHREAD)) + return true; + + return smack_privileged_cred(cap, current_cred()); +} diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c index 86d66d17f641..33095c32fa19 100644 --- a/security/smack/smack_lsm.c +++ b/security/smack/smack_lsm.c @@ -4395,6 +4395,10 @@ static int smack_key_permission(key_ref_t key_ref, */ if (tkp == NULL) return -EACCES; + + if (smack_privileged_cred(CAP_MAC_OVERRIDE, cred)) + return 0; + #ifdef CONFIG_AUDIT smk_ad_init(&ad, __func__, LSM_AUDIT_DATA_KEY); ad.a.u.key_struct.key = keyp->serial; diff --git a/sound/pci/asihpi/hpioctl.c b/sound/pci/asihpi/hpioctl.c index 81f240472943..3aa075460f8b 100644 --- a/sound/pci/asihpi/hpioctl.c +++ b/sound/pci/asihpi/hpioctl.c @@ -1,17 +1,10 @@ +// SPDX-License-Identifier: GPL-2.0-only /******************************************************************************* AudioScience HPI driver Common Linux HPI ioctl and module probe/remove functions Copyright (C) 1997-2014 AudioScience Inc. - This program is free software; you can redistribute it and/or modify - it under the terms of version 2 of the GNU General Public License as - published by the Free Software Foundation; - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. *******************************************************************************/ #define SOURCEFILE_NAME "hpioctl.c" diff --git a/tools/include/linux/filter.h b/tools/include/linux/filter.h index c5e512da8d8a..6beceae2dfc2 100644 --- a/tools/include/linux/filter.h +++ b/tools/include/linux/filter.h @@ -199,6 +199,16 @@ .off = OFF, \ .imm = 0 }) +/* Like BPF_JMP_REG, but with 32-bit wide operands for comparison. */ + +#define BPF_JMP32_REG(OP, DST, SRC, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_JMP32 | BPF_OP(OP) | BPF_X, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = OFF, \ + .imm = 0 }) + /* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */ #define BPF_JMP_IMM(OP, DST, IMM, OFF) \ @@ -209,6 +219,16 @@ .off = OFF, \ .imm = IMM }) +/* Like BPF_JMP_IMM, but with 32-bit wide operands for comparison. */ + +#define BPF_JMP32_IMM(OP, DST, IMM, OFF) \ + ((struct bpf_insn) { \ + .code = BPF_JMP32 | BPF_OP(OP) | BPF_K, \ + .dst_reg = DST, \ + .src_reg = 0, \ + .off = OFF, \ + .imm = IMM }) + /* Unconditional jumps, goto pc + off16 */ #define BPF_JMP_A(OFF) \ diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 63ed136b872e..63f11d1b77a4 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -14,6 +14,7 @@ /* Extended instruction set based on top of classic BPF */ /* instruction classes */ +#define BPF_JMP32 0x06 /* jmp mode in word width */ #define BPF_ALU64 0x07 /* alu mode in double word width */ /* ld/ldx fields */ @@ -115,6 +116,14 @@ enum bpf_map_type { BPF_MAP_TYPE_CPUMAP, }; +/* Note that tracing related programs such as + * BPF_PROG_TYPE_{KPROBE,TRACEPOINT,PERF_EVENT,RAW_TRACEPOINT} + * are not subject to a stable API since kernel internal data + * structures can change from release to release and may + * therefore break existing tracing BPF programs. Tracing BPF + * programs correspond to /a/ specific kernel which is to be + * analyzed, and not /a/ specific kernel /and/ all future ones. + */ enum bpf_prog_type { BPF_PROG_TYPE_UNSPEC, BPF_PROG_TYPE_SOCKET_FILTER, @@ -131,7 +140,6 @@ enum bpf_prog_type { BPF_PROG_TYPE_LWT_XMIT, BPF_PROG_TYPE_SOCK_OPS, BPF_PROG_TYPE_SK_SKB, - BPF_PROG_TYPE_CGROUP_DEVICE, }; enum bpf_attach_type { @@ -141,7 +149,6 @@ enum bpf_attach_type { BPF_CGROUP_SOCK_OPS, BPF_SK_SKB_STREAM_PARSER, BPF_SK_SKB_STREAM_VERDICT, - BPF_CGROUP_DEVICE, __MAX_BPF_ATTACH_TYPE }; @@ -212,7 +219,7 @@ union bpf_attr { __u32 log_level; /* verbosity level of verifier */ __u32 log_size; /* size of user buffer */ __aligned_u64 log_buf; /* user supplied buffer */ - __u32 kern_version; /* checked when prog_type=kprobe */ + __u32 kern_version; /* not used */ __u32 prog_flags; }; @@ -799,7 +806,6 @@ struct __sk_buff { __u32 local_ip6[4]; /* Stored in network byte order */ __u32 remote_port; /* Stored in network byte order */ __u32 local_port; /* stored in host byte order */ - __u64 tstamp; }; struct bpf_tunnel_key { @@ -886,9 +892,6 @@ struct bpf_map_info { __u32 value_size; __u32 max_entries; __u32 map_flags; - __u32 ifindex; - __u64 netns_dev; - __u64 netns_ino; } __attribute__((aligned(8))); /* User bpf_sock_ops struct to access socket values and specify request ops @@ -943,17 +946,4 @@ enum { #define TCP_BPF_IW 1001 /* Set TCP initial congestion window */ #define TCP_BPF_SNDCWND_CLAMP 1002 /* Set sndcwnd_clamp */ -#define BPF_DEVCG_ACC_MKNOD (1ULL << 0) -#define BPF_DEVCG_ACC_READ (1ULL << 1) -#define BPF_DEVCG_ACC_WRITE (1ULL << 2) - -#define BPF_DEVCG_DEV_BLOCK (1ULL << 0) -#define BPF_DEVCG_DEV_CHAR (1ULL << 1) - -struct bpf_cgroup_dev_ctx { - __u32 access_type; /* (access << 16) | type */ - __u32 major; - __u32 minor; -}; - #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/tools/objtool/check.c b/tools/objtool/check.c index e93c061654a7..7e5b62bd4b31 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -30,6 +30,8 @@ #define FAKE_JUMP_OFFSET -1 +#define C_JUMP_TABLE_SECTION ".rodata..c_jump_table" + struct alternative { struct list_head list; struct instruction *insn; @@ -950,9 +952,15 @@ static struct rela *find_switch_table(struct objtool_file *file, /* * Make sure the .rodata address isn't associated with a - * symbol. gcc jump tables are anonymous data. + * symbol. GCC jump tables are anonymous data. + * + * Also support C jump tables which are in the same format as + * switch jump tables. For objtool to recognize them, they + * need to be placed in the C_JUMP_TABLE_SECTION section. They + * have symbols associated with them. */ - if (find_symbol_containing(rodata_sec, table_offset)) + if (find_symbol_containing(rodata_sec, table_offset) && + strcmp(rodata_sec->name, C_JUMP_TABLE_SECTION)) continue; rodata_rela = find_rela_by_dest(rodata_sec, table_offset); @@ -1191,13 +1199,18 @@ static void mark_rodata(struct objtool_file *file) bool found = false; /* - * This searches for the .rodata section or multiple .rodata.func_name - * sections if -fdata-sections is being used. The .str.1.1 and .str.1.8 - * rodata sections are ignored as they don't contain jump tables. + * Search for the following rodata sections, each of which can + * potentially contain jump tables: + * + * - .rodata: can contain GCC switch tables + * - .rodata.: same, if -fdata-sections is being used + * - .rodata..c_jump_table: contains C annotated jump tables + * + * .rodata.str1.* sections are ignored; they don't contain jump tables. */ for_each_sec(file, sec) { - if (!strncmp(sec->name, ".rodata", 7) && - !strstr(sec->name, ".str1.")) { + if ((!strncmp(sec->name, ".rodata", 7) && !strstr(sec->name, ".str1.")) || + !strcmp(sec->name, C_JUMP_TABLE_SECTION)) { sec->rodata = true; found = true; } diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c index f2a00b0698a3..a77aa7b1ca8c 100644 --- a/tools/testing/nvdimm/test/iomap.c +++ b/tools/testing/nvdimm/test/iomap.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include #include diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c index 3ad0b3a3317b..fe88ad7fa82d 100644 --- a/tools/testing/nvdimm/test/nfit.c +++ b/tools/testing/nvdimm/test/nfit.c @@ -1,14 +1,6 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include diff --git a/tools/testing/nvdimm/test/nfit_test.h b/tools/testing/nvdimm/test/nfit_test.h index d3d63dd5ed38..d1d488a652b6 100644 --- a/tools/testing/nvdimm/test/nfit_test.h +++ b/tools/testing/nvdimm/test/nfit_test.h @@ -1,14 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright(c) 2013-2015 Intel Corporation. All rights reserved. - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #ifndef __NFIT_TEST_H__ #define __NFIT_TEST_H__ diff --git a/tools/testing/selftests/bpf/prog_tests/map_init.c b/tools/testing/selftests/bpf/prog_tests/map_init.c new file mode 100644 index 000000000000..14a31109dd0e --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/map_init.c @@ -0,0 +1,214 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2020 Tessares SA */ + +#include +#include "test_map_init.skel.h" + +#define TEST_VALUE 0x1234 +#define FILL_VALUE 0xdeadbeef + +static int nr_cpus; +static int duration; + +typedef unsigned long long map_key_t; +typedef unsigned long long map_value_t; +typedef struct { + map_value_t v; /* padding */ +} __bpf_percpu_val_align pcpu_map_value_t; + + +static int map_populate(int map_fd, int num) +{ + pcpu_map_value_t value[nr_cpus]; + int i, err; + map_key_t key; + + for (i = 0; i < nr_cpus; i++) + bpf_percpu(value, i) = FILL_VALUE; + + for (key = 1; key <= num; key++) { + err = bpf_map_update_elem(map_fd, &key, value, BPF_NOEXIST); + if (!ASSERT_OK(err, "bpf_map_update_elem")) + return -1; + } + + return 0; +} + +static struct test_map_init *setup(enum bpf_map_type map_type, int map_sz, + int *map_fd, int populate) +{ + struct test_map_init *skel; + int err; + + skel = test_map_init__open(); + if (!ASSERT_OK_PTR(skel, "skel_open")) + return NULL; + + err = bpf_map__set_type(skel->maps.hashmap1, map_type); + if (!ASSERT_OK(err, "bpf_map__set_type")) + goto error; + + err = bpf_map__set_max_entries(skel->maps.hashmap1, map_sz); + if (!ASSERT_OK(err, "bpf_map__set_max_entries")) + goto error; + + err = test_map_init__load(skel); + if (!ASSERT_OK(err, "skel_load")) + goto error; + + *map_fd = bpf_map__fd(skel->maps.hashmap1); + if (CHECK(*map_fd < 0, "bpf_map__fd", "failed\n")) + goto error; + + err = map_populate(*map_fd, populate); + if (!ASSERT_OK(err, "map_populate")) + goto error_map; + + return skel; + +error_map: + close(*map_fd); +error: + test_map_init__destroy(skel); + return NULL; +} + +/* executes bpf program that updates map with key, value */ +static int prog_run_insert_elem(struct test_map_init *skel, map_key_t key, + map_value_t value) +{ + struct test_map_init__bss *bss; + + bss = skel->bss; + + bss->inKey = key; + bss->inValue = value; + bss->inPid = getpid(); + + if (!ASSERT_OK(test_map_init__attach(skel), "skel_attach")) + return -1; + + /* Let tracepoint trigger */ + syscall(__NR_getpgid); + + test_map_init__detach(skel); + + return 0; +} + +static int check_values_one_cpu(pcpu_map_value_t *value, map_value_t expected) +{ + int i, nzCnt = 0; + map_value_t val; + + for (i = 0; i < nr_cpus; i++) { + val = bpf_percpu(value, i); + if (val) { + if (CHECK(val != expected, "map value", + "unexpected for cpu %d: 0x%llx\n", i, val)) + return -1; + nzCnt++; + } + } + + if (CHECK(nzCnt != 1, "map value", "set for %d CPUs instead of 1!\n", + nzCnt)) + return -1; + + return 0; +} + +/* Add key=1 elem with values set for all CPUs + * Delete elem key=1 + * Run bpf prog that inserts new key=1 elem with value=0x1234 + * (bpf prog can only set value for current CPU) + * Lookup Key=1 and check value is as expected for all CPUs: + * value set by bpf prog for one CPU, 0 for all others + */ +static void test_pcpu_map_init(void) +{ + pcpu_map_value_t value[nr_cpus]; + struct test_map_init *skel; + int map_fd, err; + map_key_t key; + + /* max 1 elem in map so insertion is forced to reuse freed entry */ + skel = setup(BPF_MAP_TYPE_PERCPU_HASH, 1, &map_fd, 1); + if (!ASSERT_OK_PTR(skel, "prog_setup")) + return; + + /* delete element so the entry can be re-used*/ + key = 1; + err = bpf_map_delete_elem(map_fd, &key); + if (!ASSERT_OK(err, "bpf_map_delete_elem")) + goto cleanup; + + /* run bpf prog that inserts new elem, re-using the slot just freed */ + err = prog_run_insert_elem(skel, key, TEST_VALUE); + if (!ASSERT_OK(err, "prog_run_insert_elem")) + goto cleanup; + + /* check that key=1 was re-created by bpf prog */ + err = bpf_map_lookup_elem(map_fd, &key, value); + if (!ASSERT_OK(err, "bpf_map_lookup_elem")) + goto cleanup; + + /* and has expected values */ + check_values_one_cpu(value, TEST_VALUE); + +cleanup: + test_map_init__destroy(skel); +} + +/* Add key=1 and key=2 elems with values set for all CPUs + * Run bpf prog that inserts new key=3 elem + * (only for current cpu; other cpus should have initial value = 0) + * Lookup Key=1 and check value is as expected for all CPUs + */ +static void test_pcpu_lru_map_init(void) +{ + pcpu_map_value_t value[nr_cpus]; + struct test_map_init *skel; + int map_fd, err; + map_key_t key; + + /* Set up LRU map with 2 elements, values filled for all CPUs. + * With these 2 elements, the LRU map is full + */ + skel = setup(BPF_MAP_TYPE_LRU_PERCPU_HASH, 2, &map_fd, 2); + if (!ASSERT_OK_PTR(skel, "prog_setup")) + return; + + /* run bpf prog that inserts new key=3 element, re-using LRU slot */ + key = 3; + err = prog_run_insert_elem(skel, key, TEST_VALUE); + if (!ASSERT_OK(err, "prog_run_insert_elem")) + goto cleanup; + + /* check that key=3 replaced one of earlier elements */ + err = bpf_map_lookup_elem(map_fd, &key, value); + if (!ASSERT_OK(err, "bpf_map_lookup_elem")) + goto cleanup; + + /* and has expected values */ + check_values_one_cpu(value, TEST_VALUE); + +cleanup: + test_map_init__destroy(skel); +} + +void test_map_init(void) +{ + nr_cpus = bpf_num_possible_cpus(); + if (nr_cpus <= 1) { + printf("%s:SKIP: >1 cpu needed for this test\n", __func__); + test__skip(); + return; + } + + if (test__start_subtest("pcpu_map_init")) + test_pcpu_map_init(); + if (test__start_subtest("pcpu_lru_map_init")) + test_pcpu_lru_map_init(); +} diff --git a/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_reject_nbd_invalid.c b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_reject_nbd_invalid.c new file mode 100644 index 000000000000..9807336a3016 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_reject_nbd_invalid.c @@ -0,0 +1,42 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include + +void test_raw_tp_writable_reject_nbd_invalid(void) +{ + __u32 duration = 0; + char error[4096]; + int bpf_fd = -1, tp_fd = -1; + + const struct bpf_insn program[] = { + /* r6 is our tp buffer */ + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0), + /* one byte beyond the end of the nbd_request struct */ + BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_6, + sizeof(struct nbd_request)), + BPF_EXIT_INSN(), + }; + + struct bpf_load_program_attr load_attr = { + .prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, + .license = "GPL v2", + .insns = program, + .insns_cnt = sizeof(program) / sizeof(struct bpf_insn), + .log_level = 2, + }; + + bpf_fd = bpf_load_program_xattr(&load_attr, error, sizeof(error)); + if (CHECK(bpf_fd < 0, "bpf_raw_tracepoint_writable load", + "failed: %d errno %d\n", bpf_fd, errno)) + return; + + tp_fd = bpf_raw_tracepoint_open("nbd_send_request", bpf_fd); + if (CHECK(tp_fd >= 0, "bpf_raw_tracepoint_writable open", + "erroneously succeeded\n")) + goto out_bpffd; + + close(tp_fd); +out_bpffd: + close(bpf_fd); +} diff --git a/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_run.c b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_run.c new file mode 100644 index 000000000000..5c45424cac5f --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_run.c @@ -0,0 +1,80 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include + +void test_raw_tp_writable_test_run(void) +{ + __u32 duration = 0; + char error[4096]; + + const struct bpf_insn trace_program[] = { + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0), + BPF_MOV64_IMM(BPF_REG_0, 42), + BPF_STX_MEM(BPF_W, BPF_REG_6, BPF_REG_0, 0), + BPF_EXIT_INSN(), + }; + + struct bpf_load_program_attr load_attr = { + .prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, + .license = "GPL v2", + .insns = trace_program, + .insns_cnt = sizeof(trace_program) / sizeof(struct bpf_insn), + .log_level = 2, + }; + + int bpf_fd = bpf_load_program_xattr(&load_attr, error, sizeof(error)); + if (CHECK(bpf_fd < 0, "bpf_raw_tracepoint_writable loaded", + "failed: %d errno %d\n", bpf_fd, errno)) + return; + + const struct bpf_insn skb_program[] = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }; + + struct bpf_load_program_attr skb_load_attr = { + .prog_type = BPF_PROG_TYPE_SOCKET_FILTER, + .license = "GPL v2", + .insns = skb_program, + .insns_cnt = sizeof(skb_program) / sizeof(struct bpf_insn), + }; + + int filter_fd = + bpf_load_program_xattr(&skb_load_attr, error, sizeof(error)); + if (CHECK(filter_fd < 0, "test_program_loaded", "failed: %d errno %d\n", + filter_fd, errno)) + goto out_bpffd; + + int tp_fd = bpf_raw_tracepoint_open("bpf_test_finish", bpf_fd); + if (CHECK(tp_fd < 0, "bpf_raw_tracepoint_writable opened", + "failed: %d errno %d\n", tp_fd, errno)) + goto out_filterfd; + + char test_skb[128] = { + 0, + }; + + __u32 prog_ret; + int err = bpf_prog_test_run(filter_fd, 1, test_skb, sizeof(test_skb), 0, + 0, &prog_ret, 0); + CHECK(err != 42, "test_run", + "tracepoint did not modify return value\n"); + CHECK(prog_ret != 0, "test_run_ret", + "socket_filter did not return 0\n"); + + close(tp_fd); + + err = bpf_prog_test_run(filter_fd, 1, test_skb, sizeof(test_skb), 0, 0, + &prog_ret, 0); + CHECK(err != 0, "test_run_notrace", + "test_run failed with %d errno %d\n", err, errno); + CHECK(prog_ret != 0, "test_run_ret_notrace", + "socket_filter did not return 0\n"); + +out_filterfd: + close(filter_fd); +out_bpffd: + close(bpf_fd); +} diff --git a/tools/testing/selftests/bpf/test_lpm_map.c b/tools/testing/selftests/bpf/test_lpm_map.c new file mode 100644 index 000000000000..f93a333cbf2c --- /dev/null +++ b/tools/testing/selftests/bpf/test_lpm_map.c @@ -0,0 +1,359 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Randomized tests for eBPF longest-prefix-match maps + * + * This program runs randomized tests against the lpm-bpf-map. It implements a + * "Trivial Longest Prefix Match" (tlpm) based on simple, linear, singly linked + * lists. The implementation should be pretty straightforward. + * + * Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies + * the trie-based bpf-map implementation behaves the same way as tlpm. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include "bpf_util.h" + +struct tlpm_node { + struct tlpm_node *next; + size_t n_bits; + uint8_t key[]; +}; + +static struct tlpm_node *tlpm_add(struct tlpm_node *list, + const uint8_t *key, + size_t n_bits) +{ + struct tlpm_node *node; + size_t n; + + /* add new entry with @key/@n_bits to @list and return new head */ + + n = (n_bits + 7) / 8; + node = malloc(sizeof(*node) + n); + assert(node); + + node->next = list; + node->n_bits = n_bits; + memcpy(node->key, key, n); + + return node; +} + +static void tlpm_clear(struct tlpm_node *list) +{ + struct tlpm_node *node; + + /* free all entries in @list */ + + while ((node = list)) { + list = list->next; + free(node); + } +} + +static struct tlpm_node *tlpm_match(struct tlpm_node *list, + const uint8_t *key, + size_t n_bits) +{ + struct tlpm_node *best = NULL; + size_t i; + + /* Perform longest prefix-match on @key/@n_bits. That is, iterate all + * entries and match each prefix against @key. Remember the "best" + * entry we find (i.e., the longest prefix that matches) and return it + * to the caller when done. + */ + + for ( ; list; list = list->next) { + for (i = 0; i < n_bits && i < list->n_bits; ++i) { + if ((key[i / 8] & (1 << (7 - i % 8))) != + (list->key[i / 8] & (1 << (7 - i % 8)))) + break; + } + + if (i >= list->n_bits) { + if (!best || i > best->n_bits) + best = list; + } + } + + return best; +} + +static void test_lpm_basic(void) +{ + struct tlpm_node *list = NULL, *t1, *t2; + + /* very basic, static tests to verify tlpm works as expected */ + + assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8)); + + t1 = list = tlpm_add(list, (uint8_t[]){ 0xff }, 8); + assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8)); + assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 16)); + assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0x00 }, 16)); + assert(!tlpm_match(list, (uint8_t[]){ 0x7f }, 8)); + assert(!tlpm_match(list, (uint8_t[]){ 0xfe }, 8)); + assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 7)); + + t2 = list = tlpm_add(list, (uint8_t[]){ 0xff, 0xff }, 16); + assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8)); + assert(t2 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 16)); + assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 15)); + assert(!tlpm_match(list, (uint8_t[]){ 0x7f, 0xff }, 16)); + + tlpm_clear(list); +} + +static void test_lpm_order(void) +{ + struct tlpm_node *t1, *t2, *l1 = NULL, *l2 = NULL; + size_t i, j; + + /* Verify the tlpm implementation works correctly regardless of the + * order of entries. Insert a random set of entries into @l1, and copy + * the same data in reverse order into @l2. Then verify a lookup of + * random keys will yield the same result in both sets. + */ + + for (i = 0; i < (1 << 12); ++i) + l1 = tlpm_add(l1, (uint8_t[]){ + rand() % 0xff, + rand() % 0xff, + }, rand() % 16 + 1); + + for (t1 = l1; t1; t1 = t1->next) + l2 = tlpm_add(l2, t1->key, t1->n_bits); + + for (i = 0; i < (1 << 8); ++i) { + uint8_t key[] = { rand() % 0xff, rand() % 0xff }; + + t1 = tlpm_match(l1, key, 16); + t2 = tlpm_match(l2, key, 16); + + assert(!t1 == !t2); + if (t1) { + assert(t1->n_bits == t2->n_bits); + for (j = 0; j < t1->n_bits; ++j) + assert((t1->key[j / 8] & (1 << (7 - j % 8))) == + (t2->key[j / 8] & (1 << (7 - j % 8)))); + } + } + + tlpm_clear(l1); + tlpm_clear(l2); +} + +static void test_lpm_map(int keysize) +{ + size_t i, j, n_matches, n_nodes, n_lookups; + struct tlpm_node *t, *list = NULL; + struct bpf_lpm_trie_key *key; + uint8_t *data, *value; + int r, map; + + /* Compare behavior of tlpm vs. bpf-lpm. Create a randomized set of + * prefixes and insert it into both tlpm and bpf-lpm. Then run some + * randomized lookups and verify both maps return the same result. + */ + + n_matches = 0; + n_nodes = 1 << 8; + n_lookups = 1 << 16; + + data = alloca(keysize); + memset(data, 0, keysize); + + value = alloca(keysize + 1); + memset(value, 0, keysize + 1); + + key = alloca(sizeof(*key) + keysize); + memset(key, 0, sizeof(*key) + keysize); + + map = bpf_create_map(BPF_MAP_TYPE_LPM_TRIE, + sizeof(*key) + keysize, + keysize + 1, + 4096, + BPF_F_NO_PREALLOC); + assert(map >= 0); + + for (i = 0; i < n_nodes; ++i) { + for (j = 0; j < keysize; ++j) + value[j] = rand() & 0xff; + value[keysize] = rand() % (8 * keysize + 1); + + list = tlpm_add(list, value, value[keysize]); + + key->prefixlen = value[keysize]; + memcpy(key->data, value, keysize); + r = bpf_map_update_elem(map, key, value, 0); + assert(!r); + } + + for (i = 0; i < n_lookups; ++i) { + for (j = 0; j < keysize; ++j) + data[j] = rand() & 0xff; + + t = tlpm_match(list, data, 8 * keysize); + + key->prefixlen = 8 * keysize; + memcpy(key->data, data, keysize); + r = bpf_map_lookup_elem(map, key, value); + assert(!r || errno == ENOENT); + assert(!t == !!r); + + if (t) { + ++n_matches; + assert(t->n_bits == value[keysize]); + for (j = 0; j < t->n_bits; ++j) + assert((t->key[j / 8] & (1 << (7 - j % 8))) == + (value[j / 8] & (1 << (7 - j % 8)))); + } + } + + close(map); + tlpm_clear(list); + + /* With 255 random nodes in the map, we are pretty likely to match + * something on every lookup. For statistics, use this: + * + * printf(" nodes: %zu\n" + * "lookups: %zu\n" + * "matches: %zu\n", n_nodes, n_lookups, n_matches); + */ +} + +/* Test the implementation with some 'real world' examples */ + +static void test_lpm_ipaddr(void) +{ + struct bpf_lpm_trie_key *key_ipv4; + struct bpf_lpm_trie_key *key_ipv6; + size_t key_size_ipv4; + size_t key_size_ipv6; + int map_fd_ipv4; + int map_fd_ipv6; + __u64 value; + + key_size_ipv4 = sizeof(*key_ipv4) + sizeof(__u32); + key_size_ipv6 = sizeof(*key_ipv6) + sizeof(__u32) * 4; + key_ipv4 = alloca(key_size_ipv4); + key_ipv6 = alloca(key_size_ipv6); + + map_fd_ipv4 = bpf_create_map(BPF_MAP_TYPE_LPM_TRIE, + key_size_ipv4, sizeof(value), + 100, BPF_F_NO_PREALLOC); + assert(map_fd_ipv4 >= 0); + + map_fd_ipv6 = bpf_create_map(BPF_MAP_TYPE_LPM_TRIE, + key_size_ipv6, sizeof(value), + 100, BPF_F_NO_PREALLOC); + assert(map_fd_ipv6 >= 0); + + /* Fill data some IPv4 and IPv6 address ranges */ + value = 1; + key_ipv4->prefixlen = 16; + inet_pton(AF_INET, "192.168.0.0", key_ipv4->data); + assert(bpf_map_update_elem(map_fd_ipv4, key_ipv4, &value, 0) == 0); + + value = 2; + key_ipv4->prefixlen = 24; + inet_pton(AF_INET, "192.168.0.0", key_ipv4->data); + assert(bpf_map_update_elem(map_fd_ipv4, key_ipv4, &value, 0) == 0); + + value = 3; + key_ipv4->prefixlen = 24; + inet_pton(AF_INET, "192.168.128.0", key_ipv4->data); + assert(bpf_map_update_elem(map_fd_ipv4, key_ipv4, &value, 0) == 0); + + value = 5; + key_ipv4->prefixlen = 24; + inet_pton(AF_INET, "192.168.1.0", key_ipv4->data); + assert(bpf_map_update_elem(map_fd_ipv4, key_ipv4, &value, 0) == 0); + + value = 4; + key_ipv4->prefixlen = 23; + inet_pton(AF_INET, "192.168.0.0", key_ipv4->data); + assert(bpf_map_update_elem(map_fd_ipv4, key_ipv4, &value, 0) == 0); + + value = 0xdeadbeef; + key_ipv6->prefixlen = 64; + inet_pton(AF_INET6, "2a00:1450:4001:814::200e", key_ipv6->data); + assert(bpf_map_update_elem(map_fd_ipv6, key_ipv6, &value, 0) == 0); + + /* Set tprefixlen to maximum for lookups */ + key_ipv4->prefixlen = 32; + key_ipv6->prefixlen = 128; + + /* Test some lookups that should come back with a value */ + inet_pton(AF_INET, "192.168.128.23", key_ipv4->data); + assert(bpf_map_lookup_elem(map_fd_ipv4, key_ipv4, &value) == 0); + assert(value == 3); + + inet_pton(AF_INET, "192.168.0.1", key_ipv4->data); + assert(bpf_map_lookup_elem(map_fd_ipv4, key_ipv4, &value) == 0); + assert(value == 2); + + inet_pton(AF_INET6, "2a00:1450:4001:814::", key_ipv6->data); + assert(bpf_map_lookup_elem(map_fd_ipv6, key_ipv6, &value) == 0); + assert(value == 0xdeadbeef); + + inet_pton(AF_INET6, "2a00:1450:4001:814::1", key_ipv6->data); + assert(bpf_map_lookup_elem(map_fd_ipv6, key_ipv6, &value) == 0); + assert(value == 0xdeadbeef); + + /* Test some lookups that should not match any entry */ + inet_pton(AF_INET, "10.0.0.1", key_ipv4->data); + assert(bpf_map_lookup_elem(map_fd_ipv4, key_ipv4, &value) == -1 && + errno == ENOENT); + + inet_pton(AF_INET, "11.11.11.11", key_ipv4->data); + assert(bpf_map_lookup_elem(map_fd_ipv4, key_ipv4, &value) == -1 && + errno == ENOENT); + + inet_pton(AF_INET6, "2a00:ffff::", key_ipv6->data); + assert(bpf_map_lookup_elem(map_fd_ipv6, key_ipv6, &value) == -1 && + errno == ENOENT); + + close(map_fd_ipv4); + close(map_fd_ipv6); +} + +int main(void) +{ + struct rlimit limit = { RLIM_INFINITY, RLIM_INFINITY }; + int i, ret; + + /* we want predictable, pseudo random tests */ + srand(0xf00ba1); + + /* allow unlimited locked memory */ + ret = setrlimit(RLIMIT_MEMLOCK, &limit); + if (ret < 0) + perror("Unable to lift memlock rlimit"); + + test_lpm_basic(); + test_lpm_order(); + + /* Test with 8, 16, 24, 32, ... 128 bit prefix length */ + for (i = 1; i <= 16; ++i) + test_lpm_map(i); + + test_lpm_ipaddr(); + + printf("test_lpm: OK\n"); + return 0; +} diff --git a/tools/testing/selftests/bpf/test_map_init.c b/tools/testing/selftests/bpf/test_map_init.c new file mode 100644 index 000000000000..c89d28ead673 --- /dev/null +++ b/tools/testing/selftests/bpf/test_map_init.c @@ -0,0 +1,33 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Tessares SA */ + +#include "vmlinux.h" +#include + +__u64 inKey = 0; +__u64 inValue = 0; +__u32 inPid = 0; + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_HASH); + __uint(max_entries, 2); + __type(key, __u64); + __type(value, __u64); +} hashmap1 SEC(".maps"); + + +SEC("tp/syscalls/sys_enter_getpgid") +int sysenter_getpgid(const void *ctx) +{ + /* Just do it for once, when called from our own test prog. This + * ensures the map value is only updated for a single CPU. + */ + int cur_pid = bpf_get_current_pid_tgid() >> 32; + + if (cur_pid == inPid) + bpf_map_update_elem(&hashmap1, &inKey, &inValue, BPF_NOEXIST); + + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 6fee1aad274d..0440948d4579 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -1805,10 +1805,6 @@ static struct bpf_test tests[] = { offsetof(struct __sk_buff, tc_index)), BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, offsetof(struct __sk_buff, cb[3])), - BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, - offsetof(struct __sk_buff, tstamp)), - BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, - offsetof(struct __sk_buff, tstamp)), BPF_EXIT_INSN(), }, .errstr_unpriv = "", @@ -2161,6 +2157,19 @@ static struct bpf_test tests[] = { .result_unpriv = REJECT, .result = ACCEPT, }, + { + "alu32: mov u32 const", + .insns = { + BPF_MOV32_IMM(BPF_REG_7, 0), + BPF_ALU32_IMM(BPF_AND, BPF_REG_7, 1), + BPF_MOV32_REG(BPF_REG_0, BPF_REG_7), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_7, 0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .retval = 0, + }, { "unpriv: partial copy of pointer", .insns = { @@ -3003,7 +3012,7 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, - .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END", + .errstr = "R3 pointer arithmetic on pkt_end", .result = REJECT, .prog_type = BPF_PROG_TYPE_SCHED_CLS, }, @@ -3994,31 +4003,6 @@ static struct bpf_test tests[] = { .result = REJECT, .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS, }, - { - "write tstamp from CGROUP_SKB", - .insns = { - BPF_MOV64_IMM(BPF_REG_0, 0), - BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, - offsetof(struct __sk_buff, tstamp)), - BPF_MOV64_IMM(BPF_REG_0, 0), - BPF_EXIT_INSN(), - }, - .result = ACCEPT, - .result_unpriv = REJECT, - .errstr_unpriv = "invalid bpf_context access off=152 size=8", - .prog_type = BPF_PROG_TYPE_CGROUP_SKB, - }, - { - "read tstamp from CGROUP_SKB", - .insns = { - BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, - offsetof(struct __sk_buff, tstamp)), - BPF_MOV64_IMM(BPF_REG_0, 0), - BPF_EXIT_INSN(), - }, - .result = ACCEPT, - .prog_type = BPF_PROG_TYPE_CGROUP_SKB, - }, { "multiple registers share map_lookup_elem result", .insns = { @@ -4056,7 +4040,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .fixup_map1 = { 4 }, - .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL", + .errstr = "R4 pointer arithmetic on map_value_or_null", .result = REJECT, .prog_type = BPF_PROG_TYPE_SCHED_CLS }, @@ -4077,7 +4061,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .fixup_map1 = { 4 }, - .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL", + .errstr = "R4 pointer arithmetic on map_value_or_null", .result = REJECT, .prog_type = BPF_PROG_TYPE_SCHED_CLS }, @@ -4098,7 +4082,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .fixup_map1 = { 4 }, - .errstr = "R4 pointer arithmetic on PTR_TO_MAP_VALUE_OR_NULL", + .errstr = "R4 pointer arithmetic on map_value_or_null", .result = REJECT, .prog_type = BPF_PROG_TYPE_SCHED_CLS }, @@ -5975,7 +5959,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .fixup_map_in_map = { 3 }, - .errstr = "R1 pointer arithmetic on CONST_PTR_TO_MAP prohibited", + .errstr = "R1 pointer arithmetic on map_ptr prohibited", .result = REJECT, }, { @@ -7363,7 +7347,7 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, - .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END", + .errstr = "R3 pointer arithmetic on pkt_end", .result = REJECT, .prog_type = BPF_PROG_TYPE_XDP, }, @@ -7382,7 +7366,7 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, - .errstr = "R3 pointer arithmetic on PTR_TO_PACKET_END", + .errstr = "R3 pointer arithmetic on pkt_end", .result = REJECT, .prog_type = BPF_PROG_TYPE_XDP, }, @@ -8075,6 +8059,127 @@ static struct bpf_test tests[] = { .result = REJECT, .errstr = "variable ctx access var_off=(0x0; 0x4)", }, + { + "check deducing bounds from const, 1", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 1, 0), + BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "R0 tried to subtract pointer from scalar", + }, + { + "check deducing bounds from const, 2", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 1), + BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 1, 1), + BPF_EXIT_INSN(), + BPF_JMP_IMM(BPF_JSLE, BPF_REG_0, 1, 1), + BPF_EXIT_INSN(), + BPF_ALU64_REG(BPF_SUB, BPF_REG_1, BPF_REG_0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + }, + { + "check deducing bounds from const, 3", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_IMM(BPF_JSLE, BPF_REG_0, 0, 0), + BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "R0 tried to subtract pointer from scalar", + }, + { + "check deducing bounds from const, 4", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_IMM(BPF_JSLE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_ALU64_REG(BPF_SUB, BPF_REG_1, BPF_REG_0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + }, + { + "check deducing bounds from const, 5", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 0, 1), + BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "R0 tried to subtract pointer from scalar", + }, + { + "check deducing bounds from const, 6", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "R0 tried to subtract pointer from scalar", + }, + { + "check deducing bounds from const, 7", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, ~0), + BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 0, 0), + BPF_ALU64_REG(BPF_SUB, BPF_REG_1, BPF_REG_0), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct __sk_buff, mark)), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "dereference of modified ctx ptr", + }, + { + "check deducing bounds from const, 8", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, ~0), + BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 0, 1), + BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_0), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, + offsetof(struct __sk_buff, mark)), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "dereference of modified ctx ptr", + }, + { + "check deducing bounds from const, 9", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_IMM(BPF_JSGE, BPF_REG_0, 0, 0), + BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "R0 tried to subtract pointer from scalar", + }, + { + "check deducing bounds from const, 10", + .insns = { + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_JMP_IMM(BPF_JSLE, BPF_REG_0, 0, 0), + /* Marks reg as unknown. */ + BPF_ALU64_IMM(BPF_NEG, BPF_REG_0, 0), + BPF_ALU64_REG(BPF_SUB, BPF_REG_0, BPF_REG_1), + BPF_EXIT_INSN(), + }, + .result = REJECT, + .errstr = "math between ctx pointer and register with unbounded min value is not allowed", + }, { "bpf_exit with invalid return code. test1", .insns = { diff --git a/tools/testing/selftests/bpf/verifier/raw_tp_writable.c b/tools/testing/selftests/bpf/verifier/raw_tp_writable.c new file mode 100644 index 000000000000..95b5d70a1dc1 --- /dev/null +++ b/tools/testing/selftests/bpf/verifier/raw_tp_writable.c @@ -0,0 +1,34 @@ +{ + "raw_tracepoint_writable: reject variable offset", + .insns = { + /* r6 is our tp buffer */ + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0), + + BPF_LD_MAP_FD(BPF_REG_1, 0), + /* move the key (== 0) to r10-8 */ + BPF_MOV32_IMM(BPF_REG_0, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_STX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0, 0), + /* lookup in the map */ + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, + BPF_FUNC_map_lookup_elem), + + /* exit clean if null */ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + + /* shift the buffer pointer to a variable location */ + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0), + BPF_ALU64_REG(BPF_ADD, BPF_REG_6, BPF_REG_0), + /* clobber whatever's there */ + BPF_MOV64_IMM(BPF_REG_7, 4242), + BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_7, 0), + + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .fixup_map_hash_8b = { 1, }, + .prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, + .errstr = "R6 invalid variable buffer offset: off=0, var_off=(0x0; 0xffffffff)", +}, diff --git a/tools/testing/selftests/timers/freq-step.c b/tools/testing/selftests/timers/freq-step.c index 14a2b77fd012..8cd10662ffba 100644 --- a/tools/testing/selftests/timers/freq-step.c +++ b/tools/testing/selftests/timers/freq-step.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: GPL-2.0-only /* * This test checks the response of the system clock to frequency * steps made with adjtimex(). The frequency error and stability of @@ -6,15 +7,6 @@ * values from the second interval exceed specified limits. * * Copyright (C) Miroslav Lichvar 2017 - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of version 2 of the GNU General Public License as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, but - * WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. */ #include