bka
16373 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
30f286f182 |
UPSTREAM: bpf: Fix L4 csum update on IPv6 in CHECKSUM_COMPLETE
commit ead7f9b8de65632ef8060b84b0c55049a33cfea1 upstream.
In Cilium, we use bpf_csum_diff + bpf_l4_csum_replace to, among other
things, update the L4 checksum after reverse SNATing IPv6 packets. That
use case is however not currently supported and leads to invalid
skb->csum values in some cases. This patch adds support for IPv6 address
changes in bpf_l4_csum_update via a new flag.
When calling bpf_l4_csum_replace in Cilium, it ends up calling
inet_proto_csum_replace_by_diff:
1: void inet_proto_csum_replace_by_diff(__sum16 *sum, struct sk_buff *skb,
2: __wsum diff, bool pseudohdr)
3: {
4: if (skb->ip_summed != CHECKSUM_PARTIAL) {
5: csum_replace_by_diff(sum, diff);
6: if (skb->ip_summed == CHECKSUM_COMPLETE && pseudohdr)
7: skb->csum = ~csum_sub(diff, skb->csum);
8: } else if (pseudohdr) {
9: *sum = ~csum_fold(csum_add(diff, csum_unfold(*sum)));
10: }
11: }
The bug happens when we're in the CHECKSUM_COMPLETE state. We've just
updated one of the IPv6 addresses. The helper now updates the L4 header
checksum on line 5. Next, it updates skb->csum on line 7. It shouldn't.
For an IPv6 packet, the updates of the IPv6 address and of the L4
checksum will cancel each other. The checksums are set such that
computing a checksum over the packet including its checksum will result
in a sum of 0. So the same is true here when we update the L4 checksum
on line 5. We'll update it as to cancel the previous IPv6 address
update. Hence skb->csum should remain untouched in this case.
The same bug doesn't affect IPv4 packets because, in that case, three
fields are updated: the IPv4 address, the IP checksum, and the L4
checksum. The change to the IPv4 address and one of the checksums still
cancel each other in skb->csum, but we're left with one checksum update
and should therefore update skb->csum accordingly. That's exactly what
inet_proto_csum_replace_by_diff does.
This special case for IPv6 L4 checksums is also described atop
inet_proto_csum_replace16, the function we should be using in this case.
This patch introduces a new bpf_l4_csum_replace flag, BPF_F_IPV6,
to indicate that we're updating the L4 checksum of an IPv6 packet. When
the flag is set, inet_proto_csum_replace_by_diff will skip the
skb->csum update.
Fixes:
|
||
|
|
bc96dd8739 |
UPSTREAM: bpf: Add PROG_TEST_RUN support for sk_lookup programs
commit 7c32e8f8bc33a5f4b113a630857e46634e3e143b upstream. Allow to pass sk_lookup programs to PROG_TEST_RUN. User space provides the full bpf_sk_lookup struct as context. Since the context includes a socket pointer that can't be exposed to user space we define that PROG_TEST_RUN returns the cookie of the selected socket or zero in place of the socket pointer. We don't support testing programs that select a reuseport socket, since this would mean running another (unrelated) BPF program from the sk_lookup test handler. Change-Id: I7af748e3f11804e4e1ad0c532685f0c3dfaf4816 Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210303101816.36774-3-lmb@cloudflare.com Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
172dec8a20 |
BACKPORT: locking/lockdep: Remove unused @nested argument from lock_release()
Since the following commit:
b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release")
@nested is no longer used in lock_release(), so remove it from all
lock_release() calls and friends.
Change-Id: Ic64f4522a848a31195e546ff531d59201e6e8deb
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: alexander.levin@microsoft.com
Cc: daniel@iogearbox.net
Cc: davem@davemloft.net
Cc: dri-devel@lists.freedesktop.org
Cc: duyuyang@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: hannes@cmpxchg.org
Cc: intel-gfx@lists.freedesktop.org
Cc: jack@suse.com
Cc: jlbec@evilplan.or
Cc: joonas.lahtinen@linux.intel.com
Cc: joseph.qi@linux.alibaba.com
Cc: jslaby@suse.com
Cc: juri.lelli@redhat.com
Cc: maarten.lankhorst@linux.intel.com
Cc: mark@fasheh.com
Cc: mhocko@kernel.org
Cc: mripard@kernel.org
Cc: ocfs2-devel@oss.oracle.com
Cc: rodrigo.vivi@intel.com
Cc: sean@poorly.run
Cc: st@kernel.org
Cc: tj@kernel.org
Cc: tytso@mit.edu
Cc: vdavydov.dev@gmail.com
Cc: vincent.guittot@linaro.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
c6aa1292ca |
Merge branch 'linux-4.19.y-cip' of https://git.kernel.org/pub/scm/linux/kernel/git/cip/linux-cip into android-4.19.y-mediatek
* 'linux-4.19.y-cip' of https://git.kernel.org/pub/scm/linux/kernel/git/cip/linux-cip: CIP: Bump version suffix to -cip124 after merge from cip/linux-4.19.y-st tree Update localversion-st, tree is up-to-date with 5.4.298. f2fs: fix to do sanity check on ino and xnid squashfs: fix memory leak in squashfs_fill_super pNFS: Handle RPC size limit for layoutcommits wifi: iwlwifi: fw: Fix possible memory leak in iwl_fw_dbg_collect usb: core: usb_submit_urb: downgrade type check udf: Verify partition map count f2fs: fix to avoid panic in f2fs_evict_inode usb: hub: Fix flushing and scheduling of delayed work that tunes runtime pm Revert "drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS" net: usb: qmi_wwan: add Telit Cinterion LE910C4-WWX new compositions HID: hid-ntrig: fix unable to handle page fault in ntrig_report_version() HID: asus: fix UAF via HID_CLAIMED_INPUT validation efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare sctp: initialize more fields in sctp_v6_from_sk() net: stmmac: xgmac: Do not enable RX FIFO Overflow interrupts net/mlx5e: Set local Xoff after FW update net: dlink: fix multicast stats being counted incorrectly atm: atmtcp: Prevent arbitrary write in atmtcp_recv_control(). net/atm: remove the atmdev_ops {get, set}sockopt methods Bluetooth: hci_event: Detect if HCI_EV_NUM_COMP_PKTS is unbalanced powerpc/kvm: Fix ifdef to remove build warning net: ipv4: fix regression in local-broadcast routes vhost/net: Protect ubufs with rcu read lock in vhost_net_ubuf_put() scsi: core: sysfs: Correct sysfs attributes access rights ftrace: Fix potential warning in trace_printk_seq during ftrace_dump alloc_fdtable(): change calling conventions. ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validation net/sched: Make cake_enqueue return NET_XMIT_CN when past buffer_limit ipv6: sr: validate HMAC algorithm ID in seg6_hmac_info_add ALSA: usb-audio: Fix size validation in convert_chmap_v3() scsi: qla4xxx: Prevent a potential error pointer dereference usb: xhci: Fix slot_id resource race conflict nfs: fix UAF in direct writes NFS: Fix up commit deadlocks Bluetooth: fix use-after-free in device_for_each_child() selftests: forwarding: tc_actions.sh: add matchall mirror test codel: remove sch->q.qlen check before qdisc_tree_reduce_backlog() sch_qfq: make qfq_qlen_notify() idempotent sch_hfsc: make hfsc_qlen_notify() idempotent sch_drr: make drr_qlen_notify() idempotent btrfs: populate otime when logging an inode item media: venus: hfi: explicitly release IRQ during teardown f2fs: fix to avoid out-of-boundary access in dnode page media: venus: protect against spurious interrupts during probe media: venus: vdec: Clamp param smaller than 1fps and bigger than 240. drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS media: rainshadow-cec: fix TOCTOU race condition in rain_interrupt() media: v4l2-ctrls: Don't reset handler's error in v4l2_ctrl_handler_free() ata: Fix SATA_MOBILE_LPM_POLICY description in Kconfig usb: musb: omap2430: fix device leak at unbind NFS: Fix the setting of capabilities when automounting a new filesystem NFS: Fix up handling of outstanding layoutcommit in nfs_update_inode() NFSv4: Fix nfs4_bitmap_copy_adjust() usb: typec: fusb302: cache PD RX state cdc-acm: fix race between initial clearing halt and open USB: cdc-acm: do not log successful probe on later errors nfsd: handle get_client_locked() failure in nfsd4_setclientid_confirm() tracing: Add down_write(trace_event_sem) when adding trace event usb: hub: Don't try to recover devices lost during warm reset. usb: hub: avoid warm port reset during USB3 disconnect x86/mce/amd: Add default names for MCA banks and blocks iio: hid-sensor-prox: Fix incorrect OFFSET calculation mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n mm/zsmalloc.c: convert to use kmem_cache_zalloc in cache_alloc_zspage() net: usbnet: Fix the wrong netif_carrier_on() call net: usbnet: Avoid potential RCU stall on LINK_CHANGE event PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports ACPI: processor: idle: Check acpi_fetch_acpi_dev() return value kbuild: Add KBUILD_CPPFLAGS to as-option invocation kbuild: add $(CLANG_FLAGS) to KBUILD_CPPFLAGS kbuild: Add CLANG_FLAGS to as-instr mips: Include KBUILD_CPPFLAGS in CHECKFLAGS invocation kbuild: Update assembler calls to use proper flags and language target ARM: 9448/1: Use an absolute path to unified.h in KBUILD_AFLAGS usb: dwc3: Ignore late xferNotReady event to prevent halt timeout USB: storage: Ignore driver CD mode for Realtek multi-mode Wi-Fi dongles usb: storage: realtek_cr: Use correct byte order for bcs->Residue USB: storage: Add unusual-devs entry for Novatek NTK96550-based camera usb: quirks: Add DELAY_INIT quick for another SanDisk 3.2Gen1 Flash Drive iio: proximity: isl29501: fix buffered read on big-endian systems ftrace: Also allocate and copy hash for reading of filter files fpga: zynq_fpga: Fix the wrong usage of dma_map_sgtable() fs/buffer: fix use-after-free when call bh_read() helper drm/amd/display: Fix fractional fb divider in set_pixel_clock_v3 media: venus: Add a check for packet size after reading from shared memory media: ov2659: Fix memory leaks in ov2659_probe() media: usbtv: Lock resolution while streaming media: gspca: Add bounds checking to firmware parser jbd2: prevent softlockup in jbd2_log_do_checkpoint() PCI: endpoint: Fix configfs group removal on driver teardown PCI: endpoint: Fix configfs group list head handling mtd: rawnand: fsmc: Add missing check after DMA map wifi: brcmsmac: Remove const from tbl_ptr parameter in wlc_lcnphy_common_read_table() zynq_fpga: use sgtable-based scatterlist wrappers ata: libata-scsi: Fix ata_to_sense_error() status handling ext4: fix reserved gdt blocks handling in fsmap ext4: fix fsmap end of range reporting with bigalloc ext4: check fast symlink for ea_inode correctly Revert "vgacon: Add check for vc_origin address range in vgacon_scroll()" vt: defkeymap: Map keycodes above 127 to K_HOLE usb: gadget: udc: renesas_usb3: fix device leak at unbind usb: atm: cxacru: Merge cxacru_upload_firmware() into cxacru_heavy_init() m68k: Fix lost column on framebuffer debug console serial: 8250: fix panic due to PSLVERR media: uvcvideo: Do not mark valid metadata as invalid media: uvcvideo: Fix 1-byte out-of-bounds read in uvc_parse_format() btrfs: fix log tree replay failure due to file with 0 links and extents thunderbolt: Fix copy+paste error in match_service_id() misc: rtsx: usb: Ensure mmc child device is active when card is present scsi: lpfc: Remove redundant assignment to avoid memory leak rtc: ds1307: remove clear of oscillator stop flag (OSF) in probe pNFS: Fix uninited ptr deref in block/scsi layout pNFS: Fix disk addr range check in block/scsi layout pNFS: Fix stripe mapping in block/scsi layout ipmi: Fix strcpy source and destination the same kconfig: lxdialog: fix 'space' to (de)select options kconfig: gconf: fix potential memory leak in renderer_edited() kconfig: gconf: avoid hardcoding model2 in on_treeview2_cursor_changed() scsi: aacraid: Stop using PCI_IRQ_AFFINITY scsi: Fix sas_user_scan() to handle wildcard and multi-channel scans kconfig: nconf: Ensure null termination where strncpy is used kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c PCI: pnv_php: Work around switches with broken presence detection media: uvcvideo: Fix bandwidth issue for Alcor camera media: dvb-frontends: w7090p: fix null-ptr-deref in w7090p_tuner_write_serpar and w7090p_tuner_read_serpar media: dvb-frontends: dib7090p: fix null-ptr-deref in dib7090p_rw_on_apb() media: usb: hdpvr: disable zero-length read messages media: tc358743: Increase FIFO trigger level to 374 media: tc358743: Return an appropriate colorspace from tc358743_set_fmt media: tc358743: Check I2C succeeded during probe pinctrl: stm32: Manage irq affinity settings scsi: mpt3sas: Correctly handle ATA device errors RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask() MIPS: Don't crash in stack_top() for tasks without ABI or vDSO jfs: upper bound check of tree index in dbAllocAG jfs: Regular file corruption check jfs: truncate good inode pages when hard link is 0 scsi: bfa: Double-free fix MIPS: vpe-mt: add missing prototypes for vpe_{alloc,start,stop,free} watchdog: dw_wdt: Fix default timeout fs/orangefs: use snprintf() instead of sprintf() scsi: libiscsi: Initialize iscsi_conn->dd_data only if memory is allocated ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr vhost: fail early when __vhost_add_used() fails uapi: in6: restore visibility of most IPv6 socket options net: ncsi: Fix buffer overflow in fetching version id net: dsa: b53: fix b53_imp_vlan_setup for BCM5325 net: vlan: Replace BUG() with WARN_ON_ONCE() in vlan_dev_* stubs wifi: iwlegacy: Check rate_idx range after addition netmem: fix skb_frag_address_safe with unreadable skbs wifi: rtlwifi: fix possible skb memory leak in `_rtl_pci_rx_interrupt()`. wifi: iwlwifi: dvm: fix potential overflow in rs_fill_link_cmd() net: fec: allow disable coalescing (powerpc/512) Fix possible `dma_unmap_single()` on uninitialized pointer s390/stp: Remove udelay from stp_sync_clock() wifi: iwlwifi: mvm: fix scan request validation net: thunderx: Fix format-truncation warning in bgx_acpi_match_id() net: ipv4: fix incorrect MTU in broadcast routes wifi: cfg80211: Fix interface type validation et131x: Add missing check after DMA map be2net: Use correct byte order and format string for TCP seq and ack_seq s390/time: Use monotonic clock in get_cycles() wifi: cfg80211: reject HTC bit for management frames ktest.pl: Prevent recursion of default variable options ASoC: codecs: rt5640: Retry DEVICE_ID verification ALSA: usb-audio: Avoid precedence issues in mixer_quirks macros ALSA: hda/ca0132: Fix buffer overflow in add_tuning_control platform/x86: thinkpad_acpi: Handle KCOV __init vs inline mismatches pm: cpupower: Fix the snapshot-order of tsc,mperf, clock in mperf_stop() ALSA: intel8x0: Fix incorrect codec index usage in mixer for ICH4 ASoC: hdac_hdmi: Rate limit logging on connection and disconnection mmc: rtsx_usb_sdmmc: Fix error-path in sd_set_power_mode() ACPI: processor: fix acpi_object initialization PM: sleep: console: Fix the black screen issue thermal: sysfs: Return ENODATA instead of EAGAIN for reads selftests: tracing: Use mutex_unlock for testing glob filter ARM: tegra: Use I/O memcpy to write to IRAM gpio: tps65912: check the return value of regmap_update_bits() ASoC: soc-dapm: set bias_level if snd_soc_dapm_set_bias_level() was successed cpufreq: Exit governor when failed to start old governor usb: xhci: Avoid showing errors during surprise removal usb: xhci: Set avg_trb_len = 8 for EP0 during Address Device Command usb: xhci: Avoid showing warnings for dying controller selftests/futex: Define SYS_futex on 32-bit architectures with 64-bit time_t usb: xhci: print xhci->xhc_state when queue_command failed securityfs: don't pin dentries twice, once is enough... hfs: fix not erasing deleted b-tree node issue drbd: add missing kref_get in handle_write_conflicts arm64: Handle KCOV __init vs inline mismatches hfsplus: don't use BUG_ON() in hfsplus_create_attributes_file() hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc() hfsplus: fix slab-out-of-bounds in hfsplus_bnode_read() hfs: fix slab-out-of-bounds in hfs_bnode_read() sctp: linearize cloned gso packets in sctp_rcv netfilter: ctnetlink: fix refcount leak on table dump udp: also consider secpath when evaluating ipsec use for checksumming fs: Prevent file descriptor table allocations exceeding INT_MAX sunvdc: Balance device refcount in vdc_port_mpgroup_check NFSD: detect mismatch of file handle and delegation stateid in OPEN op net: dpaa: fix device leak when querying time stamp info net: gianfar: fix device leak when querying time stamp info netlink: avoid infinite retry looping in netlink_unicast() ALSA: usb-audio: Validate UAC3 cluster segment descriptors ALSA: usb-audio: Validate UAC3 power domain descriptors, too usb: gadget : fix use-after-free in composite_dev_cleanup() MIPS: mm: tlb-r4k: Uniquify TLB entries on init USB: serial: option: add Foxconn T99W709 vsock: Do not allow binding to VMADDR_PORT_ANY net/packet: fix a race in packet_set_ring() and packet_notifier() perf/core: Prevent VMA split of buffer mappings perf/core: Exit early on perf_mmap() fail perf/core: Don't leak AUX buffer refcount on allocation failure pptp: fix pptp_xmit() error path smb: client: let recv_done() cleanup before notifying the callers. benet: fix BUG when creating VFs ipv6: reject malicious packets in ipv6_gso_segment() pptp: ensure minimal skb length in pptp_xmit() netpoll: prevent hanging NAPI when netcons gets enabled NFS: Fix filehandle bounds checking in nfs_fh_to_dentry() pci/hotplug/pnv-php: Wrap warnings in macro pci/hotplug/pnv-php: Improve error msg on power state change failure usb: chipidea: udc: fix sleeping function called from invalid context f2fs: fix to avoid out-of-boundary access in devs.path f2fs: fix to avoid UAF in f2fs_sync_inode_meta() rtc: pcf8563: fix incorrect maximum clock rate handling rtc: hym8563: fix incorrect maximum clock rate handling rtc: ds1307: fix incorrect maximum clock rate handling mtd: rawnand: atmel: set pmecc data setup time mtd: rawnand: atmel: Fix dma_mapping_error() address jfs: fix metapage reference count leak in dbAllocCtl fbdev: imxfb: Check fb_add_videomode to prevent null-ptr-deref crypto: qat - fix seq_file position update in adf_ring_next() dmaengine: nbpfaxi: Add missing check after DMA map dmaengine: mv_xor: Fix missing check after DMA map and missing unmap fs/orangefs: Allow 2 more characters in do_c_string() crypto: img-hash - Fix dma_unmap_sg() nents value scsi: isci: Fix dma_unmap_sg() nents value scsi: mvsas: Fix dma_unmap_sg() nents value scsi: ibmvscsi_tgt: Fix dma_unmap_sg() nents value perf tests bp_account: Fix leaked file descriptor crypto: ccp - Fix crash when rebind ccp device for ccp.ko pinctrl: sunxi: Fix memory leak on krealloc failure power: supply: max14577: Handle NULL pdata when CONFIG_OF is not set clk: davinci: Add NULL check in davinci_lpsc_clk_register() mtd: fix possible integer overflow in erase_xfer() crypto: marvell/cesa - Fix engine load inaccuracy PCI: rockchip-host: Fix "Unexpected Completion" log message vrf: Drop existing dst reference in vrf_ip6_input_dst netfilter: xt_nfacct: don't assume acct name is null-terminated can: kvaser_usb: Assign netdev.dev_port based on device channel index wifi: brcmfmac: fix P2P discovery failure in P2P peer due to missing P2P IE Reapply "wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()" mwl8k: Add missing check after DMA map wifi: rtl8xxxu: Fix RX skb size for aggregation disabled net/sched: Restrict conditions for adding duplicating netems to qdisc tree arch: powerpc: defconfig: Drop obsolete CONFIG_NET_CLS_TCINDEX netfilter: nf_tables: adjust lockdep assertions handling drm/amd/pm/powerplay/hwmgr/smu_helper: fix order of mask and value m68k: Don't unregister boot console needlessly tcp: fix tcp_ofo_queue() to avoid including too much DUP SACK range iwlwifi: Add missing check for alloc_ordered_workqueue wifi: iwlwifi: Fix memory leak in iwl_mvm_init() wifi: rtl818x: Kill URBs before clearing tx status queue caif: reduce stack size, again staging: nvec: Fix incorrect null termination of battery manufacturer samples: mei: Fix building on musl libc usb: early: xhci-dbc: Fix early_ioremap leak Revert "vmci: Prevent the dispatching of uninitialized payloads" pps: fix poll support vmci: Prevent the dispatching of uninitialized payloads staging: fbtft: fix potential memory leak in fbtft_framebuffer_alloc() ARM: dts: vfxxx: Correctly use two tuples for timer address ASoC: ops: dynamically allocate struct snd_ctl_elem_value hfsplus: remove mutex_lock check in hfsplus_free_extents ASoC: Intel: fix SND_SOC_SOF dependencies ethernet: intel: fix building with large NR_CPUS usb: phy: mxs: disconnect line when USB charger is attached usb: chipidea: udc: protect usb interrupt enable usb: chipidea: udc: add new API ci_hdrc_gadget_connect comedi: comedi_test: Fix possible deletion of uninitialized timers nilfs2: reject invalid file types when reading inodes i2c: qup: jump out of the loop in case of timeout net/sched: sch_qfq: Avoid triggering might_sleep in atomic context in qfq_delete_class net: appletalk: Fix use-after-free in AARP proxy probe net: appletalk: fix kerneldoc warnings RDMA/core: Rate limit GID cache warning messages usb: hub: fix detection of high tier USB3 devices behind suspended hubs net_sched: sch_sfq: reject invalid perturb period net_sched: sch_sfq: move the limit validation net_sched: sch_sfq: use a temporary work area for validating configuration net_sched: sch_sfq: don't allow 1 packet limit net_sched: sch_sfq: handle bigger packets net_sched: sch_sfq: annotate data-races around q->perturb_period power: supply: bq24190_charger: Fix runtime PM imbalance on error xhci: Disable stream for xHC controller with XHCI_BROKEN_STREAMS virtio-net: ensure the received length does not exceed allocated size usb: dwc3: qcom: Don't leave BCR asserted usb: musb: fix gadget state on disconnect net/sched: Return NULL when htb_lookup_leaf encounters an empty rbtree net: vlan: fix VLAN 0 refcount imbalance of toggling filtering during runtime Bluetooth: L2CAP: Fix attempting to adjust outgoing MTU Bluetooth: SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout Bluetooth: SMP: If an unallowed command is received consider it a failure Bluetooth: Fix null-ptr-deref in l2cap_sock_resume_cb() usb: net: sierra: check for no status endpoint net/sched: sch_qfq: Fix race condition on qfq_aggregate net: emaclite: Fix missing pointer increment in aligned_read() comedi: Fix use of uninitialized data in insn_rw_emulate_bits() comedi: Fix some signed shift left operations comedi: das6402: Fix bit shift out of bounds comedi: das16m1: Fix bit shift out of bounds comedi: aio_iiro_16: Fix bit shift out of bounds comedi: pcl812: Fix bit shift out of bounds iio: adc: max1363: Reorder mode_list[] entries iio: adc: max1363: Fix MAX1363_4X_CHANS/MAX1363_8X_CHANS[] soc: aspeed: lpc-snoop: Don't disable channels that aren't enabled soc: aspeed: lpc-snoop: Cleanup resources in stack-order mmc: sdhci-pci: Quirk for broken command queuing on Intel GLK-based Positivo models memstick: core: Zero initialize id_reg in h_memstick_read_dev_id() isofs: Verify inode mode when loading from disk dmaengine: nbpfaxi: Fix memory corruption in probe() af_packet: fix soft lockup issue caused by tpacket_snd() af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd() phonet/pep: Move call to pn_skb_get_dst_sockaddr() earlier in pep_sock_accept() HID: core: do not bypass hid_hw_raw_request HID: core: ensure __hid_request reserves the report ID as the first byte HID: core: ensure the allocated report buffer can contain the reserved report ID pch_uart: Fix dma_sync_sg_for_device() nents value Input: xpad - set correct controller type for Acer NGR200 i2c: stm32: fix the device used for the DMA map usb: gadget: configfs: Fix OOB read on empty string write USB: serial: ftdi_sio: add support for NDI EMGUIDE GEMINI USB: serial: option: add Foxconn T99W640 USB: serial: option: add Telit Cinterion FE910C04 (ECM) composition dma-mapping: add generic helpers for mapping sgtable objects usb: renesas_usbhs: Flush the notify_hotplug_work gpio: rcar: Use raw_spinlock to protect register access Change-Id: Ia6b8b00918487999c648f298d3550afc7eaaae03 Signed-off-by: bengris32 <bengris32@protonmail.ch> |
||
|
|
63eb2e4a5e |
Merge branch 'linux-4.19.y-st' into linux-4.19.y-cip
Brings the tree up-to-date with 5.4.298. Signed-off-by: Ulrich Hecht <uli@kernel.org> |
||
|
|
dffaeb7d90 |
selftests: forwarding: tc_actions.sh: add matchall mirror test
[ Upstream commit 075c8aa79d541ea08c67a2e6d955f6457e98c21c ] Add test for matchall classifier with mirred egress mirror action. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: ca22da2fbd69 ("act_mirred: use the backlog for nested calls to mirred ingress") Signed-off-by: Shubham Kulkarni <skulkarni@mvista.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ulrich Hecht <uli@kernel.org> |
||
|
|
f6201d924d |
UPSTREAM: bpf: Allow reads from uninit stack
commit 6715df8d5d24655b9fd368e904028112b54c7de1 upstream.
This commits updates the following functions to allow reads from
uninitialized stack locations when env->allow_uninit_stack option is
enabled:
- check_stack_read_fixed_off()
- check_stack_range_initialized(), called from:
- check_stack_read_var_off()
- check_helper_mem_access()
Such change allows to relax logic in stacksafe() to treat STACK_MISC
and STACK_INVALID in a same way and make the following stack slot
configurations equivalent:
| Cached state | Current state |
| stack slot | stack slot |
|------------------+------------------|
| STACK_INVALID or | STACK_INVALID or |
| STACK_MISC | STACK_SPILL or |
| | STACK_MISC or |
| | STACK_ZERO or |
| | STACK_DYNPTR |
This leads to significant verification speed gains (see below).
The idea was suggested by Andrii Nakryiko [1] and initial patch was
created by Alexei Starovoitov [2].
Currently the env->allow_uninit_stack is allowed for programs loaded
by users with CAP_PERFMON or CAP_SYS_ADMIN capabilities.
A number of test cases from verifier/*.c were expecting uninitialized
stack access to be an error. These test cases were updated to execute
in unprivileged mode (thus preserving the tests).
The test progs/test_global_func10.c expected "invalid indirect read
from stack" error message because of the access to uninitialized
memory region. This error is no longer possible in privileged mode.
The test is updated to provoke an error "invalid indirect access to
stack" because of access to invalid stack address (such error is not
verified by progs/test_global_func*.c series of tests).
The following tests had to be removed because these can't be made
unprivileged:
- verifier/sock.c:
- "sk_storage_get(map, skb->sk, &stack_value, 1): partially init
stack_value"
BPF_PROG_TYPE_SCHED_CLS programs are not executed in unprivileged mode.
- verifier/var_off.c:
- "indirect variable-offset stack access, max_off+size > max_initialized"
- "indirect variable-offset stack access, uninitialized"
These tests verify that access to uninitialized stack values is
detected when stack offset is not a constant. However, variable
stack access is prohibited in unprivileged mode, thus these tests
are no longer valid.
* * *
Here is veristat log comparing this patch with current master on a
set of selftest binaries listed in tools/testing/selftests/bpf/veristat.cfg
and cilium BPF binaries (see [3]):
$ ./veristat -e file,prog,states -C -f 'states_pct<-30' master.log current.log
File Program States (A) States (B) States (DIFF)
-------------------------- -------------------------- ---------- ---------- ----------------
bpf_host.o tail_handle_ipv6_from_host 349 244 -105 (-30.09%)
bpf_host.o tail_handle_nat_fwd_ipv4 1320 895 -425 (-32.20%)
bpf_lxc.o tail_handle_nat_fwd_ipv4 1320 895 -425 (-32.20%)
bpf_sock.o cil_sock4_connect 70 48 -22 (-31.43%)
bpf_sock.o cil_sock4_sendmsg 68 46 -22 (-32.35%)
bpf_xdp.o tail_handle_nat_fwd_ipv4 1554 803 -751 (-48.33%)
bpf_xdp.o tail_lb_ipv4 6457 2473 -3984 (-61.70%)
bpf_xdp.o tail_lb_ipv6 7249 3908 -3341 (-46.09%)
pyperf600_bpf_loop.bpf.o on_event 287 145 -142 (-49.48%)
strobemeta.bpf.o on_event 15915 4772 -11143 (-70.02%)
strobemeta_nounroll2.bpf.o on_event 17087 3820 -13267 (-77.64%)
xdp_synproxy_kern.bpf.o syncookie_tc 21271 6635 -14636 (-68.81%)
xdp_synproxy_kern.bpf.o syncookie_xdp 23122 6024 -17098 (-73.95%)
-------------------------- -------------------------- ---------- ---------- ----------------
Note: I limited selection by states_pct<-30%.
Inspection of differences in pyperf600_bpf_loop behavior shows that
the following patch for the test removes almost all differences:
- a/tools/testing/selftests/bpf/progs/pyperf.h
+ b/tools/testing/selftests/bpf/progs/pyperf.h
@ -266,8 +266,8 @ int __on_event(struct bpf_raw_tracepoint_args *ctx)
}
if (event->pthread_match || !pidData->use_tls) {
- void* frame_ptr;
- FrameData frame;
+ void* frame_ptr = 0;
+ FrameData frame = {};
Symbol sym = {};
int cur_cpu = bpf_get_smp_processor_id();
W/o this patch the difference comes from the following pattern
(for different variables):
static bool get_frame_data(... FrameData *frame ...)
{
...
bpf_probe_read_user(&frame->f_code, ...);
if (!frame->f_code)
return false;
...
bpf_probe_read_user(&frame->co_name, ...);
if (frame->co_name)
...;
}
int __on_event(struct bpf_raw_tracepoint_args *ctx)
{
FrameData frame;
...
get_frame_data(... &frame ...) // indirectly via a bpf_loop & callback
...
}
SEC("raw_tracepoint/kfree_skb")
int on_event(struct bpf_raw_tracepoint_args* ctx)
{
...
ret |= __on_event(ctx);
ret |= __on_event(ctx);
...
}
With regards to value `frame->co_name` the following is important:
- Because of the conditional `if (!frame->f_code)` each call to
__on_event() produces two states, one with `frame->co_name` marked
as STACK_MISC, another with it as is (and marked STACK_INVALID on a
first call).
- The call to bpf_probe_read_user() does not mark stack slots
corresponding to `&frame->co_name` as REG_LIVE_WRITTEN but it marks
these slots as BPF_MISC, this happens because of the following loop
in the check_helper_call():
for (i = 0; i < meta.access_size; i++) {
err = check_mem_access(env, insn_idx, meta.regno, i, BPF_B,
BPF_WRITE, -1, false);
if (err)
return err;
}
Note the size of the write, it is a one byte write for each byte
touched by a helper. The BPF_B write does not lead to write marks
for the target stack slot.
- Which means that w/o this patch when second __on_event() call is
verified `if (frame->co_name)` will propagate read marks first to a
stack slot with STACK_MISC marks and second to a stack slot with
STACK_INVALID marks and these states would be considered different.
[1] https://lore.kernel.org/bpf/CAEf4BzY3e+ZuC6HUa8dCiUovQRg2SzEk7M-dSkqNZyn=xEmnPA@mail.gmail.com/
[2] https://lore.kernel.org/bpf/CAADnVQKs2i1iuZ5SUGuJtxWVfGYR9kDgYKhq3rNV+kBLQCu7rA@mail.gmail.com/
[3] git@github.com:anakryiko/cilium.git
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Co-developed-by: Alexei Starovoitov <ast@kernel.org>
Change-Id: I9a3bae72fddb8a00f861a6928162049407eae7a2
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230219200427.606541-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
4029fa4d85 |
UPSTREAM: bpf: Add crosstask check to __bpf_get_stack
[ Upstream commit b8e3a87a627b575896e448021e5c2f8a3bc19931 ]
Currently get_perf_callchain only supports user stack walking for
the current task. Passing the correct *crosstask* param will return
0 frames if the task passed to __bpf_get_stack isn't the current
one instead of a single incorrect frame/address. This change
passes the correct *crosstask* param but also does a preemptive
check in __bpf_get_stack if the task is current and returns
-EOPNOTSUPP if it is not.
This issue was found using bpf_get_task_stack inside a BPF
iterator ("iter/task"), which iterates over all tasks.
bpf_get_task_stack works fine for fetching kernel stacks
but because get_perf_callchain relies on the caller to know
if the requested *task* is the current one (via *crosstask*)
it was failing in a confusing way.
It might be possible to get user stacks for all tasks utilizing
something like access_process_vm but that requires the bpf
program calling bpf_get_task_stack to be sleepable and would
therefore be a breaking change.
Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()")
Change-Id: Iaaa9eecc1e9f9c4c7b4953e169725d1aa1808bc3
Signed-off-by: Jordan Rome <jordalgo@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20231108112334.3433136-1-jordalgo@meta.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
d0c2b52274 |
UPSTREAM: bpf: Clarify error expectations from bpf_clone_redirect
[ Upstream commit 7cb779a6867fea00b4209bcf6de2f178a743247d ]
Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
packets") exposed the fact that bpf_clone_redirect is capable of
returning raw NET_XMIT_XXX return codes.
This is in the conflict with its UAPI doc which says the following:
"0 on success, or a negative error in case of failure."
Update the UAPI to reflect the fact that bpf_clone_redirect can
return positive error numbers, but don't explicitly define
their meaning.
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Change-Id: I5b81b2b05cb8369feceae99f9fddaaf6a0121dd1
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230911194731.286342-1-sdf@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
08fbe6d9cd |
UPSTREAM: bpf: Fix comment for helper bpf_current_task_under_cgroup()
commit 58617014405ad5c9f94f464444f4972dabb71ca7 upstream.
Fix the descriptions of the return values of helper bpf_current_task_under_cgroup().
Fixes:
|
||
|
|
55b9335242 |
UPSTREAM: bpf: Fix a typo of reuseport map in bpf.h.
[ Upstream commit f170acda7ffaf0473d06e1e17c12cd9fd63904f5 ]
Fix s/BPF_MAP_TYPE_REUSEPORT_ARRAY/BPF_MAP_TYPE_REUSEPORT_SOCKARRAY/ typo
in bpf.h.
Fixes:
|
||
|
|
b567486e24 |
UPSTREAM: bpf: Fix enum names for bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpers
Remove bpf_ prefix, which causes these helpers to be reported in verifier
dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets
fix it as long as it is still possible before UAPI freezes on these helpers.
Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
Change-Id: I67e5dde44023de422ec093935ce20647466db4b8
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
715e3660af |
UPSTREAM: bpf: Relax return code check for subprograms
Currently verifier enforces return code checks for subprograms in the same manner as it does for program entry points. This prevents returning arbitrary scalar values from subprograms. Scalar type of returned values is checked by btf_prepare_func_args() and hence it should be safe to allow only scalars for now. Relax return code checks for subprograms and allow any correct scalar values. Fixes: 51c39bb1d5d10 (bpf: Introduce function-by-function verification) Change-Id: Id0508fc1675912d14ba1c7ed3d3c2c1bd7b65b22 Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20201113171756.90594-1-me@ubique.spb.ru |
||
|
|
bafeb90f63 |
UPSTREAM: bpf: Zero-fill re-used per-cpu map element
Zero-fill element values for all other cpus than current, just as
when not using prealloc. This is the only way the bpf program can
ensure known initial values for all cpus ('onallcpus' cannot be
set when coming from the bpf program).
The scenario is: bpf program inserts some elements in a per-cpu
map, then deletes some (or userspace does). When later adding
new elements using bpf_map_update_elem(), the bpf program can
only set the value of the new elements for the current cpu.
When prealloc is enabled, previously deleted elements are re-used.
Without the fix, values for other cpus remain whatever they were
when the re-used entry was previously freed.
A selftest is added to validate correct operation in above
scenario as well as in case of LRU per-cpu map element re-use.
Fixes:
|
||
|
|
5ccc906068 |
UPSTREAM: bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop
Based on the discussion in [0], update the bpf_redirect_neigh() helper to accept an optional parameter specifying the nexthop information. This makes it possible to combine bpf_fib_lookup() and bpf_redirect_neigh() without incurring a duplicate FIB lookup - since the FIB lookup helper will return the nexthop information even if no neighbour is present, this can simply be passed on to bpf_redirect_neigh() if bpf_fib_lookup() returns BPF_FIB_LKUP_RET_NO_NEIGH. Thus fix & extend it before helper API is frozen. [0] https://lore.kernel.org/bpf/393e17fc-d187-3a8d-2f0d-a627c7c63fca@iogearbox.net/ Change-Id: I2c24b13263ccc6452023c6f76e635ab2114ce142 Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/bpf/160322915615.32199.1187570224032024535.stgit@toke.dk |
||
|
|
00b2b0959f |
UPSTREAM: bpf: Allow for map-in-map with dynamic inner array map entries
Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
("bpf: Relax max_entries check for most of the inner map types") added support
for dynamic inner max elements for most map-in-map types. Exceptions were maps
like array or prog array where the map_gen_lookup() callback uses the maps'
max_entries field as a constant when emitting instructions.
We recently implemented Maglev consistent hashing into Cilium's load balancer
which uses map-in-map with an outer map being hash and inner being array holding
the Maglev backend table for each service. This has been designed this way in
order to reduce overall memory consumption given the outer hash map allows to
avoid preallocating a large, flat memory area for all services. Also, the
number of service mappings is not always known a-priori.
The use case for dynamic inner array map entries is to further reduce memory
overhead, for example, some services might just have a small number of back
ends while others could have a large number. Right now the Maglev backend table
for small and large number of backends would need to have the same inner array
map entries which adds a lot of unneeded overhead.
Dynamic inner array map entries can be realized by avoiding the inlined code
generation for their lookup. The lookup will still be efficient since it will
be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
inline code generation and relaxes array_map_meta_equal() check to ignore both
maps' max_entries. This also still allows to have faster lookups for map-in-map
when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.
Example code generation where inner map is dynamic sized array:
# bpftool p d x i 125
int handle__sys_enter(void * ctx):
; int handle__sys_enter(void *ctx)
0: (b4) w1 = 0
; int key = 0;
1: (63) *(u32 *)(r10 -4) = r1
2: (bf) r2 = r10
;
3: (07) r2 += -4
; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
4: (18) r1 = map[id:468]
6: (07) r1 += 272
7: (61) r0 = *(u32 *)(r2 +0)
8: (35) if r0 >= 0x3 goto pc+5
9: (67) r0 <<= 3
10: (0f) r0 += r1
11: (79) r0 = *(u64 *)(r0 +0)
12: (15) if r0 == 0x0 goto pc+1
13: (05) goto pc+1
14: (b7) r0 = 0
15: (b4) w6 = -1
; if (!inner_map)
16: (15) if r0 == 0x0 goto pc+6
17: (bf) r2 = r10
;
18: (07) r2 += -4
; val = bpf_map_lookup_elem(inner_map, &key);
19: (bf) r1 = r0 | No inlining but instead
20: (85) call array_map_lookup_elem#149280 | call to array_map_lookup_elem()
; return val ? *val : -1; | for inner array lookup.
21: (15) if r0 == 0x0 goto pc+1
; return val ? *val : -1;
22: (61) r6 = *(u32 *)(r0 +0)
; }
23: (bc) w0 = w6
24: (95) exit
Change-Id: Ie3c14796539b261e3fb9433aa9ef46fe116294c4
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net
|
||
|
|
dbb8fde83f |
UPSTREAM: bpf: Add redirect_peer helper
Add an efficient ingress to ingress netns switch that can be used out of tc BPF
programs in order to redirect traffic from host ns ingress into a container
veth device ingress without having to go via CPU backlog queue [0]. For local
containers this can also be utilized and path via CPU backlog queue only needs
to be taken once, not twice. On a high level this borrows from ipvlan which does
similar switch in __netif_receive_skb_core() and then iterates via another_round.
This helps to reduce latency for mentioned use cases.
Pod to remote pod with redirect(), TCP_RR [1]:
# percpu_netperf 10.217.1.33
RT_LATENCY: 122.450 (per CPU: 122.666 122.401 122.333 122.401 )
MEAN_LATENCY: 121.210 (per CPU: 121.100 121.260 121.320 121.160 )
STDDEV_LATENCY: 120.040 (per CPU: 119.420 119.910 125.460 115.370 )
MIN_LATENCY: 46.500 (per CPU: 47.000 47.000 47.000 45.000 )
P50_LATENCY: 118.500 (per CPU: 118.000 119.000 118.000 119.000 )
P90_LATENCY: 127.500 (per CPU: 127.000 128.000 127.000 128.000 )
P99_LATENCY: 130.750 (per CPU: 131.000 131.000 129.000 132.000 )
TRANSACTION_RATE: 32666.400 (per CPU: 8152.200 8169.842 8174.439 8169.897 )
Pod to remote pod with redirect_peer(), TCP_RR:
# percpu_netperf 10.217.1.33
RT_LATENCY: 44.449 (per CPU: 43.767 43.127 45.279 45.622 )
MEAN_LATENCY: 45.065 (per CPU: 44.030 45.530 45.190 45.510 )
STDDEV_LATENCY: 84.823 (per CPU: 66.770 97.290 84.380 90.850 )
MIN_LATENCY: 33.500 (per CPU: 33.000 33.000 34.000 34.000 )
P50_LATENCY: 43.250 (per CPU: 43.000 43.000 43.000 44.000 )
P90_LATENCY: 46.750 (per CPU: 46.000 47.000 47.000 47.000 )
P99_LATENCY: 52.750 (per CPU: 51.000 54.000 53.000 53.000 )
TRANSACTION_RATE: 90039.500 (per CPU: 22848.186 23187.089 22085.077 21919.130 )
[0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
[1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperf
Change-Id: I17d75ffbb776ea4e36326b8fdd04b71441a1982b
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
|
||
|
|
bba8615766 |
UPSTREAM: bpf: Improve bpf_redirect_neigh helper description
Follow-up to address David's feedback that we should better describe internals of the bpf_redirect_neigh() helper. Suggested-by: David Ahern <dsahern@gmail.com> Change-Id: I846bfce6bbcbce8bc561501be8f86cfba4810ca5 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: David Ahern <dsahern@gmail.com> Link: https://lore.kernel.org/bpf/20201010234006.7075-2-daniel@iogearbox.net |
||
|
|
c8fbfe0610 |
UPSTREAM: bpf: Propagate scalar ranges through register assignments.
The llvm register allocator may use two different registers representing the same virtual register. In such case the following pattern can be observed: 1047: (bf) r9 = r6 1048: (a5) if r6 < 0x1000 goto pc+1 1050: ... 1051: (a5) if r9 < 0x2 goto pc+66 1052: ... 1053: (bf) r2 = r9 /* r2 needs to have upper and lower bounds */ This is normal behavior of greedy register allocator. The slides 137+ explain why regalloc introduces such register copy: http://llvm.org/devmtg/2018-04/slides/Yatsina-LLVM%20Greedy%20Register%20Allocator.pdf There is no way to tell llvm 'not to do this'. Hence the verifier has to recognize such patterns. In order to track this information without backtracking allocate ID for scalars in a similar way as it's done for find_good_pkt_pointers(). When the verifier encounters r9 = r6 assignment it will assign the same ID to both registers. Later if either register range is narrowed via conditional jump propagate the register state into the other register. Clear register ID in adjust_reg_min_max_vals() for any alu instruction. The register ID is ignored for scalars in regsafe() and doesn't affect state pruning. mark_reg_unknown() clears the ID. It's used to process call, endian and other instructions. Hence ID is explicitly cleared only in adjust_reg_min_max_vals() and in 32-bit mov. Change-Id: I20ee3cfd0be8c1ff5a8d1607bcce914ba92b7804 Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20201009011240.48506-2-alexei.starovoitov@gmail.com |
||
|
|
ead75c1fa0 |
UPSTREAM: bpf: Add tcp_notsent_lowat bpf setsockopt
Adding support for TCP_NOTSENT_LOWAT sockoption (https://lwn.net/Articles/560082/) in tcp bpf programs. Change-Id: I8edbc4662cbf7e31026c073ab628f8d28f859bef Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201009070325.226855-1-tehnerd@tehnerd.com |
||
|
|
a897a2775d |
UPSTREAM: bpf: Fix typo in uapi/linux/bpf.h
Reported-by: Samanta Navarro <ferivoz@riseup.net> Change-Id: I2b201793f7c385f8581dcff502186e518705a420 Signed-off-by: Jakub Wilk <jwilk@jwilk.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201007055717.7319-1-jwilk@jwilk.net |
||
|
|
57ae737d3c |
UPSTREAM: bpf: Introducte bpf_this_cpu_ptr()
Add bpf_this_cpu_ptr() to help access percpu var on this cpu. This helper always returns a valid pointer, therefore no need to check returned value for NULL. Also note that all programs run with preemption disabled, which means that the returned pointer is stable during all the execution of the program. Change-Id: Idd23453d28e221430cc067c6b6f812eab0ac738d Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200929235049.2533242-6-haoluo@google.com |
||
|
|
dca7b0be14 |
UPSTREAM: bpf: Introduce bpf_per_cpu_ptr()
Add bpf_per_cpu_ptr() to help bpf programs access percpu vars. bpf_per_cpu_ptr() has the same semantic as per_cpu_ptr() in the kernel except that it may return NULL. This happens when the cpu parameter is out of range. So the caller must check the returned value. Change-Id: I254209a7070abc34458020e4ee8ee73256e31886 Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200929235049.2533242-5-haoluo@google.com |
||
|
|
fae103ddcc |
UPSTREAM: bpf: Introduce pseudo_btf_id
Pseudo_btf_id is a type of ld_imm insn that associates a btf_id to a ksym so that further dereferences on the ksym can use the BTF info to validate accesses. Internally, when seeing a pseudo_btf_id ld insn, the verifier reads the btf_id stored in the insn[0]'s imm field and marks the dst_reg as PTR_TO_BTF_ID. The btf_id points to a VAR_KIND, which is encoded in btf_vminux by pahole. If the VAR is not of a struct type, the dst reg will be marked as PTR_TO_MEM instead of PTR_TO_BTF_ID and the mem_size is resolved to the size of the VAR's type. >From the VAR btf_id, the verifier can also read the address of the ksym's corresponding kernel var from kallsyms and use that to fill dst_reg. Therefore, the proper functionality of pseudo_btf_id depends on (1) kallsyms and (2) the encoding of kernel global VARs in pahole, which should be available since pahole v1.18. Change-Id: I8e81d1d9c2ed4c669e5f68fcf4246b9c1c705bb6 Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200929235049.2533242-2-haoluo@google.com |
||
|
|
d45b6b182b |
UPSTREAM: bpf: Introduce BPF_F_PRESERVE_ELEMS for perf event array
Currently, perf event in perf event array is removed from the array when the map fd used to add the event is closed. This behavior makes it difficult to the share perf events with perf event array. Introduce perf event map that keeps the perf event open with a new flag BPF_F_PRESERVE_ELEMS. With this flag set, perf events in the array are not removed when the original map fd is closed. Instead, the perf event will stay in the map until 1) it is explicitly removed from the array; or 2) the array is freed. Change-Id: I7c0aaa5ae7431439acaaaaf7b5b689834b2313db Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200930224927.1936644-2-songliubraving@fb.com |
||
|
|
f256187033 |
UPSTREAM: bpf: Add redirect_neigh helper as redirect drop-in
Add a redirect_neigh() helper as redirect() drop-in replacement
for the xmit side. Main idea for the helper is to be very similar
in semantics to the latter just that the skb gets injected into
the neighboring subsystem in order to let the stack do the work
it knows best anyway to populate the L2 addresses of the packet
and then hand over to dev_queue_xmit() as redirect() does.
This solves two bigger items: i) skbs don't need to go up to the
stack on the host facing veth ingress side for traffic egressing
the container to achieve the same for populating L2 which also
has the huge advantage that ii) the skb->sk won't get orphaned in
ip_rcv_core() when entering the IP routing layer on the host stack.
Given that skb->sk neither gets orphaned when crossing the netns
as per
|
||
|
|
2008a2e63f |
UPSTREAM: bpf: Add classid helper only based on skb->sk
Similarly to 5a52ae4e32a6 ("bpf: Allow to retrieve cgroup v1 classid
from v2 hooks"), add a helper to retrieve cgroup v1 classid solely
based on the skb->sk, so it can be used as key as part of BPF map
lookups out of tc from host ns, in particular given the skb->sk is
retained these days when crossing net ns thanks to
|
||
|
|
8894754770 |
UPSTREAM: bpf: Support attaching freplace programs to multiple attach points
This enables support for attaching freplace programs to multiple attach points. It does this by amending the UAPI for bpf_link_Create with a target btf ID that can be used to supply the new attachment point along with the target program fd. The target must be compatible with the target that was supplied at program load time. The implementation reuses the checks that were factored out of check_attach_btf_id() to ensure compatibility between the BTF types of the old and new attachment. If these match, a new bpf_tracing_link will be created for the new attach target, allowing multiple attachments to co-exist simultaneously. The code could theoretically support multiple-attach of other types of tracing programs as well, but since I don't have a use case for any of those, there is no API support for doing so. Change-Id: Ifeca634ba0c7ae9b1c5bc3c11a9d2b83fb0b5d23 Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/160138355169.48470.17165680973640685368.stgit@toke.dk |
||
|
|
c27d926940 |
UPSTREAM: bpf: Add bpf_seq_printf_btf helper
A helper is added to allow seq file writing of kernel data
structures using vmlinux BTF. Its signature is
long bpf_seq_printf_btf(struct seq_file *m, struct btf_ptr *ptr,
u32 btf_ptr_size, u64 flags);
Flags and struct btf_ptr definitions/use are identical to the
bpf_snprintf_btf helper, and the helper returns 0 on success
or a negative error value.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Change-Id: Ief0f9b8b9d9ed5f725159d71d1f3eb26f28c27c1
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-8-git-send-email-alan.maguire@oracle.com
|
||
|
|
af684854b7 |
UPSTREAM: bpf: Add bpf_snprintf_btf helper
A helper is added to support tracing kernel type information in BPF
using the BPF Type Format (BTF). Its signature is
long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
u32 btf_ptr_size, u64 flags);
struct btf_ptr * specifies
- a pointer to the data to be traced
- the BTF id of the type of data pointed to
- a flags field is provided for future use; these flags
are not to be confused with the BTF_F_* flags
below that control how the btf_ptr is displayed; the
flags member of the struct btf_ptr may be used to
disambiguate types in kernel versus module BTF, etc;
the main distinction is the flags relate to the type
and information needed in identifying it; not how it
is displayed.
For example a BPF program with a struct sk_buff *skb
could do the following:
static struct btf_ptr b = { };
b.ptr = skb;
b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);
Default output looks like this:
(struct sk_buff){
.transport_header = (__u16)65535,
.mac_header = (__u16)65535,
.end = (sk_buff_data_t)192,
.head = (unsigned char *)0x000000007524fd8b,
.data = (unsigned char *)0x000000007524fd8b,
.truesize = (unsigned int)768,
.users = (refcount_t){
.refs = (atomic_t){
.counter = (int)1,
},
},
}
Flags modifying display are as follows:
- BTF_F_COMPACT: no formatting around type information
- BTF_F_NONAME: no struct/union member names/types
- BTF_F_PTR_RAW: show raw (unobfuscated) pointer values;
equivalent to %px.
- BTF_F_ZERO: show zero-valued struct/union members;
they are not displayed by default
Change-Id: I77f2c2a0d41aee2f4f10e3288a36de475fd2cb46
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com
|
||
|
|
e50eb80d29 |
UPSTREAM: bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint
Add .test_run for raw_tracepoint. Also, introduce a new feature that runs the target program on a specific CPU. This is achieved by a new flag in bpf_attr.test, BPF_F_TEST_RUN_ON_CPU. When this flag is set, the program is triggered on cpu with id bpf_attr.test.cpu. This feature is needed for BPF programs that handle perf_event and other percpu resources, as the program can access these resource locally. Change-Id: Id8f992caff30d7b65df8195f8934bcb2d8b658cb Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200925205432.1777-2-songliubraving@fb.com |
||
|
|
dc04cfd580 |
UPSTREAM: bpf: Change bpf_sk_assign to accept ARG_PTR_TO_BTF_ID_SOCK_COMMON
This patch changes the bpf_sk_assign() to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer returned by the bpf_skc_to_*() helpers also. The bpf_sk_lookup_assign() is taking ARG_PTR_TO_SOCKET_"OR_NULL". Meaning it specifically takes a literal NULL. ARG_PTR_TO_BTF_ID_SOCK_COMMON does not allow a literal NULL, so another ARG type is required for this purpose and another follow-up patch can be used if there is such need. Change-Id: Ia2747b017398c8dcc4454118dba03e71bdeba1dc Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200925000415.3857374-1-kafai@fb.com |
||
|
|
d7b625294e |
UPSTREAM: bpf: Change bpf_tcp_*_syncookie to accept ARG_PTR_TO_BTF_ID_SOCK_COMMON
This patch changes the bpf_tcp_*_syncookie() to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer returned by the bpf_skc_to_*() helpers also. Change-Id: I4f7712acada428dab29b94f0cb9bbbb19d9eaea2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Lorenz Bauer <lmb@cloudflare.com> Link: https://lore.kernel.org/bpf/20200925000409.3856725-1-kafai@fb.com |
||
|
|
36304c0284 |
UPSTREAM: bpf: Change bpf_sk_storage_*() to accept ARG_PTR_TO_BTF_ID_SOCK_COMMON
This patch changes the bpf_sk_storage_*() to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will work with the pointer returned by the bpf_skc_to_*() helpers also. A micro benchmark has been done on a "cgroup_skb/egress" bpf program which does a bpf_sk_storage_get(). It was driven by netperf doing a 4096 connected UDP_STREAM test with 64bytes packet. The stats from "kernel.bpf_stats_enabled" shows no meaningful difference. The sk_storage_get_btf_proto, sk_storage_delete_btf_proto, btf_sk_storage_get_proto, and btf_sk_storage_delete_proto are no longer needed, so they are removed. Change-Id: Ifef1303f11952714d181a3c369e154a70a864fd0 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Lorenz Bauer <lmb@cloudflare.com> Link: https://lore.kernel.org/bpf/20200925000402.3856307-1-kafai@fb.com |
||
|
|
01c44a76c3 |
UPSTREAM: bpf: Change bpf_sk_release and bpf_sk_*cgroup_id to accept ARG_PTR_TO_BTF_ID_SOCK_COMMON
The previous patch allows the networking bpf prog to use the
bpf_skc_to_*() helpers to get a PTR_TO_BTF_ID socket pointer,
e.g. "struct tcp_sock *". It allows the bpf prog to read all the
fields of the tcp_sock.
This patch changes the bpf_sk_release() and bpf_sk_*cgroup_id()
to take ARG_PTR_TO_BTF_ID_SOCK_COMMON such that they will
work with the pointer returned by the bpf_skc_to_*() helpers
also. For example, the following will work:
sk = bpf_skc_lookup_tcp(skb, tuple, tuplen, BPF_F_CURRENT_NETNS, 0);
if (!sk)
return;
tp = bpf_skc_to_tcp_sock(sk);
if (!tp) {
bpf_sk_release(sk);
return;
}
lsndtime = tp->lsndtime;
/* Pass tp to bpf_sk_release() will also work */
bpf_sk_release(tp);
Since PTR_TO_BTF_ID could be NULL, the helper taking
ARG_PTR_TO_BTF_ID_SOCK_COMMON has to check for NULL at runtime.
A btf_id of "struct sock" may not always mean a fullsock. Regardless
the helper's running context may get a non-fullsock or not,
considering fullsock check/handling is pretty cheap, it is better to
keep the same verifier expectation on helper that takes ARG_PTR_TO_BTF_ID*
will be able to handle the minisock situation. In the bpf_sk_*cgroup_id()
case, it will try to get a fullsock by using sk_to_full_sk() as its
skb variant bpf_sk"b"_*cgroup_id() has already been doing.
bpf_sk_release can already handle minisock, so nothing special has to
be done.
Change-Id: Ia9823f535d2a69b6ed4c16f599faa3f8c3ac3a42
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200925000356.3856047-1-kafai@fb.com
|
||
|
|
38367f1d94 |
UPSTREAM: bpf: Add BPF_PROG_BIND_MAP syscall
This syscall binds a map to a program. Returns success if the map is already bound to the program. Change-Id: I3899904e13ff5cddb35d41ffa9abe67d29a55ca9 Signed-off-by: YiFei Zhu <zhuyifei@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Cc: YiFei Zhu <zhuyifei1999@gmail.com> Link: https://lore.kernel.org/bpf/20200915234543.3220146-3-sdf@google.com |
||
|
|
9d956ac760 |
UPSTREAM: bpf: Fix comment for helper bpf_current_task_under_cgroup()
This should be "current" not "skb".
Fixes:
|
||
|
|
69e5cce407 |
UPSTREAM: bpf: Add bpf_copy_from_user() helper.
Sleepable BPF programs can now use copy_from_user() to access user memory. Change-Id: Ifaf2da741d14dc56b121163d7c486e5c15d1d5a4 Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: KP Singh <kpsingh@google.com> Link: https://lore.kernel.org/bpf/20200827220114.69225-4-alexei.starovoitov@gmail.com |
||
|
|
5fae5cfdcb |
UPSTREAM: bpf: Introduce sleepable BPF programs
Introduce sleepable BPF programs that can request such property for themselves via BPF_F_SLEEPABLE flag at program load time. In such case they will be able to use helpers like bpf_copy_from_user() that might sleep. At present only fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only when they are attached to kernel functions that are known to allow sleeping. The non-sleepable programs are relying on implicit rcu_read_lock() and migrate_disable() to protect life time of programs, maps that they use and per-cpu kernel structures used to pass info between bpf programs and the kernel. The sleepable programs cannot be enclosed into rcu_read_lock(). migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs should not be enclosed in migrate_disable() as well. Therefore rcu_read_lock_trace is used to protect the life time of sleepable progs. There are many networking and tracing program types. In many cases the 'struct bpf_prog *' pointer itself is rcu protected within some other kernel data structure and the kernel code is using rcu_dereference() to load that program pointer and call BPF_PROG_RUN() on it. All these cases are not touched. Instead sleepable bpf programs are allowed with bpf trampoline only. The program pointers are hard-coded into generated assembly of bpf trampoline and synchronize_rcu_tasks_trace() is used to protect the life time of the program. The same trampoline can hold both sleepable and non-sleepable progs. When rcu_read_lock_trace is held it means that some sleepable bpf program is running from bpf trampoline. Those programs can use bpf arrays and preallocated hash/lru maps. These map types are waiting on programs to complete via synchronize_rcu_tasks_trace(); Updates to trampoline now has to do synchronize_rcu_tasks_trace() and synchronize_rcu_tasks() to wait for sleepable progs to finish and for trampoline assembly to finish. This is the first step of introducing sleepable progs. Eventually dynamically allocated hash maps can be allowed and networking program types can become sleepable too. Change-Id: I281471010348f21de4e0832dfc0265dfe85b4a0d Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: KP Singh <kpsingh@google.com> Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com |
||
|
|
8e06f1cbab |
UPSTREAM: bpf: Make bpf_link_info.iter similar to bpf_iter_link_info
bpf_link_info.iter is used by link_query to return bpf_iter_link_info
to user space. Fields may be different, e.g., map_fd vs. map_id, so
we cannot reuse the exact structure. But make them similar, e.g.,
struct bpf_link_info {
/* common fields */
union {
struct { ... } raw_tracepoint;
struct { ... } tracing;
...
struct {
/* common fields for iter */
union {
struct {
__u32 map_id;
} map;
/* other structs for other targets */
};
};
};
};
so the structure is extensible the same way as bpf_iter_link_info.
Fixes: 6b0a249a301e ("bpf: Implement link_query for bpf iterators")
Change-Id: I5bb1481995ec2fa33ce79676b0cfb7594850bbd8
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200828051922.758950-1-yhs@fb.com
|
||
|
|
b1b48c9c9e |
UPSTREAM: bpf: Add d_path helper
Adding d_path helper function that returns full path for given 'struct path' object, which needs to be the kernel BTF 'path' object. The path is returned in buffer provided 'buf' of size 'sz' and is zero terminated. bpf_d_path(&file->f_path, buf, size); The helper calls directly d_path function, so there's only limited set of function it can be called from. Adding just very modest set for the start. Updating also bpf.h tools uapi header and adding 'path' to bpf_helpers_doc.py script. Change-Id: If390ec6189a537b730b9ae595d374cbfa83f6f5b Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: KP Singh <kpsingh@google.com> Link: https://lore.kernel.org/bpf/20200825192124.710397-11-jolsa@kernel.org |
||
|
|
12833c5a0e |
UPSTREAM: bpf: Allow local storage to be used from LSM programs
Adds support for both bpf_{sk, inode}_storage_{get, delete} to be used
in LSM programs. These helpers are not used for tracing programs
(currently) as their usage is tied to the life-cycle of the object and
should only be used where the owning object won't be freed (when the
owning object is passed as an argument to the LSM hook). Thus, they
are safer to use in LSM hooks than tracing. Usage of local storage in
tracing programs will probably follow a per function based whitelist
approach.
Since the UAPI helper signature for bpf_sk_storage expect a bpf_sock,
it, leads to a compilation warning for LSM programs, it's also updated
to accept a void * pointer instead.
Change-Id: Ib0afe90a58aa586852d48d13c77bc91cbada9bd1
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-7-kpsingh@chromium.org
|
||
|
|
16dad6b655 |
BACKPORT: bpf: Implement bpf_local_storage for inodes
Similar to bpf_local_storage for sockets, add local storage for inodes. The life-cycle of storage is managed with the life-cycle of the inode. i.e. the storage is destroyed along with the owning inode. The BPF LSM allocates an __rcu pointer to the bpf_local_storage in the security blob which are now stackable and can co-exist with other LSMs. Change-Id: I078273c02202a6ee3837865b6d5b307989eb9bed Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200825182919.1118197-6-kpsingh@chromium.org |
||
|
|
4427d399f9 |
UPSTREAM: bpf: Generalize bpf_sk_storage
Refactor the functionality in bpf_sk_storage.c so that concept of storage linked to kernel objects can be extended to other objects like inode, task_struct etc. Each new local storage will still be a separate map and provide its own set of helpers. This allows for future object specific extensions and still share a lot of the underlying implementation. This includes the changes suggested by Martin in: https://lore.kernel.org/bpf/20200725013047.4006241-1-kafai@fb.com/ adding new map operations to support bpf_local_storage maps: * storages for different kernel objects to optionally have different memory charging strategy (map_local_storage_charge, map_local_storage_uncharge) * Functionality to extract the storage pointer from a pointer to the owning object (map_owner_storage_ptr) Co-developed-by: Martin KaFai Lau <kafai@fb.com> Change-Id: I02da579d09e497018432efed3a876e384d0cd561 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200825182919.1118197-4-kpsingh@chromium.org |
||
|
|
aca7b52d88 |
UPSTREAM: tcp: bpf: Optionally store mac header in TCP_SAVE_SYN
This patch is adapted from Eric's patch in an earlier discussion [1]. The TCP_SAVE_SYN currently only stores the network header and tcp header. This patch allows it to optionally store the mac header also if the setsockopt's optval is 2. It requires one more bit for the "save_syn" bit field in tcp_sock. This patch achieves this by moving the syn_smc bit next to the is_mptcp. The syn_smc is currently used with the TCP experimental option. Since syn_smc is only used when CONFIG_SMC is enabled, this patch also puts the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did with "IS_ENABLED(CONFIG_MPTCP)". The mac_hdrlen is also stored in the "struct saved_syn" to allow a quick offset from the bpf prog if it chooses to start getting from the network header or the tcp header. [1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/ Suggested-by: Eric Dumazet <edumazet@google.com> Change-Id: Id12b23200ccba9017f1792fda9c8b6aec2998d38 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com |
||
|
|
b9c5fff21b |
UPSTREAM: bpf: tcp: Allow bpf prog to write and parse TCP header option
[ Note: The TCP changes here is mainly to implement the bpf
pieces into the bpf_skops_*() functions introduced
in the earlier patches. ]
The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF. It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.
The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance. Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.
For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].
This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options. It currently supports most of
the TCP packet except RST.
Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt(). The helper will ensure there is no duplicated
option in the header.
By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog. Different bpf-prog can write its
own option kind. It could also allow the bpf-prog to support a
recently standardized option on an older kernel.
Sockops Callback Flags:
──────────────────────
The bpf program will only be called to parse/write tcp header option
if the following newly added callback flags are enabled
in tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
A few words on the PARSE CB flags. When the above PARSE CB flags are
turned on, the bpf-prog will be called on packets received
at a sk that has at least reached the ESTABLISHED state.
The parsing of the SYN-SYNACK-ACK will be discussed in the
"3 Way HandShake" section.
The default is off for all of the above new CB flags, i.e. the bpf prog
will not be called to parse or write bpf hdr option. There are
details comment on these new cb flags in the UAPI bpf.h.
sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options. They are read only.
The new bpf_load_hdr_opt() helps to read a particular option "kind"
from the skb_data.
Please refer to the comment in UAPI bpf.h. It has details
on what skb_data contains under different sock_ops->op.
3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.
* Passive side
When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog. The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb. The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
More on this later. Also, the bpf prog can learn if it is in syncookie
mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).
The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN). The example in a later patch does it.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
is not very useful here since the listen sk will be shared
by many concurrent connection requests.
Extending bpf_sk_storage support to request_sock will add weight
to the minisock and it is not necessary better than storing the
whole ~100 bytes SYN pkt. ]
When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback. At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The later patch will use the max delay ack specified in the SYN
header and set the RTO of this newly established connection
as an example.
The received ACK (that concludes the 3WHS) will also be available to
the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
It could be useful in syncookie scenario. More on this later.
There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.
The new getsockopt(TCP_BPF_SYN*) will also know where it can get
the SYN's packet from:
- (a) the just received syn (available when the bpf prog is writing SYNACK)
and it is the only way to get SYN during syncookie mode.
or
- (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
existing CB).
The bpf prog does not need to know where the SYN pkt is coming from.
The getsockopt(TCP_BPF_SYN*) will hide this details.
Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to read a particular header option from the SYN packet.
* Fastopen
Fastopen should work the same as the regular non fastopen case.
This is a test in a later patch.
* Syncookie
For syncookie, the later example patch asks the active
side's bpf prog to resend the header options in ACK. The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.
* Active side
The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB. The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().
* Turn off header CB flags after 3WHS
If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.
[1]: draft-wang-tcpm-low-latency-opt-00
https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00
Change-Id: If5c8cad06573cc8f9712fb0b0761b10df24d180d
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
|
||
|
|
d6dada003e |
BACKPORT: bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt()
The bpf prog needs to parse the SYN header to learn what options have been sent by the peer's bpf-prog before writing its options into SYNACK. This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack(). This syn_skb will eventually be made available (as read-only) to the bpf prog. This will be the only SYN packet available to the bpf prog during syncookie. For other regular cases, the bpf prog can also use the saved_syn. When writing options, the bpf prog will first be called to tell the kernel its required number of bytes. It is done by the new bpf_skops_hdr_opt_len(). The bpf prog will only be called when the new BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags. When the bpf prog returns, the kernel will know how many bytes are needed and then update the "*remaining" arg accordingly. 4 byte alignment will be included in the "*remaining" before this function returns. The 4 byte aligned number of bytes will also be stored into the opts->bpf_opt_len. "bpf_opt_len" is a newly added member to the struct tcp_out_options. Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the header options. The bpf prog is only called if it has reserved spaces before (opts->bpf_opt_len > 0). The bpf prog is the last one getting a chance to reserve header space and writing the header option. These two functions are half implemented to highlight the changes in TCP stack. The actual codes preparing the bpf running context and invoking the bpf prog will be added in the later patch with other necessary bpf pieces. Change-Id: If81bbf6ba9cd4a9ee363a447e780e9b0833ee92a Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com |
||
|
|
159db3a757 |
UPSTREAM: bpf: tcp: Add bpf_skops_parse_hdr()
The patch adds a function bpf_skops_parse_hdr(). It will call the bpf prog to parse the TCP header received at a tcp_sock that has at least reached the ESTABLISHED state. For the packets received during the 3WHS (SYN, SYNACK and ACK), the received skb will be available to the bpf prog during the callback in bpf_skops_established() introduced in the previous patch and in the bpf_skops_write_hdr_opt() that will be added in the next patch. Calling bpf prog to parse header is controlled by two new flags in tp->bpf_sock_ops_cb_flags: BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG and BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG. When BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG is set, the bpf prog will only be called when there is unknown option in the TCP header. When BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG is set, the bpf prog will be called on all received TCP header. This function is half implemented to highlight the changes in TCP stack. The actual codes preparing the bpf running context and invoking the bpf prog will be added in the later patch with other necessary bpf pieces. Change-Id: Ia70614c4388ce1eab849e426140f7c66bf68e784 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20200820190046.2885054-1-kafai@fb.com |
||
|
|
2b8051bd27 |
UPSTREAM: tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt
This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog to set the min rto of a connection. It could be used together with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX). A later selftest patch will communicate the max delay ack in a bpf tcp header option and then the receiving side can use bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto. Change-Id: I32e313838af527d12b7e5f16036481bcaefa1e7b Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com |
||
|
|
055cbced58 |
UPSTREAM: tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt
This change is mostly from an internal patch and adapts it from sysctl config to the bpf_setsockopt setup. The bpf_prog can set the max delay ack by using bpf_setsockopt(TCP_BPF_DELACK_MAX). This max delay ack can be communicated to its peer through bpf header option. The receiving peer can then use this max delay ack and set a potentially lower rto by using bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced in the next patch. Another later selftest patch will also use it like the above to show how to write and parse bpf tcp header option. Change-Id: I18bc0cb9551922db5056cb817b9690c40c5983e2 Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com |