4336 Commits

Author SHA1 Message Date
Tetsuo Handa
884c45b53e kernel: Initialize cpumask before parsing
KMSAN complains that new_value at cpumask_parse_user() from
write_irq_affinity() from irq_affinity_proc_write() is uninitialized.

  [  148.133411][ T5509] =====================================================
  [  148.135383][ T5509] BUG: KMSAN: uninit-value in find_next_bit+0x325/0x340
  [  148.137819][ T5509]
  [  148.138448][ T5509] Local variable ----new_value.i@irq_affinity_proc_write created at:
  [  148.140768][ T5509]  irq_affinity_proc_write+0xc3/0x3d0
  [  148.142298][ T5509]  irq_affinity_proc_write+0xc3/0x3d0
  [  148.143823][ T5509] =====================================================

Since bitmap_parse() from cpumask_parse_user() calls find_next_bit(),
any alloc_cpumask_var() + cpumask_parse_user() sequence has possibility
that find_next_bit() accesses uninitialized cpu mask variable. Fix this
problem by replacing alloc_cpumask_var() with zalloc_cpumask_var().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20210401055823.3929-1-penguin-kernel@I-love.SAKURA.ne.jp
2025-12-08 00:52:06 +00:00
LinkBoi00
bba5960695 {kernel, net}: Disable useless kernel modules
Signed-off-by: LinkBoi00 <linkdevel@protonmail.com>
2025-10-18 10:51:00 +00:00
Artem Savkov
e337088063 UPSTREAM: ftrace: Return the first found result in lookup_rec()
It appears that ip ranges can overlap so. In that case lookup_rec()
returns whatever results it got last even if it found nothing in last
searched page.

This breaks an obscure livepatch late module patching usecase:
  - load livepatch
  - load the patched module
  - unload livepatch
  - try to load livepatch again

To fix this return from lookup_rec() as soon as it found the record
containing searched-for ip. This used to be this way prior lookup_rec()
introduction.

Link: http://lkml.kernel.org/r/20200306174317.21699-1-asavkov@redhat.com

Cc: stable@vger.kernel.org
Fixes: 7e16f581a817 ("ftrace: Separate out functionality from ftrace_location_range()")
Change-Id: Ibfd941aa40df53bce30b7973d58c3665a4a4a8d8
Signed-off-by: Artem Savkov <asavkov@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2025-10-12 15:02:48 +01:00
Steven Rostedt (VMware)
84f8dcd11a UPSTREAM: ftrace: Separate out functionality from ftrace_location_range()
Create a new function called lookup_rec() from the functionality of
ftrace_location_range(). The difference between lookup_rec() is that it
returns the record that it finds, where as ftrace_location_range() returns
only if it found a match or not.

The lookup_rec() is static, and can be used for new functionality where
ftrace needs to find a record of a specific address.

Change-Id: I7e5a80df3f1486b889d5fa533728794f79afa24a
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2025-10-12 15:02:45 +01:00
Michel Lespinasse
b8ba613d40 BACKPORT: mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Change-Id: If729000ea8cedab7079ccc1350db26ed71f0df92
Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-10-12 14:53:46 +01:00
Peter Zijlstra
ec55a92169 UPSTREAM: x86/ibt,ftrace: Search for __fentry__ location
commit aebfd12521d9c7d0b502cf6d06314cfbcdccfe3b upstream.

Currently a lot of ftrace code assumes __fentry__ is at sym+0. However
with Intel IBT enabled the first instruction of a function will most
likely be ENDBR.

Change ftrace_location() to not only return the __fentry__ location
when called for the __fentry__ location, but also when called for the
sym+0 location.

Then audit/update all callsites of this function to consistently use
these new semantics.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Change-Id: I72966b96df528f86121b6f6c866b56bf09a4227f
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lore.kernel.org/r/20220308154318.227581603@infradead.org
Stable-dep-of: e60b613df8b6 ("ftrace: Fix possible use-after-free issue in ftrace_location()")
[Shivani: Modified to apply on v5.10.y]
Signed-off-by: Shivani Agarwal <shivani.agarwal@broadcom.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-12 13:55:27 +01:00
Steven Rostedt (VMware)
d69a112c03 UPSTREAM: ftrace: Add a helper function to modify_ftrace_direct() to allow arch optimization
If a direct ftrace callback is at a location that does not have any other
ftrace helpers attached to it, it is possible to simply just change the
text to call the new caller (if the architecture supports it). But this
requires special architecture code. Currently, modify_ftrace_direct() uses a
trick to add a stub ftrace callback to the location forcing it to call the
ftrace iterator. Then it can change the direct helper to call the new
function in C, and then remove the stub. Removing the stub will have the
location now call the new location that the direct helper is using.

The new helper function does the registering the stub trick, but is a weak
function, allowing an architecture to override it to do something a bit more
direct.

Link: https://lore.kernel.org/r/20191115215125.mbqv7taqnx376yed@ast-mbp.dhcp.thefacebook.com

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Change-Id: I24ee1bebc80aa17ee382063b3cd7d58ea6126508
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2025-10-12 13:55:26 +01:00
Steven Rostedt (VMware)
28f9678c46 UPSTREAM: ftrace: Add ftrace_find_direct_func()
As function_graph tracer modifies the return address to insert a trampoline
to trace the return of a function, it must be aware of a direct caller, as
when it gets called, the function's return address may not be at on the
stack where it expects. It may have to see if that return address points to
the a direct caller and adjust if it is.

Change-Id: I80bd23932c426ec3b2db76538dacd283b972db0a
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2025-10-12 13:55:26 +01:00
Steven Rostedt (VMware)
0ad1df7e64 UPSTREAM: ftrace: Add helper find_direct_entry() to consolidate code
Both unregister_ftrace_direct() and modify_ftrace_direct() needs to
normalize the ip passed in to match the rec->ip, as it is acceptable to have
the ip on the ftrace call site but not the start. There are also common
validity checks with the record found by the ip, these should be done for
both unregister_ftrace_direct() and modify_ftrace_direct().

Change-Id: Ib74ac2ae2f0c9d6c261b409bd87ce9f908e7c8da
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2025-10-12 13:55:26 +01:00
bengris32
c6aa1292ca Merge branch 'linux-4.19.y-cip' of https://git.kernel.org/pub/scm/linux/kernel/git/cip/linux-cip into android-4.19.y-mediatek
* 'linux-4.19.y-cip' of https://git.kernel.org/pub/scm/linux/kernel/git/cip/linux-cip:
  CIP: Bump version suffix to -cip124 after merge from cip/linux-4.19.y-st tree
  Update localversion-st, tree is up-to-date with 5.4.298.
  f2fs: fix to do sanity check on ino and xnid
  squashfs: fix memory leak in squashfs_fill_super
  pNFS: Handle RPC size limit for layoutcommits
  wifi: iwlwifi: fw: Fix possible memory leak in iwl_fw_dbg_collect
  usb: core: usb_submit_urb: downgrade type check
  udf: Verify partition map count
  f2fs: fix to avoid panic in f2fs_evict_inode
  usb: hub: Fix flushing and scheduling of delayed work that tunes runtime pm
  Revert "drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS"
  net: usb: qmi_wwan: add Telit Cinterion LE910C4-WWX new compositions
  HID: hid-ntrig: fix unable to handle page fault in ntrig_report_version()
  HID: asus: fix UAF via HID_CLAIMED_INPUT validation
  efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare
  sctp: initialize more fields in sctp_v6_from_sk()
  net: stmmac: xgmac: Do not enable RX FIFO Overflow interrupts
  net/mlx5e: Set local Xoff after FW update
  net: dlink: fix multicast stats being counted incorrectly
  atm: atmtcp: Prevent arbitrary write in atmtcp_recv_control().
  net/atm: remove the atmdev_ops {get, set}sockopt methods
  Bluetooth: hci_event: Detect if HCI_EV_NUM_COMP_PKTS is unbalanced
  powerpc/kvm: Fix ifdef to remove build warning
  net: ipv4: fix regression in local-broadcast routes
  vhost/net: Protect ubufs with rcu read lock in vhost_net_ubuf_put()
  scsi: core: sysfs: Correct sysfs attributes access rights
  ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
  alloc_fdtable(): change calling conventions.
  ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validation
  net/sched: Make cake_enqueue return NET_XMIT_CN when past buffer_limit
  ipv6: sr: validate HMAC algorithm ID in seg6_hmac_info_add
  ALSA: usb-audio: Fix size validation in convert_chmap_v3()
  scsi: qla4xxx: Prevent a potential error pointer dereference
  usb: xhci: Fix slot_id resource race conflict
  nfs: fix UAF in direct writes
  NFS: Fix up commit deadlocks
  Bluetooth: fix use-after-free in device_for_each_child()
  selftests: forwarding: tc_actions.sh: add matchall mirror test
  codel: remove sch->q.qlen check before qdisc_tree_reduce_backlog()
  sch_qfq: make qfq_qlen_notify() idempotent
  sch_hfsc: make hfsc_qlen_notify() idempotent
  sch_drr: make drr_qlen_notify() idempotent
  btrfs: populate otime when logging an inode item
  media: venus: hfi: explicitly release IRQ during teardown
  f2fs: fix to avoid out-of-boundary access in dnode page
  media: venus: protect against spurious interrupts during probe
  media: venus: vdec: Clamp param smaller than 1fps and bigger than 240.
  drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS
  media: rainshadow-cec: fix TOCTOU race condition in rain_interrupt()
  media: v4l2-ctrls: Don't reset handler's error in v4l2_ctrl_handler_free()
  ata: Fix SATA_MOBILE_LPM_POLICY description in Kconfig
  usb: musb: omap2430: fix device leak at unbind
  NFS: Fix the setting of capabilities when automounting a new filesystem
  NFS: Fix up handling of outstanding layoutcommit in nfs_update_inode()
  NFSv4: Fix nfs4_bitmap_copy_adjust()
  usb: typec: fusb302: cache PD RX state
  cdc-acm: fix race between initial clearing halt and open
  USB: cdc-acm: do not log successful probe on later errors
  nfsd: handle get_client_locked() failure in nfsd4_setclientid_confirm()
  tracing: Add down_write(trace_event_sem) when adding trace event
  usb: hub: Don't try to recover devices lost during warm reset.
  usb: hub: avoid warm port reset during USB3 disconnect
  x86/mce/amd: Add default names for MCA banks and blocks
  iio: hid-sensor-prox: Fix incorrect OFFSET calculation
  mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n
  mm/zsmalloc.c: convert to use kmem_cache_zalloc in cache_alloc_zspage()
  net: usbnet: Fix the wrong netif_carrier_on() call
  net: usbnet: Avoid potential RCU stall on LINK_CHANGE event
  PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
  ACPI: processor: idle: Check acpi_fetch_acpi_dev() return value
  kbuild: Add KBUILD_CPPFLAGS to as-option invocation
  kbuild: add $(CLANG_FLAGS) to KBUILD_CPPFLAGS
  kbuild: Add CLANG_FLAGS to as-instr
  mips: Include KBUILD_CPPFLAGS in CHECKFLAGS invocation
  kbuild: Update assembler calls to use proper flags and language target
  ARM: 9448/1: Use an absolute path to unified.h in KBUILD_AFLAGS
  usb: dwc3: Ignore late xferNotReady event to prevent halt timeout
  USB: storage: Ignore driver CD mode for Realtek multi-mode Wi-Fi dongles
  usb: storage: realtek_cr: Use correct byte order for bcs->Residue
  USB: storage: Add unusual-devs entry for Novatek NTK96550-based camera
  usb: quirks: Add DELAY_INIT quick for another SanDisk 3.2Gen1 Flash Drive
  iio: proximity: isl29501: fix buffered read on big-endian systems
  ftrace: Also allocate and copy hash for reading of filter files
  fpga: zynq_fpga: Fix the wrong usage of dma_map_sgtable()
  fs/buffer: fix use-after-free when call bh_read() helper
  drm/amd/display: Fix fractional fb divider in set_pixel_clock_v3
  media: venus: Add a check for packet size after reading from shared memory
  media: ov2659: Fix memory leaks in ov2659_probe()
  media: usbtv: Lock resolution while streaming
  media: gspca: Add bounds checking to firmware parser
  jbd2: prevent softlockup in jbd2_log_do_checkpoint()
  PCI: endpoint: Fix configfs group removal on driver teardown
  PCI: endpoint: Fix configfs group list head handling
  mtd: rawnand: fsmc: Add missing check after DMA map
  wifi: brcmsmac: Remove const from tbl_ptr parameter in wlc_lcnphy_common_read_table()
  zynq_fpga: use sgtable-based scatterlist wrappers
  ata: libata-scsi: Fix ata_to_sense_error() status handling
  ext4: fix reserved gdt blocks handling in fsmap
  ext4: fix fsmap end of range reporting with bigalloc
  ext4: check fast symlink for ea_inode correctly
  Revert "vgacon: Add check for vc_origin address range in vgacon_scroll()"
  vt: defkeymap: Map keycodes above 127 to K_HOLE
  usb: gadget: udc: renesas_usb3: fix device leak at unbind
  usb: atm: cxacru: Merge cxacru_upload_firmware() into cxacru_heavy_init()
  m68k: Fix lost column on framebuffer debug console
  serial: 8250: fix panic due to PSLVERR
  media: uvcvideo: Do not mark valid metadata as invalid
  media: uvcvideo: Fix 1-byte out-of-bounds read in uvc_parse_format()
  btrfs: fix log tree replay failure due to file with 0 links and extents
  thunderbolt: Fix copy+paste error in match_service_id()
  misc: rtsx: usb: Ensure mmc child device is active when card is present
  scsi: lpfc: Remove redundant assignment to avoid memory leak
  rtc: ds1307: remove clear of oscillator stop flag (OSF) in probe
  pNFS: Fix uninited ptr deref in block/scsi layout
  pNFS: Fix disk addr range check in block/scsi layout
  pNFS: Fix stripe mapping in block/scsi layout
  ipmi: Fix strcpy source and destination the same
  kconfig: lxdialog: fix 'space' to (de)select options
  kconfig: gconf: fix potential memory leak in renderer_edited()
  kconfig: gconf: avoid hardcoding model2 in on_treeview2_cursor_changed()
  scsi: aacraid: Stop using PCI_IRQ_AFFINITY
  scsi: Fix sas_user_scan() to handle wildcard and multi-channel scans
  kconfig: nconf: Ensure null termination where strncpy is used
  kconfig: lxdialog: replace strcpy() with strncpy() in inputbox.c
  PCI: pnv_php: Work around switches with broken presence detection
  media: uvcvideo: Fix bandwidth issue for Alcor camera
  media: dvb-frontends: w7090p: fix null-ptr-deref in w7090p_tuner_write_serpar and w7090p_tuner_read_serpar
  media: dvb-frontends: dib7090p: fix null-ptr-deref in dib7090p_rw_on_apb()
  media: usb: hdpvr: disable zero-length read messages
  media: tc358743: Increase FIFO trigger level to 374
  media: tc358743: Return an appropriate colorspace from tc358743_set_fmt
  media: tc358743: Check I2C succeeded during probe
  pinctrl: stm32: Manage irq affinity settings
  scsi: mpt3sas: Correctly handle ATA device errors
  RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask()
  MIPS: Don't crash in stack_top() for tasks without ABI or vDSO
  jfs: upper bound check of tree index in dbAllocAG
  jfs: Regular file corruption check
  jfs: truncate good inode pages when hard link is 0
  scsi: bfa: Double-free fix
  MIPS: vpe-mt: add missing prototypes for vpe_{alloc,start,stop,free}
  watchdog: dw_wdt: Fix default timeout
  fs/orangefs: use snprintf() instead of sprintf()
  scsi: libiscsi: Initialize iscsi_conn->dd_data only if memory is allocated
  ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr
  vhost: fail early when __vhost_add_used() fails
  uapi: in6: restore visibility of most IPv6 socket options
  net: ncsi: Fix buffer overflow in fetching version id
  net: dsa: b53: fix b53_imp_vlan_setup for BCM5325
  net: vlan: Replace BUG() with WARN_ON_ONCE() in vlan_dev_* stubs
  wifi: iwlegacy: Check rate_idx range after addition
  netmem: fix skb_frag_address_safe with unreadable skbs
  wifi: rtlwifi: fix possible skb memory leak in `_rtl_pci_rx_interrupt()`.
  wifi: iwlwifi: dvm: fix potential overflow in rs_fill_link_cmd()
  net: fec: allow disable coalescing
  (powerpc/512) Fix possible `dma_unmap_single()` on uninitialized pointer
  s390/stp: Remove udelay from stp_sync_clock()
  wifi: iwlwifi: mvm: fix scan request validation
  net: thunderx: Fix format-truncation warning in bgx_acpi_match_id()
  net: ipv4: fix incorrect MTU in broadcast routes
  wifi: cfg80211: Fix interface type validation
  et131x: Add missing check after DMA map
  be2net: Use correct byte order and format string for TCP seq and ack_seq
  s390/time: Use monotonic clock in get_cycles()
  wifi: cfg80211: reject HTC bit for management frames
  ktest.pl: Prevent recursion of default variable options
  ASoC: codecs: rt5640: Retry DEVICE_ID verification
  ALSA: usb-audio: Avoid precedence issues in mixer_quirks macros
  ALSA: hda/ca0132: Fix buffer overflow in add_tuning_control
  platform/x86: thinkpad_acpi: Handle KCOV __init vs inline mismatches
  pm: cpupower: Fix the snapshot-order of tsc,mperf, clock in mperf_stop()
  ALSA: intel8x0: Fix incorrect codec index usage in mixer for ICH4
  ASoC: hdac_hdmi: Rate limit logging on connection and disconnection
  mmc: rtsx_usb_sdmmc: Fix error-path in sd_set_power_mode()
  ACPI: processor: fix acpi_object initialization
  PM: sleep: console: Fix the black screen issue
  thermal: sysfs: Return ENODATA instead of EAGAIN for reads
  selftests: tracing: Use mutex_unlock for testing glob filter
  ARM: tegra: Use I/O memcpy to write to IRAM
  gpio: tps65912: check the return value of regmap_update_bits()
  ASoC: soc-dapm: set bias_level if snd_soc_dapm_set_bias_level() was successed
  cpufreq: Exit governor when failed to start old governor
  usb: xhci: Avoid showing errors during surprise removal
  usb: xhci: Set avg_trb_len = 8 for EP0 during Address Device Command
  usb: xhci: Avoid showing warnings for dying controller
  selftests/futex: Define SYS_futex on 32-bit architectures with 64-bit time_t
  usb: xhci: print xhci->xhc_state when queue_command failed
  securityfs: don't pin dentries twice, once is enough...
  hfs: fix not erasing deleted b-tree node issue
  drbd: add missing kref_get in handle_write_conflicts
  arm64: Handle KCOV __init vs inline mismatches
  hfsplus: don't use BUG_ON() in hfsplus_create_attributes_file()
  hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc()
  hfsplus: fix slab-out-of-bounds in hfsplus_bnode_read()
  hfs: fix slab-out-of-bounds in hfs_bnode_read()
  sctp: linearize cloned gso packets in sctp_rcv
  netfilter: ctnetlink: fix refcount leak on table dump
  udp: also consider secpath when evaluating ipsec use for checksumming
  fs: Prevent file descriptor table allocations exceeding INT_MAX
  sunvdc: Balance device refcount in vdc_port_mpgroup_check
  NFSD: detect mismatch of file handle and delegation stateid in OPEN op
  net: dpaa: fix device leak when querying time stamp info
  net: gianfar: fix device leak when querying time stamp info
  netlink: avoid infinite retry looping in netlink_unicast()
  ALSA: usb-audio: Validate UAC3 cluster segment descriptors
  ALSA: usb-audio: Validate UAC3 power domain descriptors, too
  usb: gadget : fix use-after-free in composite_dev_cleanup()
  MIPS: mm: tlb-r4k: Uniquify TLB entries on init
  USB: serial: option: add Foxconn T99W709
  vsock: Do not allow binding to VMADDR_PORT_ANY
  net/packet: fix a race in packet_set_ring() and packet_notifier()
  perf/core: Prevent VMA split of buffer mappings
  perf/core: Exit early on perf_mmap() fail
  perf/core: Don't leak AUX buffer refcount on allocation failure
  pptp: fix pptp_xmit() error path
  smb: client: let recv_done() cleanup before notifying the callers.
  benet: fix BUG when creating VFs
  ipv6: reject malicious packets in ipv6_gso_segment()
  pptp: ensure minimal skb length in pptp_xmit()
  netpoll: prevent hanging NAPI when netcons gets enabled
  NFS: Fix filehandle bounds checking in nfs_fh_to_dentry()
  pci/hotplug/pnv-php: Wrap warnings in macro
  pci/hotplug/pnv-php: Improve error msg on power state change failure
  usb: chipidea: udc: fix sleeping function called from invalid context
  f2fs: fix to avoid out-of-boundary access in devs.path
  f2fs: fix to avoid UAF in f2fs_sync_inode_meta()
  rtc: pcf8563: fix incorrect maximum clock rate handling
  rtc: hym8563: fix incorrect maximum clock rate handling
  rtc: ds1307: fix incorrect maximum clock rate handling
  mtd: rawnand: atmel: set pmecc data setup time
  mtd: rawnand: atmel: Fix dma_mapping_error() address
  jfs: fix metapage reference count leak in dbAllocCtl
  fbdev: imxfb: Check fb_add_videomode to prevent null-ptr-deref
  crypto: qat - fix seq_file position update in adf_ring_next()
  dmaengine: nbpfaxi: Add missing check after DMA map
  dmaengine: mv_xor: Fix missing check after DMA map and missing unmap
  fs/orangefs: Allow 2 more characters in do_c_string()
  crypto: img-hash - Fix dma_unmap_sg() nents value
  scsi: isci: Fix dma_unmap_sg() nents value
  scsi: mvsas: Fix dma_unmap_sg() nents value
  scsi: ibmvscsi_tgt: Fix dma_unmap_sg() nents value
  perf tests bp_account: Fix leaked file descriptor
  crypto: ccp - Fix crash when rebind ccp device for ccp.ko
  pinctrl: sunxi: Fix memory leak on krealloc failure
  power: supply: max14577: Handle NULL pdata when CONFIG_OF is not set
  clk: davinci: Add NULL check in davinci_lpsc_clk_register()
  mtd: fix possible integer overflow in erase_xfer()
  crypto: marvell/cesa - Fix engine load inaccuracy
  PCI: rockchip-host: Fix "Unexpected Completion" log message
  vrf: Drop existing dst reference in vrf_ip6_input_dst
  netfilter: xt_nfacct: don't assume acct name is null-terminated
  can: kvaser_usb: Assign netdev.dev_port based on device channel index
  wifi: brcmfmac: fix P2P discovery failure in P2P peer due to missing P2P IE
  Reapply "wifi: mac80211: Update skb's control block key in ieee80211_tx_dequeue()"
  mwl8k: Add missing check after DMA map
  wifi: rtl8xxxu: Fix RX skb size for aggregation disabled
  net/sched: Restrict conditions for adding duplicating netems to qdisc tree
  arch: powerpc: defconfig: Drop obsolete CONFIG_NET_CLS_TCINDEX
  netfilter: nf_tables: adjust lockdep assertions handling
  drm/amd/pm/powerplay/hwmgr/smu_helper: fix order of mask and value
  m68k: Don't unregister boot console needlessly
  tcp: fix tcp_ofo_queue() to avoid including too much DUP SACK range
  iwlwifi: Add missing check for alloc_ordered_workqueue
  wifi: iwlwifi: Fix memory leak in iwl_mvm_init()
  wifi: rtl818x: Kill URBs before clearing tx status queue
  caif: reduce stack size, again
  staging: nvec: Fix incorrect null termination of battery manufacturer
  samples: mei: Fix building on musl libc
  usb: early: xhci-dbc: Fix early_ioremap leak
  Revert "vmci: Prevent the dispatching of uninitialized payloads"
  pps: fix poll support
  vmci: Prevent the dispatching of uninitialized payloads
  staging: fbtft: fix potential memory leak in fbtft_framebuffer_alloc()
  ARM: dts: vfxxx: Correctly use two tuples for timer address
  ASoC: ops: dynamically allocate struct snd_ctl_elem_value
  hfsplus: remove mutex_lock check in hfsplus_free_extents
  ASoC: Intel: fix SND_SOC_SOF dependencies
  ethernet: intel: fix building with large NR_CPUS
  usb: phy: mxs: disconnect line when USB charger is attached
  usb: chipidea: udc: protect usb interrupt enable
  usb: chipidea: udc: add new API ci_hdrc_gadget_connect
  comedi: comedi_test: Fix possible deletion of uninitialized timers
  nilfs2: reject invalid file types when reading inodes
  i2c: qup: jump out of the loop in case of timeout
  net/sched: sch_qfq: Avoid triggering might_sleep in atomic context in qfq_delete_class
  net: appletalk: Fix use-after-free in AARP proxy probe
  net: appletalk: fix kerneldoc warnings
  RDMA/core: Rate limit GID cache warning messages
  usb: hub: fix detection of high tier USB3 devices behind suspended hubs
  net_sched: sch_sfq: reject invalid perturb period
  net_sched: sch_sfq: move the limit validation
  net_sched: sch_sfq: use a temporary work area for validating configuration
  net_sched: sch_sfq: don't allow 1 packet limit
  net_sched: sch_sfq: handle bigger packets
  net_sched: sch_sfq: annotate data-races around q->perturb_period
  power: supply: bq24190_charger: Fix runtime PM imbalance on error
  xhci: Disable stream for xHC controller with XHCI_BROKEN_STREAMS
  virtio-net: ensure the received length does not exceed allocated size
  usb: dwc3: qcom: Don't leave BCR asserted
  usb: musb: fix gadget state on disconnect
  net/sched: Return NULL when htb_lookup_leaf encounters an empty rbtree
  net: vlan: fix VLAN 0 refcount imbalance of toggling filtering during runtime
  Bluetooth: L2CAP: Fix attempting to adjust outgoing MTU
  Bluetooth: SMP: Fix using HCI_ERROR_REMOTE_USER_TERM on timeout
  Bluetooth: SMP: If an unallowed command is received consider it a failure
  Bluetooth: Fix null-ptr-deref in l2cap_sock_resume_cb()
  usb: net: sierra: check for no status endpoint
  net/sched: sch_qfq: Fix race condition on qfq_aggregate
  net: emaclite: Fix missing pointer increment in aligned_read()
  comedi: Fix use of uninitialized data in insn_rw_emulate_bits()
  comedi: Fix some signed shift left operations
  comedi: das6402: Fix bit shift out of bounds
  comedi: das16m1: Fix bit shift out of bounds
  comedi: aio_iiro_16: Fix bit shift out of bounds
  comedi: pcl812: Fix bit shift out of bounds
  iio: adc: max1363: Reorder mode_list[] entries
  iio: adc: max1363: Fix MAX1363_4X_CHANS/MAX1363_8X_CHANS[]
  soc: aspeed: lpc-snoop: Don't disable channels that aren't enabled
  soc: aspeed: lpc-snoop: Cleanup resources in stack-order
  mmc: sdhci-pci: Quirk for broken command queuing on Intel GLK-based Positivo models
  memstick: core: Zero initialize id_reg in h_memstick_read_dev_id()
  isofs: Verify inode mode when loading from disk
  dmaengine: nbpfaxi: Fix memory corruption in probe()
  af_packet: fix soft lockup issue caused by tpacket_snd()
  af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd()
  phonet/pep: Move call to pn_skb_get_dst_sockaddr() earlier in pep_sock_accept()
  HID: core: do not bypass hid_hw_raw_request
  HID: core: ensure __hid_request reserves the report ID as the first byte
  HID: core: ensure the allocated report buffer can contain the reserved report ID
  pch_uart: Fix dma_sync_sg_for_device() nents value
  Input: xpad - set correct controller type for Acer NGR200
  i2c: stm32: fix the device used for the DMA map
  usb: gadget: configfs: Fix OOB read on empty string write
  USB: serial: ftdi_sio: add support for NDI EMGUIDE GEMINI
  USB: serial: option: add Foxconn T99W640
  USB: serial: option: add Telit Cinterion FE910C04 (ECM) composition
  dma-mapping: add generic helpers for mapping sgtable objects
  usb: renesas_usbhs: Flush the notify_hotplug_work
  gpio: rcar: Use raw_spinlock to protect register access

Change-Id: Ia6b8b00918487999c648f298d3550afc7eaaae03
Signed-off-by: bengris32 <bengris32@protonmail.ch>
2025-10-12 13:39:56 +01:00
Tengda Wu
cdb89bab3e ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
[ Upstream commit 4013aef2ced9b756a410f50d12df9ebe6a883e4a ]

When calling ftrace_dump_one() concurrently with reading trace_pipe,
a WARN_ON_ONCE() in trace_printk_seq() can be triggered due to a race
condition.

The issue occurs because:

CPU0 (ftrace_dump)                              CPU1 (reader)
echo z > /proc/sysrq-trigger

!trace_empty(&iter)
trace_iterator_reset(&iter) <- len = size = 0
                                                cat /sys/kernel/tracing/trace_pipe
trace_find_next_entry_inc(&iter)
  __find_next_entry
    ring_buffer_empty_cpu <- all empty
  return NULL

trace_printk_seq(&iter.seq)
  WARN_ON_ONCE(s->seq.len >= s->seq.size)

In the context between trace_empty() and trace_find_next_entry_inc()
during ftrace_dump, the ring buffer data was consumed by other readers.
This caused trace_find_next_entry_inc to return NULL, failing to populate
`iter.seq`. At this point, due to the prior trace_iterator_reset, both
`iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal,
the WARN_ON_ONCE condition is triggered.

Move the trace_printk_seq() into the if block that checks to make sure the
return value of trace_find_next_entry_inc() is non-NULL in
ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before
subsequent operations.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@elte.hu>
Link: https://lore.kernel.org/20250822033343.3000289-1-wutengda@huaweicloud.com
Fixes: d769041f86 ("ring_buffer: implement new locking")
Signed-off-by: Tengda Wu <wutengda@huaweicloud.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Ulrich Hecht <uli@kernel.org>
2025-09-22 10:17:52 +02:00
Daniel Borkmann
0fd96bafc4 UPSTREAM: bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
[ Upstream commit ff40e51043af63715ab413995ff46996ecf9583f ]

Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:

  1) The audit events that are triggered due to calls to security_locked_down()
     can OOM kill a machine, see below details [0].

  2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
     when trying to wake up kauditd, for example, when using trace_sched_switch()
     tracepoint, see details in [1]. Triggering this was not via some hypothetical
     corner case, but with existing tools like runqlat & runqslower from bcc, for
     example, which make use of this tracepoint. Rough call sequence goes like:

     rq_lock(rq) -> -------------------------+
       trace_sched_switch() ->               |
         bpf_prog_xyz() ->                   +-> deadlock
           selinux_lockdown() ->             |
             audit_log_end() ->              |
               wake_up_interruptible() ->    |
                 try_to_wake_up() ->         |
                   rq_lock(rq) --------------+

What's worse is that the intention of 59438b46471a to further restrict lockdown
settings for specific applications in respect to the global lockdown policy is
completely broken for BPF. The SELinux policy rule for the current lockdown check
looks something like this:

  allow <who> <who> : lockdown { <reason> };

However, this doesn't match with the 'current' task where the security_locked_down()
is executed, example: httpd does a syscall. There is a tracing program attached
to the syscall which triggers a BPF program to run, which ends up doing a
bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
the permission check against 'current', that is, httpd in this example. httpd
has literally zero relation to this tracing program, and it would be nonsensical
having to write an SELinux policy rule against httpd to let the tracing helper
pass. The policy in this case needs to be against the entity that is installing
the BPF program. For example, if bpftrace would generate a histogram of syscall
counts by user space application:

  bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

bpftrace would then go and generate a BPF program from this internally. One way
of doing it [for the sake of the example] could be to call bpf_get_current_task()
helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
helpers. So the program itself has nothing to do with httpd or any other random
app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
check. The allow/deny policy belongs in the context of bpftrace: meaning, you
want to grant bpftrace access to use these helpers, but other tracers on the
system like my_random_tracer _not_.

Therefore fix all three issues at the same time by taking a completely different
approach for the security_locked_down() hook, that is, move the check into the
program verification phase where we actually retrieve the BPF func proto. This
also reliably gets the task (current) that is trying to install the BPF tracing
program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
we're moving this out of the BPF helper's fast-path which can be called several
millions of times per second.

The check is then also in line with other security_locked_down() hooks in the
system where the enforcement is performed at open/load time, for example,
open_kcore() for /proc/kcore access or module_sig_check() for module signatures
just to pick few random ones. What's out of scope in the fix as well as in
other security_locked_down() hook locations /outside/ of BPF subsystem is that
if the lockdown policy changes on the fly there is no retrospective action.
This requires a different discussion, potentially complex infrastructure, and
it's also not clear whether this can be solved generically. Either way, it is
out of scope for a suitable stable fix which this one is targeting. Note that
the breakage is specifically on 59438b46471a where it started to rely on 'current'
as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf:
Restrict bpf when kernel lockdown is in confidentiality mode").

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:

  I starting seeing this with F-34. When I run a container that is traced with
  BPF to record the syscalls it is doing, auditd is flooded with messages like:

  type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality }
    for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
      scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
        tclass=lockdown permissive=0

  This seems to be leading to auditd running out of space in the backlog buffer
  and eventually OOMs the machine.

  [...]
  auditd running at 99% CPU presumably processing all the messages, eventually I get:
  Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
  Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
  Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
  Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
  Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
  Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
  [...]

[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
    Serhei Makarov says:

  Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
  bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
  is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
  testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
  ppc64le. Example stack trace:

  [...]
  [  730.868702] stack backtrace:
  [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
  [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  [  730.873278] Call Trace:
  [  730.873770]  dump_stack+0x7f/0xa1
  [  730.874433]  check_noncircular+0xdf/0x100
  [  730.875232]  __lock_acquire+0x1202/0x1e10
  [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
  [  730.876844]  lock_acquire+0xc2/0x3a0
  [  730.877551]  ? __wake_up_common_lock+0x52/0x90
  [  730.878434]  ? lock_acquire+0xc2/0x3a0
  [  730.879186]  ? lock_is_held_type+0xa7/0x120
  [  730.880044]  ? skb_queue_tail+0x1b/0x50
  [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
  [  730.881656]  ? __wake_up_common_lock+0x52/0x90
  [  730.882532]  __wake_up_common_lock+0x52/0x90
  [  730.883375]  audit_log_end+0x5b/0x100
  [  730.884104]  slow_avc_audit+0x69/0x90
  [  730.884836]  avc_has_perm+0x8b/0xb0
  [  730.885532]  selinux_lockdown+0xa5/0xd0
  [  730.886297]  security_locked_down+0x20/0x40
  [  730.887133]  bpf_probe_read_compat+0x66/0xd0
  [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
  [  730.888917]  trace_call_bpf+0xe9/0x240
  [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
  [  730.890579]  perf_trace_sched_switch+0x142/0x180
  [  730.891485]  ? __schedule+0x6d8/0xb20
  [  730.892209]  __schedule+0x6d8/0xb20
  [  730.892899]  schedule+0x5b/0xc0
  [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
  [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
  [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [...]

Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Reported-by: Jakub Hrozek <jhrozek@redhat.com>
Reported-by: Serhei Makarov <smakarov@redhat.com>
Reported-by: Jiri Olsa <jolsa@redhat.com>
Change-Id: Ie9ec178238bbe62a9e284d52eecff1ee569b307d
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Frank Eigler <fche@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-20 03:24:09 +01:00
David Howells
d7d4a195fe BACKPORT: bpf: Restrict bpf when kernel lockdown is in confidentiality mode
bpf_read() and bpf_read_str() could potentially be abused to (eg) allow
private keys in kernel memory to be leaked. Disable them if the kernel
has been locked down in confidentiality mode.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Change-Id: Ie8ed3a29a9fb45408405d77af223e6d358ff0890
Signed-off-by: Matthew Garrett <mjg59@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
cc: netdev@vger.kernel.org
cc: Chun-Yi Lee <jlee@suse.com>
cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: James Morris <jmorris@namei.org>
2025-09-20 03:24:09 +01:00
Christoph Hellwig
d153c10080 BACKPORT: bpf: rework the compat kernel probe handling
Instead of using the dangerous probe_kernel_read and strncpy_from_unsafe
helpers, rework the compat probes to check if an address is a kernel or
userspace one, and then use the low-level kernel or user probe helper
shared by the proper kernel and user probe helpers.  This slightly
changes behavior as the compat probe on a user address doesn't check
the lockdown flags, just as the pure user probes do.

Change-Id: I511935db99f11ec3a98e5630bcca2604c302a0d5
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200521152301.2587579-14-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-09-20 03:22:49 +01:00
Daniel Borkmann
110320305a UPSTREAM: bpf: Restrict bpf_probe_read{, str}() only to archs where they work
Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs
with overlapping address ranges, we should really take the next step to
disable them from BPF use there.

To generally fix the situation, we've recently added new helper variants
bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str().
For details on them, see 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel}
and probe_read_{user,kernel}_str helpers").

Given bpf_probe_read{,str}() have been around for ~5 years by now, there
are plenty of users at least on x86 still relying on them today, so we
cannot remove them entirely w/o breaking the BPF tracing ecosystem.

However, their use should be restricted to archs with non-overlapping
address ranges where they are working in their current form. Therefore,
move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and
have x86, arm64, arm select it (other archs supporting it can follow-up
on it as well).

For the remaining archs, they can workaround easily by relying on the
feature probe from bpftool which spills out defines that can be used out
of BPF C code to implement the drop-in replacement for old/new kernels
via: bpftool feature probe macro

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: I2190b9fabdca2cf7ddd16ad4fe36e95bc832963e
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net
2025-09-20 03:22:49 +01:00
Jiri Olsa
f086dd9f28 UPSTREAM: selftests/bpf: Fix stat probe in d_path test
Some kernels builds might inline vfs_getattr call within fstat
syscall code path, so fentry/vfs_getattr trampoline is not called.

Add security_inode_getattr to allowlist and switch the d_path test stat
trampoline to security_inode_getattr.

Keeping dentry_open and filp_close, because they are in their own
files, so unlikely to be inlined, but in case they are, adding
security_file_open.

Adding flags that indicate trampolines were called and failing
the test if any of them got missed, so it's easier to identify
the issue next time.

Fixes: e4d1af4b16f8 ("selftests/bpf: Add test for d_path helper")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Change-Id: Ia09e525895e5a1c31a582d8edc795b502e3b37ce
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200918112338.2618444-1-jolsa@kernel.org
2025-09-20 03:22:49 +01:00
Peter Zijlstra
23fd84c088 UPSTREAM: module: Fix up module_notifier return values
While auditing all module notifiers I noticed a whole bunch of fail
wrt the return value. Notifiers have a 'special' return semantics.

As is; NOTIFY_DONE vs NOTIFY_OK is a bit vague; but
notifier_from_errno(0) results in NOTIFY_OK and NOTIFY_DONE has a
comment that says "Don't care".

From this I've used NOTIFY_DONE when the function completely ignores
the callback and notifier_to_error() isn't used.

Change-Id: Ibdbda4246e56b75e8641f831fb6fb6ac0b319dcc
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Robert Richter <rric@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20200818135804.385360407@infradead.org
2025-09-20 03:22:49 +01:00
Christoph Hellwig
e3a2d58a95 BACKPORT: maccess: rename strncpy_from_unsafe_user to strncpy_from_user_nofault
This matches the naming of strncpy_from_user, and also makes it more
clear what the function is supposed to do.

Change-Id: If76207724081e94b4450d7311c3385e494e29e6e
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20200521152301.2587579-7-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-09-20 03:22:48 +01:00
Alexey Budankov
4454d3c7ab UPSTREAM: trace/bpf_trace: Open access for CAP_PERFMON privileged process
Open access to bpf_trace monitoring for CAP_PERFMON privileged process.
Providing the access under CAP_PERFMON capability singly, without the
rest of CAP_SYS_ADMIN credentials, excludes chances to misuse the
credentials and makes operation more secure.

CAP_PERFMON implements the principle of least privilege for performance
monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
principle of least privilege: A security design principle that states
that a process or program be granted only those privileges (e.g.,
capabilities) necessary to accomplish its legitimate function, and only
for the time that such privileges are actually required)

For backward compatibility reasons access to bpf_trace monitoring
remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
usage for secure bpf_trace monitoring is discouraged with respect to
CAP_PERFMON capability.

Change-Id: I93d344c8988a2fd9b4cb70a8419c27ff6731b661
Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Igor Lubashev <ilubashe@akamai.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: intel-gfx@lists.freedesktop.org
Cc: linux-doc@vger.kernel.org
Cc: linux-man@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Link: http://lore.kernel.org/lkml/c0a0ae47-8b6e-ff3e-416b-3cd1faaf71c0@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-09-20 03:22:48 +01:00
Thomas Gleixner
8293073d8a UPSTREAM: bpf/trace: Remove EXPORT from trace_call_bpf()
All callers are built in. No point to export this.

Change-Id: I54e120d6c978002d3e5181b4a2f410451c1009c5
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-20 03:22:48 +01:00
Thomas Gleixner
765cc19b80 UPSTREAM: bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run()
__bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation.

This is redundant because __bpf_trace_run() is invoked from a trace point
via __DO_TRACE() which already disables preemption _before_ invoking any of
the functions which are attached to a trace point.

Remove it and add a cant_sleep() check.

Change-Id: Ic6a63f2bc723602d7e02bde3ec10424a63a07fcc
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145642.847220186@linutronix.de
2025-09-20 03:22:48 +01:00
Andrii Nakryiko
b567486e24 UPSTREAM: bpf: Fix enum names for bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpers
Remove bpf_ prefix, which causes these helpers to be reported in verifier
dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets
fix it as long as it is still possible before UAPI freezes on these helpers.

Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
Change-Id: I67e5dde44023de422ec093935ce20647466db4b8
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-09-20 03:22:42 +01:00
Hao Luo
57ae737d3c UPSTREAM: bpf: Introducte bpf_this_cpu_ptr()
Add bpf_this_cpu_ptr() to help access percpu var on this cpu. This
helper always returns a valid pointer, therefore no need to check
returned value for NULL. Also note that all programs run with
preemption disabled, which means that the returned pointer is stable
during all the execution of the program.

Change-Id: Idd23453d28e221430cc067c6b6f812eab0ac738d
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200929235049.2533242-6-haoluo@google.com
2025-09-20 03:22:32 +01:00
Hao Luo
dca7b0be14 UPSTREAM: bpf: Introduce bpf_per_cpu_ptr()
Add bpf_per_cpu_ptr() to help bpf programs access percpu vars.
bpf_per_cpu_ptr() has the same semantic as per_cpu_ptr() in the kernel
except that it may return NULL. This happens when the cpu parameter is
out of range. So the caller must check the returned value.

Change-Id: I254209a7070abc34458020e4ee8ee73256e31886
Signed-off-by: Hao Luo <haoluo@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200929235049.2533242-5-haoluo@google.com
2025-09-20 03:22:32 +01:00
Alan Maguire
c27d926940 UPSTREAM: bpf: Add bpf_seq_printf_btf helper
A helper is added to allow seq file writing of kernel data
structures using vmlinux BTF.  Its signature is

long bpf_seq_printf_btf(struct seq_file *m, struct btf_ptr *ptr,
                        u32 btf_ptr_size, u64 flags);

Flags and struct btf_ptr definitions/use are identical to the
bpf_snprintf_btf helper, and the helper returns 0 on success
or a negative error value.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Change-Id: Ief0f9b8b9d9ed5f725159d71d1f3eb26f28c27c1
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-8-git-send-email-alan.maguire@oracle.com
2025-09-20 03:22:27 +01:00
Alan Maguire
af684854b7 UPSTREAM: bpf: Add bpf_snprintf_btf helper
A helper is added to support tracing kernel type information in BPF
using the BPF Type Format (BTF).  Its signature is

long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
		      u32 btf_ptr_size, u64 flags);

struct btf_ptr * specifies

- a pointer to the data to be traced
- the BTF id of the type of data pointed to
- a flags field is provided for future use; these flags
  are not to be confused with the BTF_F_* flags
  below that control how the btf_ptr is displayed; the
  flags member of the struct btf_ptr may be used to
  disambiguate types in kernel versus module BTF, etc;
  the main distinction is the flags relate to the type
  and information needed in identifying it; not how it
  is displayed.

For example a BPF program with a struct sk_buff *skb
could do the following:

	static struct btf_ptr b = { };

	b.ptr = skb;
	b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
	bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);

Default output looks like this:

(struct sk_buff){
 .transport_header = (__u16)65535,
 .mac_header = (__u16)65535,
 .end = (sk_buff_data_t)192,
 .head = (unsigned char *)0x000000007524fd8b,
 .data = (unsigned char *)0x000000007524fd8b,
 .truesize = (unsigned int)768,
 .users = (refcount_t){
  .refs = (atomic_t){
   .counter = (int)1,
  },
 },
}

Flags modifying display are as follows:

- BTF_F_COMPACT:	no formatting around type information
- BTF_F_NONAME:		no struct/union member names/types
- BTF_F_PTR_RAW:	show raw (unobfuscated) pointer values;
			equivalent to %px.
- BTF_F_ZERO:		show zero-valued struct/union members;
			they are not displayed by default

Change-Id: I77f2c2a0d41aee2f4f10e3288a36de475fd2cb46
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com
2025-09-20 03:22:27 +01:00
Song Liu
e50eb80d29 UPSTREAM: bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint
Add .test_run for raw_tracepoint. Also, introduce a new feature that runs
the target program on a specific CPU. This is achieved by a new flag in
bpf_attr.test, BPF_F_TEST_RUN_ON_CPU. When this flag is set, the program
is triggered on cpu with id bpf_attr.test.cpu. This feature is needed for
BPF programs that handle perf_event and other percpu resources, as the
program can access these resource locally.

Change-Id: Id8f992caff30d7b65df8195f8934bcb2d8b658cb
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200925205432.1777-2-songliubraving@fb.com
2025-09-20 03:22:25 +01:00
Lorenz Bauer
23594ec12c UPSTREAM: bpf: Allow specifying a BTF ID per argument in function protos
Function prototypes using ARG_PTR_TO_BTF_ID currently use two ways to signal
which BTF IDs are acceptable. First, bpf_func_proto.btf_id is an array of
IDs, one for each argument. This array is only accessed up to the highest
numbered argument that uses ARG_PTR_TO_BTF_ID and may therefore be less than
five arguments long. It usually points at a BTF_ID_LIST. Second, check_btf_id
is a function pointer that is called by the verifier if present. It gets the
actual BTF ID of the register, and the argument number we're currently checking.
It turns out that the only user check_arg_btf_id ignores the argument, and is
simply used to check whether the BTF ID has a struct sock_common at it's start.

Replace both of these mechanisms with an explicit BTF ID for each argument
in a function proto. Thanks to btf_struct_ids_match this is very flexible:
check_arg_btf_id can be replaced by requiring struct sock_common.

Change-Id: I04d5adc4574380e9a67e7a688b98611158bb7102
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200921121227.255763-5-lmb@cloudflare.com
2025-09-20 03:22:21 +01:00
Alexei Starovoitov
69e5cce407 UPSTREAM: bpf: Add bpf_copy_from_user() helper.
Sleepable BPF programs can now use copy_from_user() to access user memory.

Change-Id: Ifaf2da741d14dc56b121163d7c486e5c15d1d5a4
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200827220114.69225-4-alexei.starovoitov@gmail.com
2025-09-20 03:22:16 +01:00
Jiri Olsa
b1b48c9c9e UPSTREAM: bpf: Add d_path helper
Adding d_path helper function that returns full path for
given 'struct path' object, which needs to be the kernel
BTF 'path' object. The path is returned in buffer provided
'buf' of size 'sz' and is zero terminated.

  bpf_d_path(&file->f_path, buf, size);

The helper calls directly d_path function, so there's only
limited set of function it can be called from. Adding just
very modest set for the start.

Updating also bpf.h tools uapi header and adding 'path' to
bpf_helpers_doc.py script.

Change-Id: If390ec6189a537b730b9ae595d374cbfa83f6f5b
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200825192124.710397-11-jolsa@kernel.org
2025-09-20 03:22:15 +01:00
Song Liu
0b6366c629 UPSTREAM: bpf: Separate bpf_get_[stack|stackid] for perf events BPF
Calling get_perf_callchain() on perf_events from PEBS entries may cause
unwinder errors. To fix this issue, the callchain is fetched early. Such
perf_events are marked with __PERF_SAMPLE_CALLCHAIN_EARLY.

Similarly, calling bpf_get_[stack|stackid] on perf_events from PEBS may
also cause unwinder errors. To fix this, add separate version of these
two helpers, bpf_get_[stack|stackid]_pe. These two hepers use callchain in
bpf_perf_event_data_kern->data->callchain.

Change-Id: I4f2cc8be3c9c25add04ea9aff270bbc9277a1ff7
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723180648.1429892-2-songliubraving@fb.com
2025-09-20 03:21:52 +01:00
Jiri Olsa
6f7eed12cf UPSTREAM: bpf: Resolve BTF IDs in vmlinux image
Using BTF_ID_LIST macro to define lists for several helpers
using BTF arguments.

And running resolve_btfids on vmlinux elf object during linking,
so the .BTF_ids section gets the IDs resolved.

Change-Id: I4defbf7ee9cd47c47766a71a66820d24380e3061
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200711215329.41165-5-jolsa@kernel.org
2025-09-20 03:21:47 +01:00
Song Liu
585c3265dc UPSTREAM: bpf: Introduce helper bpf_get_task_stack()
Introduce helper bpf_get_task_stack(), which dumps stack trace of given
task. This is different to bpf_get_stack(), which gets stack track of
current task. One potential use case of bpf_get_task_stack() is to call
it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.

bpf_get_task_stack() uses stack_trace_save_tsk() instead of
get_perf_callchain() for kernel stack. The benefit of this choice is that
stack_trace_save_tsk() doesn't require changes in arch/. The downside of
using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
stack trace to unsigned long array. For 32-bit systems, we need to
translate it to u64 array.

Change-Id: Ief4d6331116f2df922da24cd4cff87b023e861e7
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200630062846.664389-3-songliubraving@fb.com
2025-09-20 03:21:46 +01:00
Yonghong Song
1b0aa60d3d UPSTREAM: bpf: Allow tracing programs to use bpf_jiffies64() helper
/proc/net/tcp{4,6} uses jiffies for various computations.
Let us add bpf_jiffies64() helper to tracing program
so bpf_iter and other programs can use it.

Change-Id: I5c4da588b49f70567d0725293b0e08c7df2c572a
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230808.3988073-1-yhs@fb.com
2025-09-20 03:21:45 +01:00
Yonghong Song
e89c278184 UPSTREAM: bpf: Add bpf_skc_to_udp6_sock() helper
The helper is used in tracing programs to cast a socket
pointer to a udp6_sock pointer.
The return value could be NULL if the casting is illegal.

Change-Id: Ic8ac98123cd4f949a0649f3723d3e0a9da071ae2
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200623230815.3988481-1-yhs@fb.com
2025-09-20 03:21:43 +01:00
Yonghong Song
82998aa67b UPSTREAM: bpf: Add bpf_skc_to_{tcp, tcp_timewait, tcp_request}_sock() helpers
Three more helpers are added to cast a sock_common pointer to
an tcp_sock, tcp_timewait_sock or a tcp_request_sock for
tracing programs.

Change-Id: Id0ca4c0315d17d9f672e5c4fd3a0712877b45d03
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230811.3988277-1-yhs@fb.com
2025-09-20 03:21:43 +01:00
Yonghong Song
6cc292bbce UPSTREAM: bpf: Add bpf_skc_to_tcp6_sock() helper
The helper is used in tracing programs to cast a socket
pointer to a tcp6_sock pointer.
The return value could be NULL if the casting is illegal.

A new helper return type RET_PTR_TO_BTF_ID_OR_NULL is added
so the verifier is able to deduce proper return types for the helper.

Different from the previous BTF_ID based helpers,
the bpf_skc_to_tcp6_sock() argument can be several possible
btf_ids. More specifically, all possible socket data structures
with sock_common appearing in the first in the memory layout.
This patch only added socket types related to tcp and udp.

All possible argument btf_id and return value btf_id
for helper bpf_skc_to_tcp6_sock() are pre-calculcated and
cached. In the future, it is even possible to precompute
these btf_id's at kernel build time.

Change-Id: Ie8449335dbb1f9be85b2d3bcc2d6cfc8e53600cb
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230809.3988195-1-yhs@fb.com
2025-09-20 03:21:43 +01:00
Jiri Olsa
0e869f06a0 UPSTREAM: bpf: Use tracing helpers for lsm programs
Currenty lsm uses bpf_tracing_func_proto helpers which do
not include stack trace or perf event output. It's useful
to have those for bpftrace lsm support [1].

Using tracing_prog_func_proto helpers for lsm programs.

[1] https://github.com/iovisor/bpftrace/pull/1347

Change-Id: I644580ffdfbdf3f6b2c870f3a8338e0711046efa
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20200531154255.896551-1-jolsa@kernel.org
2025-09-20 03:21:36 +01:00
Andrii Nakryiko
2c00c8c647 UPSTREAM: bpf: Implement BPF ring buffer and verifier support for it
This commit adds a new MPSC ring buffer implementation into BPF ecosystem,
which allows multiple CPUs to submit data to a single shared ring buffer. On
the consumption side, only single consumer is assumed.

Motivation
----------
There are two distinctive motivators for this work, which are not satisfied by
existing perf buffer, which prompted creation of a new ring buffer
implementation.
  - more efficient memory utilization by sharing ring buffer across CPUs;
  - preserving ordering of events that happen sequentially in time, even
  across multiple CPUs (e.g., fork/exec/exit events for a task).

These two problems are independent, but perf buffer fails to satisfy both.
Both are a result of a choice to have per-CPU perf ring buffer.  Both can be
also solved by having an MPSC implementation of ring buffer. The ordering
problem could technically be solved for perf buffer with some in-kernel
counting, but given the first one requires an MPSC buffer, the same solution
would solve the second problem automatically.

Semantics and APIs
------------------
Single ring buffer is presented to BPF programs as an instance of BPF map of
type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately
rejected.

One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make
BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce
"same CPU only" rule. This would be more familiar interface compatible with
existing perf buffer use in BPF, but would fail if application needed more
advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses
this with current approach. Additionally, given the performance of BPF
ringbuf, many use cases would just opt into a simple single ring buffer shared
among all CPUs, for which current approach would be an overkill.

Another approach could introduce a new concept, alongside BPF map, to
represent generic "container" object, which doesn't necessarily have key/value
interface with lookup/update/delete operations. This approach would add a lot
of extra infrastructure that has to be built for observability and verifier
support. It would also add another concept that BPF developers would have to
familiarize themselves with, new syntax in libbpf, etc. But then would really
provide no additional benefits over the approach of using a map.
BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so
doesn't few other map types (e.g., queue and stack; array doesn't support
delete, etc).

The approach chosen has an advantage of re-using existing BPF map
infrastructure (introspection APIs in kernel, libbpf support, etc), being
familiar concept (no need to teach users a new type of object in BPF program),
and utilizing existing tooling (bpftool). For common scenario of using
a single ring buffer for all CPUs, it's as simple and straightforward, as
would be with a dedicated "container" object. On the other hand, by being
a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to
implement a wide variety of topologies, from one ring buffer for each CPU
(e.g., as a replacement for perf buffer use cases), to a complicated
application hashing/sharding of ring buffers (e.g., having a small pool of
ring buffers with hashed task's tgid being a look up key to preserve order,
but reduce contention).

Key and value sizes are enforced to be zero. max_entries is used to specify
the size of ring buffer and has to be a power of 2 value.

There are a bunch of similarities between perf buffer
(BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics:
  - variable-length records;
  - if there is no more space left in ring buffer, reservation fails, no
    blocking;
  - memory-mappable data area for user-space applications for ease of
    consumption and high performance;
  - epoll notifications for new incoming data;
  - but still the ability to do busy polling for new data to achieve the
    lowest latency, if necessary.

BPF ringbuf provides two sets of APIs to BPF programs:
  - bpf_ringbuf_output() allows to *copy* data from one place to a ring
    buffer, similarly to bpf_perf_event_output();
  - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs
    split the whole process into two steps. First, a fixed amount of space is
    reserved. If successful, a pointer to a data inside ring buffer data area
    is returned, which BPF programs can use similarly to a data inside
    array/hash maps. Once ready, this piece of memory is either committed or
    discarded. Discard is similar to commit, but makes consumer ignore the
    record.

bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because
record has to be prepared in some other place first. But it allows to submit
records of the length that's not known to verifier beforehand. It also closely
matches bpf_perf_event_output(), so will simplify migration significantly.

bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory
pointer directly to ring buffer memory. In a lot of cases records are larger
than BPF stack space allows, so many programs have use extra per-CPU array as
a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
completely. But in exchange, it only allows a known constant size of memory to
be reserved, such that verifier can verify that BPF program can't access
memory outside its reserved record space. bpf_ringbuf_output(), while slightly
slower due to extra memory copy, covers some use cases that are not suitable
for bpf_ringbuf_reserve().

The difference between commit and discard is very small. Discard just marks
a record as discarded, and such records are supposed to be ignored by consumer
code. Discard is useful for some advanced use-cases, such as ensuring
all-or-nothing multi-record submission, or emulating temporary malloc()/free()
within single BPF program invocation.

Each reserved record is tracked by verifier through existing
reference-tracking logic, similar to socket ref-tracking. It is thus
impossible to reserve a record, but forget to submit (or discard) it.

bpf_ringbuf_query() helper allows to query various properties of ring buffer.
Currently 4 are supported:
  - BPF_RB_AVAIL_DATA returns amount of unconsumed data in ring buffer;
  - BPF_RB_RING_SIZE returns the size of ring buffer;
  - BPF_RB_CONS_POS/BPF_RB_PROD_POS returns current logical possition of
    consumer/producer, respectively.
Returned values are momentarily snapshots of ring buffer state and could be
off by the time helper returns, so this should be used only for
debugging/reporting reasons or for implementing various heuristics, that take
into account highly-changeable nature of some of those characteristics.

One such heuristic might involve more fine-grained control over poll/epoll
notifications about new data availability in ring buffer. Together with
BPF_RB_NO_WAKEUP/BPF_RB_FORCE_WAKEUP flags for output/commit/discard helpers,
it allows BPF program a high degree of control and, e.g., more efficient
batched notifications. Default self-balancing strategy, though, should be
adequate for most applications and will work reliable and efficiently already.

Design and implementation
-------------------------
This reserve/commit schema allows a natural way for multiple producers, either
on different CPUs or even on the same CPU/in the same BPF program, to reserve
independent records and work with them without blocking other producers. This
means that if BPF program was interruped by another BPF program sharing the
same ring buffer, they will both get a record reserved (provided there is
enough space left) and can work with it and submit it independently. This
applies to NMI context as well, except that due to using a spinlock during
reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock,
in which case reservation will fail even if ring buffer is not full.

The ring buffer itself internally is implemented as a power-of-2 sized
circular buffer, with two logical and ever-increasing counters (which might
wrap around on 32-bit architectures, that's not a problem):
  - consumer counter shows up to which logical position consumer consumed the
    data;
  - producer counter denotes amount of data reserved by all producers.

Each time a record is reserved, producer that "owns" the record will
successfully advance producer counter. At that point, data is still not yet
ready to be consumed, though. Each record has 8 byte header, which contains
the length of reserved record, as well as two extra bits: busy bit to denote
that record is still being worked on, and discard bit, which might be set at
commit time if record is discarded. In the latter case, consumer is supposed
to skip the record and move on to the next one. Record header also encodes
record's relative offset from the beginning of ring buffer data area (in
pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only
the pointer to the record itself, without requiring also the pointer to ring
buffer itself. Ring buffer memory location will be restored from record
metadata header. This significantly simplifies verifier, as well as improving
API usability.

Producer counter increments are serialized under spinlock, so there is
a strict ordering between reservations. Commits, on the other hand, are
completely lockless and independent. All records become available to consumer
in the order of reservations, but only after all previous records where
already committed. It is thus possible for slow producers to temporarily hold
off submitted records, that were reserved later.

Reservation/commit/consumer protocol is verified by litmus tests in
Documentation/litmus-test/bpf-rb.

One interesting implementation bit, that significantly simplifies (and thus
speeds up as well) implementation of both producers and consumers is how data
area is mapped twice contiguously back-to-back in the virtual memory. This
allows to not take any special measures for samples that have to wrap around
at the end of the circular buffer data area, because the next page after the
last data page would be first data page again, and thus the sample will still
appear completely contiguous in virtual memory. See comment and a simple ASCII
diagram showing this visually in bpf_ringbuf_area_alloc().

Another feature that distinguishes BPF ringbuf from perf ring buffer is
a self-pacing notifications of new data being availability.
bpf_ringbuf_commit() implementation will send a notification of new record
being available after commit only if consumer has already caught up right up
to the record being committed. If not, consumer still has to catch up and thus
will see new data anyways without needing an extra poll notification.
Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c) show that
this allows to achieve a very high throughput without having to resort to
tricks like "notify only every Nth sample", which are necessary with perf
buffer. For extreme cases, when BPF program wants more manual control of
notifications, commit/discard/output helpers accept BPF_RB_NO_WAKEUP and
BPF_RB_FORCE_WAKEUP flags, which give full control over notifications of data
availability, but require extra caution and diligence in using this API.

Comparison to alternatives
--------------------------
Before considering implementing BPF ring buffer from scratch existing
alternatives in kernel were evaluated, but didn't seem to meet the needs. They
largely fell into few categores:
  - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations
    outlined above (ordering and memory consumption);
  - linked list-based implementations; while some were multi-producer designs,
    consuming these from user-space would be very complicated and most
    probably not performant; memory-mapping contiguous piece of memory is
    simpler and more performant for user-space consumers;
  - io_uring is SPSC, but also requires fixed-sized elements. Naively turning
    SPSC queue into MPSC w/ lock would have subpar performance compared to
    locked reserve + lockless commit, as with BPF ring buffer. Fixed sized
    elements would be too limiting for BPF programs, given existing BPF
    programs heavily rely on variable-sized perf buffer already;
  - specialized implementations (like a new printk ring buffer, [0]) with lots
    of printk-specific limitations and implications, that didn't seem to fit
    well for intended use with BPF programs.

  [0] https://lwn.net/Articles/779550/

Change-Id: Idf07d0cc7f8c23be724fbe0cda5bd8254cafe8ae
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200529075424.3139988-2-andriin@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-20 03:21:33 +01:00
John Fastabend
71e8ec889e UPSTREAM: bpf: Extend bpf_base_func_proto helpers with probe_* and *current_task*
Often it is useful when applying policy to know something about the
task. If the administrator has CAP_SYS_ADMIN rights then they can
use kprobe + networking hook and link the two programs together to
accomplish this. However, this is a bit clunky and also means we have
to call both the network program and kprobe program when we could just
use a single program and avoid passing metadata through sk_msg/skb->cb,
socket, maps, etc.

To accomplish this add probe_* helpers to bpf_base_func_proto programs
guarded by a perfmon_capable() check. New supported helpers are the
following,

 BPF_FUNC_get_current_task
 BPF_FUNC_probe_read_user
 BPF_FUNC_probe_read_kernel
 BPF_FUNC_probe_read_user_str
 BPF_FUNC_probe_read_kernel_str

Change-Id: I1108328ac2b17a367d554861aa7f20f994147ce9
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/159033905529.12355.4368381069655254932.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-09-20 03:21:32 +01:00
Alexei Starovoitov
40901f7dd4 BACKPORT: bpf: Implement CAP_BPF
Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
  env->allow_ptr_leaks = bpf_allow_ptr_leaks();
  env->bypass_spec_v1 = bpf_bypass_spec_v1();
  env->bypass_spec_v4 = bpf_bypass_spec_v4();
  env->bpf_capable = bpf_capable();

The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.

'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.

That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.

Change-Id: I6ecbb9e9ecc587e04fb35b743f0d5b2ab165cb68
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
2025-09-20 03:21:30 +01:00
Yonghong Song
e5f3003727 UPSTREAM: bpf: Add bpf_seq_printf and bpf_seq_write helpers
Two helpers bpf_seq_printf and bpf_seq_write, are added for
writing data to the seq_file buffer.

bpf_seq_printf supports common format string flag/width/type
fields so at least I can get identical results for
netlink and ipv6_route targets.

For bpf_seq_printf and bpf_seq_write, return value -EOVERFLOW
specifically indicates a write failure due to overflow, which
means the object will be repeated in the next bpf invocation
if object collection stays the same. Note that if the object
collection is changed, depending how collection traversal is
done, even if the object still in the collection, it may not
be visited.

For bpf_seq_printf, format %s, %p{i,I}{4,6} needs to
read kernel memory. Reading kernel memory may fail in
the following two cases:
  - invalid kernel address, or
  - valid kernel address but requiring a major fault
If reading kernel memory failed, the %s string will be
an empty string and %p{i,I}{4,6} will be all 0.
Not returning error to bpf program is consistent with
what bpf_trace_printk() does for now.

bpf_seq_printf may return -EBUSY meaning that internal percpu
buffer for memory copy of strings or other pointees is
not available. Bpf program can return 1 to indicate it
wants the same object to be repeated. Right now, this should not
happen on no-RT kernels since migrate_disable(), which guards
bpf prog call, calls preempt_disable().

Change-Id: If3f9578a12dd29a90e69a6daa23dd67841b7bf9a
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175914.2476661-1-yhs@fb.com
2025-09-20 03:21:25 +01:00
Christoph Hellwig
e13d6f5008 BACKPORT: sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Change-Id: Ic71fd778e4cea58adc51d634d9e53c1f9f90cdf2
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-20 03:21:20 +01:00
KP Singh
da90ad3130 UPSTREAM: bpf: Introduce BPF_PROG_TYPE_LSM
Introduce types and configs for bpf programs that can be attached to
LSM hooks. The programs can be enabled by the config option
CONFIG_BPF_LSM.

Change-Id: I9949a4e8c5a3ab07dc05089a5267306d6e23087a
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Reviewed-by: Florent Revest <revest@google.com>
Reviewed-by: Thomas Garnier <thgarnie@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org
2025-09-20 03:21:15 +01:00
Tejun Heo
54d67894d7 BACKPORT: cgroup: use cgrp->kn->id as the cgroup ID
cgroup ID is currently allocated using a dedicated per-hierarchy idr
and used internally and exposed through tracepoints and bpf.  This is
confusing because there are tracepoints and other interfaces which use
the cgroupfs ino as IDs.

The preceding changes made kn->id exposed as ino as 64bit ino on
supported archs or ino+gen (low 32bits as ino, high gen).  There's no
reason for cgroup to use different IDs.  The kernfs IDs are unique and
userland can easily discover them and map them back to paths using
standard file operations.

This patch replaces cgroup IDs with kernfs IDs.

* cgroup_id() is added and all cgroup ID users are converted to use it.

* kernfs_node creation is moved to earlier during cgroup init so that
  cgroup_id() is available during init.

* While at it, s/cgroup/cgrp/ in psi helpers for consistency.

* Fallback ID value is changed to 1 to be consistent with root cgroup
  ID.

Change-Id: Iab2ee05e4e75671c3ee6799e7e2b3358394fec66
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
2025-09-20 03:21:14 +01:00
Eelco Chaudron
ac757bc45d UPSTREAM: bpf: Add bpf_xdp_output() helper
Introduce new helper that reuses existing xdp perf_event output
implementation, but can be called from raw_tracepoint programs
that receive 'struct xdp_buff *' as a tracepoint argument.

Change-Id: I564df26793051dc61c1f4d22cb7381063ad4dac1
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158348514556.2239.11050972434793741444.stgit@xdp-tutorial
2025-09-20 03:21:09 +01:00
Carlos Neira
ccdca6ebce UPSTREAM: bpf: Added new helper bpf_get_ns_current_pid_tgid
New bpf helper bpf_get_ns_current_pid_tgid,
This helper will return pid and tgid from current task
which namespace matches dev_t and inode number provided,
this will allows us to instrument a process inside a container.

Change-Id: I5787829c729f08c9c0b5901784799b2e673dca6d
Signed-off-by: Carlos Neira <cneirabustos@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200304204157.58695-3-cneirabustos@gmail.com
2025-09-20 03:21:09 +01:00
Song Liu
5b80d48597 UPSTREAM: bpf: Allow bpf_perf_event_read_value in all BPF programs
bpf_perf_event_read_value() is NMI safe. Enable it for all BPF programs.
This can be used in fentry/fexit to profile BPF program and individual
kernel function with hardware counters.

Change-Id: I93045d95244dd1f7dcf70f95a3c2d342f9118e57
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200214234146.2910011-1-songliubraving@fb.com
2025-09-20 03:21:09 +01:00
KP Singh
f586061e85 UPSTREAM: bpf: Add test ops for BPF_PROG_TYPE_TRACING
The current fexit and fentry tests rely on a different program to
exercise the functions they attach to. Instead of doing this, implement
the test operations for tracing which will also be used for
BPF_MODIFY_RETURN in a subsequent patch.

Also, clean up the fexit test to use the generated skeleton.

Change-Id: I265cdc182a22c5f89cefac2c8fe53d56cf9b8fa9
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-7-kpsingh@chromium.org
2025-09-20 03:21:06 +01:00
Daniel Xu
4a56a11e44 UPSTREAM: bpf: Add bpf_read_branch_records() helper
Branch records are a CPU feature that can be configured to record
certain branches that are taken during code execution. This data is
particularly interesting for profile guided optimizations. perf has had
branch record support for a while but the data collection can be a bit
coarse grained.

We (Facebook) have seen in experiments that associating metadata with
branch records can improve results (after postprocessing). We generally
use bpf_probe_read_*() to get metadata out of userspace. That's why bpf
support for branch records is useful.

Aside from this particular use case, having branch data available to bpf
progs can be useful to get stack traces out of userspace applications
that omit frame pointers.

Change-Id: Ieac36e94eac3c6ab77071de524739286c5819e5c
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200218030432.4600-2-dxu@dxuuu.xyz
2025-09-20 03:21:01 +01:00