commit e1fbbd073137a9d63279f6bf363151a938347640 upstream.
Keno Fischer reported that when a binray loaded via ld-linux-x the
prctl(PR_SET_MM_MAP) doesn't allow to setup brk value because it lays
before mm:end_data.
For example a test program shows
| # ~/t
|
| start_code 401000
| end_code 401a15
| start_stack 7ffce4577dd0
| start_data 403e10
| end_data 40408c
| start_brk b5b000
| sbrk(0) b5b000
and when executed via ld-linux
| # /lib64/ld-linux-x86-64.so.2 ~/t
|
| start_code 7fc25b0a4000
| end_code 7fc25b0c4524
| start_stack 7fffcc6b2400
| start_data 7fc25b0ce4c0
| end_data 7fc25b0cff98
| start_brk 55555710c000
| sbrk(0) 55555710c000
This of course prevent criu from restoring such programs. Looking into
how kernel operates with brk/start_brk inside brk() syscall I don't see
any problem if we allow to setup brk/start_brk without checking for
end_data. Even if someone pass some weird address here on a purpose then
the worst possible result will be an unexpected unmapping of existing vma
(own vma, since prctl works with the callers memory) but test for
RLIMIT_DATA is still valid and a user won't be able to gain more memory in
case of expanding VMAs via new values shipped with prctl call.
Link: https://lkml.kernel.org/r/20210121221207.GB2174@grain
Fixes: bbdc6076d2 ("binfmt_elf: move brk out of mmap when doing direct loader exec")
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Keno Fischer <keno@juliacomputing.com>
Acked-by: Andrey Vagin <avagin@gmail.com>
Tested-by: Andrey Vagin <avagin@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Changes in 5.4.69
kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec()
scsi: lpfc: Fix pt2pt discovery on SLI3 HBAs
scsi: mpt3sas: Free diag buffer without any status check
selinux: allow labeling before policy is loaded
media: mc-device.c: fix memleak in media_device_register_entity
drm/amd/display: Do not double-buffer DTO adjustments
drm/amdkfd: Fix race in gfx10 context restore handler
dma-fence: Serialise signal enabling (dma_fence_enable_sw_signaling)
scsi: qla2xxx: Add error handling for PLOGI ELS passthrough
ath10k: fix array out-of-bounds access
ath10k: fix memory leak for tpc_stats_final
PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes
mm: fix double page fault on arm64 if PTE_AF is cleared
scsi: aacraid: fix illegal IO beyond last LBA
m68k: q40: Fix info-leak in rtc_ioctl
xfs: fix inode fork extent count overflow
gma/gma500: fix a memory disclosure bug due to uninitialized bytes
ASoC: kirkwood: fix IRQ error handling
soundwire: intel/cadence: fix startup sequence
media: smiapp: Fix error handling at NVM reading
drm/amd/display: Free gamma after calculating legacy transfer function
xfs: properly serialise fallocate against AIO+DIO
leds: mlxreg: Fix possible buffer overflow
dm table: do not allow request-based DM to stack on partitions
PM / devfreq: tegra30: Fix integer overflow on CPU's freq max out
scsi: fnic: fix use after free
scsi: lpfc: Fix kernel crash at lpfc_nvme_info_show during remote port bounce
powerpc/64s: Always disable branch profiling for prom_init.o
net: silence data-races on sk_backlog.tail
dax: Fix alloc_dax_region() compile warning
iomap: Fix overflow in iomap_page_mkwrite
f2fs: avoid kernel panic on corruption test
clk/ti/adpll: allocate room for terminating null
drm/amdgpu/powerplay: fix AVFS handling with custom powerplay table
ice: Fix to change Rx/Tx ring descriptor size via ethtool with DCBx
mtd: cfi_cmdset_0002: don't free cfi->cfiq in error path of cfi_amdstd_setup()
mfd: mfd-core: Protect against NULL call-back function pointer
drm/amdgpu/powerplay/smu7: fix AVFS handling with custom powerplay table
tpm_crb: fix fTPM on AMD Zen+ CPUs
tracing: Verify if trace array exists before destroying it.
tracing: Adding NULL checks for trace_array descriptor pointer
bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
dmaengine: mediatek: hsdma_probe: fixed a memory leak when devm_request_irq fails
x86/kdump: Always reserve the low 1M when the crashkernel option is specified
RDMA/qedr: Fix potential use after free
RDMA/i40iw: Fix potential use after free
PCI: Avoid double hpmemsize MMIO window assignment
fix dget_parent() fastpath race
xfs: fix attr leaf header freemap.size underflow
RDMA/iw_cgxb4: Fix an error handling path in 'c4iw_connect()'
ubi: Fix producing anchor PEBs
mmc: core: Fix size overflow for mmc partitions
gfs2: clean up iopen glock mess in gfs2_create_inode
scsi: pm80xx: Cleanup command when a reset times out
mt76: do not use devm API for led classdev
mt76: add missing locking around ampdu action
debugfs: Fix !DEBUG_FS debugfs_create_automount
SUNRPC: Capture completion of all RPC tasks
CIFS: Use common error handling code in smb2_ioctl_query_info()
CIFS: Properly process SMB3 lease breaks
f2fs: stop GC when the victim becomes fully valid
ASoC: max98090: remove msleep in PLL unlocked workaround
xtensa: fix system_call interaction with ptrace
s390: avoid misusing CALL_ON_STACK for task stack setup
xfs: fix realtime file data space leak
drm/amdgpu: fix calltrace during kmd unload(v3)
arm64: insn: consistently handle exit text
selftests/bpf: De-flake test_tcpbpf
kernel/notifier.c: intercept duplicate registrations to avoid infinite loops
kernel/sys.c: avoid copying possible padding bytes in copy_to_user
KVM: arm/arm64: vgic: Fix potential double free dist->spis in __kvm_vgic_destroy()
module: Remove accidental change of module_enable_x()
xfs: fix log reservation overflows when allocating large rt extents
ALSA: hda: enable regmap internal locking
tipc: fix link overflow issue at socket shutdown
vcc_seq_next should increase position index
neigh_stat_seq_next() should increase position index
rt_cpu_seq_next should increase position index
ipv6_route_seq_next should increase position index
drm/mcde: Handle pending vblank while disabling display
seqlock: Require WRITE_ONCE surrounding raw_seqcount_barrier
drm/scheduler: Avoid accessing freed bad job.
media: ti-vpe: cal: Restrict DMA to avoid memory corruption
opp: Replace list_kref with a local counter
scsi: qla2xxx: Fix stuck session in GNL
scsi: lpfc: Fix incomplete NVME discovery when target
sctp: move trace_sctp_probe_path into sctp_outq_sack
ACPI: EC: Reference count query handlers under lock
scsi: ufs: Make ufshcd_add_command_trace() easier to read
scsi: ufs: Fix a race condition in the tracing code
drm/amd/display: Initialize DSC PPS variables to 0
i2c: tegra: Prevent interrupt triggering after transfer timeout
btrfs: tree-checker: Check leaf chunk item size
dmaengine: zynqmp_dma: fix burst length configuration
s390/cpum_sf: Use kzalloc and minor changes
nfsd: Fix a soft lockup race in nfsd_file_mark_find_or_create()
powerpc/eeh: Only dump stack once if an MMIO loop is detected
Bluetooth: btrtl: Use kvmalloc for FW allocations
tracing: Set kernel_stack's caller size properly
ARM: 8948/1: Prevent OOB access in stacktrace
ar5523: Add USB ID of SMCWUSBT-G2 wireless adapter
ceph: ensure we have a new cap before continuing in fill_inode
selftests/ftrace: fix glob selftest
tools/power/x86/intel_pstate_tracer: changes for python 3 compatibility
Bluetooth: Fix refcount use-after-free issue
mm/swapfile.c: swap_next should increase position index
mm: pagewalk: fix termination condition in walk_pte_range()
Bluetooth: prefetch channel before killing sock
KVM: fix overflow of zero page refcount with ksm running
ALSA: hda: Clear RIRB status before reading WP
skbuff: fix a data race in skb_queue_len()
nfsd: Fix a perf warning
drm/amd/display: fix workaround for incorrect double buffer register for DLG ADL and TTU
audit: CONFIG_CHANGE don't log internal bookkeeping as an event
selinux: sel_avc_get_stat_idx should increase position index
scsi: lpfc: Fix RQ buffer leakage when no IOCBs available
scsi: lpfc: Fix release of hwq to clear the eq relationship
scsi: lpfc: Fix coverity errors in fmdi attribute handling
drm/omap: fix possible object reference leak
locking/lockdep: Decrement IRQ context counters when removing lock chain
clk: stratix10: use do_div() for 64-bit calculation
crypto: chelsio - This fixes the kernel panic which occurs during a libkcapi test
mt76: clear skb pointers from rx aggregation reorder buffer during cleanup
mt76: fix handling full tx queues in mt76_dma_tx_queue_skb_raw
ALSA: usb-audio: Don't create a mixer element with bogus volume range
perf test: Fix test trace+probe_vfs_getname.sh on s390
RDMA/rxe: Fix configuration of atomic queue pair attributes
KVM: x86: fix incorrect comparison in trace event
KVM: nVMX: Hold KVM's srcu lock when syncing vmcs12->shadow
dmaengine: stm32-mdma: use vchan_terminate_vdesc() in .terminate_all
media: staging/imx: Missing assignment in imx_media_capture_device_register()
x86/pkeys: Add check for pkey "overflow"
bpf: Remove recursion prevention from rcu free callback
dmaengine: stm32-dma: use vchan_terminate_vdesc() in .terminate_all
dmaengine: tegra-apb: Prevent race conditions on channel's freeing
soundwire: bus: disable pm_runtime in sdw_slave_delete
drm/amd/display: dal_ddc_i2c_payloads_create can fail causing panic
drm/omap: dss: Cleanup DSS ports on initialisation failure
iavf: use tc_cls_can_offload_and_chain0() instead of chain check
firmware: arm_sdei: Use cpus_read_lock() to avoid races with cpuhp
random: fix data races at timer_rand_state
bus: hisi_lpc: Fixup IO ports addresses to avoid use-after-free in host removal
ASoC: SOF: ipc: check ipc return value before data copy
media: go7007: Fix URB type for interrupt handling
Bluetooth: guard against controllers sending zero'd events
timekeeping: Prevent 32bit truncation in scale64_check_overflow()
powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()
drm/amd/display: fix image corruption with ODM 2:1 DSC 2 slice
ext4: fix a data race at inode->i_disksize
perf jevents: Fix leak of mapfile memory
mm: avoid data corruption on CoW fault into PFN-mapped VMA
drm/amdgpu: increase atombios cmd timeout
ARM: OMAP2+: Handle errors for cpu_pm
drm/amd/display: Stop if retimer is not available
clk: imx: Fix division by zero warning on pfdv2
cpu-topology: Fix the potential data corruption
s390/irq: replace setup_irq() by request_irq()
perf cs-etm: Swap packets for instruction samples
perf cs-etm: Correct synthesizing instruction samples
ath10k: use kzalloc to read for ath10k_sdio_hif_diag_read
scsi: aacraid: Disabling TM path and only processing IOP reset
Bluetooth: L2CAP: handle l2cap config request during open state
media: tda10071: fix unsigned sign extension overflow
tty: sifive: Finish transmission before changing the clock
xfs: don't ever return a stale pointer from __xfs_dir3_free_read
xfs: mark dir corrupt when lookup-by-hash fails
ext4: mark block bitmap corrupted when found instead of BUGON
tpm: ibmvtpm: Wait for buffer to be set before proceeding
rtc: sa1100: fix possible race condition
rtc: ds1374: fix possible race condition
nfsd: Don't add locks to closed or closing open stateids
RDMA/cm: Remove a race freeing timewait_info
intel_th: Disallow multi mode on devices where it's broken
KVM: PPC: Book3S HV: Treat TM-related invalid form instructions on P9 like the valid ones
drm/msm: fix leaks if initialization fails
drm/msm/a5xx: Always set an OPP supported hardware value
tracing: Use address-of operator on section symbols
thermal: rcar_thermal: Handle probe error gracefully
KVM: LAPIC: Mark hrtimer for period or oneshot mode to expire in hard interrupt context
perf parse-events: Fix 3 use after frees found with clang ASAN
btrfs: do not init a reloc root if we aren't relocating
btrfs: free the reloc_control in a consistent way
r8169: improve RTL8168b FIFO overflow workaround
serial: 8250_port: Don't service RX FIFO if throttled
serial: 8250_omap: Fix sleeping function called from invalid context during probe
serial: 8250: 8250_omap: Terminate DMA before pushing data on RX timeout
perf cpumap: Fix snprintf overflow check
net: axienet: Convert DMA error handler to a work queue
net: axienet: Propagate failure of DMA descriptor setup
cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_work_fn
tools: gpio-hammer: Avoid potential overflow in main
exec: Add exec_update_mutex to replace cred_guard_mutex
exec: Fix a deadlock in strace
selftests/ptrace: add test cases for dead-locks
kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
proc: Use new infrastructure to fix deadlocks in execve
proc: io_accounting: Use new infrastructure to fix deadlocks in execve
perf: Use new infrastructure to fix deadlocks in execve
nvme-multipath: do not reset on unknown status
nvme: Fix ctrl use-after-free during sysfs deletion
nvme: Fix controller creation races with teardown flow
brcmfmac: Fix double freeing in the fmac usb data path
xfs: prohibit fs freezing when using empty transactions
RDMA/rxe: Set sys_image_guid to be aligned with HW IB devices
IB/iser: Always check sig MR before putting it to the free pool
scsi: hpsa: correct race condition in offload enabled
SUNRPC: Fix a potential buffer overflow in 'svc_print_xprts()'
svcrdma: Fix leak of transport addresses
netfilter: nf_tables: silence a RCU-list warning in nft_table_lookup()
PCI: Use ioremap(), not phys_to_virt() for platform ROM
ubifs: ubifs_jnl_write_inode: Fix a memory leak bug
ubifs: ubifs_add_orphan: Fix a memory leak bug
ubifs: Fix out-of-bounds memory access caused by abnormal value of node_len
ALSA: usb-audio: Fix case when USB MIDI interface has more than one extra endpoint descriptor
PCI: pciehp: Fix MSI interrupt race
NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()
drm/amdgpu/vcn2.0: stall DPG when WPTR/RPTR reset
powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.
mm/kmemleak.c: use address-of operator on section symbols
mm/filemap.c: clear page error before actual read
mm/swapfile: fix data races in try_to_unuse()
mm/vmscan.c: fix data races using kswapd_classzone_idx
SUNRPC: Don't start a timer on an already queued rpc task
nvmet-rdma: fix double free of rdma queue
workqueue: Remove the warning in wq_worker_sleeping()
drm/amdgpu/sriov add amdgpu_amdkfd_pre_reset in gpu reset
mm/mmap.c: initialize align_offset explicitly for vm_unmapped_area
ALSA: hda: Skip controller resume if not needed
scsi: qedi: Fix termination timeouts in session logout
serial: uartps: Wait for tx_empty in console setup
btrfs: fix setting last_trans for reloc roots
KVM: Remove CREATE_IRQCHIP/SET_PIT2 race
perf stat: Force error in fallback on :k events
bdev: Reduce time holding bd_mutex in sync in blkdev_close()
drivers: char: tlclk.c: Avoid data race between init and interrupt handler
KVM: arm64: vgic-v3: Retire all pending LPIs on vcpu destroy
KVM: arm64: vgic-its: Fix memory leak on the error path of vgic_add_lpi()
net: openvswitch: use u64 for meter bucket
scsi: aacraid: Fix error handling paths in aac_probe_one()
staging:r8188eu: avoid skb_clone for amsdu to msdu conversion
sparc64: vcc: Fix error return code in vcc_probe()
arm64: cpufeature: Relax checks for AArch32 support at EL[0-2]
sched/fair: Eliminate bandwidth race between throttling and distribution
dpaa2-eth: fix error return code in setup_dpni()
dt-bindings: sound: wm8994: Correct required supplies based on actual implementaion
devlink: Fix reporter's recovery condition
atm: fix a memory leak of vcc->user_back
media: venus: vdec: Init registered list unconditionally
perf mem2node: Avoid double free related to realloc
mm/slub: fix incorrect interpretation of s->offset
i2c: tegra: Restore pinmux on system resume
power: supply: max17040: Correct voltage reading
phy: samsung: s5pv210-usb2: Add delay after reset
Bluetooth: Handle Inquiry Cancel error after Inquiry Complete
USB: EHCI: ehci-mv: fix error handling in mv_ehci_probe()
KVM: x86: handle wrap around 32-bit address space
tipc: fix memory leak in service subscripting
tty: serial: samsung: Correct clock selection logic
ALSA: hda: Fix potential race in unsol event handler
drm/exynos: dsi: Remove bridge node reference in error handling path in probe function
ipmi:bt-bmc: Fix error handling and status check
powerpc/traps: Make unrecoverable NMIs die instead of panic
svcrdma: Fix backchannel return code
fuse: don't check refcount after stealing page
fuse: update attr_version counter on fuse_notify_inval_inode()
USB: EHCI: ehci-mv: fix less than zero comparison of an unsigned int
coresight: etm4x: Fix use-after-free of per-cpu etm drvdata
arm64: acpi: Make apei_claim_sea() synchronise with APEI's irq work
scsi: cxlflash: Fix error return code in cxlflash_probe()
arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
drm/amdkfd: fix restore worker race condition
e1000: Do not perform reset in reset_task if we are already down
drm/nouveau/debugfs: fix runtime pm imbalance on error
drm/nouveau: fix runtime pm imbalance on error
drm/nouveau/dispnv50: fix runtime pm imbalance on error
printk: handle blank console arguments passed in.
usb: dwc3: Increase timeout for CmdAct cleared by device controller
btrfs: don't force read-only after error in drop snapshot
btrfs: fix double __endio_write_update_ordered in direct I/O
gpio: rcar: Fix runtime PM imbalance on error
vfio/pci: fix memory leaks of eventfd ctx
KVM: PPC: Book3S HV: Close race with page faults around memslot flushes
perf evsel: Fix 2 memory leaks
perf trace: Fix the selection for architectures to generate the errno name tables
perf stat: Fix duration_time value for higher intervals
perf util: Fix memory leak of prefix_if_not_in
perf metricgroup: Free metric_events on error
perf kcore_copy: Fix module map when there are no modules loaded
PCI: tegra194: Fix runtime PM imbalance on error
ASoC: img-i2s-out: Fix runtime PM imbalance on error
wlcore: fix runtime pm imbalance in wl1271_tx_work
wlcore: fix runtime pm imbalance in wlcore_regdomain_config
mtd: rawnand: gpmi: Fix runtime PM imbalance on error
mtd: rawnand: omap_elm: Fix runtime PM imbalance on error
PCI: tegra: Fix runtime PM imbalance on error
ceph: fix potential race in ceph_check_caps
mm/swap_state: fix a data race in swapin_nr_pages
mm: memcontrol: fix stat-corrupting race in charge moving
rapidio: avoid data race between file operation callbacks and mport_cdev_add().
mtd: parser: cmdline: Support MTD names containing one or more colons
x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline
NFS: nfs_xdr_status should record the procedure name
vfio/pci: Clear error and request eventfd ctx after releasing
cifs: Fix double add page to memcg when cifs_readpages
nvme: fix possible deadlock when I/O is blocked
mac80211: skip mpath lookup also for control port tx
scsi: libfc: Handling of extra kref
scsi: libfc: Skip additional kref updating work event
selftests/x86/syscall_nt: Clear weird flags after each test
vfio/pci: fix racy on error and request eventfd ctx
btrfs: qgroup: fix data leak caused by race between writeback and truncate
perf tests: Fix test 68 zstd compression for s390
scsi: qla2xxx: Retry PLOGI on FC-NVMe PRLI failure
ubi: fastmap: Free unused fastmap anchor peb during detach
mt76: fix LED link time failure
opp: Increase parsed_static_opps in _of_add_opp_table_v1()
perf parse-events: Use strcmp() to compare the PMU name
ALSA: hda: Always use jackpoll helper for jack update after resume
ALSA: hda: Workaround for spurious wakeups on some Intel platforms
net: openvswitch: use div_u64() for 64-by-32 divisions
nvme: explicitly update mpath disk capacity on revalidation
device_cgroup: Fix RCU list debugging warning
ASoC: pcm3168a: ignore 0 Hz settings
ASoC: wm8994: Skip setting of the WM8994_MICBIAS register for WM1811
ASoC: wm8994: Ensure the device is resumed in wm89xx_mic_detect functions
ASoC: Intel: bytcr_rt5640: Add quirk for MPMAN Converter9 2-in-1
RISC-V: Take text_mutex in ftrace_init_nop()
i2c: aspeed: Mask IRQ status to relevant bits
s390/init: add missing __init annotations
lockdep: fix order in trace_hardirqs_off_caller()
EDAC/ghes: Check whether the driver is on the safe list correctly
drm/amdkfd: fix a memory leak issue
drm/amd/display: update nv1x stutter latencies
drm/amdgpu/dc: Require primary plane to be enabled whenever the CRTC is
i2c: core: Call i2c_acpi_install_space_handler() before i2c_acpi_register_devices()
objtool: Fix noreturn detection for ignored functions
ieee802154: fix one possible memleak in ca8210_dev_com_init
ieee802154/adf7242: check status of adf7242_read_reg
clocksource/drivers/h8300_timer8: Fix wrong return value in h8300_8timer_init()
mwifiex: Increase AES key storage size to 256 bits
batman-adv: bla: fix type misuse for backbone_gw hash indexing
atm: eni: fix the missed pci_disable_device() for eni_init_one()
batman-adv: mcast/TT: fix wrongly dropped or rerouted packets
netfilter: conntrack: nf_conncount_init is failing with IPv6 disabled
mac802154: tx: fix use-after-free
bpf: Fix clobbering of r2 in bpf_gen_ld_abs
drm/vc4/vc4_hdmi: fill ASoC card owner
net: qed: Disable aRFS for NPAR and 100G
net: qede: Disable aRFS for NPAR and 100G
net: qed: RDMA personality shouldn't fail VF load
drm/sun4i: sun8i-csc: Secondary CSC register correction
batman-adv: Add missing include for in_interrupt()
nvme-tcp: fix kconfig dependency warning when !CRYPTO
batman-adv: mcast: fix duplicate mcast packets in BLA backbone from LAN
batman-adv: mcast: fix duplicate mcast packets in BLA backbone from mesh
batman-adv: mcast: fix duplicate mcast packets from BLA backbone to mesh
bpf: Fix a rcu warning for bpffs map pretty-print
lib80211: fix unmet direct dependendices config warning when !CRYPTO
ALSA: asihpi: fix iounmap in error handler
regmap: fix page selection for noinc reads
regmap: fix page selection for noinc writes
MIPS: Add the missing 'CPU_1074K' into __get_cpu_type()
regulator: axp20x: fix LDO2/4 description
KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE
KVM: SVM: Add a dedicated INVD intercept routine
mm: validate pmd after splitting
arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback
x86/ioapic: Unbreak check_timer()
scsi: lpfc: Fix initial FLOGI failure due to BBSCN not supported
ALSA: usb-audio: Add delay quirk for H570e USB headsets
ALSA: hda/realtek - Couldn't detect Mic if booting with headset plugged
ALSA: hda/realtek: Enable front panel headset LED on Lenovo ThinkStation P520
lib/string.c: implement stpcpy
tracing: fix double free
s390/dasd: Fix zero write for FBA devices
kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()
kprobes: tracing/kprobes: Fix to kill kprobes on initmem after boot
btrfs: fix overflow when copying corrupt csums for a message
dmabuf: fix NULL pointer dereference in dma_buf_release()
mm, THP, swap: fix allocating cluster for swapfile by mistake
mm/gup: fix gup_fast with dynamic page table folding
s390/zcrypt: Fix ZCRYPT_PERDEV_REQCNT ioctl
KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch
dm: fix bio splitting and its bio completion order for regular IO
kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACE
ata: define AC_ERR_OK
ata: make qc_prep return ata_completion_errors
ata: sata_mv, avoid trigerrable BUG_ON
Linux 5.4.69
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2a26b4f6fd89b641fa80e339ee72089da51a1415
This merges Linus's tree as of commit b41dae061b ("Merge tag
'xfs-5.4-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux")
into android-mainline.
This "early" merge makes it easier to test and handle merge conflicts
instead of having to wait until the "end" of the merge window and handle
all 10000+ commits at once.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6bebf55e5e2353f814e3c87f5033607b1ae5d812
Pull core timer updates from Thomas Gleixner:
"Timers and timekeeping updates:
- A large overhaul of the posix CPU timer code which is a preparation
for moving the CPU timer expiry out into task work so it can be
properly accounted on the task/process.
An update to the bogus permission checks will come later during the
merge window as feedback was not complete before heading of for
travel.
- Switch the timerqueue code to use cached rbtrees and get rid of the
homebrewn caching of the leftmost node.
- Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
single function
- Implement the separation of hrtimers to be forced to expire in hard
interrupt context even when PREEMPT_RT is enabled and mark the
affected timers accordingly.
- Implement a mechanism for hrtimers and the timer wheel to protect
RT against priority inversion and live lock issues when a (hr)timer
which should be canceled is currently executing the callback.
Instead of infinitely spinning, the task which tries to cancel the
timer blocks on a per cpu base expiry lock which is held and
released by the (hr)timer expiry code.
- Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
resulting in faster access to timekeeping functions.
- Updates to various clocksource/clockevent drivers and their device
tree bindings.
- The usual small improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
posix-cpu-timers: Fix permission check regression
posix-cpu-timers: Always clear head pointer on dequeue
hrtimer: Add a missing bracket and hide `migration_base' on !SMP
posix-cpu-timers: Make expiry_active check actually work correctly
posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
tick: Mark sched_timer to expire in hard interrupt context
hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
posix-cpu-timers: Utilize timerqueue for storage
posix-cpu-timers: Move state tracking to struct posix_cputimers
posix-cpu-timers: Deduplicate rlimit handling
posix-cpu-timers: Remove pointless comparisons
posix-cpu-timers: Get rid of 64bit divisions
posix-cpu-timers: Consolidate timer expiry further
posix-cpu-timers: Get rid of zero checks
rlimit: Rewrite non-sensical RLIMIT_CPU comment
posix-cpu-timers: Respect INFINITY for hard RTTIME limit
posix-cpu-timers: Switch thread group sampling to array
posix-cpu-timers: Restructure expiry array
posix-cpu-timers: Remove cputime_expires
...
Pull x86 cpu-feature updates from Ingo Molnar:
- Rework the Intel model names symbols/macros, which were decades of
ad-hoc extensions and added random noise. It's now a coherent, easy
to follow nomenclature.
- Add new Intel CPU model IDs:
- "Tiger Lake" desktop and mobile models
- "Elkhart Lake" model ID
- and the "Lightning Mountain" variant of Airmont, plus support code
- Add the new AVX512_VP2INTERSECT instruction to cpufeatures
- Remove Intel MPX user-visible APIs and the self-tests, because the
toolchain (gcc) is not supporting it going forward. This is the
first, lowest-risk phase of MPX removal.
- Remove X86_FEATURE_MFENCE_RDTSC
- Various smaller cleanups and fixes
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
x86/cpu: Update init data for new Airmont CPU model
x86/cpu: Add new Airmont variant to Intel family
x86/cpu: Add Elkhart Lake to Intel family
x86/cpu: Add Tiger Lake to Intel family
x86: Correct misc typos
x86/intel: Add common OPTDIFFs
x86/intel: Aggregate microserver naming
x86/intel: Aggregate big core graphics naming
x86/intel: Aggregate big core mobile naming
x86/intel: Aggregate big core client naming
x86/cpufeature: Explain the macro duplication
x86/ftrace: Remove mcount() declaration
x86/PCI: Remove superfluous returns from void functions
x86/msr-index: Move AMD MSRs where they belong
x86/cpu: Use constant definitions for CPU models
lib: Remove redundant ftrace flag removal
x86/crash: Remove unnecessary comparison
x86/bitops: Use __builtin_constant_p() directly instead of IS_IMMEDIATE()
x86: Remove X86_FEATURE_MFENCE_RDTSC
x86/mpx: Remove MPX APIs
...
Deactivation of the expiry cache is done by setting all clock caches to
0. That requires to have a check for zero in all places which update the
expiry cache:
if (cache == 0 || new < cache)
cache = new;
Use U64_MAX as the deactivated value, which allows to remove the zero
checks when updating the cache and reduces it to the obvious check:
if (new < cache)
cache = new;
This also removes the weird workaround in do_prlimit() which was required
to convert a RLIMIT_CPU value of 0 (immediate expiry) to 1 because handing
in 0 to the posix CPU timer code would have effectively disarmed it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.275086128@linutronix.de
The comment above the function which arms RLIMIT_CPU in the posix CPU timer
code makes no sense at all. It claims that the kernel does not return an
error code when it rejected the attempt to set RLIMIT_CPU. That's clearly
bogus as the code does an error check and the rlimit is only set and
activated when the permission checks are ok. In case of a rejection an
appropriate error code is returned.
This is a historical and outdated comment which got dragged along even when
the rlimit handling code was rewritten.
Replace it with an explanation why the setup function is not called when
the rlimit value is RLIM_INFINITY and how the 'disarming' is handled.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lkml.kernel.org/r/20190821192922.185511287@linutronix.de
Require that arg{3,4,5} of the PR_{SET,GET}_TAGGED_ADDR_CTRL prctl and
arg2 of the PR_GET_TAGGED_ADDR_CTRL prctl() are zero rather than ignored
for future extensions.
Acked-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
It is not desirable to relax the ABI to allow tagged user addresses into
the kernel indiscriminately. This patch introduces a prctl() interface
for enabling or disabling the tagged ABI with a global sysctl control
for preventing applications from enabling the relaxed ABI (meant for
testing user-space prctl() return error checking without reconfiguring
the kernel). The ABI properties are inherited by threads of the same
application and fork()'ed children but cleared on execve(). A Kconfig
option allows the overall disabling of the relaxed ABI.
The PR_SET_TAGGED_ADDR_CTRL will be expanded in the future to handle
MTE-specific settings like imprecise vs precise exceptions.
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
MPX is being removed from the kernel due to a lack of support in the
toolchain going forward (gcc).
The first step is to remove the userspace-visible ABIs so that applications
will stop using it. The most visible one are the enable/disable prctl()s.
Remove them first.
This is the most minimal and least invasive change needed to ensure that
apps stop using MPX with new kernels.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190705175321.DB42F0AD@viggo.jf.intel.com
The commit a3b609ef9f ("proc read mm's {arg,env}_{start,end} with mmap
semaphore taken.") added synchronization of reading argument/environment
boundaries under mmap_sem. Later commit 88aa7cc688 ("mm: introduce
arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided
the coarse use of mmap_sem in similar situations. But there still
remained two places that (mis)use mmap_sem.
get_cmdline should also use arg_lock instead of mmap_sem when it reads the
boundaries.
The second place that should use arg_lock is in prctl_set_mm. By
protecting the boundaries fields with the arg_lock, we can downgrade
mmap_sem to reader lock (analogous to what we already do in
prctl_set_mm_map).
[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com
Fixes: 88aa7cc688 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Co-developed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While validating new map we require the @start_data to be strictly less
than @end_data, which is fine for regular applications (this is why this
nit didn't trigger for that long). These members are set from executable
loaders such as elf handers, still it is pretty valid to have a loadable
data section with zero size in file, in such case the start_data is equal
to end_data once kernel loader finishes.
As a result when we're trying to restore such programs the procedure fails
and the kernel returns -EINVAL. From the image dump of a program:
| "mm_start_code": "0x400000",
| "mm_end_code": "0x8f5fb4",
| "mm_start_data": "0xf1bfb0",
| "mm_end_data": "0xf1bfb0",
Thus we need to change validate_prctl_map from strictly less to less or
equal operator use.
Link: http://lkml.kernel.org/r/20190408143554.GY1421@uranus.lan
Fixes: f606b77f1a ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Andrey Vagin <avagin@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory. When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.
This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma. vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.
Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.
The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:<name>] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas. If the userspace pointer
is no longer valid all or part of the name will be replaced
with "<fault>".
The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen. This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging. The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.
Includes fix from Jed Davis <jld@mozilla.com> for typo in
prctl_set_vma_anon_name, which could attempt to set the name
across two vmas at the same time due to a typo, which might
corrupt the vma list. Fix it to use tmp instead of end to limit
the name setting to a single vma at a time.
Bug: 120441514
Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
[AmitP: Fix get_user_pages_remote() call to align with upstream commit
5b56d49fc3 ("mm: add locked parameter to get_user_pages_remote()")]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Merge more updates from Andrew Morton:
- some of the rest of MM
- various misc things
- dynamic-debug updates
- checkpatch
- some epoll speedups
- autofs
- rapidio
- lib/, lib/lzo/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
samples/mic/mpssd/mpssd.h: remove duplicate header
kernel/fork.c: remove duplicated include
include/linux/relay.h: fix percpu annotation in struct rchan
arch/nios2/mm/fault.c: remove duplicate include
unicore32: stop printing the virtual memory layout
MAINTAINERS: fix GTA02 entry and mark as orphan
mm: create the new vm_fault_t type
arm, s390, unicore32: remove oneliner wrappers for memblock_alloc()
arch: simplify several early memory allocations
openrisc: simplify pte_alloc_one_kernel()
sh: prefer memblock APIs returning virtual address
microblaze: prefer memblock API returning virtual address
powerpc: prefer memblock APIs returning virtual address
lib/lzo: separate lzo-rle from lzo
lib/lzo: implement run-length encoding
lib/lzo: fast 8-byte copy on arm64
lib/lzo: 64-bit CTZ on arm64
lib/lzo: tidy-up ifdefs
ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size
ipc: annotate implicit fall through
...
This change ensures that the set*uid family of syscalls in kernel/sys.c
(setreuid, setuid, setresuid, setfsuid) all call ns_capable_common with
the CAP_OPT_INSETID flag, so capability checks in the security_capable
hook can know whether they are being called from within a set*uid
syscall. This change is a no-op by itself, but is needed for the
proposed SafeSetID LSM.
Signed-off-by: Micah Morton <mortonm@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.morris@microsoft.com>
UNAME26 is a mechanism to report Linux's version as 2.6.x, for
compatibility with old/broken software. Due to the way it is
implemented, it would have to be updated after 5.0, to keep the
resulting versions unique. Linus Torvalds argued:
"Do we actually need this?
I'd rather let it bitrot, and just let it return random versions. It
will just start again at 2.4.60, won't it?
Anybody who uses UNAME26 for a 5.x kernel might as well think it's
still 4.x. The user space is so old that it can't possibly care about
differences between 4.x and 5.x, can it?
The only thing that matters is that it shows "2.4.<largeenough>",
which it will do regardless"
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.
It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.
A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.
This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
There were a couple of notable cases:
- csky still had the old "verify_area()" name as an alias.
- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)
- microblaze used the type argument for a debug printout
but other than those oddities this should be a total no-op patch.
I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an arm64-specific prctl to allow a thread to reinitialize its
pointer authentication keys to random values. This can be useful when
exec() is not used for starting new processes, to ensure that different
processes still have different keys.
Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Pull namespace fixes from Eric Biederman:
"This is a set of four fairly obvious bug fixes:
- a switch from d_find_alias to d_find_any_alias because the xattr
code perversely takes a dentry
- two mutex vs copy_to_user fixes from Jann Horn
- a fix to use a sanitized size not the size userspace passed in from
Christian Brauner"
* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
getxattr: use correct xattr length
sys: don't hold uts_sem while accessing userspace memory
userns: move user access out of the mutex
cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()
Holding uts_sem as a writer while accessing userspace memory allows a
namespace admin to stall all processes that attempt to take uts_sem.
Instead, move data through stack buffers and don't access userspace memory
while uts_sem is held.
Cc: stable@vger.kernel.org
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.
And, the mmap_sem contention may cause unexpected issue like below:
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
proc_pid_cmdline_read+0xd9/0x4e0
__vfs_read+0x37/0x150
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xc5
Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.
So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.
This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem. The mmap_sem
scalability issue will be solved in the future.
[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adjust arch_prctl_get/set_spec_ctrl() to operate on tasks other than
current.
This is needed both for /proc/$pid/status queries and for seccomp (since
thread-syncing can trigger seccomp in non-current threads).
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add two new prctls to control aspects of speculation related vulnerabilites
and their mitigations to provide finer grained control over performance
impacting mitigations.
PR_GET_SPECULATION_CTRL returns the state of the speculation misfeature
which is selected with arg2 of prctl(2). The return value uses bit 0-2 with
the following meaning:
Bit Define Description
0 PR_SPEC_PRCTL Mitigation can be controlled per task by
PR_SET_SPECULATION_CTRL
1 PR_SPEC_ENABLE The speculation feature is enabled, mitigation is
disabled
2 PR_SPEC_DISABLE The speculation feature is disabled, mitigation is
enabled
If all bits are 0 the CPU is not affected by the speculation misfeature.
If PR_SPEC_PRCTL is set, then the per task control of the mitigation is
available. If not set, prctl(PR_SET_SPECULATION_CTRL) for the speculation
misfeature will fail.
PR_SET_SPECULATION_CTRL allows to control the speculation misfeature, which
is selected by arg2 of prctl(2) per task. arg3 is used to hand in the
control value, i.e. either PR_SPEC_ENABLE or PR_SPEC_DISABLE.
The common return values are:
EINVAL prctl is not implemented by the architecture or the unused prctl()
arguments are not 0
ENODEV arg2 is selecting a not supported speculation misfeature
PR_SET_SPECULATION_CTRL has these additional return values:
ERANGE arg3 is incorrect, i.e. it's not either PR_SPEC_ENABLE or PR_SPEC_DISABLE
ENXIO prctl control of the selected speculation misfeature is disabled
The first supported controlable speculation misfeature is
PR_SPEC_STORE_BYPASS. Add the define so this can be shared between
architectures.
Based on an initial patch from Tim Chen and mostly rewritten.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Using this helper allows us to avoid the in-kernel call to the
sys_setsid() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it
uses the same calling convention as sys_setsid().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using these helpers allows us to avoid the in-kernel calls to these
syscalls: sys_setregid(), sys_setgid(), sys_setreuid(), sys_setuid(),
sys_setresuid(), sys_setresgid(), sys_setfsuid(), and sys_setfsgid().
The ksys_ prefix denotes that these function are meant as a drop-in
replacement for the syscall. In particular, they use the same calling
convention.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
The patch remains without practical effect since both macros carry
identical values. Still, it might become a problem in the future if
(for whatever reason) the default overflow uid and gid differ. The
DEFAULT_FS_OVERFLOWGID macro was previously unused.
Signed-off-by: Wolffhardt Schwabe <wolffhardt.schwabe@fau.de>
Signed-off-by: Anatoliy Cherepantsev <anatoliy.cherepantsev@fau.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Pull arm64 updates from Will Deacon:
"The big highlight is support for the Scalable Vector Extension (SVE)
which required extensive ABI work to ensure we don't break existing
applications by blowing away their signal stack with the rather large
new vector context (<= 2 kbit per vector register). There's further
work to be done optimising things like exception return, but the ABI
is solid now.
Much of the line count comes from some new PMU drivers we have, but
they're pretty self-contained and I suspect we'll have more of them in
future.
Plenty of acronym soup here:
- initial support for the Scalable Vector Extension (SVE)
- improved handling for SError interrupts (required to handle RAS
events)
- enable GCC support for 128-bit integer types
- remove kernel text addresses from backtraces and register dumps
- use of WFE to implement long delay()s
- ACPI IORT updates from Lorenzo Pieralisi
- perf PMU driver for the Statistical Profiling Extension (SPE)
- perf PMU driver for Hisilicon's system PMUs
- misc cleanups and non-critical fixes"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (97 commits)
arm64: Make ARMV8_DEPRECATED depend on SYSCTL
arm64: Implement __lshrti3 library function
arm64: support __int128 on gcc 5+
arm64/sve: Add documentation
arm64/sve: Detect SVE and activate runtime support
arm64/sve: KVM: Hide SVE from CPU features exposed to guests
arm64/sve: KVM: Treat guest SVE use as undefined instruction execution
arm64/sve: KVM: Prevent guests from using SVE
arm64/sve: Add sysctl to set the default vector length for new processes
arm64/sve: Add prctl controls for userspace vector length management
arm64/sve: ptrace and ELF coredump support
arm64/sve: Preserve SVE registers around EFI runtime service calls
arm64/sve: Preserve SVE registers around kernel-mode NEON use
arm64/sve: Probe SVE capabilities and usable vector lengths
arm64: cpufeature: Move sys_caps_initialised declarations
arm64/sve: Backend logic for setting the vector length
arm64/sve: Signal handling support
arm64/sve: Support vector length resetting for new processes
arm64/sve: Core task context handling
arm64/sve: Low-level CPU setup
...
This patch adds two arm64-specific prctls, to permit userspace to
control its vector length:
* PR_SVE_SET_VL: set the thread's SVE vector length and vector
length inheritance mode.
* PR_SVE_GET_VL: get the same information.
Although these prctls resemble instruction set features in the SVE
architecture, they provide additional control: the vector length
inheritance mode is Linux-specific and nothing to do with the
architecture, and the architecture does not permit EL0 to set its
own vector length directly. Both can be used in portable tools
without requiring the use of SVE instructions.
Signed-off-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alex Bennée <alex.bennee@linaro.org>
[will: Fixed up prctl constants to avoid clash with PDEATHSIG]
Signed-off-by: Will Deacon <will.deacon@arm.com>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
During checkpointing and restore of userspace tasks
we bumped into the situation, that it's not possible
to restore the tasks, which user namespace does not
have uid 0 or gid 0 mapped.
People create user namespace mappings like they want,
and there is no a limitation on obligatory uid and gid
"must be mapped". So, if there is no uid 0 or gid 0
in the mapping, it's impossible to restore mm->exe_file
of the processes belonging to this user namespace.
Also, there is no a workaround. It's impossible
to create a temporary uid/gid mapping, because
only one write to /proc/[pid]/uid_map and gid_map
is allowed during a namespace lifetime.
If there is an entry, then no more mapings can't be
written. If there isn't an entry, we can't write
there too, otherwise user task won't be able
to do that in the future.
The patch changes the check, and looks for CAP_SYS_ADMIN
instead of zero uid and gid. This allows to restore
a task independently of its user namespace mappings.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Serge Hallyn <serge@hallyn.com>
CC: "Eric W. Biederman" <ebiederm@xmission.com>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Michal Hocko <mhocko@suse.com>
CC: Andrei Vagin <avagin@openvz.org>
CC: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Stanislav Kinsburskiy <skinsbursky@virtuozzo.com>
CC: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
existing mapping because it only updated mm->def_flags which is a
template for new mappings.
The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
flag set. This can be quite surprising for all those applications which
do not do prctl(); fork() & exec() and want to control their own THP
behavior.
Another usecase when the immediate semantic of the prctl might be useful
is a combination of pre- and post-copy migration of containers with
CRIU. In this case CRIU populates a part of a memory region with data
that was saved during the pre-copy stage. Afterwards, the region is
registered with userfaultfd and CRIU expects to get page faults for the
parts of the region that were not yet populated. However, khugepaged
collapses the pages and the expected page faults do not occur.
In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
temporary mechanism for enabling/disabling THP process wide.
Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
tested when decision whether to use huge pages is taken either during
page fault of at the time of THP collapse.
It should be noted, that the new implementation makes PR_SET_THP_DISABLE
master override to any per-VMA setting, which was not the case
previously.
Fixes: a0715cc226 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull misc compat stuff updates from Al Viro:
"This part is basically untangling various compat stuff. Compat
syscalls moved to their native counterparts, getting rid of quite a
bit of double-copying and/or set_fs() uses. A lot of field-by-field
copyin/copyout killed off.
- kernel/compat.c is much closer to containing just the
copyin/copyout of compat structs. Not all compat syscalls are gone
from it yet, but it's getting there.
- ipc/compat_mq.c killed off completely.
- block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
drivers/block/floppy.c where they belong. Yes, there are several
drivers that implement some of the same ioctls. Some are m68k and
one is 32bit-only pmac. drivers/block/floppy.c is the only one in
that bunch that can be built on biarch"
* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mqueue: move compat syscalls to native ones
usbdevfs: get rid of field-by-field copyin
compat_hdio_ioctl: get rid of set_fs()
take floppy compat ioctls to sodding floppy.c
ipmi: get rid of field-by-field __get_user()
ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
rt_sigtimedwait(): move compat to native
select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
put_compat_rusage(): switch to copy_to_user()
sigpending(): move compat to native
getrlimit()/setrlimit(): move compat to native
times(2): move compat to native
compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
do_sigaltstack(): lift copying to/from userland into callers
take compat_sys_old_getrlimit() to native syscall
trim __ARCH_WANT_SYS_OLD_GETRLIMIT
New helpers: kernel_waitid() and kernel_wait4(). sys_waitid(),
sys_wait4() and their compat variants switched to those. Copying
struct rusage to userland is left to syscall itself. For
compat_sys_wait4() that eliminates the use of set_fs() completely.
For compat_sys_waitid() it's still needed (for siginfo handling);
that will change shortly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull namespace updates from Eric Biederman:
"This is a set of small fixes that were mostly stumbled over during
more significant development. This proc fix and the fix to
posix-timers are the most significant of the lot.
There is a lot of good development going on but unfortunately it
didn't quite make the merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: Fix unbalanced hard link numbers
signal: Make kill_proc_info static
rlimit: Properly call security_task_setrlimit
signal: Remove unused definition of sig_user_definied
ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET
ipc: Remove unused declaration of recompute_msgmni
posix-timers: Correct sanity check in posix_cpu_nsleep
sysctl: Remove dead register_sysctl_root