udc
113 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
204a155489 | Merge branch 'android-4.14-stable' of https://android.googlesource.com/kernel/common into eleven | ||
|
|
5f0185cd37 |
kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
commit 5fa54346caf67b4b1b10b1f390316ae466da4d53 upstream.
The system might hang with the following backtrace:
schedule+0x80/0x100
schedule_timeout+0x48/0x138
wait_for_common+0xa4/0x134
wait_for_completion+0x1c/0x2c
kthread_flush_work+0x114/0x1cc
kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144
kthread_cancel_delayed_work_sync+0x18/0x2c
xxxx_pm_notify+0xb0/0xd8
blocking_notifier_call_chain_robust+0x80/0x194
pm_notifier_call_chain_robust+0x28/0x4c
suspend_prepare+0x40/0x260
enter_state+0x80/0x3f4
pm_suspend+0x60/0xdc
state_store+0x108/0x144
kobj_attr_store+0x38/0x88
sysfs_kf_write+0x64/0xc0
kernfs_fop_write_iter+0x108/0x1d0
vfs_write+0x2f4/0x368
ksys_write+0x7c/0xec
It is caused by the following race between kthread_mod_delayed_work()
and kthread_cancel_delayed_work_sync():
CPU0 CPU1
Context: Thread A Context: Thread B
kthread_mod_delayed_work()
spin_lock()
__kthread_cancel_work()
spin_unlock()
del_timer_sync()
kthread_cancel_delayed_work_sync()
spin_lock()
__kthread_cancel_work()
spin_unlock()
del_timer_sync()
spin_lock()
work->canceling++
spin_unlock
spin_lock()
queue_delayed_work()
// dwork is put into the worker->delayed_work_list
spin_unlock()
kthread_flush_work()
// flush_work is put at the tail of the dwork
wait_for_completion()
Context: IRQ
kthread_delayed_work_timer_fn()
spin_lock()
list_del_init(&work->node);
spin_unlock()
BANG: flush_work is not longer linked and will never get proceed.
The problem is that kthread_mod_delayed_work() checks work->canceling
flag before canceling the timer.
A simple solution is to (re)check work->canceling after
__kthread_cancel_work(). But then it is not clear what should be
returned when __kthread_cancel_work() removed the work from the queue
(list) and it can't queue it again with the new @delay.
The return value might be used for reference counting. The caller has
to know whether a new work has been queued or an existing one was
replaced.
The proper solution is that kthread_mod_delayed_work() will remove the
work from the queue (list) _only_ when work->canceling is not set. The
flag must be checked after the timer is stopped and the remaining
operations can be done under worker->lock.
Note that kthread_mod_delayed_work() could remove the timer and then
bail out. It is fine. The other canceling caller needs to cancel the
timer as well. The important thing is that the queue (list)
manipulation is done atomically under worker->lock.
Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com
Fixes:
|
||
|
|
5f7c8a41b8 |
kthread_worker: split code for canceling the delayed work timer
commit 34b3d5344719d14fd2185b2d9459b3abcb8cf9d8 upstream. Patch series "kthread_worker: Fix race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync()". This patchset fixes the race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync() including proper return value handling. This patch (of 2): Simple code refactoring as a preparation step for fixing a race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(). It does not modify the existing behavior. Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: <jenhaochen@google.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
7a65cfc279 | Merge branch 'android-4.14-stable' of https://android.googlesource.com/kernel/common into HEAD | ||
|
|
a9b2e4a94e |
kthread: Extract KTHREAD_IS_PER_CPU
[ Upstream commit ac687e6e8c26181a33270efd1a2e2241377924b0 ] There is a need to distinguish geniune per-cpu kthreads from kthreads that happen to have a single CPU affinity. Geniune per-cpu kthreads are kthreads that are CPU affine for correctness, these will obviously have PF_KTHREAD set, but must also have PF_NO_SETAFFINITY set, lest userspace modify their affinity and ruins things. However, these two things are not sufficient, PF_NO_SETAFFINITY is also set on other tasks that have their affinities controlled through other means, like for instance workqueues. Therefore another bit is needed; it turns out kthread_create_per_cpu() already has such a bit: KTHREAD_IS_PER_CPU, which is used to make kthread_park()/kthread_unpark() work correctly. Expose this flag and remove the implicit setting of it from kthread_create_on_cpu(); the io_uring usage of it seems dubious at best. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103506.557620262@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
0fafd5a141 |
cpu-boost: Rework scheduling setup
This patch minimizes the latency of an input boost by creating a dedicated kworker and scheduling it with the realtime policy "SCHED_FIFO". We use "MAX_RT_PRIO - 2" to have the input boost task preempt userspace RT tasks if needed while being preempted by kernel internal RT tasks which are scheduled at "MAX_RT_PRIO - 1" or higher. Also since now the cpu-boost wq would only handle the work to disable a running input boost, which isn't latency critical, we break up the cpu-boost wq and move the work to disable a running input boost to the shared regular wq. Signed-off-by: Alex Naidis <alex.naidis@linux.com> Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com> Signed-off-by: UtsavisGreat <utsavbalar1231@gmail.com> |
||
|
|
efdfd74e39 | Merge branch 'android-4.14-stable' of https://android.googlesource.com/kernel/common into eleven | ||
|
|
b72ae987f4 |
kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled
commit 6993d0fdbee0eb38bfac350aa016f65ad11ed3b1 upstream.
There is a small race window when a delayed work is being canceled and
the work still might be queued from the timer_fn:
CPU0 CPU1
kthread_cancel_delayed_work_sync()
__kthread_cancel_work_sync()
__kthread_cancel_work()
work->canceling++;
kthread_delayed_work_timer_fn()
kthread_insert_work();
BUG: kthread_insert_work() should not get called when work->canceling is
set.
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201014083030.16895-1-qiang.zhang@windriver.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
b2c8463039 |
Merge android-4.14-p.61 (b7e55e8) into msm-4.14
* remotes/origin/tmp-b7e55e8:
Linux 4.14.61
scsi: sg: fix minor memory leak in error path
drm/vc4: Reset ->{x, y}_scaling[1] when dealing with uniplanar formats
crypto: padlock-aes - Fix Nano workaround data corruption
RDMA/uverbs: Expand primary and alt AV port checks
iwlwifi: add more card IDs for 9000 series
userfaultfd: remove uffd flags from vma->vm_flags if UFFD_EVENT_FORK fails
audit: fix potential null dereference 'context->module.name'
kvm: x86: vmx: fix vpid leak
x86/entry/64: Remove %ebx handling from error_entry/exit
x86/apic: Future-proof the TSC_DEADLINE quirk for SKX
virtio_balloon: fix another race between migration and ballooning
net: socket: fix potential spectre v1 gadget in socketcall
can: ems_usb: Fix memory leak on ems_usb_disconnect()
squashfs: more metadata hardenings
squashfs: more metadata hardening
net/mlx5e: E-Switch, Initialize eswitch only if eswitch manager
rxrpc: Fix user call ID check in rxrpc_service_prealloc_one
net: stmmac: Fix WoL for PCI-based setups
netlink: Fix spectre v1 gadget in netlink_create()
net: dsa: Do not suspend/resume closed slave_dev
ipv4: frags: handle possible skb truesize change
inet: frag: enforce memory limits earlier
bonding: avoid lockdep confusion in bond_get_stats()
Linux 4.14.60
tcp: add one more quick ack after after ECN events
tcp: refactor tcp_ecn_check_ce to remove sk type cast
tcp: do not aggressively quick ack after ECN events
tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode
tcp: do not force quickack when receiving out-of-order packets
netlink: Don't shift with UB on nlk->ngroups
netlink: Do not subscribe to non-existent groups
xen-netfront: wait xenbus state change when load module manually
tcp_bbr: fix bw probing to raise in-flight data for very small BDPs
NET: stmmac: align DMA stuff to largest cache line length
net: mdio-mux: bcm-iproc: fix wrong getter and setter pair
net: lan78xx: fix rx handling before first packet is send
net: fix amd-xgbe flow-control issue
net: ena: Fix use of uninitialized DMA address bits field
ipv4: remove BUG_ON() from fib_compute_spec_dst
net: dsa: qca8k: Allow overwriting CPU port setting
net: dsa: qca8k: Add QCA8334 binding documentation
net: dsa: qca8k: Enable RXMAC when bringing up a port
net: dsa: qca8k: Force CPU port to its highest bandwidth
RDMA/uverbs: Protect from attempts to create flows on unsupported QP
usb: gadget: udc: renesas_usb3: should remove debugfs
ovl: Sync upper dirty data when syncing overlayfs
PCI: xgene: Remove leftover pci_scan_child_bus() call
PCI: pciehp: Assume NoCompl+ for Thunderbolt ports
ext4: fix check to prevent initializing reserved inodes
ext4: check for allocation block validity with block group locked
ext4: fix inline data updates with checksums enabled
squashfs: be more careful about metadata corruption
random: mix rdrand with entropy sent in from userspace
block: reset bi_iter.bi_done after splitting bio
blkdev: __blkdev_direct_IO_simple: fix leak in error case
block: bio_iov_iter_get_pages: fix size of last iovec
drm/dp/mst: Fix off-by-one typo when dump payload table
drm/atomic-helper: Drop plane->fb references only for drm_atomic_helper_shutdown()
drm: Add DP PSR2 sink enable bit
ASoC: topology: Add missing clock gating parameter when parsing hw_configs
ASoC: topology: Fix bclk and fsync inversion in set_link_hw_format()
media: si470x: fix __be16 annotations
media: atomisp: compat32: fix __user annotations
scsi: cxlflash: Avoid clobbering context control register value
scsi: cxlflash: Synchronize reset and remove ops
scsi: megaraid_sas: Increase timeout by 1 sec for non-RAID fastpath IOs
scsi: scsi_dh: replace too broad "TP9" string with the exact models
regulator: Don't return or expect -errno from of_map_mode()
media: omap3isp: fix unbalanced dma_iommu_mapping
crypto: authenc - don't leak pointers to authenc keys
crypto: authencesn - don't leak pointers to authenc keys
usb: hub: Don't wait for connect state at resume for powered-off ports
microblaze: Fix simpleImage format generation
soc: imx: gpcv2: Do not pass static memory as platform data
serial: core: Make sure compiler barfs for 16-byte earlycon names
staging: lustre: ldlm: free resource when ldlm_lock_create() fails.
staging: lustre: llite: correct removexattr detection
staging: vchiq_core: Fix missing semaphore release in error case
audit: allow not equal op for audit by executable
rsi: fix nommu_map_sg overflow kernel panic
rsi: Fix 'invalid vdd' warning in mmc
ipconfig: Correctly initialise ic_nameservers
drm/gma500: fix psb_intel_lvds_mode_valid()'s return type
igb: Fix queue selection on MAC filters on i210
arm64: defconfig: Enable Rockchip io-domain driver
nvme: lightnvm: add granby support
memory: tegra: Apply interrupts mask per SoC
memory: tegra: Do not handle spurious interrupts
delayacct: Use raw_spinlocks
stop_machine: Use raw spinlocks
backlight: pwm_bl: Don't use GPIOF_* with gpiod_get_direction
dt-bindings: net: meson-dwmac: new compatible name for AXG SoC
net: hns3: Fixes the out of bounds access in hclge_map_tqp
spi: meson-spicc: Fix error handling in meson_spicc_probe()
dt-bindings: pinctrl: meson: add support for the Meson8m2 SoC
mmc: pwrseq: Use kmalloc_array instead of stack VLA
mmc: dw_mmc: update actual clock for mmc debugfs
ALSA: hda/ca0132: fix build failure when a local macro is defined
drm/atomic: Handling the case when setting old crtc for plane
media: siano: get rid of __le32/__le16 cast warnings
f2fs: avoid fsync() failure caused by EAGAIN in writepage()
bpf: fix references to free_bpf_prog_info() in comments
thermal: exynos: fix setting rising_threshold for Exynos5433
staging: lustre: o2iblnd: Fix FastReg map/unmap for MLX5
staging: lustre: o2iblnd: fix race at kiblnd_connect_peer
scsi: qedf: Set the UNLOADING flag when removing a vport
scsi: hisi_sas: config ATA de-reset as an constrained command for v3 hw
scsi: megaraid: silence a static checker bug
scsi: 3w-xxxx: fix a missing-check bug
scsi: 3w-9xxx: fix a missing-check bug
bnxt_en: Check unsupported speeds in bnxt_update_link() on PF only.
perf: fix invalid bit in diagnostic entry
s390/cpum_sf: Add data entry sizes to sampling trailer entry
brcmfmac: Add support for bcm43364 wireless chipset
mtd: rawnand: fsl_ifc: fix FSL NAND driver to read all ONFI parameter pages
media: saa7164: Fix driver name in debug output
media: media-device: fix ioctl function types
ACPI / LPSS: Only call pwm_add_table() for Bay Trail PWM if PMIC HRV is 2
libata: Fix command retry decision
media: rcar_jpu: Add missing clk_disable_unprepare() on error in jpu_open()
net: phy: phylink: Release link GPIO
dma-iommu: Fix compilation when !CONFIG_IOMMU_DMA
tty: Fix data race in tty_insert_flip_string_fixed_flag
i40e: free the skb after clearing the bitlock
nvmem: properly handle returned value nvmem_reg_read
ARM: dts: sh73a0: Add missing interrupt-affinity to PMU node
ARM: dts: emev2: Add missing interrupt-affinity to PMU node
ARM: dts: stih407-pinctrl: Fix complain about IRQ_TYPE_NONE usage
EDAC, altera: Fix ARM64 build warning
HID: i2c-hid: check if device is there before really probing
powerpc/embedded6xx/hlwd-pic: Prevent interrupts from being handled by Starlet
drm/amdgpu: Remove VRAM from shared bo domains.
drm/radeon: fix mode_valid's return type
arm64: dts: renesas: salvator-common: use audio-graph-card for Sound
HID: hid-plantronics: Re-resend Update to map button for PTT products
arm64: cmpwait: Clear event register before arming exclusive monitor
media: atomisp: ov2680: don't declare unused vars
ALSA: usb-audio: Apply rate limit to warning messages in URB complete callback
net: ethernet: ti: cpsw-phy-sel: check bus_find_device() ret value
media: smiapp: fix timeout checking in smiapp_read_nvm
ixgbevf: fix MAC address changes through ixgbevf_set_mac()
md: fix NULL dereference of mddev->pers in remove_and_add_spares()
md/raid1: add error handling of read error from FailFast device
regulator: pfuze100: add .is_enable() for pfuze100_swb_regulator_ops
ALSA: emu10k1: Rate-limit error messages about page errors
rtc: tps65910: fix possible race condition
rtc: vr41xx: fix possible race condition
rtc: tps6586x: fix possible race condition
Bluetooth: btusb: add ID for LiteOn 04ca:301a
drm/nouveau/fifo/gk104-: poll for runlist update completion
scsi: zfcp: assert that the ERP lock is held when tracing a recovery trigger
scsi: ufs: fix exception event handling
scsi: ufs: ufshcd: fix possible unclocked register access
fscrypt: use unbound workqueue for decryption
net: hns3: Fix the missing client list node initialization
spi: Add missing pm_runtime_put_noidle() after failed get
drivers/perf: arm-ccn: don't log to dmesg in event_init
ima: based on policy verify firmware signatures (pre-allocated buffer)
mwifiex: correct histogram data with appropriate index
net: dsa: qca8k: Add support for QCA8334 switch
PCI: pciehp: Request control of native hotplug only if supported
bpf: powerpc64: pad function address loads with NOPs
pinctrl: at91-pio4: add missing of_node_put
powerpc/8xx: fix invalid register expression in head_8xx.S
spi: sh-msiof: Fix setting SIRMDR1.SYNCAC to match SITMDR1.SYNCAC
powerpc: Add __printf verification to prom_printf
powerpc/powermac: Mark variable x as unused
powerpc/powermac: Add missing prototype for note_bootable_part()
powerpc/chrp/time: Make some functions static, add missing header include
powerpc/32: Add a missing include header
ath: Add regulatory mapping for Bahamas
ath: Add regulatory mapping for Bermuda
ath: Add regulatory mapping for Serbia
ath: Add regulatory mapping for Tanzania
ath: Add regulatory mapping for Uganda
ath: Add regulatory mapping for APL2_FCCA
ath: Add regulatory mapping for APL13_WORLD
ath: Add regulatory mapping for ETSI8_WORLD
ath: Add regulatory mapping for FCC3_ETSIC
nvme-pci: Fix AER reset handling
nvme-rdma: stop admin queue before freeing it
PCI: Prevent sysfs disable of device while driver is attached
PM / wakeup: Make s2idle_lock a RAW_SPINLOCK
x86/microcode: Make the late update update_lock a raw lock for RT
btrfs: qgroup: Finish rescan when hit the last leaf of extent tree
btrfs: add barriers to btrfs_sync_log before log_commit_wait wakeups
Btrfs: don't BUG_ON() in btrfs_truncate_inode_items()
Btrfs: don't return ino to ino cache if inode item removal fails
media: videobuf2-core: don't call memop 'finish' when queueing
media: tw686x: Fix incorrect vb2_mem_ops GFP flags
net: hns3: Fixes the init of the VALID BD info in the descriptor
wlcore: sdio: check for valid platform device data before suspend
mwifiex: handle race during mwifiex_usb_disconnect
mfd: cros_ec: Fail early if we cannot identify the EC
ASoC: dpcm: fix BE dai not hw_free and shutdown
Bluetooth: btusb: Add a new Realtek 8723DE ID 2ff8:b011
Bluetooth: hci_qca: Fix "Sleep inside atomic section" warning
iwlwifi: pcie: fix race in Rx buffer allocator
btrfs: balance dirty metadata pages in btrfs_finish_ordered_io
PCI: Fix devm_pci_alloc_host_bridge() memory leak
selftests: intel_pstate: return Kselftest Skip code for skipped tests
selftests: memfd: return Kselftest Skip code for skipped tests
selftests/intel_pstate: Improve test, minor fixes
perf/x86/intel/uncore: Correct fixed counter index check for NHM
perf/x86/intel/uncore: Correct fixed counter index check in generic code
usbip: dynamically allocate idev by nports found in sysfs
usbip: usbip_detach: Fix memory, udev context and udev leak
block, bfq: remove wrong lock in bfq_requests_merged
f2fs: fix race in between GC and atomic open
f2fs: fix to detect failure of dquot_initialize
f2fs: Fix deadlock in shutdown ioctl
f2fs: fix to wait page writeback during revoking atomic write
f2fs: fix to don't trigger writeback during recovery
f2fs: fix error path of move_data_page
disable loading f2fs module on PAGE_SIZE > 4KB
pnfs: Don't release the sequence slot until we've processed layoutget on open
netfilter: nf_tables: check msg_type before nft_trans_set(trans)
lightnvm: pblk: warn in case of corrupted write buffer
RDMA/mad: Convert BUG_ONs to error flows
powerpc/64s: Fix compiler store ordering to SLB shadow area
hvc_opal: don't set tb_ticks_per_usec in udbg_init_opal_common()
powerpc/eeh: Fix use-after-release of EEH driver
powerpc/64s: Add barrier_nospec
powerpc/lib: Adjust .balign inside string functions for PPC32
infiniband: fix a possible use-after-free bug
e1000e: Ignore TSYNCRXCTL when getting I219 clock attributes
ceph: fix alignment of rasize
bpf, arm32: fix inconsistent naming about emit_a32_lsr_{r64,i64}
printk: drop in_nmi check from printk_safe_flush_on_panic()
watchdog: da9063: Fix updating timeout value
irqchip/ls-scfg-msi: Map MSIs in the iommu
netfilter: ipset: List timing out entries with "timeout 1" instead of zero
netfilter: ipset: forbid family for hash:mac sets
perf tools: Fix pmu events parsing rule
rtc: ensure rtc_set_alarm fails when alarms are not supported
mm/slub.c: add __printf verification to slab_err()
mm: vmalloc: avoid racy handling of debugobjects in vunmap
mm: /proc/pid/pagemap: hide swap entries from unprivileged users
kernel/hung_task.c: show all hung tasks before panic
vfio/type1: Fix task tracking for QEMU vCPU hotplug
vfio/mdev: Check globally for duplicate devices
vfio: platform: Fix reset module leak in error path
nfsd: fix potential use-after-free in nfsd4_decode_getdeviceinfo
NFSv4.1: Fix the client behaviour on NFS4ERR_SEQ_FALSE_RETRY
ALSA: fm801: add error handling for snd_ctl_add
ALSA: emu10k1: add error handling for snd_ctl_add
skip LAYOUTRETURN if layout is invalid
hv_netvsc: fix network namespace issues with VF support
xen/netfront: raise max number of slots in xennet_get_responses()
kcov: ensure irq code sees a valid area
mlxsw: spectrum_switchdev: Fix port_vlan refcounting
arm64: fix vmemmap BUILD_BUG_ON() triggering on !vmemmap setups
tracing: Quiet gcc warning about maybe unused link variable
tracing/kprobes: Fix trace_probe flags on enable_trace_kprobe() failure
kthread, tracing: Don't expose half-written comm when creating kthreads
tracing: Fix possible double free in event_enable_trigger_func()
tracing: Fix double free of event_trigger_data
delayacct: fix crash in delayacct_blkio_end() after delayacct init failure
kvm, mm: account shadow page tables to kmemcg
Input: elan_i2c - add another ACPI ID for Lenovo Ideapad 330-15AST
Input: i8042 - add Lenovo LaVie Z to the i8042 reset list
Input: elan_i2c - add ACPI ID for lenovo ideapad 330
spi: spi-s3c64xx: Fix system resume support
drivers/infiniband/ulp/srpt/ib_srpt.c: fix build with gcc-4.4.4
IB/srpt: Fix an out-of-bounds stack access in srpt_zerolength_write()
drivers/infiniband/core/verbs.c: fix build with gcc-4.4.4
RDMA/core: Avoid that ib_drain_qp() triggers an out-of-bounds stack access
i2c: core: decrease reference count of device node in i2c_unregister_device
fork: unconditionally clear stack on fork
Linux 4.14.59
turn off -Wattribute-alias
can: m_can.c: fix setup of CCCR register: clear CCCR NISO bit before checking can.ctrlmode
can: peak_canfd: fix firmware < v3.3.0: limit allocation to 32-bit DMA addr only
can: xilinx_can: fix RX overflow interrupt not being enabled
can: xilinx_can: fix incorrect clear of non-processed interrupts
can: xilinx_can: keep only 1-2 frames in TX FIFO to fix TX accounting
can: xilinx_can: fix device dropping off bus on RX overrun
can: xilinx_can: fix recovery from error states not being propagated
can: xilinx_can: fix power management handling
can: xilinx_can: fix RX loop if RXNEMP is asserted without RXOK
driver core: Partially revert "driver core: correct device's shutdown order"
usb: gadget: f_fs: Only return delayed status when len is 0
usb: dwc2: Fix DMA alignment to start at allocated boundary
usb: core: handle hub C_PORT_OVER_CURRENT condition
usb: cdc_acm: Add quirk for Castles VEGA3000
staging: speakup: fix wraparound in uaccess length check
tcp: add tcp_ooo_try_coalesce() helper
tcp: call tcp_drop() from tcp_data_queue_ofo()
tcp: detect malicious patterns in tcp_collapse_ofo_queue()
tcp: avoid collapses in tcp_prune_queue() if possible
tcp: free batches of packets in tcp_prune_ofo_queue()
tcp: do not delay ACK in DCTCP upon CE status change
tcp: do not cancel delay-AcK on DCTCP special ACK
tcp: helpers to send special DCTCP ack
tcp: fix dctcp delayed ACK schedule
vxlan: fix default fdb entry netlink notify ordering during netdev create
vxlan: make netlink notify in vxlan_fdb_destroy optional
vxlan: add new fdb alloc and create helpers
rtnetlink: add rtnl_link_state check in rtnl_configure_link
sock: fix sg page frag coalescing in sk_alloc_sg
net: phy: consider PHY_IGNORE_INTERRUPT in phy_start_aneg_priv
multicast: do not restore deleted record source filter mode to new one
net/ipv6: Fix linklocal to global address with VRF
net/mlx5e: Fix quota counting in aRFS expire flow
net/mlx5e: Don't allow aRFS for encapsulated packets
net/mlx5: Adjust clock overflow work period
net: skb_segment() should not return NULL
net/mlx4_core: Save the qpn from the input modifier in RST2INIT wrapper
ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull
ip: hash fragments consistently
bonding: set default miimon value for non-arp modes if not set
drm/nouveau: Set DRIVER_ATOMIC cap earlier to fix debugfs
drm/nouveau/drm/nouveau: Fix runtime PM leak in nv50_disp_atomic_commit()
KVM: PPC: Check if IOMMU page is contained in the pinned physical page
xen/PVH: Set up GS segment for stack canary
MIPS: Fix off-by-one in pci_resource_to_user()
MIPS: ath79: fix register address in ath79_ddr_wb_flush()
Revert "cifs: Fix slab-out-of-bounds in send_set_info() on SMB2 ACE setting"
ANDROID: verity: really fix android-verity Kconfig
tcp: add tcp_ooo_try_coalesce() helper
tcp: call tcp_drop() from tcp_data_queue_ofo()
tcp: detect malicious patterns in tcp_collapse_ofo_queue()
tcp: avoid collapses in tcp_prune_queue() if possible
tcp: free batches of packets in tcp_prune_ofo_queue()
x86_64_cuttlefish_defconfig: Enable android-verity
x86_64_cuttlefish_defconfig: enable verity cert
ANDROID: android-verity: Fix broken parameter handling.
ANDROID: android-verity: Make it work with newer kernels
ANDROID: android-verity: Add API to verify signature with builtin keys.
ANDROID: verity: fix android-verity Kconfig dependencies
Linux 4.14.58
xhci: Fix perceived dead host due to runtime suspend race with event handler
powerpc/powernv: Fix save/restore of SPRG3 on entry/exit from stop (idle)
cxl_getfile(): fix double-iput() on alloc_file() failures
alpha: fix osf_wait4() breakage
net: usb: asix: replace mii_nway_restart in resume path
ipv6: make DAD fail with enhanced DAD when nonce length differs
net: systemport: Fix CRC forwarding check for SYSTEMPORT Lite
net/mlx4_en: Don't reuse RX page when XDP is set
hv_netvsc: Fix napi reschedule while receive completion is busy
tg3: Add higher cpu clock for 5762.
qmi_wwan: add support for Quectel EG91
ptp: fix missing break in switch
net: phy: fix flag masking in __set_phy_supported
net/ipv4: Set oif in fib_compute_spec_dst
skbuff: Unconditionally copy pfmemalloc in __skb_clone()
net: Don't copy pfmemalloc flag in __copy_skb_header()
net: diag: Don't double-free TCP_NEW_SYN_RECV sockets in tcp_abort
lib/rhashtable: consider param->min_size when setting initial table size
ipv6: ila: select CONFIG_DST_CACHE
ipv6: fix useless rol32 call on hash
ipv4: Return EINVAL when ping_group_range sysctl doesn't map to user ns
gen_stats: Fix netlink stats dumping in the presence of padding
drm/nouveau: Avoid looping through fake MST connectors
drm/nouveau: Use drm_connector_list_iter_* for iterating connectors
drm/i915: Fix hotplug irq ack on i965/g4x
stop_machine: Disable preemption when waking two stopper threads
vfio/spapr: Use IOMMU pageshift rather than pagesize
vfio/pci: Fix potential Spectre v1
cpufreq: intel_pstate: Register when ACPI PCCH is present
mm/huge_memory.c: fix data loss when splitting a file pmd
mm: memcg: fix use after free in mem_cgroup_iter()
ARC: mm: allow mprotect to make stack mappings executable
ARC: configs: Remove CONFIG_INITRAMFS_SOURCE from defconfigs
ARC: Fix CONFIG_SWAP
ARCv2: [plat-hsdk]: Save accl reg pair by default
ALSA: hda: add mute led support for HP ProBook 455 G5
ALSA: hda/realtek - Add Panasonic CF-SZ6 headset jack quirk
ALSA: rawmidi: Change resized buffers atomically
fat: fix memory allocation failure handling of match_strdup()
x86/MCE: Remove min interval polling limitation
x86/events/intel/ds: Fix bts_interrupt_threshold alignment
x86/apm: Don't access __preempt_count with zeroed fs
KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel.
scsi: sd_zbc: Fix variable type and bogus comment
ANDROID: uid_sys_stats: Replace tasklist lock with RCU in uid_cputime_show
Linux 4.14.57
string: drop __must_check from strscpy() and restore strscpy() usages in cgroup
arm64: KVM: Add ARCH_WORKAROUND_2 discovery through ARCH_FEATURES_FUNC_ID
arm64: KVM: Handle guest's ARCH_WORKAROUND_2 requests
arm64: KVM: Add ARCH_WORKAROUND_2 support for guests
arm64: KVM: Add HYP per-cpu accessors
arm64: ssbd: Add prctl interface for per-thread mitigation
arm64: ssbd: Introduce thread flag to control userspace mitigation
arm64: ssbd: Restore mitigation status on CPU resume
arm64: ssbd: Skip apply_ssbd if not using dynamic mitigation
arm64: ssbd: Add global mitigation state accessor
arm64: Add 'ssbd' command-line option
arm64: Add ARCH_WORKAROUND_2 probing
arm64: Add per-cpu infrastructure to call ARCH_WORKAROUND_2
arm64: Call ARCH_WORKAROUND_2 on transitions between EL0 and EL1
arm/arm64: smccc: Add SMCCC-specific return codes
KVM: arm64: Avoid storing the vcpu pointer on the stack
KVM: arm/arm64: Do not use kern_hyp_va() with kvm_vgic_global_state
arm64: alternatives: Add dynamic patching feature
KVM: arm64: Stop save/restoring host tpidr_el1 on VHE
arm64: alternatives: use tpidr_el2 on VHE hosts
KVM: arm64: Change hyp_panic()s dependency on tpidr_el2
KVM: arm/arm64: Convert kvm_host_cpu_state to a static per-cpu allocation
KVM: arm64: Store vcpu on the stack during __guest_enter()
net/nfc: Avoid stalls when nfc_alloc_send_skb() returned NULL.
rds: avoid unenecessary cong_update in loop transport
bdi: Fix another oops in wb_workfn()
netfilter: ipv6: nf_defrag: drop skb dst before queueing
nsh: set mac len based on inner packet
autofs: fix slab out of bounds read in getname_kernel()
tls: Stricter error checking in zerocopy sendmsg path
KEYS: DNS: fix parsing multiple options
reiserfs: fix buffer overflow with long warning messages
netfilter: ebtables: reject non-bridge targets
PCI: hv: Disable/enable IRQs rather than BH in hv_compose_msi_msg()
block: do not use interruptible wait anywhere
mtd: rawnand: denali_dt: set clk_x_rate to 200 MHz unconditionally
crypto: af_alg - Initialize sg_num_bytes in error code path
clocksource: Initialize cs->wd_list
media: rc: oops in ir_timer_keyup after device unplug
xhci: Fix USB3 NULL pointer dereference at logical disconnect.
net: lan78xx: Fix race in tx pending skb size calculation
rtlwifi: rtl8821ae: fix firmware is not ready to run
rtlwifi: Fix kernel Oops "Fw download fail!!"
net: cxgb3_main: fix potential Spectre v1
VSOCK: fix loopback on big-endian systems
vhost_net: validate sock before trying to put its fd
tcp: prevent bogus FRTO undos with non-SACK flows
tcp: fix Fast Open key endianness
strparser: Remove early eaten to fix full tcp receive buffer stall
stmmac: fix DMA channel hang in half-duplex mode
r8152: napi hangup fix after disconnect
qmi_wwan: add support for the Dell Wireless 5821e module
qed: Limit msix vectors in kdump kernel to the minimum required count.
qed: Fix use of incorrect size in memcpy call.
qed: Fix setting of incorrect eswitch mode.
qede: Adverstise software timestamp caps when PHC is not available.
net/tcp: Fix socket lookups with SO_BINDTODEVICE
net: sungem: fix rx checksum support
net_sched: blackhole: tell upper qdisc about dropped packets
net/packet: fix use-after-free
net: mvneta: fix the Rx desc DMA address in the Rx path
net/mlx5: Fix wrong size allocation for QoS ETC TC regitster
net/mlx5: Fix required capability for manipulating MPFS
net/mlx5: Fix incorrect raw command length parsing
net/mlx5: Fix command interface race in polling mode
net/mlx5: E-Switch, Avoid setup attempt if not being e-switch manager
net/mlx5e: Don't attempt to dereference the ppriv struct if not being eswitch manager
net/mlx5e: Avoid dealing with vport representors if not being e-switch manager
net: macb: Fix ptp time adjustment for large negative delta
net: fix use-after-free in GRO with ESP
net: dccp: switch rx_tstamp_last_feedback to monotonic clock
net: dccp: avoid crash in ccid3_hc_rx_send_feedback()
ixgbe: split XDP_TX tail and XDP_REDIRECT map flushing
ipvlan: fix IFLA_MTU ignored on NEWLINK
ipv6: sr: fix passing wrong flags to crypto_alloc_shash()
hv_netvsc: split sub-channel setup into async and sync
atm: zatm: Fix potential Spectre v1
atm: Preserve value of skb->truesize when accounting to vcc
alx: take rtnl before calling __alx_open from resume
crypto: crypto4xx - fix crypto4xx_build_pdr, crypto4xx_build_sdr leak
crypto: crypto4xx - remove bad list_del
PCI: exynos: Fix a potential init_clk_resources NULL pointer dereference
bcm63xx_enet: do not write to random DMA channel on BCM6345
bcm63xx_enet: correct clock usage
ocfs2: ip_alloc_sem should be taken in ocfs2_get_block()
ocfs2: subsystem.su_mutex is required while accessing the item->ci_parent
xprtrdma: Fix corner cases when handling device removal
cpufreq / CPPC: Set platform specific transition_delay_us
Btrfs: fix duplicate extents after fsync of file with prealloc extents
x86/paravirt: Make native_save_fl() extern inline
x86/asm: Add _ASM_ARG* constants for argument registers to <asm/asm.h>
compiler-gcc.h: Add __attribute__((gnu_inline)) to all inline declarations
ANDROID: Add hold functionality to schedtune CPU boost
ANDROID: sched/rt: Add schedtune accounting to rt task enqueue/dequeue
UPSTREAM: cpuidle: menu: Avoid selecting shallow states with stopped tick
UPSTREAM: cpuidle: menu: Refine idle state selection for running tick
UPSTREAM: sched: idle: Select idle state before stopping the tick
BACKPORT: time: hrtimer: Introduce hrtimer_next_event_without()
BACKPORT: time: tick-sched: Split tick_nohz_stop_sched_tick()
UPSTREAM: cpuidle: Return nohz hint from cpuidle_select()
UPSTREAM: jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC
UPSTREAM: sched: idle: Do not stop the tick before cpuidle_idle_call()
BACKPORT: sched: idle: Do not stop the tick upfront in the idle loop
BACKPORT: time: tick-sched: Reorganize idle tick management code
ANDROID: sched/fair: fix a warning
ANDROID: sched/walt: Fix compilation issue for x86_64
ANDROID: mnt: Fix next_descendent
ANDROID: sched/events: Introduce util_est trace events
ANDROID: sched/fair: schedtune: update before schedutil
FROMLIST: sched/fair: add support to tune PELT ramp/decay timings
BACKPORT: sched/fair: Update util_est before updating schedutil
BACKPORT: sched/fair: Update util_est only on util_avg updates
BACKPORT: sched/fair: Use util_est in LB and WU paths
BACKPORT: sched/fair: Add util_est on top of PELT
ANDROID: sched/fair: Cleanup cpu_util{_wake}()
ANDROID: sched: Update max cpu capacity in case of max frequency constraints
ANDROID: arm: enable max frequency capping
ANDROID: arm64: enable max frequency capping
ANDROID: implement max frequency capping
ANDROID: sched/fair: add arch scaling function for max frequency capping
ANDROID: trace: Add WALT util signal to trace event sched_load_cfs_rq
ANDROID: sched, trace: Remove trace event sched_load_avg_cpu
ANDROID: Rename and move include/linux/sched_energy.h
ANDROID: Adjust juno energy model
ANDROID: Check equality of max cap state cap and cpu scale
ANDROID: Move energy model init call into arch_topology driver
ANDROID: Streamline sched_domain_energy_f functions
ANDROID: Separate cpu_scale and energy model setup
ANDROID: update_group_capacity for single cpu in cluster
ANDROID: sched/fair: return idle CPU immediately for prefer_idle
ANDROID: sched/fair: add idle state filter to prefer_idle case
ANDROID: sched/fair: remove order from CPU selection
ANDROID: sched/fair: unify spare capacity calculation
ANDROID:sched/fair: prefer energy efficient CPUs for !prefer_idle tasks
ANDROID: sched/fair: fix CPU selection for non latency sensitive tasks
ANDROID: sched/fair: Also do misfit in overloaded groups
ANDROID: sched/fair: Don't balance misfits if it would overload local group
ANDROID: sched/fair: Attempt to improve throughput for asym cap systems
FROMLIST: sched/fair: Don't move tasks to lower capacity cpus unless necessary
FROMLIST: sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
FROMLIST: sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
FROMLIST: sched/fair: Set rq->rd->overload when misfit
FROMLIST: sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
FROMLIST: sched: Change root_domain->overload type to int
FROMLIST: sched/fair: Change prefer_sibling type to bool
FROMLIST: sched/fair: Consider misfit tasks when load-balancing
FROMLIST: sched: Add sched_group per-cpu max capacity
FROMLIST: sched/fair: Add group_misfit_task load-balance type
FROMLIST: sched: Add static_key for asymmetric cpu capacity optimizations
UPSTREAM: ANDROID: binder: change down_write to down_read
UPSTREAM: ANDROID: binder: correct the cmd print for BINDER_WORK_RETURN_ERROR
UPSTREAM: ANDROID: binder: remove 32-bit binder interface.
UPSTREAM: android: binder: Use true and false for boolean values
UPSTREAM: android: binder: Use octal permissions
UPSTREAM: android: binder: Prefer __func__ to using hardcoded function name
UPSTREAM: ANDROID: binder: make binder_alloc_new_buf_locked static and indent its arguments
UPSTREAM: android: binder: Check for errors in binder_alloc_shrinker_init().
Conflicts:
arch/arm64/Kconfig
arch/arm64/include/asm/cpucaps.h
arch/arm64/include/asm/cpufeature.h
arch/arm64/include/asm/thread_info.h
arch/arm64/kernel/cpu_errata.c
arch/arm64/kernel/cpufeature.c
arch/arm64/kernel/entry.S
arch/arm64/kernel/ssbd.c
drivers/base/arch_topology.c
drivers/md/Kconfig
drivers/scsi/ufs/ufshcd.c
drivers/usb/gadget/function/f_fs.c
include/trace/events/sched.h
kernel/sched/cpufreq_schedutil.c
kernel/sched/energy.c
kernel/sched/fair.c
kernel/sched/features.h
kernel/sched/sched.h
kernel/sched/topology.c
kernel/sched/tune.c
kernel/sched/walt.c
kernel/sched/walt.h
kernel/stop_machine.c
kernel/time/tick-sched.c
net/socket.c
sound/core/rawmidi.c
Change-Id: Ia246711317930ecd55bb42565a04e6b4fdfc26d2
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
|
||
|
|
f957456878 |
kthread, tracing: Don't expose half-written comm when creating kthreads
commit 3e536e222f2930534c252c1cc7ae799c725c5ff9 upstream. There is a window for racing when printing directly to task->comm, allowing other threads to see a non-terminated string. The vsnprintf function fills the buffer, counts the truncated chars, then finally writes the \0 at the end. creator other vsnprintf: fill (not terminated) count the rest trace_sched_waking(p): ... memcpy(comm, p->comm, TASK_COMM_LEN) write \0 The consequences depend on how 'other' uses the string. In our case, it was copied into the tracing system's saved cmdlines, a buffer of adjacent TASK_COMM_LEN-byte buffers (note the 'n' where 0 should be): crash-arm64> x/1024s savedcmd->saved_cmdlines | grep 'evenk' 0xffffffd5b3818640: "irq/497-pwr_evenkworker/u16:12" ...and a strcpy out of there would cause stack corruption: [224761.522292] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffff9bf9783c78 crash-arm64> kbt | grep 'comm\|trace_print_context' #6 0xffffff9bf9783c78 in trace_print_context+0x18c(+396) comm (char [16]) = "irq/497-pwr_even" crash-arm64> rd 0xffffffd4d0e17d14 8 ffffffd4d0e17d14: 2f71726900000000 5f7277702d373934 ....irq/497-pwr_ ffffffd4d0e17d24: 726f776b6e657665 3a3631752f72656b evenkworker/u16: ffffffd4d0e17d34: f9780248ff003231 cede60e0ffffff9b 12..H.x......`.. ffffffd4d0e17d44: cede60c8ffffffd4 00000fffffffffd4 .....`.......... The workaround in |
||
|
|
61ca60932d |
kthread, sched/wait: Fix kthread_parkme() wait-loop
[ Upstream commit 741a76b350897604c48fb12beff1c9b77724dc96 ]
Gaurav reported a problem with __kthread_parkme() where a concurrent
try_to_wake_up() could result in competing stores to ->state which,
when the TASK_PARKED store got lost bad things would happen.
The comment near set_current_state() actually mentions this competing
store, but only mentions the case against TASK_RUNNING. This same
store, with different timing, can happen against a subsequent !RUNNING
store.
This normally is not a problem, because as per that same comment, the
!RUNNING state store is inside a condition based wait-loop:
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!need_sleep)
break;
schedule();
}
__set_current_state(TASK_RUNNING);
If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous
(concurrent) wakeup, the schedule() will NO-OP and we'll go around the
loop once more.
The problem here is that the TASK_PARKED store is not inside the
KTHREAD_SHOULD_PARK condition wait-loop.
There is a genuine issue with sleeps that do not have a condition;
this is addressed in a subsequent patch.
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
251385e3f2 |
kthread, sched: Fix kthread_parkme() (again...)
Gaurav reports that commit:
85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue")
isn't working for him. Because of the following race:
> controller Thread CPUHP Thread
> takedown_cpu
> kthread_park
> kthread_parkme
> Set KTHREAD_SHOULD_PARK
> smpboot_thread_fn
> set Task interruptible
>
>
> wake_up_process
> if (!(p->state & state))
> goto out;
>
> Kthread_parkme
> SET TASK_PARKED
> schedule
> raw_spin_lock(&rq->lock)
> ttwu_remote
> waiting for __task_rq_lock
> context_switch
>
> finish_lock_switch
>
>
>
> Case TASK_PARKED
> kthread_park_complete
>
>
> SET Running
Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
handling is buggered because the TASK_DEAD thing is done with
preemption disabled, the current code can still complete early on
preemption :/
So basically revert that earlier fix and go with a variant of the
alternative mentioned in the commit. Promote TASK_PARKED to special
state to avoid the store-store issue on task->state leading to the
WARN in kthread_unpark() -> __kthread_bind().
But in addition, add wait_task_inactive() to kthread_park() to ensure
the task really is PARKED when we return from kthread_park(). This
avoids the whole kthread still gets migrated nonsense -- although it
would be really good to get this done differently.
Change-Id: I11d6a8c72fcf433c637f44e7b39adc979c8ddd7b
Cc: Oleg Nesterov <oleg@redhat.com>
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Fixes: 85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Patch-mainline: linux-kernel@ 06/07/18, 05:33AM
Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
|
||
|
|
0e12dd7496 |
kthread: Allow kthread_park() on a parked kthread
The following commit:
85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue")
added a WARN() in the case where we call kthread_park() on an already
parked thread, because the old code wasn't doing the right thing there
and it wasn't at all clear that would happen.
It turns out, this does in fact happen, so we have to deal with it.
Instead of potentially returning early, also wait for the completion.
This does however mean we have to use complete_all() and re-initialize
the completion on re-use.
Change-Id: I35f6723ea8cab6251f66e7b127d501fa09f6a086
Reported-by: LKP <lkp@01.org>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kernel test robot <lkp@intel.com>
Cc: wfg@linux.intel.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 85f1abe0019f ("kthread, sched/wait: Fix kthread_parkme() completion issue")
Link: http://lkml.kernel.org/r/20180504091142.GI12235@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Git-commit: b1f5b378e126133521df668379249fb8265121f1
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
|
||
|
|
b0af9a45e9 |
kthread, sched/wait: Fix kthread_parkme() completion issue
Even with the wait-loop fixed, there is a further issue with kthread_parkme(). Upon hotplug, when we do takedown_cpu(), smpboot_park_threads() can return before all those threads are in fact blocked, due to the placement of the complete() in __kthread_parkme(). When that happens, sched_cpu_dying() -> migrate_tasks() can end up migrating such a still runnable task onto another CPU. Normally the task will have hit schedule() and gone to sleep by the time we do kthread_unpark(), which will then do __kthread_bind() to re-bind the task to the correct CPU. However, when we loose the initial TASK_PARKED store to the concurrent wakeup issue described previously, do the complete(), get migrated, it is possible to either: - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate the park and set TASK_RUNNING, or - __kthread_bind()'s wait_task_inactive() to observe the competing TASK_RUNNING store. Either way the WARN() in __kthread_bind() will trigger and fail to correctly set the CPU affinity. Fix this by only issuing the complete() when the kthread has scheduled out. This does away with all the icky 'still running' nonsense. The alternative is to promote TASK_PARKED to a special state, this guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING and we'll end up doing the right thing, but this preserves the whole icky business of potentially migating the still runnable thing. Change-Id: I9706fe03ceb8df1f93a718cbb3e0ce890de12afe Reported-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: 85f1abe0019fcb3ea10df7029056cf42702283a8 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [gkohli@codeaurora: Resolve trivial merge conflicts] Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org> Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org> |
||
|
|
c4f0392705 |
kthread, sched/wait: Fix kthread_parkme() wait-loop
Gaurav reported a problem with __kthread_parkme() where a concurrent
try_to_wake_up() could result in competing stores to ->state which,
when the TASK_PARKED store got lost bad things would happen.
The comment near set_current_state() actually mentions this competing
store, but only mentions the case against TASK_RUNNING. This same
store, with different timing, can happen against a subsequent !RUNNING
store.
This normally is not a problem, because as per that same comment, the
!RUNNING state store is inside a condition based wait-loop:
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!need_sleep)
break;
schedule();
}
__set_current_state(TASK_RUNNING);
If we loose the (first) TASK_UNINTERRUPTIBLE store to a previous
(concurrent) wakeup, the schedule() will NO-OP and we'll go around the
loop once more.
The problem here is that the TASK_PARKED store is not inside the
KTHREAD_SHOULD_PARK condition wait-loop.
There is a genuine issue with sleeps that do not have a condition;
this is addressed in a subsequent patch.
Change-Id: I508e74ea0fbd7e83b1712c0e2f4d069a09ab12ab
Reported-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Git-commit: 741a76b350897604c48fb12beff1c9b77724dc96
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Gaurav Kohli <gkohli@codeaurora.org>
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
|
||
|
|
22cf8bc6cb |
kernel/kthread.c: kthread_worker: don't hog the cpu
If the worker thread continues getting work, it will hog the cpu and rcu stall complains. Make it a good citizen. This is triggered in a loop block device test. Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.com Signed-off-by: Shaohua Li <shli@fb.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
77f88796ce |
cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator. Once the new kthread starts
running, it initializes itself and wakes up the creator. The creator
then can further configure the kthread and then let it start doing its
job by waking it up.
In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland. When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity. This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.
Unfortunately, the cgroup side of protection is racy. While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.
This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one. Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.
This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland. kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed. The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.
It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side. Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.
v2: Switch to a simpler implementation using a new task_struct bit
field suggested by Oleg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: Tejun Heo <tj@kernel.org>
|
||
|
|
299300258d |
sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task.h>
We are going to split <linux/sched/task.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/task.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
ae7e81c077 |
sched/headers: Prepare for new header dependencies before moving code to <uapi/linux/sched/types.h>
We are going to move scheduler ABI details to <uapi/linux/sched/types.h>, which will be used from a number of .c files. Create empty placeholder header that maps to <linux/types.h>. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
dfb4357da6 |
time: Remove CONFIG_TIMER_STATS
Currently CONFIG_TIMER_STATS exposes process information across namespaces:
kernel/time/timer_list.c print_timer():
SEQ_printf(m, ", %s/%d", tmp, timer->start_pid);
/proc/timer_list:
#11: <0000000000000000>, hrtimer_wakeup, S:01, do_nanosleep, cron/2570
Given that the tracer can give the same information, this patch entirely
removes CONFIG_TIMER_STATS.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: linux-doc@vger.kernel.org
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Xing Gao <xgao01@email.wm.edu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jessica Frazelle <me@jessfraz.com>
Cc: kernel-hardening@lists.openwall.com
Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
||
|
|
c0b942a763 |
kthread: add __printf attributes
When commit
|
||
|
|
8fb9dcbdc3 |
kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker()
kthread_create_on_cpu() sets KTHREAD_IS_PER_CPU and kthread->cpu, this only makes sense if this kthread can be parked/unparked by cpuhp code. kthread workers never call kthread_parkme() so this has no effect. Change __kthread_create_worker() to simply call kthread_bind(task, cpu). The very fact that kthread_create_on_cpu() doesn't accept a generic fmt shows that it should not be used outside of smpboot.c. Now, the only reason we can not unexport this helper and move it into smpboot.c is that it sets kthread->cpu and struct kthread is not exported. And the only reason we can not kill kthread->cpu is that kthread_unpark() is used by drivers/gpu/drm/amd/scheduler/gpu_scheduler.c and thus we can not turn _unpark into kthread_unpark(struct smp_hotplug_thread *, cpu). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Petr Mladek <pmladek@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Chunming Zhou <David1.Zhou@amd.com> Cc: Roman Pen <roman.penyaev@profitbricks.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20161129175110.GA5342@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> |
||
|
|
cf380a4a96 |
kthread: Don't use to_live_kthread() in kthread_[un]park()
Now that to_kthread() is always validm change kthread_park() and kthread_unpark() to use it and kill to_live_kthread(). The conversion of kthread_unpark() is trivial. If KTHREAD_IS_PARKED is set then the task has called complete(&self->parked) and there the function cannot race against a concurrent kthread_stop() and exit. kthread_park() is more tricky, because its semantics are not well defined. It returns -ENOSYS if the thread exited but this can never happen and as Roman pointed out kthread_park() can obviously block forever if it would race with the exiting kthread. The usage of kthread_park() in cpuhp code (cpu.c, smpboot.c, stop_machine.c) is fine. It can never see an exiting/exited kthread, smpboot_destroy_threads() clears *ht->store, smpboot_park_thread() checks it is not NULL under the same smpboot_threads_lock. cpuhp_threads and cpu_stop_threads never exit, so other callers are fine too. But it has two more users: - watchdog_park_threads(): The code is actually correct, get_online_cpus() ensures that kthread_park() can't race with itself (note that kthread_park() can't handle this race correctly), but it should not use kthread_park() directly. - drivers/gpu/drm/amd/scheduler/gpu_scheduler.c should not use kthread_park() either. kthread_park() must not be called after amd_sched_fini() which does kthread_stop(), otherwise even to_live_kthread() is not safe because task_struct can be already freed and sched->thread can point to nowhere. The usage of kthread_park/unpark should either be restricted to core code which is properly protected against the exit race or made more robust so it is safe to use it in drivers. To catch eventual exit issues, add a WARN_ON(PF_EXITING) for now. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chunming Zhou <David1.Zhou@amd.com> Cc: Roman Pen <roman.penyaev@profitbricks.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20161129175107.GA5339@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> |
||
|
|
efb29fbfa5 |
kthread: Don't use to_live_kthread() in kthread_stop()
kthread_stop() had to use to_live_kthread() simply because it was not possible to access kthread->exited after the exiting task clears task_struct->vfork_done. Now that to_kthread() is always valid, wake_up_process() + wait_for_completion() can be done ununconditionally. It's not an issue anymore if the task has already issued complete_vfork_done() or died. The exiting task can get the spurious wakeup after mm_release() but this is possible without this change too and is fine; do_task_dead() ensures that this can't make any harm. As a further enhancement this could be converted to task_work_add() later, so ->vfork_done can be avoided completely. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chunming Zhou <David1.Zhou@amd.com> Cc: Roman Pen <roman.penyaev@profitbricks.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20161129175103.GA5336@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> |
||
|
|
eff9662547 |
Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function"
This reverts commit
|
||
|
|
1da5c46fa9 |
kthread: Make struct kthread kmalloc'ed
commit
|
||
|
|
dbf52682cb |
kthread: better support freezable kthread workers
This patch allows to make kthread worker freezable via a new @flags parameter. It will allow to avoid an init work in some kthreads. It currently does not affect the function of kthread_worker_fn() but it might help to do some optimization or fixes eventually. I currently do not know about any other use for the @flags parameter but I believe that we will want more flags in the future. Finally, I hope that it will not cause confusion with @flags member in struct kthread. Well, I guess that we will want to rework the basic kthreads implementation once all kthreads are converted into kthread workers or workqueues. It is possible that we will merge the two structures. Link: http://lkml.kernel.org/r/1470754545-17632-12-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9a6b06c8d9 |
kthread: allow to modify delayed kthread work
There are situations when we need to modify the delay of a delayed kthread work. For example, when the work depends on an event and the initial delay means a timeout. Then we want to queue the work immediately when the event happens. This patch implements kthread_mod_delayed_work() as inspired workqueues. It cancels the timer, removes the work from any worker list and queues it again with the given timeout. A very special case is when the work is being canceled at the same time. It might happen because of the regular kthread_cancel_delayed_work_sync() or by another kthread_mod_delayed_work(). In this case, we do nothing and let the other operation win. This should not normally happen as the caller is supposed to synchronize these operations a reasonable way. Link: http://lkml.kernel.org/r/1470754545-17632-11-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
37be45d49d |
kthread: allow to cancel kthread work
We are going to use kthread workers more widely and sometimes we will need to make sure that the work is neither pending nor running. This patch implements cancel_*_sync() operations as inspired by workqueues. Well, we are synchronized against the other operations via the worker lock, we use del_timer_sync() and a counter to count parallel cancel operations. Therefore the implementation might be easier. First, we check if a worker is assigned. If not, the work has newer been queued after it was initialized. Second, we take the worker lock. It must be the right one. The work must not be assigned to another worker unless it is initialized in between. Third, we try to cancel the timer when it exists. The timer is deleted synchronously to make sure that the timer call back is not running. We need to temporary release the worker->lock to avoid a possible deadlock with the callback. In the meantime, we set work->canceling counter to avoid any queuing. Fourth, we try to remove the work from a worker list. It might be the list of either normal or delayed works. Fifth, if the work is running, we call kthread_flush_work(). It might take an arbitrary time. We need to release the worker-lock again. In the meantime, we again block any queuing by the canceling counter. As already mentioned, the check for a pending kthread work is done under a lock. In compare with workqueues, we do not need to fight for a single PENDING bit to block other operations. Therefore we do not suffer from the thundering storm problem and all parallel canceling jobs might use kthread_flush_work(). Any queuing is blocked until the counter gets zero. Link: http://lkml.kernel.org/r/1470754545-17632-10-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
22597dc3d9 |
kthread: initial support for delayed kthread work
We are going to use kthread_worker more widely and delayed works will be pretty useful. The implementation is inspired by workqueues. It uses a timer to queue the work after the requested delay. If the delay is zero, the work is queued immediately. In compare with workqueues, each work is associated with a single worker (kthread). Therefore the implementation could be much easier. In particular, we use the worker->lock to synchronize all the operations with the work. We do not need any atomic operation with a flags variable. In fact, we do not need any state variable at all. Instead, we add a list of delayed works into the worker. Then the pending work is listed either in the list of queued or delayed works. And the existing check of pending works is the same even for the delayed ones. A work must not be assigned to another worker unless reinitialized. Therefore the timer handler might expect that dwork->work->worker is valid and it could simply take the lock. We just add some sanity checks to help with debugging a potential misuse. Link: http://lkml.kernel.org/r/1470754545-17632-9-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
8197b3d43b |
kthread: detect when a kthread work is used by more workers
Nothing currently prevents a work from queuing for a kthread worker when it is already running on another one. This means that the work might run in parallel on more than one worker. Also some operations are not reliable, e.g. flush. This problem will be even more visible after we add kthread_cancel_work() function. It will only have "work" as the parameter and will use worker->lock to synchronize with others. Well, normally this is not a problem because the API users are sane. But bugs might happen and users also might be crazy. This patch adds a warning when we try to insert the work for another worker. It does not fully prevent the misuse because it would make the code much more complicated without a big benefit. It adds the same warning also into kthread_flush_work() instead of the repeated attempts to get the right lock. A side effect is that one needs to explicitly reinitialize the work if it must be queued into another worker. This is needed, for example, when the worker is stopped and started again. It is a bit inconvenient. But it looks like a good compromise between the stability and complexity. I have double checked all existing users of the kthread worker API and they all seems to initialize the work after the worker gets started. Just for completeness, the patch adds a check that the work is not already in a queue. The patch also puts all the checks into a separate function. It will be reused when implementing delayed works. Link: http://lkml.kernel.org/r/1470754545-17632-8-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
35033fe9cb |
kthread: add kthread_destroy_worker()
The current kthread worker users call flush() and stop() explicitly. This function does the same plus it frees the kthread_worker struct in one call. It is supposed to be used together with kthread_create_worker*() that allocates struct kthread_worker. Link: http://lkml.kernel.org/r/1470754545-17632-7-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
fbae2d44aa |
kthread: add kthread_create_worker*()
Kthread workers are currently created using the classic kthread API, namely kthread_run(). kthread_worker_fn() is passed as the @threadfn parameter. This patch defines kthread_create_worker() and kthread_create_worker_on_cpu() functions that hide implementation details. They enforce using kthread_worker_fn() for the main thread. But I doubt that there are any plans to create any alternative. In fact, I think that we do not want any alternative main thread because it would be hard to support consistency with the rest of the kthread worker API. The naming and function of kthread_create_worker() is inspired by the workqueues API like the rest of the kthread worker API. The kthread_create_worker_on_cpu() variant is motivated by the original kthread_create_on_cpu(). Note that we need to bind per-CPU kthread workers already when they are created. It makes the life easier. kthread_bind() could not be used later for an already running worker. This patch does _not_ convert existing kthread workers. The kthread worker API need more improvements first, e.g. a function to destroy the worker. IMPORTANT: kthread_create_worker_on_cpu() allows to use any format of the worker name, in compare with kthread_create_on_cpu(). The good thing is that it is more generic. The bad thing is that most users will need to pass the cpu number in two parameters, e.g. kthread_create_worker_on_cpu(cpu, "helper/%d", cpu). To be honest, the main motivation was to avoid the need for an empty va_list. The only legal way was to create a helper function that would be called with an empty list. Other attempts caused compilation warnings or even errors on different architectures. There were also other alternatives, for example, using #define or splitting __kthread_create_worker(). The used solution looked like the least ugly. Link: http://lkml.kernel.org/r/1470754545-17632-6-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
255451e453 |
kthread: allow to call __kthread_create_on_node() with va_list args
kthread_create_on_node() implements a bunch of logic to create the kthread. It is already called by kthread_create_on_cpu(). We are going to extend the kthread worker API and will need to call kthread_create_on_node() with va_list args there. This patch does only a refactoring and does not modify the existing behavior. Link: http://lkml.kernel.org/r/1470754545-17632-5-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a65d40961d |
kthread/smpboot: do not park in kthread_create_on_cpu()
kthread_create_on_cpu() was added by the commit
|
||
|
|
3989144f86 |
kthread: kthread worker API cleanup
A good practice is to prefix the names of functions by the name
of the subsystem.
The kthread worker API is a mix of classic kthreads and workqueues. Each
worker has a dedicated kthread. It runs a generic function that process
queued works. It is implemented as part of the kthread subsystem.
This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:
__init_kthread_worker() -> __kthread_init_worker()
init_kthread_worker() -> kthread_init_worker()
init_kthread_work() -> kthread_init_work()
insert_kthread_work() -> kthread_insert_work()
queue_kthread_work() -> kthread_queue_work()
flush_kthread_work() -> kthread_flush_work()
flush_kthread_worker() -> kthread_flush_worker()
Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.
Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:
+ "init" in the function names stands for the verb "initialize"
aka "initialize worker". While "INIT" in the macro names
stands for the noun "INITIALIZER" aka "worker initializer".
+ INIT() macros are used only in DEFINE() macros
+ init() functions are used close to the other kthread()
functions. It looks much better if all the functions
use the same scheme.
+ There will be also kthread_destroy_worker() that will
be used close to kthread_cancel_work(). It is related
to the init() function. Again it looks better if all
functions use the same naming scheme.
+ there are several precedents for such init() function
names, e.g. amd_iommu_init_device(), free_area_init_node(),
jump_label_init_type(), regmap_init_mmio_clk(),
+ It is not an argument but it was inconsistent even before.
[arnd@arndb.de: fix linux-next merge conflict]
Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
e700591ae0 |
kthread: rename probe_kthread_data() to kthread_probe_data()
Patch series "kthread: Kthread worker API improvements" The intention of this patchset is to make it easier to manipulate and maintain kthreads. Especially, I want to replace all the custom main cycles with a generic one. Also I want to make the kthreads sleep in a consistent state in a common place when there is no work. This patch (of 11): A good practice is to prefix the names of functions by the name of the subsystem. This patch fixes the name of probe_kthread_data(). The other wrong functions names are part of the kthread worker API and will be fixed separately. Link: http://lkml.kernel.org/r/1470754545-17632-2-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
23196f2e5f |
kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
get_task_struct(tsk) no longer pins tsk->stack so all users of to_live_kthread() should do try_get_task_stack/put_task_stack to protect "struct kthread" which lives on kthread's stack. TODO: Kill to_live_kthread(), perhaps we can even kill "struct kthread" too, and rework kthread_stop(), it can use task_work_add() to sync with the exiting kernel thread. Message-Id: <20160629180357.GA7178@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jann Horn <jann@thejh.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/cb9b16bbc19d4aea4507ab0552e4644c1211d130.1474003868.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
e9f069868d |
kernel/kthread.c:kthread_create_on_node(): clarify documentation
- Make it clear that the `node' arg refers to memory allocations only: kthread_create_on_node() does not pin the new thread to that node's CPUs. - Encourage the use of NUMA_NO_NODE. [nzimmer@sgi.com: use NUMA_NO_NODE in kthread_create() also] Cc: Nathan Zimmer <nzimmer@sgi.com> Cc: Tejun Heo <tj@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a1d8561172 |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The biggest change in this cycle is the rewrite of the main SMP load
balancing metric: the CPU load/utilization. The main goal was to make
the metric more precise and more representative - see the changelog of
this commit for the gory details:
|
||
|
|
25834c73f9 |
sched: Fix a race between __kthread_bind() and sched_setaffinity()
Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
without locks, a caller might observe an old value and race with the
set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
it:
__kthread_bind()
do_set_cpus_allowed()
<SYSCALL>
sched_setaffinity()
if (p->flags & PF_NO_SETAFFINITIY)
set_cpus_allowed_ptr()
p->flags |= PF_NO_SETAFFINITY
Fix the bug by putting everything under the regular scheduler locks.
This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
||
|
|
18896451ea |
kthread: export kthread functions
The s-Par visornic driver, currently in staging, processes a queue being serviced by the an s-Par service partition. We can get a message that something has happened with the Service Partition, when that happens, we must not access the channel until we get a message that the service partition is back again. The visornic driver has a thread for processing the channel, when we get the message, we need to be able to park the thread and then resume it when the problem clears. We can do this with kthread_park and unpark but they are not exported from the kernel, this patch exports the needed functions. Signed-off-by: David Kershner <david.kershner@unisys.com> Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
109228389a |
kernel/kthread.c: partial revert of 81c98869fa ("kthread: ensure locality of task_struct allocations")
After discussions with Tejun, we don't want to spread the use of
cpu_to_mem() (and thus knowledge of allocators/NUMA topology details) into
callers, but would rather ensure the callees correctly handle memoryless
nodes. With the previous patches ("topology: add support for
node_to_mem_node() to determine the fallback node" and "slub: fallback to
node_to_mem_node() node if allocating on memoryless node") adding and
using node_to_mem_node(), we can safely undo part of the change to the
kthread logic from
|
||
|
|
ed1403ec2b |
kthread_work: wake up worker only when the worker is idle
If the worker is already executing a work item when another is queued, we can safely skip wakeup without worrying about stalling queue thus avoiding waking up the busy worker spuriously. Spurious wakeups should be fine but still isn't nice and avoiding it is trivial here. tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
|
|
8fe6929cfd |
kthread: fix return value of kthread_create() upon SIGKILL.
Commit
|
||
|
|
81c98869fa |
kthread: ensure locality of task_struct allocations
In the presence of memoryless nodes, numa_node_id() will return the current CPU's NUMA node, but that may not be where we expect to allocate from memory from. Instead, we should rely on the fallback code in the memory allocator itself, by using NUMA_NO_NODE. Also, when calling kthread_create_on_node(), use the nearest node with memory to the cpu in question, rather than the node it is running on. Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Anton Blanchard <anton@samba.org> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
786235eeba |
kthread: make kthread_create() killable
Any user process callers of wait_for_completion() except global init process might be chosen by the OOM killer while waiting for completion() call by some other process which does memory allocation. See CVE-2012-4398 "kernel: request_module() OOM local DoS" can happen. When such users are chosen by the OOM killer when they are waiting for completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed due to memory starvation because the OOM killer cannot kill such users. kthread_create() is one of such users and this patch fixes the problem for kthreadd by making kthread_create() killable - the same approach used for fixing CVE-2012-4398. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
cd42d559e4 |
kthread: implement probe_kthread_data()
One of the problems that arise when converting dedicated custom threadpool to workqueue is that the shared worker pool used by workqueue anonimizes each worker making it more difficult to identify what the worker was doing on which target from the output of sysrq-t or debug dump from oops, BUG() and friends. For example, after writeback is converted to use workqueue instead of priviate thread pool, there's no easy to tell which backing device a writeback work item was working on at the time of task dump, which, according to our writeback brethren, is important in tracking down issues with a lot of mounted file systems on a lot of different devices. This patchset implements a way for a work function to mark its execution instance so that task dump of the worker task includes information to indicate what the work item was doing. An example WARN dump would look like the following. WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0() Modules linked in: CPU: 0 Pid: 28 Comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24 Hardware name: empty empty/S3992, BIOS 080011 10/26/2007 Workqueue: writeback bdi_writeback_workfn (flush-8:16) ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8 ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0 ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08 Call Trace: [<ffffffff81c61855>] dump_stack+0x19/0x1b [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0 ... This patch: Implement probe_kthread_data() which returns kthread_data if accessible. The function is equivalent to kthread_data() except that the specified @task may not be a kthread or its vfork_done is already cleared rendering struct kthread inaccessible. In the former case, probe_kthread_data() may return any value. In the latter, NULL. This will be used to safely print debug information without affecting synchronization in the normal paths. Workqueue debug info printing on dump_stack() and friends will make use of it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
46d9be3e5e |
Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
"A lot of activities on workqueue side this time. The changes achieve
the followings.
- WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
updated to be able to interface with multiple backend worker pools.
This involved a lot of churning but the end result seems actually
neater as unbound workqueues are now a lot closer to per-cpu ones.
- The ability to interface with multiple backend worker pools are
used to implement unbound workqueues with custom attributes.
Currently the supported attributes are the nice level and CPU
affinity. It may be expanded to include cgroup association in
future. The attributes can be specified either by calling
apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
the workqueue in question is exported through sysfs.
The backend worker pools are keyed by the actual attributes and
shared by any workqueues which share the same attributes. When
attributes of a workqueue are changed, the workqueue binds to the
worker pool with the specified attributes while leaving the work
items which are already executing in its previous worker pools
alone.
This allows converting custom worker pool implementations which
want worker attribute tuning to use workqueues. The writeback pool
is already converted in block tree and there are a couple others
are likely to follow including btrfs io workers.
- WQ_UNBOUND's ability to bind to multiple worker pools is also used
to make it NUMA-aware. Because there's no association between work
item issuer and the specific worker assigned to execute it, before
this change, using unbound workqueue led to unnecessary cross-node
bouncing and it couldn't be helped by autonuma as it requires tasks
to have implicit node affinity and workers are assigned randomly.
After these changes, an unbound workqueue now binds to multiple
NUMA-affine worker pools so that queued work items are executed in
the same node. This is turned on by default but can be disabled
system-wide or for individual workqueues.
Crypto was requesting NUMA affinity as encrypting data across
different nodes can contribute noticeable overhead and doing it
per-cpu was too limiting for certain cases and IO throughput could
be bottlenecked by one CPU being fully occupied while others have
idle cycles.
While the new features required a lot of changes including
restructuring locking, it didn't complicate the execution paths much.
The unbound workqueue handling is now closer to per-cpu ones and the
new features are implemented by simply associating a workqueue with
different sets of backend worker pools without changing queue,
execution or flush paths.
As such, even though the amount of change is very high, I feel
relatively safe in that it isn't likely to cause subtle issues with
basic correctness of work item execution and handling. If something
is wrong, it's likely to show up as being associated with worker pools
with the wrong attributes or OOPS while workqueue attributes are being
changed or during CPU hotplug.
While this creates more backend worker pools, it doesn't add too many
more workers unless, of course, there are many workqueues with unique
combinations of attributes. Assuming everything else is the same,
NUMA awareness costs an extra worker pool per NUMA node with online
CPUs.
There are also a couple things which are being routed outside the
workqueue tree.
- block tree pulled in workqueue for-3.10 so that writeback worker
pool can be converted to unbound workqueue with sysfs control
exposed. This simplifies the code, makes writeback workers
NUMA-aware and allows tuning nice level and CPU affinity via sysfs.
- The conversion to workqueue means that there's no 1:1 association
between a specific worker, which makes writeback folks unhappy as
they want to be able to tell which filesystem caused a problem from
backtrace on systems with many filesystems mounted. This is
resolved by allowing work items to set debug info string which is
printed when the task is dumped. As this change involves unifying
implementations of dump_stack() and friends in arch codes, it's
being routed through Andrew's -mm tree."
* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
workqueue: use kmem_cache_free() instead of kfree()
workqueue: avoid false negative WARN_ON() in destroy_workqueue()
workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
workqueue: implement NUMA affinity for unbound workqueues
workqueue: introduce put_pwq_unlocked()
workqueue: introduce numa_pwq_tbl_install()
workqueue: use NUMA-aware allocation for pool_workqueues
workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
workqueue: map an unbound workqueues to multiple per-node pool_workqueues
workqueue: move hot fields of workqueue_struct to the end
workqueue: make workqueue->name[] fixed len
workqueue: add workqueue->unbound_attrs
workqueue: determine NUMA node of workers accourding to the allowed cpumask
workqueue: drop 'H' from kworker names of unbound worker pools
workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
workqueue: fix memory leak in apply_workqueue_attrs()
workqueue: fix unbound workqueue attrs hashing / comparison
workqueue: fix race condition in unbound workqueue free path
workqueue: remove pwq_lock which is no longer used
...
|
||
|
|
b5c5442bb6 |
kthread: kill task_get_live_kthread()
task_get_live_kthread() looks confusing and unneeded. It does get_task_struct() but only kthread_stop() needs this, it can be called even if the calller doesn't have a reference when we know that this kthread can't exit until we do kthread_stop(). kthread_park() and kthread_unpark() do not need get_task_struct(), the callers already have the reference. And it can not help if we can race with the exiting kthread anyway, kthread_park() can hang forever in this case. Change kthread_park() and kthread_unpark() to use to_live_kthread(), change kthread_stop() to do get_task_struct() by hand and remove task_get_live_kthread(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Namhyung Kim <namhyung@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |