81 Commits

Author SHA1 Message Date
Nathan Chancellor
83a3e3539c Merge 4.4.155 into android-msm-wahoo-4.4
Changes in 4.4.155: (48 commits)
        net: 6lowpan: fix reserved space for single frames
        net: mac802154: tx: expand tailroom if necessary
        9p/net: Fix zero-copy path in the 9p virtio transport
        net: lan78xx: Fix misplaced tasklet_schedule() call
        spi: davinci: fix a NULL pointer dereference
        drm/i915/userptr: reject zero user_size
        powerpc/fadump: handle crash memory ranges array index overflow
        powerpc/pseries: Fix endianness while restoring of r3 in MCE handler.
        fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
        9p/virtio: fix off-by-one error in sg list bounds check
        net/9p/client.c: version pointer uninitialized
        net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
        x86/mm/pat: Fix L1TF stable backport for CPA, 2nd call
        dm cache metadata: save in-core policy_hint_size to on-disk superblock
        iio: ad9523: Fix displayed phase
        iio: ad9523: Fix return value for ad952x_store()
        vmw_balloon: fix inflation of 64-bit GFNs
        vmw_balloon: do not use 2MB without batching
        vmw_balloon: VMCI_DOORBELL_SET does not check status
        vmw_balloon: fix VMCI use when balloon built into kernel
        tracing: Do not call start/stop() functions when tracing_on does not change
        tracing/blktrace: Fix to allow setting same value
        kthread, tracing: Don't expose half-written comm when creating kthreads
        uprobes: Use synchronize_rcu() not synchronize_sched()
        9p: fix multiple NULL-pointer-dereferences
        PM / sleep: wakeup: Fix build error caused by missing SRCU support
        pnfs/blocklayout: off by one in bl_map_stripe()
        ARM: tegra: Fix Tegra30 Cardhu PCA954x reset
        mm/tlb: Remove tlb_remove_table() non-concurrent condition
        iommu/vt-d: Add definitions for PFSID
        iommu/vt-d: Fix dev iotlb pfsid use
        osf_getdomainname(): use copy_to_user()
        sys: don't hold uts_sem while accessing userspace memory
        userns: move user access out of the mutex
        ubifs: Fix memory leak in lprobs self-check
        Revert "UBIFS: Fix potential integer overflow in allocation"
        ubifs: Check data node size before truncate
        ubifs: Fix synced_i_size calculation for xattr inodes
        pwm: tiehrpwm: Fix disabling of output of PWMs
        fb: fix lost console when the user unplugs a USB adapter
        udlfb: set optimal write delay
        getxattr: use correct xattr length
        bcache: release dc->writeback_lock properly in bch_writeback_thread()
        perf auxtrace: Fix queue resize
        fs/quota: Fix spectre gadget in do_quotactl
        x86/io: add interface to reserve io memtype for a resource range. (v1.1)
        drm/drivers: add support for using the arch wc mapping API.
        Linux 4.4.155

Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
2018-09-09 11:22:27 -07:00
Snild Dolkow
f6db350c9a kthread, tracing: Don't expose half-written comm when creating kthreads
commit 3e536e222f2930534c252c1cc7ae799c725c5ff9 upstream.

There is a window for racing when printing directly to task->comm,
allowing other threads to see a non-terminated string. The vsnprintf
function fills the buffer, counts the truncated chars, then finally
writes the \0 at the end.

	creator                     other
	vsnprintf:
	  fill (not terminated)
	  count the rest            trace_sched_waking(p):
	  ...                         memcpy(comm, p->comm, TASK_COMM_LEN)
	  write \0

The consequences depend on how 'other' uses the string. In our case,
it was copied into the tracing system's saved cmdlines, a buffer of
adjacent TASK_COMM_LEN-byte buffers (note the 'n' where 0 should be):

	crash-arm64> x/1024s savedcmd->saved_cmdlines | grep 'evenk'
	0xffffffd5b3818640:     "irq/497-pwr_evenkworker/u16:12"

...and a strcpy out of there would cause stack corruption:

	[224761.522292] Kernel panic - not syncing: stack-protector:
	    Kernel stack is corrupted in: ffffff9bf9783c78

	crash-arm64> kbt | grep 'comm\|trace_print_context'
	#6  0xffffff9bf9783c78 in trace_print_context+0x18c(+396)
	      comm (char [16]) =  "irq/497-pwr_even"

	crash-arm64> rd 0xffffffd4d0e17d14 8
	ffffffd4d0e17d14:  2f71726900000000 5f7277702d373934   ....irq/497-pwr_
	ffffffd4d0e17d24:  726f776b6e657665 3a3631752f72656b   evenkworker/u16:
	ffffffd4d0e17d34:  f9780248ff003231 cede60e0ffffff9b   12..H.x......`..
	ffffffd4d0e17d44:  cede60c8ffffffd4 00000fffffffffd4   .....`..........

The workaround in e09e28671 (use strlcpy in __trace_find_cmdline) was
likely needed because of this same bug.

Solved by vsnprintf:ing to a local buffer, then using set_task_comm().
This way, there won't be a window where comm is not terminated.

Link: http://lkml.kernel.org/r/20180726071539.188015-1-snild@sony.com

Cc: stable@vger.kernel.org
Fixes: bc0c38d139 ("ftrace: latency tracer infrastructure")
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Snild Dolkow <snild@sony.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
[backported to 3.18 / 4.4 by Snild]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09 20:04:34 +02:00
Thierry Strudel
b11ab24fe6 Merged linux-4.4.70 into android-msm-wahoo-4.4
Linux 4.4.70
    drivers: char: mem: Check for address space wraparound with mmap()
    nfsd: encoders mustn't use unitialized values in error cases
    drm/edid: Add 10 bpc quirk for LGD 764 panel in HP zBook 17 G2
    PCI: Freeze PME scan before suspending devices
    PCI: Fix pci_mmap_fits() for HAVE_PCI_RESOURCE_TO_USER platforms
    tracing/kprobes: Enforce kprobes teardown after testing
    osf_wait4(): fix infoleak
    genirq: Fix chained interrupt data ordering
    uwb: fix device quirk on big-endian hosts
    metag/uaccess: Check access_ok in strncpy_from_user
    metag/uaccess: Fix access_ok()
    iommu/vt-d: Flush the IOTLB to get rid of the initial kdump mappings
    staging: rtl8192e: rtl92e_get_eeprom_size Fix read size of EPROM_CMD.
    staging: rtl8192e: fix 2 byte alignment of register BSSIDR.
    mm/huge_memory.c: respect FOLL_FORCE/FOLL_COW for thp
    xc2028: Fix use-after-free bug properly
    arm64: documentation: document tagged pointer stack constraints
    arm64: uaccess: ensure extension of access_ok() addr
    arm64: xchg: hazard against entire exchange variable
    ARM: dts: at91: sama5d3_xplained: not all ADC channels are available
    ARM: dts: at91: sama5d3_xplained: fix ADC vref
    powerpc/64e: Fix hang when debugging programs with relocated kernel
    powerpc/pseries: Fix of_node_put() underflow during DLPAR remove
    powerpc/book3s/mce: Move add_taint() later in virtual mode
    cx231xx-cards: fix NULL-deref at probe
    cx231xx-audio: fix NULL-deref at probe
    cx231xx-audio: fix init error path
    dvb-frontends/cxd2841er: define symbol_rate_min/max in T/C fe-ops
    zr364xx: enforce minimum size when reading header
    dib0700: fix NULL-deref at probe
    s5p-mfc: Fix unbalanced call to clock management
    gspca: konica: add missing endpoint sanity check
    ceph: fix recursion between ceph_set_acl() and __ceph_setattr()
    iio: proximity: as3935: fix as3935_write
    ipx: call ipxitf_put() in ioctl error path
    USB: hub: fix non-SS hub-descriptor handling
    USB: hub: fix SS hub-descriptor handling
    USB: serial: io_ti: fix div-by-zero in set_termios
    USB: serial: mct_u232: fix big-endian baud-rate handling
    USB: serial: qcserial: add more Lenovo EM74xx device IDs
    usb: serial: option: add Telit ME910 support
    USB: iowarrior: fix info ioctl on big-endian hosts
    usb: musb: tusb6010_omap: Do not reset the other direction's packet size
    ttusb2: limit messages to buffer size
    mceusb: fix NULL-deref at probe
    usbvision: fix NULL-deref at probe
    net: irda: irda-usb: fix firmware name on big-endian hosts
    usb: host: xhci-mem: allocate zeroed Scratchpad Buffer
    xhci: apply PME_STUCK_QUIRK and MISSING_CAS quirk for Denverton
    usb: host: xhci-plat: propagate return value of platform_get_irq()
    sched/fair: Initialize throttle_count for new task-groups lazily
    sched/fair: Do not announce throttled next buddy in dequeue_task_fair()
    fscrypt: avoid collisions when presenting long encrypted filenames
    f2fs: check entire encrypted bigname when finding a dentry
    fscrypt: fix context consistency check when key(s) unavailable
    net: qmi_wwan: Add SIMCom 7230E
    ext4 crypto: fix some error handling
    ext4 crypto: don't let data integrity writebacks fail with ENOMEM
    USB: serial: ftdi_sio: add Olimex ARM-USB-TINY(H) PIDs
    USB: serial: ftdi_sio: fix setting latency for unprivileged users
    pid_ns: Fix race between setns'ed fork() and zap_pid_ns_processes()
    pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes
    iio: dac: ad7303: fix channel description
    of: fix sparse warning in of_pci_range_parser_one
    proc: Fix unbalanced hard link numbers
    cdc-acm: fix possible invalid access when processing notification
    drm/nouveau/tmr: handle races with hw when updating the next alarm time
    drm/nouveau/tmr: avoid processing completed alarms when adding a new one
    drm/nouveau/tmr: fix corruption of the pending list when rescheduling an alarm
    drm/nouveau/tmr: ack interrupt before processing alarms
    drm/nouveau/therm: remove ineffective workarounds for alarm bugs
    drm/amdgpu: Make display watermark calculations more accurate
    drm/amdgpu: Avoid overflows/divide-by-zero in latency_watermark calculations.
    ath9k_htc: fix NULL-deref at probe
    ath9k_htc: Add support of AirTies 1eda:2315 AR9271 device
    s390/cputime: fix incorrect system time
    s390/kdump: Add final note
    regulator: tps65023: Fix inverted core enable logic.
    KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulation
    KVM: x86: Fix load damaged SSEx MXCSR register
    ima: accept previously set IMA_NEW_FILE
    mwifiex: pcie: fix cmd_buf use-after-free in remove/reset
    rtlwifi: rtl8821ae: setup 8812ae RFE according to device type
    md: update slab_cache before releasing new stripes when stripes resizing
    dm space map disk: fix some book keeping in the disk space map
    dm thin metadata: call precommit before saving the roots
    dm bufio: make the parameter "retain_bytes" unsigned long
    dm cache metadata: fail operations if fail_io mode has been established
    dm bufio: check new buffer allocation watermark every 30 seconds
    dm bufio: avoid a possible ABBA deadlock
    dm raid: select the Kconfig option CONFIG_MD_RAID0
    dm btree: fix for dm_btree_find_lowest_key()
    infiniband: call ipv6 route lookup via the stub interface
    tpm_crb: check for bad response size
    ARM: tegra: paz00: Mark panel regulator as enabled on boot
    USB: core: replace %p with %pK
    char: lp: fix possible integer overflow in lp_setup()
    watchdog: pcwd_usb: fix NULL-deref at probe
    USB: ene_usb6250: fix DMA to the stack
    usb: misc: legousbtower: Fix memory leak
    usb: misc: legousbtower: Fix buffers on stack
Linux 4.4.69
    ipmi: Fix kernel panic at ipmi_ssif_thread()
    wlcore: Add RX_BA_WIN_SIZE_CHANGE_EVENT event
    wlcore: Pass win_size taken from ieee80211_sta to FW
    mac80211: RX BA support for sta max_rx_aggregation_subframes
    mac80211: pass block ack session timeout to to driver
    mac80211: pass RX aggregation window size to driver
    Bluetooth: hci_intel: add missing tty-device sanity check
    Bluetooth: hci_bcm: add missing tty-device sanity check
    Bluetooth: Fix user channel for 32bit userspace on 64bit kernel
    tty: pty: Fix ldisc flush after userspace become aware of the data already
    serial: omap: suspend device on probe errors
    serial: omap: fix runtime-pm handling on unbind
    serial: samsung: Use right device for DMA-mapping calls
    arm64: KVM: Fix decoding of Rt/Rt2 when trapping AArch32 CP accesses
    padata: free correct variable
    CIFS: add misssing SFM mapping for doublequote
    cifs: fix CIFS_IOC_GET_MNT_INFO oops
    CIFS: fix mapping of SFM_SPACE and SFM_PERIOD
    SMB3: Work around mount failure when using SMB3 dialect to Macs
    Set unicode flag on cifs echo request to avoid Mac error
    fs/block_dev: always invalidate cleancache in invalidate_bdev()
    ceph: fix memory leak in __ceph_setxattr()
    fs/xattr.c: zero out memory copied to userspace in getxattr
    ext4: evict inline data when writing to memory map
    IB/mlx4: Reduce SRIOV multicast cleanup warning message to debug level
    IB/mlx4: Fix ib device initialization error flow
    IB/IPoIB: ibX: failed to create mcg debug file
    IB/core: Fix sysfs registration error flow
    vfio/type1: Remove locked page accounting workqueue
    dm era: save spacemap metadata root after the pre-commit
    crypto: algif_aead - Require setkey before accept(2)
    block: fix blk_integrity_register to use template's interval_exp if not 0
    KVM: arm/arm64: fix races in kvm_psci_vcpu_on
    KVM: x86: fix user triggerable warning in kvm_apic_accept_events()
    um: Fix PTRACE_POKEUSER on x86_64
    x86, pmem: Fix cache flushing for iovec write < 8 bytes
    selftests/x86/ldt_gdt_32: Work around a glibc sigaction() bug
    x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup
    usb: hub: Do not attempt to autosuspend disconnected devices
    usb: hub: Fix error loop seen after hub communication errors
    usb: Make sure usb/phy/of gets built-in
    usb: misc: add missing continue in switch
    staging: comedi: jr3_pci: cope with jiffies wraparound
    staging: comedi: jr3_pci: fix possible null pointer dereference
    staging: gdm724x: gdm_mux: fix use-after-free on module unload
    staging: vt6656: use off stack for out buffer USB transfers.
    staging: vt6656: use off stack for in buffer USB transfers.
    USB: Proper handling of Race Condition when two USB class drivers try to call init_usb_class simultaneously
    USB: serial: ftdi_sio: add device ID for Microsemi/Arrow SF2PLUS Dev Kit
    usb: host: xhci: print correct command ring address
    iscsi-target: Set session_fall_back_to_erl0 when forcing reinstatement
    target: Convert ACL change queue_depth se_session reference usage
    target/fileio: Fix zero-length READ and WRITE handling
    target: Fix compare_and_write_callback handling for non GOOD status
    xen: adjust early dom0 p2m handling to xen hypervisor behavior
Linux 4.4.68
    block: get rid of blk_integrity_revalidate()
    drm/ttm: fix use-after-free races in vm fault handling
    f2fs: sanity check segment count
    bnxt_en: allocate enough space for ->ntp_fltr_bmap
    ipv6: reorder ip6_route_dev_notifier after ipv6_dev_notf
    ipv6: initialize route null entry in addrconf_init()
    rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string
    ipv4, ipv6: ensure raw socket message is big enough to hold an IP header
    tcp: do not inherit fastopen_req from parent
    tcp: fix wraparound issue in tcp_lp
    bpf, arm64: fix jit branch offset related to ldimm64
    tcp: do not underestimate skb->truesize in tcp_trim_head()
    ALSA: hda - Fix deadlock of controller device lock at unbinding
    staging: emxx_udc: remove incorrect __init annotations
    staging: wlan-ng: add missing byte order conversion
    brcmfmac: Make skb header writable before use
    brcmfmac: Ensure pointer correctly set if skb data location changes
    MIPS: R2-on-R6 MULTU/MADDU/MSUBU emulation bugfix
    scsi: mac_scsi: Fix MAC_SCSI=m option when SCSI=m
    serial: 8250_omap: Fix probe and remove for PM runtime
    phy: qcom-usb-hs: Add depends on EXTCON
    USB: serial: io_edgeport: fix descriptor error handling
    USB: serial: mct_u232: fix modem-status error handling
    USB: serial: quatech2: fix control-message error handling
    USB: serial: ftdi_sio: fix latency-timer error handling
    USB: serial: ark3116: fix open error handling
    USB: serial: ti_usb_3410_5052: fix control-message error handling
    USB: serial: io_edgeport: fix epic-descriptor handling
    USB: serial: ssu100: fix control-message error handling
    USB: serial: digi_acceleport: fix incomplete rx sanity check
    USB: serial: keyspan_pda: fix receive sanity checks
    usb: chipidea: Handle extcon events properly
    usb: chipidea: Only read/write OTGSC from one place
    usb: host: ohci-exynos: Decrese node refcount on exynos_ehci_get_phy() error paths
    usb: host: ehci-exynos: Decrese node refcount on exynos_ehci_get_phy() error paths
    KVM: nVMX: do not leak PML full vmexit to L1
    KVM: nVMX: initialize PML fields in vmcs02
    Revert "KVM: nested VMX: disable perf cpuid reporting"
    x86/platform/intel-mid: Correct MSI IRQ line for watchdog device
    kprobes/x86: Fix kernel panic when certain exception-handling addresses are probed
    clk: Make x86/ conditional on CONFIG_COMMON_CLK
    x86/pci-calgary: Fix iommu_free() comparison of unsigned expression >= 0
    x86/ioapic: Restore IO-APIC irq_chip retrigger callback
    mwifiex: Avoid skipping WEP key deletion for AP
    mwifiex: remove redundant dma padding in AMSDU
    mwifiex: debugfs: Fix (sometimes) off-by-1 SSID print
    ARM: OMAP5 / DRA7: Fix HYP mode boot for thumb2 build
    leds: ktd2692: avoid harmless maybe-uninitialized warning
    power: supply: bq24190_charger: Handle fault before status on interrupt
    power: supply: bq24190_charger: Don't read fault register outside irq_handle_thread()
    power: supply: bq24190_charger: Call power_supply_changed() for relevant component
    power: supply: bq24190_charger: Install irq_handler_thread() at end of probe()
    power: supply: bq24190_charger: Call set_mode_host() on pm_resume()
    power: supply: bq24190_charger: Fix irq trigger to IRQF_TRIGGER_FALLING
    powerpc/powernv: Fix opal_exit tracepoint opcode
    cpupower: Fix turbo frequency reporting for pre-Sandy Bridge cores
    ARM: 8452/3: PJ4: make coprocessor access sequences buildable in Thumb2 mode
    9p: fix a potential acl leak
Linux 4.4.67
    dm ioctl: prevent stack leak in dm ioctl call
    nfsd: stricter decoding of write-like NFSv2/v3 ops
    nfsd4: minor NFSv2/v3 write decoding cleanup
    ext4/fscrypto: avoid RCU lookup in d_revalidate
    ext4 crypto: use dget_parent() in ext4_d_revalidate()
    ext4 crypto: revalidate dentry after adding or removing the key
    ext4: require encryption feature for EXT4_IOC_SET_ENCRYPTION_POLICY
    IB/ehca: fix maybe-uninitialized warnings
    IB/qib: rename BITS_PER_PAGE to RVT_BITS_PER_PAGE
    netlink: Allow direct reclaim for fallback allocation
    8250_pci: Fix potential use-after-free in error path
    scsi: cxlflash: Improve EEH recovery time
    scsi: cxlflash: Fix to avoid EEH and host reset collisions
    scsi: cxlflash: Scan host only after the port is ready for I/O
    net: tg3: avoid uninitialized variable warning
    mtd: avoid stack overflow in MTD CFI code
    drbd: avoid redefinition of BITS_PER_PAGE
    ALSA: ppc/awacs: shut up maybe-uninitialized warning
    ASoC: intel: Fix PM and non-atomic crash in bytcr drivers
    Handle mismatched open calls
    timerfd: Protect the might cancel mechanism proper
Linux 4.4.66
    ftrace/x86: Fix triple fault with graph tracing and suspend-to-ram
    ARCv2: save r30 on kernel entry as gcc uses it for code-gen
    nfsd: check for oversized NFSv2/v3 arguments
    Input: i8042 - add Clevo P650RS to the i8042 reset list
    p9_client_readdir() fix
    MIPS: Avoid BUG warning in arch_check_elf
    MIPS: KGDB: Use kernel context for sleeping threads
    ALSA: seq: Don't break snd_use_lock_sync() loop by timeout
    ALSA: firewire-lib: fix inappropriate assignment between signed/unsigned type
    ipv6: check raw payload size correctly in ioctl
    ipv6: check skb->protocol before lookup for nexthop
    macvlan: Fix device ref leak when purging bc_queue
    ip6mr: fix notification device destruction
    netpoll: Check for skb->queue_mapping
    net: ipv6: RTF_PCPU should not be settable from userspace
    dp83640: don't recieve time stamps twice
    tcp: clear saved_syn in tcp_disconnect()
    sctp: listen on the sock only when it's state is listening or closed
    net: ipv4: fix multipath RTM_GETROUTE behavior when iif is given
    l2tp: fix PPP pseudo-wire auto-loading
    l2tp: take reference on sessions being dumped
    net/packet: fix overflow in check for tp_reserve
    net/packet: fix overflow in check for tp_frame_nr
    l2tp: purge socket queues in the .destruct() callback
    net: phy: handle state correctly in phy_stop_machine
    net: neigh: guard against NULL solicit() method
    sparc64: Fix kernel panic due to erroneous #ifdef surrounding pmd_write()
    sparc64: kern_addr_valid regression
    xen/x86: don't lose event interrupts
    usb: gadget: f_midi: Fixed a bug when buflen was smaller than wMaxPacketSize
    regulator: core: Clear the supply pointer if enabling fails
    RDS: Fix the atomicity for congestion map update
    net_sched: close another race condition in tcf_mirred_release()
    net: cavium: liquidio: Avoid dma_unmap_single on uninitialized ndata
    MIPS: Fix crash registers on non-crashing CPUs
    md:raid1: fix a dead loop when read from a WriteMostly disk
    ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
    drm/amdgpu: fix array out of bounds
    crypto: testmgr - fix out of bound read in __test_aead()
    clk: sunxi: Add apb0 gates for H3
    ARM: OMAP2+: timer: add probe for clocksources
    xc2028: unlock on error in xc2028_set_config()
    f2fs: do more integrity verification for superblock
Linux 4.4.65
    perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
    ping: implement proper locking
    staging/android/ion : fix a race condition in the ion driver
    vfio/pci: Fix integer overflows, bitmask check
    tipc: check minimum bearer MTU
    netfilter: nfnetlink: correctly validate length of batch messages
    xc2028: avoid use after free
    mnt: Add a per mount namespace limit on the number of mounts
    tipc: fix socket timer deadlock
    tipc: fix random link resets while adding a second bearer
    gfs2: avoid uninitialized variable warning
    hostap: avoid uninitialized variable use in hfa384x_get_rid
    tty: nozomi: avoid a harmless gcc warning
    tipc: correct error in node fsm
    tipc: re-enable compensation for socket receive buffer double counting
    tipc: make dist queue pernet
    tipc: make sure IPv6 header fits in skb headroom
Linux 4.4.64
    tipc: fix crash during node removal
    block: fix del_gendisk() vs blkdev_ioctl crash
    x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions
    hv: don't reset hv_context.tsc_page on crash
    Drivers: hv: balloon: account for gaps in hot add regions
    Drivers: hv: balloon: keep track of where ha_region starts
    Tools: hv: kvp: ensure kvp device fd is closed on exec
    kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd
    x86/mce/AMD: Give a name to MCA bank 3 when accessed with legacy MSRs
    powerpc/kprobe: Fix oops when kprobed on 'stdu' instruction
    ubi/upd: Always flush after prepared for an update
    mac80211: reject ToDS broadcast data frames
    mmc: sdhci-esdhc-imx: increase the pad I/O drive strength for DDR50 card
    ACPI / power: Avoid maybe-uninitialized warning
    Input: elantech - add Fujitsu Lifebook E547 to force crc_enabled
    VSOCK: Detach QP check should filter out non matching QPs.
    Drivers: hv: vmbus: Reduce the delay between retries in vmbus_post_msg()
    Drivers: hv: get rid of timeout in vmbus_open()
    Drivers: hv: don't leak memory in vmbus_establish_gpadl()
    s390/mm: fix CMMA vs KSM vs others
    CIFS: remove bad_network_name flag
    cifs: Do not send echoes before Negotiate is complete
    ring-buffer: Have ring_buffer_iter_empty() return true when empty
    tracing: Allocate the snapshot buffer before enabling probe
    KEYS: fix keyctl_set_reqkey_keyring() to not leak thread keyrings
    KEYS: Change the name of the dead type to ".dead" to prevent user access
    KEYS: Disallow keyrings beginning with '.' to be joined as session keyrings
Linux 4.4.63
    MIPS: fix Select HAVE_IRQ_EXIT_ON_IRQ_STACK patch.
    sctp: deny peeloff operation on asocs with threads sleeping on it
    net: ipv6: check route protocol when deleting routes
    tty/serial: atmel: RS485 half duplex w/DMA: enable RX after TX is done
    SUNRPC: fix refcounting problems with auth_gss messages.
    ibmveth: calculate gso_segs for large packets
    catc: Use heap buffer for memory size test
    catc: Combine failure cleanup code in catc_probe()
    rtl8150: Use heap buffers for all register access
    pegasus: Use heap buffers for all register access
    virtio-console: avoid DMA from stack
    dvb-usb-firmware: don't do DMA on stack
    dvb-usb: don't use stack for firmware load
    mm: Tighten x86 /dev/mem with zeroing reads
    rtc: tegra: Implement clock handling
    platform/x86: acer-wmi: setup accelerometer when machine has appropriate notify event
    ext4: fix inode checksum calculation problem if i_extra_size is small
    dvb-usb-v2: avoid use-after-free
    ath9k: fix NULL pointer dereference
    crypto: ahash - Fix EINPROGRESS notification callback
    powerpc: Disable HFSCR[TM] if TM is not supported
    zram: do not use copy_page with non-page aligned address
    kvm: fix page struct leak in handle_vmon
    Revert "MIPS: Lantiq: Fix cascaded IRQ setup"
    char: lack of bool string made CONFIG_DEVPORT always on
    char: Drop bogus dependency of DEVPORT on !M68K
    ftrace: Fix removing of second function probe
    irqchip/irq-imx-gpcv2: Fix spinlock initialization
    libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat
    xen, fbfront: fix connecting to backend
    scsi: sd: Fix capacity calculation with 32-bit sector_t
    scsi: sd: Consider max_xfer_blocks if opt_xfer_blocks is unusable
    scsi: sr: Sanity check returned mode data
    iscsi-target: Drop work-around for legacy GlobalSAN initiator
    iscsi-target: Fix TMR reference leak during session shutdown
    acpi, nfit, libnvdimm: fix interleave set cookie calculation (64-bit comparison)
    x86/vdso: Plug race between mapping and ELF header setup
    x86/vdso: Ensure vdso32_enabled gets set to valid values only
    perf/x86: Avoid exposing wrong/stale data in intel_pmu_lbr_read_32()
    Input: xpad - add support for Razer Wildcat gamepad
    CIFS: store results of cifs_reopen_file to avoid infinite wait
    drm/nouveau/mmu/nv4a: use nv04 mmu rather than the nv44 one
    drm/nouveau/mpeg: mthd returns true on success now
    thp: fix MADV_DONTNEED vs clear soft dirty race
    cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
Linux 4.4.62
    ibmveth: set correct gso_size and gso_type
    net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
    net/mlx4_core: Fix racy CQ (Completion Queue) free
    net/mlx4_en: Fix bad WQE issue
    usb: hub: Wait for connection to be reestablished after port reset
    blk-mq: Avoid memory reclaim when remapping queues
    net/packet: fix overflow in check for priv area size
    crypto: caam - fix RNG deinstantiation error checking
    MIPS: IRQ Stack: Fix erroneous jal to plat_irq_dispatch
    MIPS: Select HAVE_IRQ_EXIT_ON_IRQ_STACK
    MIPS: Switch to the irq_stack in interrupts
    MIPS: Only change $28 to thread_info if coming from user mode
    MIPS: Stack unwinding while on IRQ stack
    MIPS: Introduce irq_stack
    mtd: bcm47xxpart: fix parsing first block after aligned TRX
    usb: dwc3: gadget: delay unmap of bounced requests
    drm/i915: Stop using RP_DOWN_EI on Baytrail
    drm/i915: Avoid tweaking evaluation thresholds on Baytrail v3
Linux 4.4.61
    mm/mempolicy.c: fix error handling in set_mempolicy and mbind.
    MIPS: Flush wrong invalid FTLB entry for huge page
    MIPS: Lantiq: fix missing xbar kernel panic
    MIPS: End spinlocks with .insn
    MIPS: ralink: Fix typos in rt3883 pinctrl
    MIPS: Force o32 fp64 support on 32bit MIPS64r6 kernels
    s390/uaccess: get_user() should zero on failure (again)
    s390/decompressor: fix initrd corruption caused by bss clear
    nios2: reserve boot memory for device tree
    powerpc: Don't try to fix up misaligned load-with-reservation instructions
    powerpc/mm: Add missing global TLB invalidate if cxl is active
    metag/usercopy: Add missing fixups
    metag/usercopy: Fix src fixup in from user rapf loops
    metag/usercopy: Set flags before ADDZ
    metag/usercopy: Zero rest of buffer from copy_from_user
    metag/usercopy: Add early abort to copy_to_user
    metag/usercopy: Fix alignment error checking
    metag/usercopy: Drop unused macros
    ring-buffer: Fix return value check in test_ringbuffer()
    ptrace: fix PTRACE_LISTEN race corrupting task->state
    Reset TreeId to zero on SMB2 TREE_CONNECT
    iio: bmg160: reset chip when probing
    arm/arm64: KVM: Take mmap_sem in kvm_arch_prepare_memory_region
    arm/arm64: KVM: Take mmap_sem in stage2_unmap_vm
    staging: android: ashmem: lseek failed due to no FMODE_LSEEK.
    sysfs: be careful of error returns from ops->show()
    drm/vmwgfx: fix integer overflow in vmw_surface_define_ioctl()
    drm/vmwgfx: Remove getparam error message
    drm/ttm, drm/vmwgfx: Relax permission checking when opening surfaces
    drm/vmwgfx: avoid calling vzalloc with a 0 size in vmw_get_cap_3d_ioctl()
    drm/vmwgfx: NULL pointer dereference in vmw_surface_define_ioctl()
    drm/vmwgfx: Type-check lookups of fence objects

Bug: 62730977
Change-Id: I4458200bbc977cf55a134fd9fd08627604e36d95
Signed-off-by: Thierry Strudel <tstrudel@google.com>
2017-09-20 15:50:18 -07:00
Oleg Nesterov
717b77c07f UPSTREAM: kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function
get_task_struct(tsk) no longer pins tsk->stack so all users of
to_live_kthread() should do try_get_task_stack/put_task_stack to protect
"struct kthread" which lives on kthread's stack.

TODO: Kill to_live_kthread(), perhaps we can even kill "struct kthread" too,
and rework kthread_stop(), it can use task_work_add() to sync with the exiting
kernel thread.

Message-Id: <20160629180357.GA7178@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/cb9b16bbc19d4aea4507ab0552e4644c1211d130.1474003868.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Bug: 64672701
Change-Id: I2872658e56dcb1ab4173c490ef8f52affa54a404
(cherry picked from commit 23196f2e5f5d810578a772785807dcdc2b9fdce9)
Signed-off-by: Zubin Mithra <zsm@google.com>
2017-08-21 16:04:11 +01:00
Petr Mladek
0025d3d13e BACKPORT: kthread: allow to cancel kthread work
We are going to use kthread workers more widely and sometimes we will need
to make sure that the work is neither pending nor running.

This patch implements cancel_*_sync() operations as inspired by
workqueues.  Well, we are synchronized against the other operations via
the worker lock, we use del_timer_sync() and a counter to count parallel
cancel operations.  Therefore the implementation might be easier.

First, we check if a worker is assigned.  If not, the work has newer been
queued after it was initialized.

Second, we take the worker lock.  It must be the right one.  The work must
not be assigned to another worker unless it is initialized in between.

Third, we try to cancel the timer when it exists.  The timer is deleted
synchronously to make sure that the timer call back is not running.  We
need to temporary release the worker->lock to avoid a possible deadlock
with the callback.  In the meantime, we set work->canceling counter to
avoid any queuing.

Fourth, we try to remove the work from a worker list. It might be
the list of either normal or delayed works.

Fifth, if the work is running, we call kthread_flush_work().  It might
take an arbitrary time.  We need to release the worker-lock again.  In the
meantime, we again block any queuing by the canceling counter.

As already mentioned, the check for a pending kthread work is done under a
lock.  In compare with workqueues, we do not need to fight for a single
PENDING bit to block other operations.  Therefore we do not suffer from
the thundering storm problem and all parallel canceling jobs might use
kthread_flush_work().  Any queuing is blocked until the counter gets zero.

Change-Id: I8a8ece0f93c828f311d0ad5c88d80db2388e4808
Link: http://lkml.kernel.org/r/1470754545-17632-10-git-send-email-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry-picked from 37be45d49dec2a411e29d50c9597cfe8184b5645)
[major changes to the original patch while cherry-picking; only rebased
the sync variant]
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
2017-05-13 13:02:56 -07:00
Tejun Heo
3144d81a77 cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups
commit 77f88796cee819b9c4562b0b6b44691b3b7755b1 upstream.

Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland.  kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed.  The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: Chris Mason <clm@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-21 09:30:04 +02:00
Andrew Morton
e9f069868d kernel/kthread.c:kthread_create_on_node(): clarify documentation
- Make it clear that the `node' arg refers to memory allocations only:
  kthread_create_on_node() does not pin the new thread to that node's
  CPUs.

- Encourage the use of NUMA_NO_NODE.

[nzimmer@sgi.com: use NUMA_NO_NODE in kthread_create() also]
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Linus Torvalds
a1d8561172 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The biggest change in this cycle is the rewrite of the main SMP load
  balancing metric: the CPU load/utilization.  The main goal was to make
  the metric more precise and more representative - see the changelog of
  this commit for the gory details:

    9d89c257df ("sched/fair: Rewrite runnable load and utilization average tracking")

  It is done in a way that significantly reduces complexity of the code:

    5 files changed, 249 insertions(+), 494 deletions(-)

  and the performance testing results are encouraging.  Nevertheless we
  need to keep an eye on potential regressions, since this potentially
  affects every SMP workload in existence.

  This work comes from Yuyang Du.

  Other changes:

   - SCHED_DL updates.  (Andrea Parri)

   - Simplify architecture callbacks by removing finish_arch_switch().
     (Peter Zijlstra et al)

   - cputime accounting: guarantee stime + utime == rtime.  (Peter
     Zijlstra)

   - optimize idle CPU wakeups some more - inspired by Facebook server
     loads.  (Mike Galbraith)

   - stop_machine fixes and updates.  (Oleg Nesterov)

   - Introduce the 'trace_sched_waking' tracepoint.  (Peter Zijlstra)

   - sched/numa tweaks.  (Srikar Dronamraju)

   - misc fixes and small cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  sched/deadline: Fix comment in enqueue_task_dl()
  sched/deadline: Fix comment in push_dl_tasks()
  sched: Change the sched_class::set_cpus_allowed() calling context
  sched: Make sched_class::set_cpus_allowed() unconditional
  sched: Fix a race between __kthread_bind() and sched_setaffinity()
  sched: Ensure a task has a non-normalized vruntime when returning back to CFS
  sched/numa: Fix NUMA_DIRECT topology identification
  tile: Reorganize _switch_to()
  sched, sparc32: Update scheduler comments in copy_thread()
  sched: Remove finish_arch_switch()
  sched, tile: Remove finish_arch_switch
  sched, sh: Fold finish_arch_switch() into switch_to()
  sched, score: Remove finish_arch_switch()
  sched, avr32: Remove finish_arch_switch()
  sched, MIPS: Get rid of finish_arch_switch()
  sched, arm: Remove finish_arch_switch()
  sched/fair: Clean up load average references
  sched/fair: Provide runnable_load_avg back to cfs_rq
  sched/fair: Remove task and group entity load when they are dead
  sched/fair: Init cfs_rq's sched_entity load average
  ...
2015-08-31 20:26:22 -07:00
Peter Zijlstra
25834c73f9 sched: Fix a race between __kthread_bind() and sched_setaffinity()
Because sched_setscheduler() checks p->flags & PF_NO_SETAFFINITY
without locks, a caller might observe an old value and race with the
set_cpus_allowed_ptr() call from __kthread_bind() and effectively undo
it:

	__kthread_bind()
	  do_set_cpus_allowed()
						<SYSCALL>
						  sched_setaffinity()
						    if (p->flags & PF_NO_SETAFFINITIY)
						    set_cpus_allowed_ptr()
	  p->flags |= PF_NO_SETAFFINITY

Fix the bug by putting everything under the regular scheduler locks.

This also closes a hole in the serialization of task_struct::{nr_,}cpus_allowed.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dedekind1@gmail.com
Cc: juri.lelli@arm.com
Cc: mgorman@suse.de
Cc: riel@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20150515154833.545640346@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-12 12:06:09 +02:00
David Kershner
18896451ea kthread: export kthread functions
The s-Par visornic driver, currently in staging, processes a queue being
serviced by the an s-Par service partition.  We can get a message that
something has happened with the Service Partition, when that happens, we
must not access the channel until we get a message that the service
partition is back again.

The visornic driver has a thread for processing the channel, when we get
the message, we need to be able to park the thread and then resume it
when the problem clears.

We can do this with kthread_park and unpark but they are not exported
from the kernel, this patch exports the needed functions.

Signed-off-by: David Kershner <david.kershner@unisys.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:41 +03:00
Nishanth Aravamudan
109228389a kernel/kthread.c: partial revert of 81c98869fa ("kthread: ensure locality of task_struct allocations")
After discussions with Tejun, we don't want to spread the use of
cpu_to_mem() (and thus knowledge of allocators/NUMA topology details) into
callers, but would rather ensure the callees correctly handle memoryless
nodes.  With the previous patches ("topology: add support for
node_to_mem_node() to determine the fallback node" and "slub: fallback to
node_to_mem_node() node if allocating on memoryless node") adding and
using node_to_mem_node(), we can safely undo part of the change to the
kthread logic from 81c98869fa.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Anton Blanchard <anton@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:51 -04:00
Lai Jiangshan
ed1403ec2b kthread_work: wake up worker only when the worker is idle
If the worker is already executing a work item when another is queued,
we can safely skip wakeup without worrying about stalling queue thus
avoiding waking up the busy worker spuriously.  Spurious wakeups
should be fine but still isn't nice and avoiding it is trivial here.

tj: Updated description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-07-28 14:07:52 -04:00
Tetsuo Handa
8fe6929cfd kthread: fix return value of kthread_create() upon SIGKILL.
Commit 786235eeba ("kthread: make kthread_create() killable") meant
for allowing kthread_create() to abort as soon as killed by the
OOM-killer.  But returning -ENOMEM is wrong if killed by SIGKILL from
userspace.  Change kthread_create() to return -EINTR upon SIGKILL.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:53:51 -07:00
Nishanth Aravamudan
81c98869fa kthread: ensure locality of task_struct allocations
In the presence of memoryless nodes, numa_node_id() will return the
current CPU's NUMA node, but that may not be where we expect to allocate
from memory from.  Instead, we should rely on the fallback code in the
memory allocator itself, by using NUMA_NO_NODE.  Also, when calling
kthread_create_on_node(), use the nearest node with memory to the cpu in
question, rather than the node it is running on.

Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:20:49 -07:00
Tetsuo Handa
786235eeba kthread: make kthread_create() killable
Any user process callers of wait_for_completion() except global init
process might be chosen by the OOM killer while waiting for completion()
call by some other process which does memory allocation.  See
CVE-2012-4398 "kernel: request_module() OOM local DoS" can happen.

When such users are chosen by the OOM killer when they are waiting for
completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed
due to memory starvation because the OOM killer cannot kill such users.

kthread_create() is one of such users and this patch fixes the problem
for kthreadd by making kthread_create() killable - the same approach
used for fixing CVE-2012-4398.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:08:59 +09:00
Tejun Heo
cd42d559e4 kthread: implement probe_kthread_data()
One of the problems that arise when converting dedicated custom threadpool
to workqueue is that the shared worker pool used by workqueue anonimizes
each worker making it more difficult to identify what the worker was doing
on which target from the output of sysrq-t or debug dump from oops, BUG()
and friends.

For example, after writeback is converted to use workqueue instead of
priviate thread pool, there's no easy to tell which backing device a
writeback work item was working on at the time of task dump, which,
according to our writeback brethren, is important in tracking down issues
with a lot of mounted file systems on a lot of different devices.

This patchset implements a way for a work function to mark its execution
instance so that task dump of the worker task includes information to
indicate what the work item was doing.

An example WARN dump would look like the following.

 WARNING: at fs/fs-writeback.c:1015 bdi_writeback_workfn+0x2b4/0x3c0()
 Modules linked in:
 CPU: 0 Pid: 28 Comm: kworker/u18:0 Not tainted 3.9.0-rc1-work+ #24
 Hardware name: empty empty/S3992, BIOS 080011  10/26/2007
 Workqueue: writeback bdi_writeback_workfn (flush-8:16)
  ffffffff820a3a98 ffff88015b927cb8 ffffffff81c61855 ffff88015b927cf8
  ffffffff8108f500 0000000000000000 ffff88007a171948 ffff88007a1716b0
  ffff88015b49df00 ffff88015b8d3940 0000000000000000 ffff88015b927d08
 Call Trace:
  [<ffffffff81c61855>] dump_stack+0x19/0x1b
  [<ffffffff8108f500>] warn_slowpath_common+0x70/0xa0
  ...

This patch:

Implement probe_kthread_data() which returns kthread_data if accessible.
The function is equivalent to kthread_data() except that the specified
@task may not be a kthread or its vfork_done is already cleared rendering
struct kthread inaccessible.  In the former case, probe_kthread_data() may
return any value.  In the latter, NULL.

This will be used to safely print debug information without affecting
synchronization in the normal paths.  Workqueue debug info printing on
dump_stack() and friends will make use of it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:02 -07:00
Linus Torvalds
46d9be3e5e Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "A lot of activities on workqueue side this time.  The changes achieve
  the followings.

   - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
     updated to be able to interface with multiple backend worker pools.
     This involved a lot of churning but the end result seems actually
     neater as unbound workqueues are now a lot closer to per-cpu ones.

   - The ability to interface with multiple backend worker pools are
     used to implement unbound workqueues with custom attributes.
     Currently the supported attributes are the nice level and CPU
     affinity.  It may be expanded to include cgroup association in
     future.  The attributes can be specified either by calling
     apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
     the workqueue in question is exported through sysfs.

     The backend worker pools are keyed by the actual attributes and
     shared by any workqueues which share the same attributes.  When
     attributes of a workqueue are changed, the workqueue binds to the
     worker pool with the specified attributes while leaving the work
     items which are already executing in its previous worker pools
     alone.

     This allows converting custom worker pool implementations which
     want worker attribute tuning to use workqueues.  The writeback pool
     is already converted in block tree and there are a couple others
     are likely to follow including btrfs io workers.

   - WQ_UNBOUND's ability to bind to multiple worker pools is also used
     to make it NUMA-aware.  Because there's no association between work
     item issuer and the specific worker assigned to execute it, before
     this change, using unbound workqueue led to unnecessary cross-node
     bouncing and it couldn't be helped by autonuma as it requires tasks
     to have implicit node affinity and workers are assigned randomly.

     After these changes, an unbound workqueue now binds to multiple
     NUMA-affine worker pools so that queued work items are executed in
     the same node.  This is turned on by default but can be disabled
     system-wide or for individual workqueues.

     Crypto was requesting NUMA affinity as encrypting data across
     different nodes can contribute noticeable overhead and doing it
     per-cpu was too limiting for certain cases and IO throughput could
     be bottlenecked by one CPU being fully occupied while others have
     idle cycles.

  While the new features required a lot of changes including
  restructuring locking, it didn't complicate the execution paths much.
  The unbound workqueue handling is now closer to per-cpu ones and the
  new features are implemented by simply associating a workqueue with
  different sets of backend worker pools without changing queue,
  execution or flush paths.

  As such, even though the amount of change is very high, I feel
  relatively safe in that it isn't likely to cause subtle issues with
  basic correctness of work item execution and handling.  If something
  is wrong, it's likely to show up as being associated with worker pools
  with the wrong attributes or OOPS while workqueue attributes are being
  changed or during CPU hotplug.

  While this creates more backend worker pools, it doesn't add too many
  more workers unless, of course, there are many workqueues with unique
  combinations of attributes.  Assuming everything else is the same,
  NUMA awareness costs an extra worker pool per NUMA node with online
  CPUs.

  There are also a couple things which are being routed outside the
  workqueue tree.

   - block tree pulled in workqueue for-3.10 so that writeback worker
     pool can be converted to unbound workqueue with sysfs control
     exposed.  This simplifies the code, makes writeback workers
     NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

   - The conversion to workqueue means that there's no 1:1 association
     between a specific worker, which makes writeback folks unhappy as
     they want to be able to tell which filesystem caused a problem from
     backtrace on systems with many filesystems mounted.  This is
     resolved by allowing work items to set debug info string which is
     printed when the task is dumped.  As this change involves unifying
     implementations of dump_stack() and friends in arch codes, it's
     being routed through Andrew's -mm tree."

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
  workqueue: use kmem_cache_free() instead of kfree()
  workqueue: avoid false negative WARN_ON() in destroy_workqueue()
  workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
  workqueue: implement NUMA affinity for unbound workqueues
  workqueue: introduce put_pwq_unlocked()
  workqueue: introduce numa_pwq_tbl_install()
  workqueue: use NUMA-aware allocation for pool_workqueues
  workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
  workqueue: map an unbound workqueues to multiple per-node pool_workqueues
  workqueue: move hot fields of workqueue_struct to the end
  workqueue: make workqueue->name[] fixed len
  workqueue: add workqueue->unbound_attrs
  workqueue: determine NUMA node of workers accourding to the allowed cpumask
  workqueue: drop 'H' from kworker names of unbound worker pools
  workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
  workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
  workqueue: fix memory leak in apply_workqueue_attrs()
  workqueue: fix unbound workqueue attrs hashing / comparison
  workqueue: fix race condition in unbound workqueue free path
  workqueue: remove pwq_lock which is no longer used
  ...
2013-04-29 19:07:40 -07:00
Oleg Nesterov
b5c5442bb6 kthread: kill task_get_live_kthread()
task_get_live_kthread() looks confusing and unneeded.  It does
get_task_struct() but only kthread_stop() needs this, it can be called
even if the calller doesn't have a reference when we know that this
kthread can't exit until we do kthread_stop().

kthread_park() and kthread_unpark() do not need get_task_struct(), the
callers already have the reference.  And it can not help if we can race
with the exiting kthread anyway, kthread_park() can hang forever in this
case.

Change kthread_park() and kthread_unpark() to use to_live_kthread(),
change kthread_stop() to do get_task_struct() by hand and remove
task_get_live_kthread().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:25 -07:00
Oleg Nesterov
4ecdafc808 kthread: introduce to_live_kthread()
"k->vfork_done != NULL" with a barrier() after to_kthread(k) in
task_get_live_kthread(k) looks unclear, and sub-optimal because we load
->vfork_done twice.

All we need is to ensure that we do not return to_kthread(NULL).  Add a
new trivial helper which loads/checks ->vfork_done once, this also looks
more understandable.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29 15:54:25 -07:00
Thomas Gleixner
f2530dc71c kthread: Prevent unpark race which puts threads on the wrong cpu
The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:

CPU0	       	 	CPU1  	     	    CPU2
unpark(T)				    wake_up_process(T)
  clear(SHOULD_PARK)	T runs
			leave parkme() due to !SHOULD_PARK  
  bind_to(CPU2)		BUG_ON(wrong CPU)						    

We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.

The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.

Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.

Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.

The migration thread has another related issue.

CPU0	      	     	 CPU1
Bring up CPU2
create_thread(T)
park(T)
 wait_for_completion()
			 parkme()
			 complete()
sched_set_stop_task()
			 schedule(TASK_PARKED)

The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().

Reported-by: Dave Jones <davej@redhat.com>
Reported-and-tested-by: Dave Hansen <dave@sr71.net>
Reported-and-tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Peter Ziljstra <peterz@infradead.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: dhillf@gmail.com
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-12 14:18:43 +02:00
Tejun Heo
14a40ffccd sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY
PF_THREAD_BOUND was originally used to mark kernel threads which were
bound to a specific CPU using kthread_bind() and a task with the flag
set allows cpus_allowed modifications only to itself.  Workqueue is
currently abusing it to prevent userland from meddling with
cpus_allowed of workqueue workers.

What we need is a flag to prevent userland from messing with
cpus_allowed of certain kernel tasks.  In kernel, anyone can
(incorrectly) squash the flag, and, for worker-type usages,
restricting cpus_allowed modification to the task itself doesn't
provide meaningful extra proection as other tasks can inject work
items to the task anyway.

This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
sched_setaffinity() checks the flag and return -EINVAL if set.
set_cpus_allowed_ptr() is no longer affected by the flag.

This will allow simplifying workqueue worker CPU affinity management.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
2013-03-19 13:45:20 -07:00
Lai Jiangshan
aee4faa499 kthread: use N_MEMORY instead N_HIGH_MEMORY
N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 17:38:33 -08:00
Linus Torvalds
4e21fc138b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull third pile of kernel_execve() patches from Al Viro:
 "The last bits of infrastructure for kernel_thread() et.al., with
  alpha/arm/x86 use of those.  Plus sanitizing the asm glue and
  do_notify_resume() on alpha, fixing the "disabled irq while running
  task_work stuff" breakage there.

  At that point the rest of kernel_thread/kernel_execve/sys_execve work
  can be done independently for different architectures.  The only
  pending bits that do depend on having all architectures converted are
  restrictred to fs/* and kernel/* - that'll obviously have to wait for
  the next cycle.

  I thought we'd have to wait for all of them done before we start
  eliminating the longjump-style insanity in kernel_execve(), but it
  turned out there's a very simple way to do that without flagday-style
  changes."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  alpha: switch to saner kernel_execve() semantics
  arm: switch to saner kernel_execve() semantics
  x86, um: convert to saner kernel_execve() semantics
  infrastructure for saner ret_from_kernel_thread semantics
  make sure that kernel_thread() callbacks call do_exit() themselves
  make sure that we always have a return path from kernel_execve()
  ppc: eeh_event should just use kthread_run()
  don't bother with kernel_thread/kernel_execve for launching linuxrc
  alpha: get rid of switch_stack argument of do_work_pending()
  alpha: don't bother passing switch_stack separately from regs
  alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c
  alpha: simplify TIF_NEED_RESCHED handling
2012-10-13 10:05:52 +09:00
Al Viro
a74fb73c12 infrastructure for saner ret_from_kernel_thread semantics
* allow kernel_execve() leave the actual return to userland to
caller (selected by CONFIG_GENERIC_KERNEL_EXECVE).  Callers
updated accordingly.
* architecture that does select GENERIC_KERNEL_EXECVE in its
Kconfig should have its ret_from_kernel_thread() do this:
	call schedule_tail
	call the callback left for it by copy_thread(); if it ever
returns, that's because it has just done successful kernel_execve()
	jump to return from syscall
IOW, its only difference from ret_from_fork() is that it does call the
callback.
* such an architecture should also get rid of ret_from_kernel_execve()
and __ARCH_WANT_KERNEL_EXECVE

This is the last part of infrastructure patches in that area - from
that point on work on different architectures can live independently.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 13:35:07 -04:00
Thomas Gleixner
2a1d446019 kthread: Implement park/unpark facility
To avoid the full teardown/setup of per cpu kthreads in the case of
cpu hot(un)plug, provide a facility which allows to put the kthread
into a park position and unpark it when the cpu comes online again.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20120716103948.236618824@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-13 17:01:06 +02:00
Tejun Heo
46f3d97621 kthread_worker: reimplement flush_kthread_work() to allow freeing the work item being executed
kthread_worker provides minimalistic workqueue-like interface for
users which need a dedicated worker thread (e.g. for realtime
priority).  It has basic queue, flush_work, flush_worker operations
which mostly match the workqueue counterparts; however, due to the way
flush_work() is implemented, it has a noticeable difference of not
allowing work items to be freed while being executed.

While the current users of kthread_worker are okay with the current
behavior, the restriction does impede some valid use cases.  Also,
removing this difference isn't difficult and actually makes the code
easier to understand.

This patch reimplements flush_kthread_work() such that it uses a
flush_work item instead of queue/done sequence numbers.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-07-22 10:15:28 -07:00
Tejun Heo
9a2e03d8ed kthread_worker: reorganize to prepare for flush_kthread_work() reimplementation
Make the following two non-functional changes.

* Separate out insert_kthread_work() from queue_kthread_work().

* Relocate struct kthread_flush_work and kthread_flush_work_fn()
  definitions above flush_kthread_work().

v2: Added lockdep_assert_held() in insert_kthread_work() as suggested
    by Andy Walls.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andy Walls <awalls@md.metrocast.net>
2012-07-22 10:11:01 -07:00
Tejun Heo
34b087e483 freezer: kill unused set_freezable_with_signal()
There's no in-kernel user of set_freezable_with_signal() left.  Mixing
TIF_SIGPENDING with kernel threads can lead to nasty corner cases as
kernel threads never travel signal delivery path on their own.

e.g. the current implementation is buggy in the cancelation path of
__thaw_task().  It calls recalc_sigpending_and_wake() in an attempt to
clear TIF_SIGPENDING but the function never clears it regardless of
sigpending state.  This means that signallable freezable kthreads may
continue executing with !freezing() && stuck TIF_SIGPENDING, which can
be troublesome.

This patch removes set_freezable_with_signal() along with
PF_FREEZER_NOSIG and recalc_sigpending*() calls in freezer.  User
tasks get TIF_SIGPENDING, kernel tasks get woken up and the spurious
sigpending is dealt with in the usual signal delivery path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
2011-11-23 09:28:17 -08:00
Tejun Heo
8a32c441c1 freezer: implement and use kthread_freezable_should_stop()
Writeback and thinkpad_acpi have been using thaw_process() to prevent
deadlock between the freezer and kthread_stop(); unfortunately, this
is inherently racy - nothing prevents freezing from happening between
thaw_process() and kthread_stop().

This patch implements kthread_freezable_should_stop() which enters
refrigerator if necessary but is guaranteed to return if
kthread_stop() is invoked.  Both thaw_process() users are converted to
use the new function.

Note that this deadlock condition exists for many of freezable
kthreads.  They need to be converted to use the new should_stop or
freezable workqueue.

Tested with synthetic test case.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Henrique de Moraes Holschuh <ibm-acpi@hmh.eng.br>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
2011-11-21 12:32:23 -08:00
Paul Gortmaker
9984de1a5a kernel: Map most files to use export.h instead of module.h
The changed files were only including linux/module.h for the
EXPORT_SYMBOL infrastructure, and nothing else.  Revector them
onto the isolated export header for faster compile times.

Nothing to see here but a whole lot of instances of:

  -#include <linux/module.h>
  +#include <linux/export.h>

This commit is only changing the kernel dir; next targets
will probably be mm, fs, the arch dirs, etc.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 09:20:12 -04:00
KOSAKI Motohiro
1e1b6c511d cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed
The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
tsk->cpus_allowed. Otherwise RT scheduler may confuse.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-28 17:02:57 +02:00
Lucas De Marchi
25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Eric Dumazet
207205a2ba kthread: NUMA aware kthread_create_on_node()
All kthreads being created from a single helper task, they all use memory
from a single node for their kernel stack and task struct.

This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter
to parameters already used by kthread_create().

This parameter serves in allocating memory for the new kthread on its
memory node if possible.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:01 -07:00
Peter Zijlstra
c9b5f501ef sched: Constify function scope static struct sched_param usage
Function-scope statics are discouraged because they are
easily overlooked and can cause subtle bugs/races due to
their global (non-SMP safe) nature.

Linus noticed that we did this for sched_param - at minimum
make the const.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: Message-ID: <AANLkTinotRxScOHEb0HgFgSpGPkq_6jKTv5CfvnQM=ee@mail.gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-07 15:55:45 +01:00
Ingo Molnar
27066fd484 Merge commit 'v2.6.37' into sched/core
Merge reason: Merge the final .37 tree.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-05 14:14:46 +01:00
Yong Zhang
4f32e9b1f8 kthread_work: make lockdep happy
spinlock in kthread_worker and wait_queue_head in kthread_work both
should be lockdep sensible, so change the interface to make it
suiltable for CONFIG_LOCKDEP.

tj: comment update

Reported-by: Nicolas <nicolas.mailhot@laposte.net>
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Andy Walls <awalls@md.metrocast.net>
Tested-by: Andy Walls <awalls@md.metrocast.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-22 10:27:53 +01:00
KOSAKI Motohiro
fe7de49f9d sched: Make sched_param argument static in sched_setscheduler() callers
Andrew Morton pointed out almost all sched_setscheduler() callers are
using fixed parameters and can be converted to static.  It reduces runtime
memory use a little.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-23 17:56:48 +02:00
Tejun Heo
82805ab77d kthread: implement kthread_data()
Implement kthread_data() which takes @task pointing to a kthread and
returns @data specified when creating the kthread.  The caller is
responsible for ensuring the validity of @task when calling this
function.

Signed-off-by: Tejun Heo <tj@kernel.org>
2010-06-29 10:07:09 +02:00
Tejun Heo
b56c0d8937 kthread: implement kthread_worker
Implement simple work processor for kthread.  This is to ease using
kthread.  Single thread workqueue used to be used for things like this
but workqueue won't guarantee fixed kthread association anymore to
enable worker sharing.

This can be used in cases where specific kthread association is
necessary, for example, when it should have RT priority or be assigned
to certain cgroup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
2010-06-29 10:07:09 +02:00
Miao Xie
5ab116c934 cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node
cpuset_mem_spread_node() returns an offline node, and causes an oops.

This patch fixes it by initializing task->mems_allowed to
node_states[N_HIGH_MEMORY], and updating task->mems_allowed when doing
memory hotplug.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Reported-by: Nick Piggin <npiggin@suse.de>
Tested-by: Nick Piggin <npiggin@suse.de>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-24 16:31:21 -07:00
Anton Blanchard
301ba0457f kthread, sched: Remove reference to kthread_create_on_cpu
kthread_create_on_cpu doesn't exist so update a comment in
kthread.c to reflect this.

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100209040740.GB3702@kryten>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-09 11:47:39 +01:00
Peter Zijlstra
881232b70b sched: Move kthread_bind() back to kthread.c
Since kthread_bind() lost its dependencies on sched.c, move it
back where it came from.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170518.039524041@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-12-16 19:01:57 +01:00
Mike Galbraith
b84ff7d6f1 sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c
Eric Paris reported that commit
f685ceacab causes boot time
PREEMPT_DEBUG complaints.

 [    4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
 [    4.593043] caller is task_hot+0x86/0xd0

Since kthread_bind() messes with scheduler internals, move the
body to sched.c, and lock the runqueue.

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Tested-by: Eric Paris <eparis@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1256813310.7574.3.camel@marge.simson.net>
[ v2: fix !SMP build and clean up ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-03 07:25:00 +01:00
Mike Galbraith
61cbe54d94 sched: Keep kthreads at default priority
Removes kthread/workqueue priority boost, they increase worst-case
desktop latencies.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1252486344.28645.18.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-09 17:30:06 +02:00
Oleg Nesterov
9ae260270c update the comment in kthread_stop()
Commit 63706172f3 ("kthreads: rework
kthread_stop()") removed the limitation that the thread function mysr
not call do_exit() itself, but forgot to update the comment.

Since that commit it is OK to use kthread_stop() even if kthread can
exit itself.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-27 12:15:46 -07:00
Oleg Nesterov
63706172f3 kthreads: rework kthread_stop()
Based on Eric's patch which in turn was based on my patch.

kthread_stop() has the nasty problems:

- it runs unpredictably long with the global semaphore held.

- it deadlocks if kthread itself does kthread_stop() before it obeys
  the kthread_should_stop() request.

- it is not useable if kthread exits on its own, see for example the
  ugly "wait_to_die:" hack in migration_thread()

- it is not possible to just tell kthread it should stop, we must always
  wait for its exit.

With this patch kthread() allocates all neccesary data (struct kthread) on
its own stack, globals kthread_stop_xxx are deleted.  ->vfork_done is used
as a pointer into "struct kthread", this means kthread_stop() can easily
wait for kthread's exit.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Vitaliy Gusev <vgusev@openvz.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:54 -07:00
Oleg Nesterov
cdd140bdd6 kthreads: simplify the startup synchronization
We use two completions two create the kernel thread, this is a bit ugly.
kthread() wakes up create_kthread() via ->started, then create_kthread()
wakes up the caller kthread_create() via ->done.  But kthread() does not
need to wait for kthread(), it can just return.  Instead kthread() itself
can wake up the caller of kthread_create().

Kill kthread_create_info->started, ->done is enough.  This improves the
scalability a bit and sijmplifies the code.

The only problem if kernel_thread() fails, in that case create_kthread()
must do complete(&create->done).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Vitaliy Gusev <vgusev@openvz.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-18 13:03:53 -07:00
Miao Xie
58568d2a82 cpuset,mm: update tasks' mems_allowed in time
Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.

In order to update tasks' mems_allowed in time, we must modify the code of
memory policy.  Because the memory policy is applied in the process's
context originally.  After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.

But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression.  But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set.  In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.

[lee.schermerhorn@hp.com:
  The rework of mpol_new() to extract the adjusting of the node mask to
  apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
  with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
  allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
  node mask to mpol_new_mpolicy().

  Remove the now unneeded 'nodes = NULL' from mpol_new().

  Note that mpol_new_mempolicy() is always called with a non-NULL
  'nodes' parameter now that it has been removed from mpol_new().
  Therefore, we don't need to test nodes for NULL before testing it for
  'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
  verify this assumption.]
[lee.schermerhorn@hp.com:

  I don't think the function name 'mpol_new_mempolicy' is descriptive
  enough to differentiate it from mpol_new().

  This function applies cpuset set context, usually constraining nodes
  to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
  is set, it also translates the nodes.  So I settled on
  'mpol_set_nodemask()', because the comment block for mpol_new() mentions
  that we need to call this function to "set nodes".

  Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16 19:47:31 -07:00
Steven Rostedt
ad8d75fff8 tracing/events: move trace point headers into include/trace/events
Impact: clean up

Create a sub directory in include/trace called events to keep the
trace point headers in their own separate directory. Only headers that
declare trace points should be defined in this directory.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 22:05:43 -04:00
Steven Rostedt
a8d154b009 tracing: create automated trace defines
This patch lowers the number of places a developer must modify to add
new tracepoints. The current method to add a new tracepoint
into an existing system is to write the trace point macro in the
trace header with one of the macros TRACE_EVENT, TRACE_FORMAT or
DECLARE_TRACE, then they must add the same named item into the C file
with the macro DEFINE_TRACE(name) and then add the trace point.

This change cuts out the needing to add the DEFINE_TRACE(name).
Every file that uses the tracepoint must still include the trace/<type>.h
file, but the one C file must also add a define before the including
of that file.

 #define CREATE_TRACE_POINTS
 #include <trace/mytrace.h>

This will cause the trace/mytrace.h file to also produce the C code
necessary to implement the trace point.

Note, if more than one trace/<type>.h is used to create the C code
it is best to list them all together.

 #define CREATE_TRACE_POINTS
 #include <trace/foo.h>
 #include <trace/bar.h>
 #include <trace/fido.h>

Thanks to Mathieu Desnoyers and Christoph Hellwig for coming up with
the cleaner solution of the define above the includes over my first
design to have the C code include a "special" header.

This patch converts sched, irq and lockdep and skb to use this new
method.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-04-14 12:57:28 -04:00