The page allocator uses tsk_is_oom_victim() to determine when to
fast-path memory allocations in order to get an allocating process out
of the page allocator and into do_exit() quickly. Unfortunately,
tsk_is_oom_victim()'s check to see if a process is killed for OOM
purposes is to look for the presence of an OOM reaper artifact that only
the OOM killer sets. This means that for processes killed by Simple LMK,
there is no fast-pathing done in the page allocator to get them to die
faster.
Remedy this by changing tsk_is_oom_victim() to look for the existence of
the TIF_MEMDIE flag, which Simple LMK sets for its victims.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
* refs/heads/tmp-7b43d74:
Revert USB changes
Linux 4.14.206
powercap: restrict energy meter to root access
Linux 4.14.205
arm64: dts: marvell: espressobin: add ethernet alias
PM: runtime: Resume the device earlier in __device_release_driver()
Revert "ARC: entry: fix potential EFA clobber when TIF_SYSCALL_TRACE"
ARC: stack unwinding: avoid indefinite looping
usb: mtu3: fix panic in mtu3_gadget_stop()
USB: Add NO_LPM quirk for Kingston flash drive
USB: serial: option: add Telit FN980 composition 0x1055
USB: serial: option: add LE910Cx compositions 0x1203, 0x1230, 0x1231
USB: serial: option: add Quectel EC200T module support
USB: serial: cyberjack: fix write-URB completion race
serial: txx9: add missing platform_driver_unregister() on error in serial_txx9_init
serial: 8250_mtk: Fix uart_get_baud_rate warning
fork: fix copy_process(CLONE_PARENT) race with the exiting ->real_parent
vt: Disable KD_FONT_OP_COPY
ACPI: NFIT: Fix comparison to '-ENXIO'
drm/vc4: drv: Add error handding for bind
vsock: use ns_capable_noaudit() on socket create
scsi: core: Don't start concurrent async scan on same host
blk-cgroup: Pre-allocate tree node on blkg_conf_prep
blk-cgroup: Fix memleak on error path
of: Fix reserved-memory overlap detection
x86/kexec: Use up-to-dated screen_info copy to fill boot params
ARM: dts: sun4i-a10: fix cpu_alert temperature
futex: Handle transient "ownerless" rtmutex state correctly
tracing: Fix out of bounds write in get_trace_buf
ftrace: Handle tracing when switching between context
ftrace: Fix recursion check for NMI test
gfs2: Wake up when sd_glock_disposal becomes zero
mm: always have io_remap_pfn_range() set pgprot_decrypted()
kthread_worker: prevent queuing delayed work from timer_fn when it is being canceled
lib/crc32test: remove extra local_irq_disable/enable
ALSA: usb-audio: Add implicit feedback quirk for Qu-16
Fonts: Replace discarded const qualifier
i40e: Memory leak in i40e_config_iwarp_qvlist
i40e: Fix of memory leak and integer truncation in i40e_virtchnl.c
i40e: Wrong truncation from u16 to u8
i40e: add num_vectors checker in iwarp handler
i40e: Fix a potential NULL pointer dereference
blktrace: fix debugfs use after free
Blktrace: bail out early if block debugfs is not configured
sfp: Fix error handing in sfp_probe()
sctp: Fix COMM_LOST/CANT_STR_ASSOC err reporting on big-endian platforms
net: usb: qmi_wwan: add Telit LE910Cx 0x1230 composition
gianfar: Account for Tx PTP timestamp in the skb headroom
gianfar: Replace skb_realloc_headroom with skb_cow_head for PTP
tipc: fix use-after-free in tipc_bcast_get_mode
xen/events: don't use chip_data for legacy IRQs
drm/i915: Break up error capture compression loops with cond_resched()
ANDROID: Temporarily disable XFRM_USER_COMPAT filtering
Linux 4.14.204
staging: octeon: Drop on uncorrectable alignment or FCS error
staging: octeon: repair "fixed-link" support
staging: comedi: cb_pcidas: Allow 2-channel commands for AO subdevice
KVM: arm64: Fix AArch32 handling of DBGD{CCINT,SCRext} and DBGVCR
device property: Don't clear secondary pointer for shared primary firmware node
device property: Keep secondary firmware node secondary by type
ARM: s3c24xx: fix missing system reset
ARM: samsung: fix PM debug build with DEBUG_LL but !MMU
arm: dts: mt7623: add missing pause for switchport
hil/parisc: Disable HIL driver when it gets stuck
cachefiles: Handle readpage error correctly
arm64: berlin: Select DW_APB_TIMER_OF
tty: make FONTX ioctl use the tty pointer they were actually passed
rtc: rx8010: don't modify the global rtc ops
drm/ttm: fix eviction valuable range check.
ext4: fix invalid inode checksum
ext4: fix error handling code in add_new_gdb
ext4: fix leaking sysfs kobject after failed mount
vringh: fix __vringh_iov() when riov and wiov are different
ring-buffer: Return 0 on success from ring_buffer_resize()
9P: Cast to loff_t before multiplying
libceph: clear con->out_msg on Policy::stateful_server faults
ceph: promote to unsigned long long before shifting
drm/amdgpu: don't map BO in reserved region
ia64: fix build error with !COREDUMP
ubi: check kthread_should_stop() after the setting of task state
perf python scripting: Fix printable strings in python3 scripts
ubifs: dent: Fix some potential memory leaks while iterating entries
NFSD: Add missing NFSv2 .pc_func methods
NFSv4.2: support EXCHGID4_FLAG_SUPP_FENCE_OPS 4.2 EXCHANGE_ID flag
powerpc/powernv/elog: Fix race while processing OPAL error log event.
powerpc: Warn about use of smt_snooze_delay
powerpc/rtas: Restrict RTAS requests from userspace
s390/stp: add locking to sysfs functions
iio:gyro:itg3200: Fix timestamp alignment and prevent data leak.
iio:adc:ti-adc12138 Fix alignment issue with timestamp
iio:adc:ti-adc0832 Fix alignment issue with timestamp
iio:light:si1145: Fix timestamp alignment and prevent data leak.
dmaengine: dma-jz4780: Fix race in jz4780_dma_tx_status
vt: keyboard, extend func_buf_lock to readers
vt: keyboard, simplify vt_kdgkbsent
drm/i915: Force VT'd workarounds when running as a guest OS
usb: host: fsl-mph-dr-of: check return of dma_set_mask()
usb: cdc-acm: fix cooldown mechanism
usb: dwc3: core: don't trigger runtime pm when remove driver
usb: dwc3: core: add phy cleanup for probe error handling
usb: dwc3: ep0: Fix ZLP for OUT ep0 requests
btrfs: fix use-after-free on readahead extent after failure to create it
btrfs: cleanup cow block on error
btrfs: use kvzalloc() to allocate clone_roots in btrfs_ioctl_send()
btrfs: send, recompute reference path after orphanization of a directory
btrfs: reschedule if necessary when logging directory items
scsi: mptfusion: Fix null pointer dereferences in mptscsih_remove()
w1: mxc_w1: Fix timeout resolution problem leading to bus error
acpi-cpufreq: Honor _PSD table setting on new AMD CPUs
ACPI: debug: don't allow debugging when ACPI is disabled
ACPI: video: use ACPI backlight for HP 635 Notebook
ACPI / extlog: Check for RDMSR failure
NFS: fix nfs_path in case of a rename retry
fs: Don't invalidate page buffers in block_write_full_page()
leds: bcm6328, bcm6358: use devres LED registering function
perf/x86/amd/ibs: Fix raw sample data accumulation
perf/x86/amd/ibs: Don't include randomized bits in get_ibs_op_count()
md/raid5: fix oops during stripe resizing
nvme-rdma: fix crash when connect rejected
sgl_alloc_order: fix memory leak
nbd: make the config put is called before the notifying the waiter
ARM: dts: s5pv210: remove dedicated 'audio-subsystem' node
ARM: dts: s5pv210: move PMU node out of clock controller
ARM: dts: s5pv210: remove DMA controller bus node name to fix dtschema warnings
memory: emif: Remove bogus debugfs error handling
arm64: dts: renesas: ulcb: add full-pwr-cycle-in-suspend into eMMC nodes
gfs2: add validation checks for size of superblock
ext4: Detect already used quota file early
drivers: watchdog: rdc321x_wdt: Fix race condition bugs
net: 9p: initialize sun_server.sun_path to have addr's value only when addr is valid
clk: ti: clockdomain: fix static checker warning
bnxt_en: Log unknown link speed appropriately.
md/bitmap: md_bitmap_get_counter returns wrong blocks
power: supply: test_power: add missing newlines when printing parameters by sysfs
bus/fsl_mc: Do not rely on caller to provide non NULL mc_io
drivers/net/wan/hdlc_fr: Correctly handle special skb->protocol values
ACPI: Add out of bounds and numa_off protections to pxm_to_node()
arm64/mm: return cpu_all_mask when node is NUMA_NO_NODE
uio: free uio id after uio file node is freed
USB: adutux: fix debugging
cpufreq: sti-cpufreq: add stih418 support
kgdb: Make "kgdbcon" work properly with "kgdb_earlycon"
printk: reduce LOG_BUF_SHIFT range for H8300
drm/bridge/synopsys: dsi: add support for non-continuous HS clock
mmc: via-sdmmc: Fix data race bug
media: tw5864: check status of tw5864_frameinterval_get
usb: typec: tcpm: During PR_SWAP, source caps should be sent only after tSwapSourceStart
media: platform: Improve queue set up flow for bug fixing
media: videodev2.h: RGB BT2020 and HSV are always full range
drm/brige/megachips: Add checking if ge_b850v3_lvds_init() is working correctly
ath10k: fix VHT NSS calculation when STBC is enabled
ath10k: start recovery process when payload length exceeds max htc length for sdio
video: fbdev: pvr2fb: initialize variables
xfs: fix realtime bitmap/summary file truncation when growing rt volume
ARM: 8997/2: hw_breakpoint: Handle inexact watchpoint addresses
um: change sigio_spinlock to a mutex
f2fs: fix to check segment boundary during SIT page readahead
f2fs: add trace exit in exception path
sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
powerpc/powernv/smp: Fix spurious DBG() warning
futex: Fix incorrect should_fail_futex() handling
mlxsw: core: Fix use-after-free in mlxsw_emad_trans_finish()
x86/unwind/orc: Fix inactive tasks with stack pointer in %sp on GCC 10 compiled kernels
fscrypt: return -EXDEV for incompatible rename or link into encrypted dir
ata: sata_rcar: Fix DMA boundary mask
mtd: lpddr: Fix bad logic in print_drs_error
p54: avoid accessing the data mapped to streaming DMA
fuse: fix page dereference after free
x86/xen: disable Firmware First mode for correctable memory errors
arch/x86/amd/ibs: Fix re-arming IBS Fetch
tipc: fix memory leak caused by tipc_buf_append()
ravb: Fix bit fields checking in ravb_hwtstamp_get()
gtp: fix an use-before-init in gtp_newlink()
efivarfs: Replace invalid slashes with exclamation marks in dentries.
arm64: link with -z norelro regardless of CONFIG_RELOCATABLE
scripts/setlocalversion: make git describe output more reliable
UPSTREAM: mm/sl[uo]b: export __kmalloc_track(_node)_caller
BACKPORT: xfrm/compat: Translate 32-bit user_policy from sockptr
BACKPORT: xfrm/compat: Add 32=>64-bit messages translator
UPSTREAM: xfrm/compat: Attach xfrm dumps to 64=>32 bit translator
UPSTREAM: xfrm/compat: Add 64=>32-bit messages translator
BACKPORT: xfrm: Provide API to register translator module
ANDROID: Publish uncompressed Image on aarch64
Linux 4.14.203
powerpc/powernv/opal-dump : Use IRQ_HANDLED instead of numbers in interrupt handler
usb: gadget: f_ncm: allow using NCM in SuperSpeed Plus gadgets.
eeprom: at25: set minimum read/write access stride to 1
USB: cdc-wdm: Make wdm_flush() interruptible and add wdm_fsync().
usb: cdc-acm: add quirk to blacklist ETAS ES58X devices
tty: serial: fsl_lpuart: fix lpuart32_poll_get_char
net: korina: cast KSEG0 address to pointer in kfree
ath10k: check idx validity in __ath10k_htt_rx_ring_fill_n()
scsi: ufs: ufs-qcom: Fix race conditions caused by ufs_qcom_testbus_config()
usb: core: Solve race condition in anchor cleanup functions
brcm80211: fix possible memleak in brcmf_proto_msgbuf_attach
mwifiex: don't call del_timer_sync() on uninitialized timer
reiserfs: Fix memory leak in reiserfs_parse_options()
ipvs: Fix uninit-value in do_ip_vs_set_ctl()
tty: ipwireless: fix error handling
scsi: qedi: Fix list_del corruption while removing active I/O
scsi: qedi: Protect active command list to avoid list corruption
Fix use after free in get_capset_info callback.
rtl8xxxu: prevent potential memory leak
brcmsmac: fix memory leak in wlc_phy_attach_lcnphy
scsi: ibmvfc: Fix error return in ibmvfc_probe()
Bluetooth: Only mark socket zapped after unlocking
usb: ohci: Default to per-port over-current protection
xfs: make sure the rt allocator doesn't run off the end
reiserfs: only call unlock_new_inode() if I_NEW
misc: rtsx: Fix memory leak in rtsx_pci_probe
ath9k: hif_usb: fix race condition between usb_get_urb() and usb_kill_anchored_urbs()
can: flexcan: flexcan_chip_stop(): add error handling and propagate error value
USB: cdc-acm: handle broken union descriptors
udf: Avoid accessing uninitialized data on failed inode read
udf: Limit sparing table size
usb: gadget: function: printer: fix use-after-free in __lock_acquire
misc: vop: add round_up(x,4) for vring_size to avoid kernel panic
mic: vop: copy data to kernel space then write to io memory
scsi: target: core: Add CONTROL field for trace events
scsi: mvumi: Fix error return in mvumi_io_attach()
PM: hibernate: remove the bogus call to get_gendisk() in software_resume()
mac80211: handle lack of sband->bitrates in rates
ntfs: add check for mft record size in superblock
media: venus: core: Fix runtime PM imbalance in venus_probe
fs: dlm: fix configfs memory leak
media: saa7134: avoid a shift overflow
mmc: sdio: Check for CISTPL_VERS_1 buffer size
media: uvcvideo: Ensure all probed info is returned to v4l2
media: media/pci: prevent memory leak in bttv_probe
media: bdisp: Fix runtime PM imbalance on error
media: platform: sti: hva: Fix runtime PM imbalance on error
media: platform: s3c-camif: Fix runtime PM imbalance on error
media: vsp1: Fix runtime PM imbalance on error
media: exynos4-is: Fix a reference count leak
media: exynos4-is: Fix a reference count leak due to pm_runtime_get_sync
media: exynos4-is: Fix several reference count leaks due to pm_runtime_get_sync
media: sti: Fix reference count leaks
media: st-delta: Fix reference count leak in delta_run_work
media: ati_remote: sanity check for both endpoints
media: firewire: fix memory leak
crypto: ccp - fix error handling
i2c: core: Restore acpi_walk_dep_device_list() getting called after registering the ACPI i2c devs
perf: correct SNOOPX field offset
NTB: hw: amd: fix an issue about leak system resources
nvmet: fix uninitialized work for zero kato
powerpc/powernv/dump: Fix race while processing OPAL dump
arm64: dts: zynqmp: Remove additional compatible string for i2c IPs
ARM: dts: owl-s500: Fix incorrect PPI interrupt specifiers
arm64: dts: qcom: msm8916: Fix MDP/DSI interrupts
memory: fsl-corenet-cf: Fix handling of platform_get_irq() error
memory: omap-gpmc: Fix a couple off by ones
KVM: x86: emulating RDPID failure shall return #UD rather than #GP
Input: sun4i-ps2 - fix handling of platform_get_irq() error
Input: twl4030_keypad - fix handling of platform_get_irq() error
Input: omap4-keypad - fix handling of platform_get_irq() error
Input: ep93xx_keypad - fix handling of platform_get_irq() error
Input: stmfts - fix a & vs && typo
Input: imx6ul_tsc - clean up some errors in imx6ul_tsc_resume()
vfio iommu type1: Fix memory leak in vfio_iommu_type1_pin_pages
vfio/pci: Clear token on bypass registration failure
ext4: limit entries returned when counting fsmap records
clk: bcm2835: add missing release if devm_clk_hw_register fails
clk: at91: clk-main: update key before writing AT91_CKGR_MOR
PCI: iproc: Set affinity mask on MSI interrupts
i2c: rcar: Auto select RESET_CONTROLLER
mailbox: avoid timer start from callback
rapidio: fix the missed put_device() for rio_mport_add_riodev
rapidio: fix error handling path
ramfs: fix nommu mmap with gaps in the page cache
lib/crc32.c: fix trivial typo in preprocessor condition
f2fs: wait for sysfs kobject removal before freeing f2fs_sb_info
IB/rdmavt: Fix sizeof mismatch
cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
powerpc/perf/hv-gpci: Fix starting index value
powerpc/perf: Exclude pmc5/6 from the irrelevant PMU group constraints
overflow: Include header file with SIZE_MAX declaration
kdb: Fix pager search for multi-line strings
RDMA/hns: Set the unsupported wr opcode
perf intel-pt: Fix "context_switch event has no tid" error
powerpc/tau: Disable TAU between measurements
powerpc/tau: Remove duplicated set_thresholds() call
powerpc/tau: Use appropriate temperature sample interval
RDMA/qedr: Fix use of uninitialized field
xfs: limit entries returned when counting fsmap records
arc: plat-hsdk: fix kconfig dependency warning when !RESET_CONTROLLER
ARM: 9007/1: l2c: fix prefetch bits init in L2X0_AUX_CTRL using DT values
mtd: mtdoops: Don't write panic data twice
mtd: lpddr: fix excessive stack usage with clang
powerpc/icp-hv: Fix missing of_node_put() in success path
powerpc/pseries: Fix missing of_node_put() in rng_init()
IB/mlx4: Adjust delayed work when a dup is observed
IB/mlx4: Fix starvation in paravirt mux/demux
mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary
mm/memcg: fix device private memcg accounting
net: korina: fix kfree of rx/tx descriptor array
mwifiex: fix double free
scsi: be2iscsi: Fix a theoretical leak in beiscsi_create_eqs()
usb: dwc2: Fix INTR OUT transfers in DDMA mode.
nl80211: fix non-split wiphy information
usb: gadget: u_ether: enable qmult on SuperSpeed Plus as well
usb: gadget: f_ncm: fix ncm_bitrate for SuperSpeed and above.
iwlwifi: mvm: split a print to avoid a WARNING in ROC
mfd: sm501: Fix leaks in probe()
net: enic: Cure the enic api locking trainwreck
qtnfmac: fix resource leaks on unsupported iftype error return path
HID: hid-input: fix stylus battery reporting
quota: clear padding in v2r1_mem2diskdqb()
usb: dwc2: Fix parameter type in function pointer prototype
ALSA: seq: oss: Avoid mutex lock for a long-time ioctl
misc: mic: scif: Fix error handling path
ath6kl: wmi: prevent a shift wrapping bug in ath6kl_wmi_delete_pstream_cmd()
pinctrl: mcp23s08: Fix mcp23x17 precious range
pinctrl: mcp23s08: Fix mcp23x17_regmap initialiser
HID: roccat: add bounds checking in kone_sysfs_write_settings()
video: fbdev: sis: fix null ptr dereference
video: fbdev: vga16fb: fix setting of pixclock because a pass-by-value error
drivers/virt/fsl_hypervisor: Fix error handling path
pwm: lpss: Add range limit check for the base_unit register value
pwm: lpss: Fix off by one error in base_unit math in pwm_lpss_prepare()
pty: do tty_flip_buffer_push without port->lock in pty_write
tty: hvcs: Don't NULL tty->driver_data until hvcs_cleanup()
tty: serial: earlycon dependency
VMCI: check return value of get_user_pages_fast() for errors
backlight: sky81452-backlight: Fix refcount imbalance on error
scsi: csiostor: Fix wrong return value in csio_hw_prep_fw()
scsi: qla4xxx: Fix an error handling path in 'qla4xxx_get_host_stats()'
drm/gma500: fix error check
mwifiex: Do not use GFP_KERNEL in atomic context
brcmfmac: check ndev pointer
ASoC: qcom: lpass-cpu: fix concurrency issue
ASoC: qcom: lpass-platform: fix memory leak
wcn36xx: Fix reported 802.11n rx_highest rate wcn3660/wcn3680
ath9k: Fix potential out of bounds in ath9k_htc_txcompletion_cb()
ath6kl: prevent potential array overflow in ath6kl_add_new_sta()
Bluetooth: hci_uart: Cancel init work before unregistering
ath10k: provide survey info as accumulated data
regulator: resolve supply after creating regulator
media: ti-vpe: Fix a missing check and reference count leak
media: s5p-mfc: Fix a reference count leak
media: platform: fcp: Fix a reference count leak.
media: tc358743: initialize variable
media: mx2_emmaprp: Fix memleak in emmaprp_probe
cypto: mediatek - fix leaks in mtk_desc_ring_alloc
crypto: omap-sham - fix digcnt register handling with export/import
media: omap3isp: Fix memleak in isp_probe
media: uvcvideo: Set media controller entity functions
media: m5mols: Check function pointer in m5mols_sensor_power
media: Revert "media: exynos4-is: Add missed check for pinctrl_lookup_state()"
media: tuner-simple: fix regression in simple_set_radio_freq
crypto: ixp4xx - Fix the size used in a 'dma_free_coherent()' call
crypto: mediatek - Fix wrong return value in mtk_desc_ring_alloc()
crypto: algif_skcipher - EBUSY on aio should be an error
drivers/perf: xgene_pmu: Fix uninitialized resource struct
x86/fpu: Allow multiple bits in clearcpuid= parameter
EDAC/i5100: Fix error handling order in i5100_init_one()
crypto: algif_aead - Do not set MAY_BACKLOG on the async path
ima: Don't ignore errors from crypto_shash_update()
KVM: SVM: Initialize prev_ga_tag before use
KVM: x86/mmu: Commit zap of remaining invalid pages when recovering lpages
cifs: Return the error from crypt_message when enc/dec key not found.
cifs: remove bogus debug code
icmp: randomize the global rate limiter
tcp: fix to update snd_wl1 in bulk receiver fast path
nfc: Ensure presence of NFC_ATTR_FIRMWARE_NAME attribute in nfc_genl_fw_download()
net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling ether_setup
net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device
ALSA: bebob: potential info leak in hwdep_read()
binder: fix UAF when releasing todo list
r8169: fix data corruption issue on RTL8402
net/ipv4: always honour route mtu during forwarding
tipc: fix the skb_unshare() in tipc_buf_append()
net: usb: qmi_wwan: add Cellient MPL200 card
mlx4: handle non-napi callers to napi_poll
ipv4: Restore flowi4_oif update before call to xfrm_lookup_route
ibmveth: Identify ingress large send packets.
ibmveth: Switch order of ibmveth_helper calls.
ANDROID: GKI: Enable CONFIG_X86_X2APIC
UPSTREAM: binder: fix UAF when releasing todo list
Conflicts:
drivers/mailbox/mailbox.c
drivers/scsi/ufs/ufs-qcom.c
Change-Id: Ic48321a27b8984cb769a9f11b5ce3d62efeb78fa
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
[ Upstream commit 67197a4f28d28d0b073ab0427b03cb2ee5382578 ]
Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm. This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.
Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important. Such operation can happen frequently. We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression. Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us. Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.
Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM && !CLONE_THREAD && !CLONE_VFORK). Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes. To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex. Its scope is changed to global.
The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
the case of vfork(). To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified. Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare. Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.
With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.
[surenb@google.com: v3]
Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com
Fixes: 44a70adec9 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: Tim Murray <timmurray@google.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Christian Kellner <christian@kellner.me>
Cc: Adrian Reber <areber@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
Debugged-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Free the pages parallely for a task that is LMK killed using the
oom_reaper. This freeing of pages will help to give the pages to
buddy system well advance there by we may achieve less number of
killings by LMK.
Change-Id: I5e1ed183437ab243f12cbbf3ae10d9ca5211fc06
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
Free the pages parallely for a task that receives SIGKILL using the
oom_reaper. This freeing of pages will help to give the pages to buddy
system well advance.
This reaps for the process which received SIGKILL through
either sys_kill from user or kill_pid from kernel and that sending
process has CAP_KILL capability.
Also sysctl interface, reap_mem_on_sigkill, is added to turn on/off this
feature.
Change-Id: I21adb95de5e380a80d7eb0b87d9b5b553f52e28a
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
commit 27ae357fa82be5ab73b2ef8d39dcb8ca2563483a upstream.
Since exit_mmap() is done without the protection of mm->mmap_sem, it is
possible for the oom reaper to concurrently operate on an mm until
MMF_OOM_SKIP is set.
This allows munlock_vma_pages_all() to concurrently run while the oom
reaper is operating on a vma. Since munlock_vma_pages_range() depends
on clearing VM_LOCKED from vm_flags before actually doing the munlock to
determine if any other vmas are locking the same memory, the check for
VM_LOCKED in the oom reaper is racy.
This is especially noticeable on architectures such as powerpc where
clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd
is zapped by the oom reaper during follow_page_mask() after the check
for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a
kernel oops.
Fix this by manually freeing all possible memory from the mm before
doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not
run on the mm anymore so the munlock is safe to do in exit_mmap(). It
also matches the logic that the oom reaper currently uses for
determining when to set MMF_OOM_SKIP itself, so there's no new risk of
excessive oom killing.
This issue fixes CVE-2018-1000200.
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com
Fixes: 2129258024 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: David Rientjes <rientjes@google.com>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Allow other functions to dump the list of tasks.
Useful for when debugging memory leaks.
Change-Id: I76c33a118a9765b4c2276e8c76de36399c78dbf6
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
commit 4837fe37adff1d159904f0c013471b1ecbcb455e upstream.
David Rientjes has reported the following memory corruption while the
oom reaper tries to unmap the victims address space
BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000
addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d
file: (null) fault: (null) mmap: (null) readpage: (null)
CPU: 2 PID: 1001 Comm: oom_reaper
Call Trace:
unmap_page_range+0x1068/0x1130
__oom_reap_task_mm+0xd5/0x16b
oom_reaper+0xff/0x14c
kthread+0xc1/0xe0
Tetsuo Handa has noticed that the synchronization inside exit_mmap is
insufficient. We only synchronize with the oom reaper if
tsk_is_oom_victim which is not true if the final __mmput is called from
a different context than the oom victim exit path. This can trivially
happen from context of any task which has grabbed mm reference (e.g. to
read /proc/<pid>/ file which requires mm etc.).
The race would look like this
oom_reaper oom_victim task
mmget_not_zero
do_exit
mmput
__oom_reap_task_mm mmput
__mmput
exit_mmap
remove_vma
unmap_page_range
Fix this issue by providing a new mm_is_oom_victim() helper which
operates on the mm struct rather than a task. Any context which
operates on a remote mm struct should use this helper in place of
tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared
so it is stable in the exit_mmap path.
Debugged by Tetsuo Handa.
Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
Fixes: 2129258024 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Wenwei Tao has noticed that our current assumption that the oom victim
is dying and never doing any visible changes after it dies, and so the
oom_reaper can tear it down, is not entirely true.
__task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set
but do_group_exit sends SIGKILL to all threads _after_ the flag is set.
So there is a race window when some threads won't have
fatal_signal_pending while the oom_reaper could start unmapping the
address space. Moreover some paths might not check for fatal signals
before each PF/g-u-p/copy_from_user.
We already have a protection for oom_reaper vs. PF races by checking
MMF_UNSTABLE. This has been, however, checked only for kernel threads
(use_mm users) which can outlive the oom victim. A simple fix would be
to extend the current check in handle_mm_fault for all tasks but that
wouldn't be sufficient because the current check assumes that a kernel
thread would bail out after EFAULT from get_user*/copy_from_user and
never re-read the same address which would succeed because the PF path
has established page tables already. This seems to be the case for the
only existing use_mm user currently (virtio driver) but it is rather
fragile in general.
This is even more fragile in general for more complex paths such as
generic_perform_write which can re-read the same address more times
(e.g. iov_iter_copy_from_user_atomic to fail and then
iov_iter_fault_in_readable on retry).
Therefore we have to implement MMF_UNSTABLE protection in a robust way
and never make a potentially corrupted content visible. That requires
to hook deeper into the PF path and check for the flag _every time_
before a pte for anonymous memory is established (that means all
!VM_SHARED mappings).
The corruption can be triggered artificially
(http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp)
but there doesn't seem to be any real life bug report. The race window
should be quite tight to trigger most of the time.
Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org
Fixes: aac4536355 ("mm, oom: introduce oom reaper")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Wenwei Tao <wenwei.tww@alibaba-inc.com>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 7407054209 ("oom, suspend: fix oom_reaper vs.
oom_killer_disable race") has workaround an existing race between
oom_killer_disable and oom_reaper by adding another round of
try_to_freeze_tasks after the oom killer was disabled. This was the
easiest thing to do for a late 4.7 fix. Let's fix it properly now.
After "oom: keep mm of the killed task available" we no longer have to
call exit_oom_victim from the oom reaper because we have stable mm
available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So let's
remove exit_oom_victim and the race described in the above commit
doesn't exist anymore if.
Unfortunately this alone is not sufficient for the oom_killer_disable
usecase because now we do not have any reliable way to reach
exit_oom_victim (the victim might get stuck on a way to exit for an
unbounded amount of time). OOM killer can cope with that by checking mm
flags and move on to another victim but we cannot do the same for
oom_killer_disable as we would lose the guarantee of no further
interference of the victim with the rest of the system. What we can do
instead is to cap the maximum time the oom_killer_disable waits for
victims. The only current user of this function (pm suspend) already
has a concept of timeout for back off so we can reuse the same value
there.
Let's drop set_freezable for the oom_reaper kthread because it is no
longer needed as the reaper doesn't wake or thaw any processes.
Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When selecting an oom victim, we use the same heuristic for both memory
cgroup and global oom. The only difference is the scope of tasks to
select the victim from. So we could just export an iterator over all
memcg tasks and keep all oom related logic in oom_kill.c, but instead we
duplicate pieces of it in memcontrol.c reusing some initially private
functions of oom_kill.c in order to not duplicate all of it. That looks
ugly and error prone, because any modification of select_bad_process
should also be propagated to mem_cgroup_out_of_memory.
Let's rework this as follows: keep all oom heuristic related code private
to oom_kill.c and make oom_kill.c use exported memcg functions when it's
really necessary (like in case of iterating over memcg tasks).
Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_will_free_mem is rather weak. It doesn't really tell whether the
task has chance to drop its mm. 98748bd722 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step into making
it more robust for multi-threaded applications so now we know that the
whole process is going down and probably drop the mm.
This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
use_mm().
Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper because
task_will_free_mem implies the task is reapable now. Therefore all paths
which bypass the oom killer are now reapable and so they shouldn't lock up
the oom killer.
Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the size of "struct signal_struct"->oom_flags member is
sizeof(unsigned) bytes, but only one flag OOM_FLAG_ORIGIN which is
updated by current thread is defined. We can convert OOM_FLAG_ORIGIN
into a bool, and reuse the saved bytes for updating from the OOM killer
and/or the OOM reaper thread.
By the way, do we care about a race window between run_store() and
swapoff() because it would be theoretically possible that two threads
sharing the "struct signal_struct" concurrently call respective
functions? If we care, we can make oom_flags an atomic_t.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_will_free_mem is a misnomer for a more complex PF_EXITING test for
early break out from the oom killer because it is believed that such a
task would release its memory shortly and so we do not have to select an
oom victim and perform a disruptive action.
Currently we make sure that the given task is not participating in the
core dumping because it might get blocked for a long time - see commit
d003f371b2 ("oom: don't assume that a coredumping thread will exit
soon").
The check can still do better though. We shouldn't consider the task
unless the whole thread group is going down. This is rather unlikely
but not impossible. A single exiting thread would surely leave all the
address space behind. If we are really unlucky it might get stuck on
the exit path and keep its TIF_MEMDIE and so block the oom killer.
Link: http://lkml.kernel.org/r/1460452756-15491-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If either the current task is already killed or PF_EXITING or a selected
task is PF_EXITING then the oom killer is suppressed and so is the oom
reaper. This patch adds try_oom_reaper which checks the given task and
queues it for the oom reaper if that is safe to be done meaning that the
task doesn't share the mm with an alive process.
This might help to release the memory pressure while the task tries to
exit.
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Daniel Vetter <daniel.vetter@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.
The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.
This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem for
write should be blocked killable which will help to reduce such a lock
contention. This is not done by this patch.
Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The force_kill member of struct oom_control isn't needed if an order of -1
is used instead. This is the same as order == -1 in struct
compact_control which requires full memory compaction.
This patch introduces no functional change.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are essential elements to an oom context that are passed around to
multiple functions.
Organize these elements into a new struct, struct oom_control, that
specifies the context for an oom condition.
This patch introduces no functional change.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The zonelist locking and the oom_sem are two overlapping locks that are
used to serialize global OOM killing against different things.
The historical zonelist locking serializes OOM kills from allocations with
overlapping zonelists against each other to prevent killing more tasks
than necessary in the same memory domain. Only when neither tasklists nor
zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
bound to separate nodes) are OOM kills allowed to execute in parallel.
The younger oom_sem is a read-write lock to serialize OOM killing against
the PM code trying to disable the OOM killer altogether.
However, the OOM killer is a fairly cold error path, there is really no
reason to optimize for highly performant and concurrent OOM kills. And
the oom_sem is just flat-out redundant.
Replace both locking schemes with a single global mutex serializing OOM
kills regardless of context.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking
are related in functionality, but the interface is not symmetrical at
all: one is an internal OOM killer function used during the killing, the
other is for an OOM victim to signal its own death on exit later on.
This has locking implications, see follow-up changes.
While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
is easier on the eye.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If kernel panics due to oom, caused by a cgroup reaching its limit, when
'compulsory panic_on_oom' is enabled, then we will only see that the OOM
happened because of "compulsory panic_on_oom is enabled" but this doesn't
tell the difference between mempolicy and memcg. And dumping system wide
information is plain wrong and more confusing. This patch provides the
information of the cgroup whose limit triggerred panic
Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 5695be142e ("OOM, PM: OOM killed task shouldn't escape PM
suspend") has left a race window when OOM killer manages to
note_oom_kill after freeze_processes checks the counter. The race
window is quite small and really unlikely and partial solution deemed
sufficient at the time of submission.
Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.
oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.
oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the further
OOMs by setting oom_killer_disabled and checks for any oom victims.
Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
last victim wakes up all waiters enqueued by oom_killer_disable().
Therefore this function acts as the full OOM barrier.
The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.
out_of_memory tells the caller whether the OOM was allowed to trigger and
the callers are supposed to handle the situation. The page allocation
path simply fails the allocation same as before. The page fault path will
retry the fault (more on that later) and Sysrq OOM trigger will simply
complain to the log.
Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because
- the victim cannot loop in the page fault handler (it would die
on the way out from the exception)
- it cannot loop in the page allocator because all the further
allocation would fail and __GFP_NOFAIL allocations are not
acceptable at this stage
- it shouldn't be blocked on any locks held by frozen tasks
(try_to_freeze expects lockless context) and kernel threads and
work queues are not frozen yet
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset addresses a race which was described in the changelog for
5695be142e ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
: PM freezer relies on having all tasks frozen by the time devices are
: getting frozen so that no task will touch them while they are getting
: frozen. But OOM killer is allowed to kill an already frozen task in order
: to handle OOM situtation. In order to protect from late wake ups OOM
: killer is disabled after all tasks are frozen. This, however, still keeps
: a window open when a killed task didn't manage to die by the time
: freeze_processes finishes.
The original patch hasn't closed the race window completely because that
would require a more complex solution as it can be seen by this patchset.
The primary motivation was to close the race condition between OOM killer
and PM freezer _completely_. As Tejun pointed out, even though the race
condition is unlikely the harder it would be to debug weird bugs deep in
the PM freezer when the debugging options are reduced considerably. I can
only speculate what might happen when a task is still runnable
unexpectedly.
On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.
I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
echo processors > /sys/power/pm_test
while true
do
echo mem > /sys/power/state
sleep 1s
done
- simple module which allocates and frees 20M in 8K chunks. If it sees
freezing(current) then it tries another round of allocation before calling
try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
I think this should be OK because __thaw_task shouldn't interfere with any
locking down wake_up_process. Oleg?
As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the tasks
are frozen and OOM disabled. I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the race
is really unlikely.
[ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[ 243.636072] (elapsed 2.837 seconds) done.
[ 243.641985] Trying to disable OOM killer
[ 243.643032] Waiting for concurent OOM victims
[ 243.644342] OOM killer disabled
[ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[ 243.652983] Suspending console(s) (use no_console_suspend to debug)
[ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[ 243.992600] PM: suspend of devices complete after 336.667 msecs
[ 243.993264] PM: late suspend of devices complete after 0.660 msecs
[ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[ 243.994717] ACPI: Preparing to enter system sleep state S3
[ 243.994795] PM: Saving platform NVS memory
[ 243.994796] Disabling non-boot CPUs ...
The first 2 patches are simple cleanups for OOM. They should go in
regardless the rest IMO.
Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.
The main patch is the last one and I would appreciate acks from Tejun and
Rafael. I think the OOM part should be OK (except for __thaw_task vs.
task_lock where a look from Oleg would appreciated) but I am not so sure I
haven't screwed anything in the freezer code. I have found several
surprises there.
This patch (of 5):
This patch is just a preparatory and it doesn't introduce any functional
change.
Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The OOM killing invocation does a lot of duplicative checks against the
task's allocation context. Rework it to take advantage of the existing
checks in the allocator slowpath.
The OOM killer is invoked when the allocator is unable to reclaim any
pages but the allocation has to keep looping. Instead of having a check
for __GFP_NORETRY hidden in oom_gfp_allowed(), just move the OOM
invocation to the true branch of should_alloc_retry(). The __GFP_FS
check from oom_gfp_allowed() can then be moved into the OOM avoidance
branch in __alloc_pages_may_oom(), along with the PF_DUMPCORE test.
__alloc_pages_may_oom() can then signal to the caller whether the OOM
killer was invoked, instead of requiring it to duplicate the order and
high_zoneidx checks to guess this when deciding whether to continue.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
oom_kill.c assumes that PF_EXITING task should exit and free the memory
soon. This is wrong in many ways and one important case is the coredump.
A task can sleep in exit_mm() "forever" while the coredumping sub-thread
can need more memory.
Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
we add the new trivial helper for that.
Note: this is only the first step, this patch doesn't try to solve other
problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
participate in coredump after it was already observed in PF_EXITING state,
so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
And even the name/usage of the new helper is confusing, an exiting thread
can only free its ->mm if it is the only/last task in thread group.
[akpm@linux-foundation.org: add comment]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.
Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.
Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.
Changes since v1
- push the re-check loop out of freeze_processes into
check_frozen_processes and invert the condition to make the code more
readable as per Rafael
Fixes: f660daac47 (oom: thaw threads if oom killed thread is frozen before deferring)
Cc: 3.2+ <stable@vger.kernel.org> # 3.2+
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
to imply that they require locking semantics to avoid out_of_memory()
being reordered.
zone_scan_lock is required for both functions to ensure that there is
proper locking synchronization.
Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
locking semantics.
At the same time, convert oom_zonelist_trylock() to return bool instead
of int since only success and failure are tested.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
specify that current should be killed first if an oom condition occurs in
between the two calls.
The usage is
short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
...
compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
to store the thread's oom_score_adj, temporarily change it to the maximum
score possible, and then restore the old value if it is still the same.
This happens to still be racy, however, if the user writes
OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
The compare_swap_oom_score_adj() will then incorrectly reset the old value
prior to the write of OOM_SCORE_ADJ_MAX.
To fix this, introduce a new oom_flags_t member in struct signal_struct
that will be used for per-thread oom killer flags. KSM and swapoff can
now use a bit in this member to specify that threads should be killed
first in oom conditions without playing around with oom_score_adj.
This also allows the correct oom_score_adj to always be shown when reading
/proc/pid/oom_score.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The global oom killer is serialized by the per-zonelist
try_set_zonelist_oom() which is used in the page allocator. Concurrent
oom kills are thus a rare event and only occur in systems using
mempolicies and with a large number of nodes.
Memory controller oom kills, however, can frequently be concurrent since
there is no serialization once the oom killer is called for oom conditions
in several different memcgs in parallel.
This creates a massive contention on tasklist_lock since the oom killer
requires the readside for the tasklist iteration. If several memcgs are
calling the oom killer, this lock can be held for a substantial amount of
time, especially if threads continue to enter it as other threads are
exiting.
Since the exit path grabs the writeside of the lock with irqs disabled in
a few different places, this can cause a soft lockup on cpus as a result
of tasklist_lock starvation.
The kernel lacks unfair writelocks, and successful calls to the oom killer
usually result in at least one thread entering the exit path, so an
alternative solution is needed.
This patch introduces a seperate oom handler for memcgs so that they do
not require tasklist_lock for as much time. Instead, it iterates only
over the threads attached to the oom memcg and grabs a reference to the
selected thread before calling oom_kill_process() to ensure it doesn't
prematurely exit.
This still requires tasklist_lock for the tasklist dump, iterating
children of the selected process, and killing all other threads on the
system sharing the same memory as the selected victim. So while this
isn't a complete solution to tasklist_lock starvation, it significantly
reduces the amount of time that it is held.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Sha Zhengju <handai.szj@taobao.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oom_score_adj scale ranges from -1000 to 1000 and represents the
proportion of memory available to the process at allocation time. This
means an oom_score_adj value of 300, for example, will bias a process as
though it was using an extra 30.0% of available memory and a value of
-350 will discount 35.0% of available memory from its usage.
The oom killer badness heuristic also uses this scale to report the oom
score for each eligible process in determining the "best" process to
kill. Thus, it can only differentiate each process's memory usage by
0.1% of system RAM.
On large systems, this can end up being a large amount of memory: 256MB
on 256GB systems, for example.
This can be fixed by having the badness heuristic to use the actual
memory usage in scoring threads and then normalizing it to the
oom_score_adj scale for userspace. This results in better comparison
between eligible threads for kill and no change from the userspace
perspective.
Suggested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The oom killer chooses not to kill a thread if:
- an eligible thread has already been oom killed and has yet to exit,
and
- an eligible thread is exiting but has yet to free all its memory and
is not the thread attempting to currently allocate memory.
SysRq+F manually invokes the global oom killer to kill a memory-hogging
task. This is normally done as a last resort to free memory when no
progress is being made or to test the oom killer itself.
For both uses, we always want to kill a thread and never defer. This
patch causes SysRq+F to always kill an eligible thread and can be used to
force a kill even if another oom killed thread has failed to exit.
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
test_set_oom_score_adj() was introduced in 72788c3856 ("oom: replace
PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
current's oom_score_adj for ksm and swapoff without requiring an
additional per-process flag.
Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
then reinstate the previous value is racy since it's possible that
userspace can set the value to something else itself before the old value
is reinstated. That results in userspace setting current's oom_score_adj
to a different value and then the kernel immediately setting it back to
its previous value without notification.
To fix this, a new compare_swap_oom_score_adj() function is introduced
with the same semantics as the compare and swap CAS instruction, or
CMPXCHG on x86. It is used to reinstate the previous value of
oom_score_adj if and only if the present value is the same as the old
value.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The badness() function in the oom killer was renamed to oom_badness() in
a63d83f427 ("oom: badness heuristic rewrite") since it is a globally
exported function for clarity.
The prototype for the old function still existed in linux/oom.h, so remove
it. There are no existing users.
Also fixes documentation and comment references to badness() and adjusts
them accordingly.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>