* common/android-4.4-p:
Linux 4.4.293
usb: max-3421: Use driver data instead of maintaining a list of bound devices
ASoC: DAPM: Cover regression by kctl change notification fix
batman-adv: Avoid WARN_ON timing related checks
batman-adv: Don't always reallocate the fragmentation skb head
batman-adv: Reserve needed_*room for fragments
batman-adv: Consider fragmentation for needed_headroom
batman-adv: set .owner to THIS_MODULE
batman-adv: mcast: fix duplicate mcast packets from BLA backbone to mesh
batman-adv: mcast: fix duplicate mcast packets in BLA backbone from mesh
batman-adv: mcast: fix duplicate mcast packets in BLA backbone from LAN
batman-adv: Prevent duplicated softif_vlan entry
batman-adv: Fix multicast TT issues with bogus ROAM flags
batman-adv: Keep fragments equally sized
drm/amdgpu: fix set scaling mode Full/Full aspect/Center not works on vga and dvi connectors
drm/udl: fix control-message timeout
cfg80211: call cfg80211_stop_ap when switch from P2P_GO type
parisc/sticon: fix reverse colors
btrfs: fix memory ordering between normal and ordered work functions
mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag
hexagon: export raw I/O routines for modules
tun: fix bonding active backup with arp monitoring
NFC: reorder the logic in nfc_{un,}register_device
NFC: reorganize the functions in nci_request
platform/x86: hp_accel: Fix an error handling path in 'lis3lv02d_probe()'
mips: bcm63xx: add support for clk_get_parent()
net: bnx2x: fix variable dereferenced before check
sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()
mips: BCM63XX: ensure that CPU_SUPPORTS_32BIT_KERNEL is set
sh: define __BIG_ENDIAN for math-emu
sh: fix kconfig unmet dependency warning for FRAME_POINTER
maple: fix wrong return value of maple_bus_init().
sh: check return code of request_irq
powerpc/dcr: Use cmplwi instead of 3-argument cmpli
ALSA: gus: fix null pointer dereference on pointer block
powerpc/5200: dts: fix memory node unit name
scsi: target: Fix alua_tg_pt_gps_count tracking
scsi: target: Fix ordered tag handling
MIPS: sni: Fix the build
tty: tty_buffer: Fix the softlockup issue in flush_to_ldisc
usb: host: ohci-tmio: check return value after calling platform_get_resource()
ARM: dts: omap: fix gpmc,mux-add-data type
scsi: advansys: Fix kernel pointer leak
usb: musb: tusb6010: check return value after calling platform_get_resource()
scsi: lpfc: Fix list_add() corruption in lpfc_drain_txq()
net: batman-adv: fix error handling
PCI/MSI: Destroy sysfs before freeing entries
parisc/entry: fix trace test in syscall exit path
PCI: Add PCI_EXP_DEVCTL_PAYLOAD_* macros
mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
ARM: 9156/1: drop cc-option fallbacks for architecture selection
USB: chipidea: fix interrupt deadlock
vsock: prevent unnecessary refcnt inc for nonblocking connect
nfc: pn533: Fix double free when pn533_fill_fragment_skbs() fails
llc: fix out-of-bound array index in llc_sk_dev_hash()
bonding: Fix a use-after-free problem when bond_sysfs_slave_add() failed
net: davinci_emac: Fix interrupt pacing disable
xen-pciback: Fix return in pm_ctrl_init()
scsi: qla2xxx: Turn off target reset during issue_lip
watchdog: f71808e_wdt: fix inaccurate report in WDIOC_GETTIMEOUT
m68k: set a default value for MEMORY_RESERVE
netfilter: nfnetlink_queue: fix OOB when mac header was cleared
dmaengine: at_xdmac: fix AT_XDMAC_CC_PERID() macro
RDMA/mlx4: Return missed an error if device doesn't support steering
scsi: csiostor: Uninitialized data in csio_ln_vnp_read_cbfn()
power: supply: rt5033_battery: Change voltage values to µV
usb: gadget: hid: fix error code in do_config()
serial: 8250_dw: Drop wrong use of ACPI_PTR()
video: fbdev: chipsfb: use memset_io() instead of memset()
memory: fsl_ifc: fix leak of irq and nand_irq in fsl_ifc_ctrl_probe
JFS: fix memleak in jfs_mount
scsi: dc395: Fix error case unwinding
ARM: s3c: irq-s3c24xx: Fix return value check for s3c24xx_init_intc()
crypto: pcrypt - Delay write to padata->info
libertas: Fix possible memory leak in probe and disconnect
libertas_tf: Fix possible memory leak in probe and disconnect
smackfs: use netlbl_cfg_cipsov4_del() for deleting cipso_v4_doi
mwifiex: Send DELBA requests according to spec
platform/x86: thinkpad_acpi: Fix bitwise vs. logical warning
net: stream: don't purge sk_error_queue in sk_stream_kill_queues()
drm/msm: uninitialized variable in msm_gem_import()
memstick: jmb38x_ms: use appropriate free function in jmb38x_ms_alloc_host()
memstick: avoid out-of-range warning
b43: fix a lower bounds test
b43legacy: fix a lower bounds test
crypto: qat - detect PFVF collision after ACK
ath9k: Fix potential interrupt storm on queue reset
cpuidle: Fix kobject memory leaks in error paths
media: si470x: Avoid card name truncation
media: dvb-usb: fix ununit-value in az6027_rc_query
parisc/kgdb: add kgdb_roundup() to make kgdb work with idle polling
parisc: fix warning in flush_tlb_all
ARM: 9136/1: ARMv7-M uses BE-8, not BE-32
ARM: clang: Do not rely on lr register for stacktrace
smackfs: use __GFP_NOFAIL for smk_cipso_doi()
iwlwifi: mvm: disable RX-diversity in powersave
PM: hibernate: Get block device exclusively in swsusp_check()
mwl8k: Fix use-after-free in mwl8k_fw_state_machine()
lib/xz: Validate the value before assigning it to an enum variable
lib/xz: Avoid overlapping memcpy() with invalid input with in-place decompression
memstick: r592: Fix a UAF bug when removing the driver
ACPI: battery: Accept charges over the design capacity as full
ath: dfs_pattern_detector: Fix possible null-pointer dereference in channel_detector_create()
tracefs: Have tracefs directories not set OTH permission bits by default
media: usb: dvd-usb: fix uninit-value bug in dibusb_read_eeprom_byte()
ACPICA: Avoid evaluating methods too early during system resume
ia64: don't do IA64_CMPXCHG_DEBUG without CONFIG_PRINTK
media: mceusb: return without resubmitting URB in case of -EPROTO error.
media: s5p-mfc: fix possible null-pointer dereference in s5p_mfc_probe()
media: uvcvideo: Set capability in s_param
media: netup_unidvb: handle interrupt properly according to the firmware
media: mt9p031: Fix corrupted frame after restarting stream
x86: Increase exception stack sizes
smackfs: Fix use-after-free in netlbl_catmap_walk()
MIPS: lantiq: dma: reset correct number of channel
MIPS: lantiq: dma: add small delay after reset
platform/x86: wmi: do not fail if disabling fails
Bluetooth: fix use-after-free error in lock_sock_nested()
Bluetooth: sco: Fix lock_sock() blockage by memcpy_from_msg()
USB: iowarrior: fix control-message timeouts
USB: serial: keyspan: fix memleak on probe errors
iio: dac: ad5446: Fix ad5622_write() return value
quota: correct error number in free_dqentry()
quota: check block number when reading the block in quota file
ALSA: mixer: fix deadlock in snd_mixer_oss_set_volume
ALSA: mixer: oss: Fix racy access to slots
power: supply: max17042_battery: use VFSOC for capacity when no rsns
power: supply: max17042_battery: Prevent int underflow in set_soc_threshold
signal: Remove the bogus sigkill_pending in ptrace_stop
mwifiex: Read a PCI register after writing the TX ring write pointer
wcn36xx: Fix HT40 capability for 2Ghz band
PCI: Mark Atheros QCA6174 to avoid bus reset
ath6kl: fix control-message timeout
ath6kl: fix division by zero in send path
mwifiex: fix division by zero in fw download path
EDAC/sb_edac: Fix top-of-high-memory value for Broadwell/Haswell
hwmon: (pmbus/lm25066) Add offset coefficients
btrfs: fix lost error handling when replaying directory deletes
vmxnet3: do not stop tx queues after netif_device_detach()
spi: spl022: fix Microwire full duplex mode
xen/netfront: stop tx queues during live migration
mmc: winbond: don't build on M68K
hyperv/vmbus: include linux/bitops.h
x86/irq: Ensure PI wakeup handler is unregistered before module unload
ALSA: timer: Unconditionally unlink slave instances, too
ALSA: timer: Fix use-after-free problem
ALSA: synth: missing check for possible NULL after the call to kstrdup
ALSA: line6: fix control and interrupt message timeouts
ALSA: 6fire: fix control and bulk message timeouts
ALSA: ua101: fix division by zero at probe
media: ite-cir: IR receiver stop working after receive overflow
parisc: Fix ptrace check on syscall return
mmc: dw_mmc: Dont wait for DRTO on Write RSP error
ocfs2: fix data corruption on truncate
libata: fix read log timeout value
Input: i8042 - Add quirk for Fujitsu Lifebook T725
Input: elantench - fix misreporting trackpoint coordinates
xhci: Fix USB 3.1 enumeration issues by increasing roothub power-on-good delay
binder: use cred instead of task for selinux checks
binder: use euid from cred instead of using task
FROMGIT: binder: fix test regression due to sender_euid change
BACKPORT: binder: use cred instead of task for selinux checks
BACKPORT: binder: use euid from cred instead of using task
BACKPORT: ip_gre: add validation for csum_start
Linux 4.4.292
rsi: fix control-message timeout
staging: rtl8192u: fix control-message timeouts
staging: r8712u: fix control-message timeout
comedi: vmk80xx: fix bulk and interrupt message timeouts
comedi: vmk80xx: fix bulk-buffer overflow
comedi: vmk80xx: fix transfer-buffer overflows
staging: comedi: drivers: replace le16_to_cpu() with usb_endpoint_maxp()
comedi: ni_usb6501: fix NULL-deref in command paths
comedi: dt9812: fix DMA buffers on stack
isofs: Fix out of bound access for corrupted isofs image
usb: hso: fix error handling code of hso_create_net_device
printk/console: Allow to disable console output by using console="" or console=null
usb-storage: Add compatibility quirk flags for iODD 2531/2541
usb: gadget: Mark USB_FSL_QE broken on 64-bit
IB/qib: Protect from buffer overflow in struct qib_user_sdma_pkt fields
IB/qib: Use struct_size() helper
net: hso: register netdev later to avoid a race condition
ARM: 9120/1: Revert "amba: make use of -1 IRQs warn"
scsi: core: Put LLD module refcnt after SCSI device is released
Conflicts:
net/bluetooth/l2cap_sock.c
Change-Id: I066f7145b7245f9c95e1c78f84f0871a9825150f
782 lines
21 KiB
C
782 lines
21 KiB
C
/*
|
|
* linux/mm/oom_kill.c
|
|
*
|
|
* Copyright (C) 1998,2000 Rik van Riel
|
|
* Thanks go out to Claus Fischer for some serious inspiration and
|
|
* for goading me into coding this file...
|
|
* Copyright (C) 2010 Google, Inc.
|
|
* Rewritten by David Rientjes
|
|
*
|
|
* The routines in this file are used to kill a process when
|
|
* we're seriously out of memory. This gets called from __alloc_pages()
|
|
* in mm/page_alloc.c when we really run out of memory.
|
|
*
|
|
* Since we won't call these routines often (on a well-configured
|
|
* machine) this file will double as a 'coding guide' and a signpost
|
|
* for newbie kernel hackers. It features several pointers to major
|
|
* kernel subsystems and hints as to where to find out what things do.
|
|
*/
|
|
|
|
#include <linux/oom.h>
|
|
#include <linux/mm.h>
|
|
#include <linux/err.h>
|
|
#include <linux/gfp.h>
|
|
#include <linux/sched.h>
|
|
#include <linux/swap.h>
|
|
#include <linux/timex.h>
|
|
#include <linux/jiffies.h>
|
|
#include <linux/cpuset.h>
|
|
#include <linux/export.h>
|
|
#include <linux/notifier.h>
|
|
#include <linux/memcontrol.h>
|
|
#include <linux/mempolicy.h>
|
|
#include <linux/security.h>
|
|
#include <linux/ptrace.h>
|
|
#include <linux/freezer.h>
|
|
#include <linux/ftrace.h>
|
|
#include <linux/ratelimit.h>
|
|
|
|
#define CREATE_TRACE_POINTS
|
|
#include <trace/events/oom.h>
|
|
|
|
int sysctl_panic_on_oom;
|
|
int sysctl_oom_kill_allocating_task;
|
|
int sysctl_oom_dump_tasks = 1;
|
|
|
|
DEFINE_MUTEX(oom_lock);
|
|
|
|
#ifdef CONFIG_NUMA
|
|
/**
|
|
* has_intersects_mems_allowed() - check task eligiblity for kill
|
|
* @start: task struct of which task to consider
|
|
* @mask: nodemask passed to page allocator for mempolicy ooms
|
|
*
|
|
* Task eligibility is determined by whether or not a candidate task, @tsk,
|
|
* shares the same mempolicy nodes as current if it is bound by such a policy
|
|
* and whether or not it has the same set of allowed cpuset nodes.
|
|
*/
|
|
static bool has_intersects_mems_allowed(struct task_struct *start,
|
|
const nodemask_t *mask)
|
|
{
|
|
struct task_struct *tsk;
|
|
bool ret = false;
|
|
|
|
rcu_read_lock();
|
|
for_each_thread(start, tsk) {
|
|
if (mask) {
|
|
/*
|
|
* If this is a mempolicy constrained oom, tsk's
|
|
* cpuset is irrelevant. Only return true if its
|
|
* mempolicy intersects current, otherwise it may be
|
|
* needlessly killed.
|
|
*/
|
|
ret = mempolicy_nodemask_intersects(tsk, mask);
|
|
} else {
|
|
/*
|
|
* This is not a mempolicy constrained oom, so only
|
|
* check the mems of tsk's cpuset.
|
|
*/
|
|
ret = cpuset_mems_allowed_intersects(current, tsk);
|
|
}
|
|
if (ret)
|
|
break;
|
|
}
|
|
rcu_read_unlock();
|
|
|
|
return ret;
|
|
}
|
|
#else
|
|
static bool has_intersects_mems_allowed(struct task_struct *tsk,
|
|
const nodemask_t *mask)
|
|
{
|
|
return true;
|
|
}
|
|
#endif /* CONFIG_NUMA */
|
|
|
|
/*
|
|
* The process p may have detached its own ->mm while exiting or through
|
|
* use_mm(), but one or more of its subthreads may still have a valid
|
|
* pointer. Return p, or any of its subthreads with a valid ->mm, with
|
|
* task_lock() held.
|
|
*/
|
|
struct task_struct *find_lock_task_mm(struct task_struct *p)
|
|
{
|
|
struct task_struct *t;
|
|
|
|
rcu_read_lock();
|
|
|
|
for_each_thread(p, t) {
|
|
task_lock(t);
|
|
if (likely(t->mm))
|
|
goto found;
|
|
task_unlock(t);
|
|
}
|
|
t = NULL;
|
|
found:
|
|
rcu_read_unlock();
|
|
|
|
return t;
|
|
}
|
|
|
|
/*
|
|
* order == -1 means the oom kill is required by sysrq, otherwise only
|
|
* for display purposes.
|
|
*/
|
|
static inline bool is_sysrq_oom(struct oom_control *oc)
|
|
{
|
|
return oc->order == -1;
|
|
}
|
|
|
|
/* return true if the task is not adequate as candidate victim task. */
|
|
static bool oom_unkillable_task(struct task_struct *p,
|
|
struct mem_cgroup *memcg, const nodemask_t *nodemask)
|
|
{
|
|
if (is_global_init(p))
|
|
return true;
|
|
if (p->flags & PF_KTHREAD)
|
|
return true;
|
|
|
|
/* When mem_cgroup_out_of_memory() and p is not member of the group */
|
|
if (memcg && !task_in_mem_cgroup(p, memcg))
|
|
return true;
|
|
|
|
/* p may not have freeable memory in nodemask */
|
|
if (!has_intersects_mems_allowed(p, nodemask))
|
|
return true;
|
|
|
|
return false;
|
|
}
|
|
|
|
/**
|
|
* oom_badness - heuristic function to determine which candidate task to kill
|
|
* @p: task struct of which task we should calculate
|
|
* @totalpages: total present RAM allowed for page allocation
|
|
*
|
|
* The heuristic for determining which task to kill is made to be as simple and
|
|
* predictable as possible. The goal is to return the highest value for the
|
|
* task consuming the most memory to avoid subsequent oom failures.
|
|
*/
|
|
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
|
|
const nodemask_t *nodemask, unsigned long totalpages)
|
|
{
|
|
long points;
|
|
long adj;
|
|
|
|
if (oom_unkillable_task(p, memcg, nodemask))
|
|
return 0;
|
|
|
|
p = find_lock_task_mm(p);
|
|
if (!p)
|
|
return 0;
|
|
|
|
adj = (long)p->signal->oom_score_adj;
|
|
if (adj == OOM_SCORE_ADJ_MIN) {
|
|
task_unlock(p);
|
|
return 0;
|
|
}
|
|
|
|
/*
|
|
* The baseline for the badness score is the proportion of RAM that each
|
|
* task's rss, pagetable and swap space use.
|
|
*/
|
|
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
|
|
atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);
|
|
task_unlock(p);
|
|
|
|
/*
|
|
* Root processes get 3% bonus, just like the __vm_enough_memory()
|
|
* implementation used by LSMs.
|
|
*/
|
|
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
|
|
points -= (points * 3) / 100;
|
|
|
|
/* Normalize to oom_score_adj units */
|
|
adj *= totalpages / 1000;
|
|
points += adj;
|
|
|
|
/*
|
|
* Never return 0 for an eligible task regardless of the root bonus and
|
|
* oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
|
|
*/
|
|
return points > 0 ? points : 1;
|
|
}
|
|
|
|
/*
|
|
* Determine the type of allocation constraint.
|
|
*/
|
|
#ifdef CONFIG_NUMA
|
|
static enum oom_constraint constrained_alloc(struct oom_control *oc,
|
|
unsigned long *totalpages)
|
|
{
|
|
struct zone *zone;
|
|
struct zoneref *z;
|
|
enum zone_type high_zoneidx = gfp_zone(oc->gfp_mask);
|
|
bool cpuset_limited = false;
|
|
int nid;
|
|
|
|
/* Default to all available memory */
|
|
*totalpages = totalram_pages + total_swap_pages;
|
|
|
|
if (!oc->zonelist)
|
|
return CONSTRAINT_NONE;
|
|
/*
|
|
* Reach here only when __GFP_NOFAIL is used. So, we should avoid
|
|
* to kill current.We have to random task kill in this case.
|
|
* Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
|
|
*/
|
|
if (oc->gfp_mask & __GFP_THISNODE)
|
|
return CONSTRAINT_NONE;
|
|
|
|
/*
|
|
* This is not a __GFP_THISNODE allocation, so a truncated nodemask in
|
|
* the page allocator means a mempolicy is in effect. Cpuset policy
|
|
* is enforced in get_page_from_freelist().
|
|
*/
|
|
if (oc->nodemask &&
|
|
!nodes_subset(node_states[N_MEMORY], *oc->nodemask)) {
|
|
*totalpages = total_swap_pages;
|
|
for_each_node_mask(nid, *oc->nodemask)
|
|
*totalpages += node_spanned_pages(nid);
|
|
return CONSTRAINT_MEMORY_POLICY;
|
|
}
|
|
|
|
/* Check this allocation failure is caused by cpuset's wall function */
|
|
for_each_zone_zonelist_nodemask(zone, z, oc->zonelist,
|
|
high_zoneidx, oc->nodemask)
|
|
if (!cpuset_zone_allowed(zone, oc->gfp_mask))
|
|
cpuset_limited = true;
|
|
|
|
if (cpuset_limited) {
|
|
*totalpages = total_swap_pages;
|
|
for_each_node_mask(nid, cpuset_current_mems_allowed)
|
|
*totalpages += node_spanned_pages(nid);
|
|
return CONSTRAINT_CPUSET;
|
|
}
|
|
return CONSTRAINT_NONE;
|
|
}
|
|
#else
|
|
static enum oom_constraint constrained_alloc(struct oom_control *oc,
|
|
unsigned long *totalpages)
|
|
{
|
|
*totalpages = totalram_pages + total_swap_pages;
|
|
return CONSTRAINT_NONE;
|
|
}
|
|
#endif
|
|
|
|
enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
|
|
struct task_struct *task, unsigned long totalpages)
|
|
{
|
|
if (oom_unkillable_task(task, NULL, oc->nodemask))
|
|
return OOM_SCAN_CONTINUE;
|
|
|
|
/*
|
|
* This task already has access to memory reserves and is being killed.
|
|
* Don't allow any other task to have access to the reserves.
|
|
*/
|
|
if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
|
|
if (!is_sysrq_oom(oc))
|
|
return OOM_SCAN_ABORT;
|
|
}
|
|
if (!task->mm)
|
|
return OOM_SCAN_CONTINUE;
|
|
|
|
/*
|
|
* If task is allocating a lot of memory and has been marked to be
|
|
* killed first if it triggers an oom, then select it.
|
|
*/
|
|
if (oom_task_origin(task))
|
|
return OOM_SCAN_SELECT;
|
|
|
|
if (task_will_free_mem(task) && !is_sysrq_oom(oc))
|
|
return OOM_SCAN_ABORT;
|
|
|
|
return OOM_SCAN_OK;
|
|
}
|
|
|
|
/*
|
|
* Simple selection loop. We chose the process with the highest
|
|
* number of 'points'. Returns -1 on scan abort.
|
|
*/
|
|
static struct task_struct *select_bad_process(struct oom_control *oc,
|
|
unsigned int *ppoints, unsigned long totalpages)
|
|
{
|
|
struct task_struct *g, *p;
|
|
struct task_struct *chosen = NULL;
|
|
unsigned long chosen_points = 0;
|
|
|
|
rcu_read_lock();
|
|
for_each_process_thread(g, p) {
|
|
unsigned int points;
|
|
|
|
switch (oom_scan_process_thread(oc, p, totalpages)) {
|
|
case OOM_SCAN_SELECT:
|
|
chosen = p;
|
|
chosen_points = ULONG_MAX;
|
|
/* fall through */
|
|
case OOM_SCAN_CONTINUE:
|
|
continue;
|
|
case OOM_SCAN_ABORT:
|
|
rcu_read_unlock();
|
|
return (struct task_struct *)(-1UL);
|
|
case OOM_SCAN_OK:
|
|
break;
|
|
};
|
|
points = oom_badness(p, NULL, oc->nodemask, totalpages);
|
|
if (!points || points < chosen_points)
|
|
continue;
|
|
/* Prefer thread group leaders for display purposes */
|
|
if (points == chosen_points && thread_group_leader(chosen))
|
|
continue;
|
|
|
|
chosen = p;
|
|
chosen_points = points;
|
|
}
|
|
if (chosen)
|
|
get_task_struct(chosen);
|
|
rcu_read_unlock();
|
|
|
|
*ppoints = chosen_points * 1000 / totalpages;
|
|
return chosen;
|
|
}
|
|
|
|
/**
|
|
* dump_tasks - dump current memory state of all system tasks
|
|
* @memcg: current's memory controller, if constrained
|
|
* @nodemask: nodemask passed to page allocator for mempolicy ooms
|
|
*
|
|
* Dumps the current memory state of all eligible tasks. Tasks not in the same
|
|
* memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
|
|
* are not shown.
|
|
* State information includes task's pid, uid, tgid, vm size, rss, nr_ptes,
|
|
* swapents, oom_score_adj value, and name.
|
|
*/
|
|
void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask)
|
|
{
|
|
struct task_struct *p;
|
|
struct task_struct *task;
|
|
|
|
pr_info("[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name\n");
|
|
rcu_read_lock();
|
|
for_each_process(p) {
|
|
if (oom_unkillable_task(p, memcg, nodemask))
|
|
continue;
|
|
|
|
task = find_lock_task_mm(p);
|
|
if (!task) {
|
|
/*
|
|
* This is a kthread or all of p's threads have already
|
|
* detached their mm's. There's no need to report
|
|
* them; they can't be oom killed anyway.
|
|
*/
|
|
continue;
|
|
}
|
|
|
|
pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu %5hd %s\n",
|
|
task->pid, from_kuid(&init_user_ns, task_uid(task)),
|
|
task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
|
|
atomic_long_read(&task->mm->nr_ptes),
|
|
mm_nr_pmds(task->mm),
|
|
get_mm_counter(task->mm, MM_SWAPENTS),
|
|
task->signal->oom_score_adj, task->comm);
|
|
task_unlock(task);
|
|
}
|
|
rcu_read_unlock();
|
|
}
|
|
|
|
static void dump_header(struct oom_control *oc, struct task_struct *p,
|
|
struct mem_cgroup *memcg)
|
|
{
|
|
pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, oom_score_adj=%hd\n",
|
|
current->comm, oc->gfp_mask, oc->order,
|
|
current->signal->oom_score_adj);
|
|
cpuset_print_current_mems_allowed();
|
|
dump_stack();
|
|
if (memcg)
|
|
mem_cgroup_print_oom_info(memcg, p);
|
|
else
|
|
show_mem(SHOW_MEM_FILTER_NODES);
|
|
if (sysctl_oom_dump_tasks)
|
|
dump_tasks(memcg, oc->nodemask);
|
|
}
|
|
|
|
/*
|
|
* Number of OOM victims in flight
|
|
*/
|
|
static atomic_t oom_victims = ATOMIC_INIT(0);
|
|
static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
|
|
|
|
bool oom_killer_disabled __read_mostly;
|
|
|
|
/**
|
|
* mark_oom_victim - mark the given task as OOM victim
|
|
* @tsk: task to mark
|
|
*
|
|
* Has to be called with oom_lock held and never after
|
|
* oom has been disabled already.
|
|
*/
|
|
void mark_oom_victim(struct task_struct *tsk)
|
|
{
|
|
WARN_ON(oom_killer_disabled);
|
|
/* OOM killer might race with memcg OOM */
|
|
if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
|
|
return;
|
|
/*
|
|
* Make sure that the task is woken up from uninterruptible sleep
|
|
* if it is frozen because OOM killer wouldn't be able to free
|
|
* any memory and livelock. freezing_slow_path will tell the freezer
|
|
* that TIF_MEMDIE tasks should be ignored.
|
|
*/
|
|
__thaw_task(tsk);
|
|
atomic_inc(&oom_victims);
|
|
}
|
|
|
|
/**
|
|
* exit_oom_victim - note the exit of an OOM victim
|
|
*/
|
|
void exit_oom_victim(void)
|
|
{
|
|
clear_thread_flag(TIF_MEMDIE);
|
|
|
|
if (!atomic_dec_return(&oom_victims))
|
|
wake_up_all(&oom_victims_wait);
|
|
}
|
|
|
|
/**
|
|
* oom_killer_disable - disable OOM killer
|
|
*
|
|
* Forces all page allocations to fail rather than trigger OOM killer.
|
|
* Will block and wait until all OOM victims are killed.
|
|
*
|
|
* The function cannot be called when there are runnable user tasks because
|
|
* the userspace would see unexpected allocation failures as a result. Any
|
|
* new usage of this function should be consulted with MM people.
|
|
*
|
|
* Returns true if successful and false if the OOM killer cannot be
|
|
* disabled.
|
|
*/
|
|
bool oom_killer_disable(void)
|
|
{
|
|
/*
|
|
* Make sure to not race with an ongoing OOM killer
|
|
* and that the current is not the victim.
|
|
*/
|
|
mutex_lock(&oom_lock);
|
|
if (test_thread_flag(TIF_MEMDIE)) {
|
|
mutex_unlock(&oom_lock);
|
|
return false;
|
|
}
|
|
|
|
oom_killer_disabled = true;
|
|
mutex_unlock(&oom_lock);
|
|
|
|
wait_event(oom_victims_wait, !atomic_read(&oom_victims));
|
|
|
|
return true;
|
|
}
|
|
|
|
/**
|
|
* oom_killer_enable - enable OOM killer
|
|
*/
|
|
void oom_killer_enable(void)
|
|
{
|
|
oom_killer_disabled = false;
|
|
}
|
|
|
|
/*
|
|
* task->mm can be NULL if the task is the exited group leader. So to
|
|
* determine whether the task is using a particular mm, we examine all the
|
|
* task's threads: if one of those is using this mm then this task was also
|
|
* using it.
|
|
*/
|
|
static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
|
|
{
|
|
struct task_struct *t;
|
|
|
|
for_each_thread(p, t) {
|
|
struct mm_struct *t_mm = READ_ONCE(t->mm);
|
|
if (t_mm)
|
|
return t_mm == mm;
|
|
}
|
|
return false;
|
|
}
|
|
|
|
#define K(x) ((x) << (PAGE_SHIFT-10))
|
|
/*
|
|
* Must be called while holding a reference to p, which will be released upon
|
|
* returning.
|
|
*/
|
|
void oom_kill_process(struct oom_control *oc, struct task_struct *p,
|
|
unsigned int points, unsigned long totalpages,
|
|
struct mem_cgroup *memcg, const char *message)
|
|
{
|
|
struct task_struct *victim = p;
|
|
struct task_struct *child;
|
|
struct task_struct *t;
|
|
struct mm_struct *mm;
|
|
unsigned int victim_points = 0;
|
|
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
|
|
DEFAULT_RATELIMIT_BURST);
|
|
|
|
/*
|
|
* If the task is already exiting, don't alarm the sysadmin or kill
|
|
* its children or threads, just set TIF_MEMDIE so it can die quickly
|
|
*/
|
|
task_lock(p);
|
|
if (p->mm && task_will_free_mem(p)) {
|
|
mark_oom_victim(p);
|
|
task_unlock(p);
|
|
put_task_struct(p);
|
|
return;
|
|
}
|
|
task_unlock(p);
|
|
|
|
if (__ratelimit(&oom_rs))
|
|
dump_header(oc, p, memcg);
|
|
|
|
pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
|
|
message, task_pid_nr(p), p->comm, points);
|
|
|
|
/*
|
|
* If any of p's children has a different mm and is eligible for kill,
|
|
* the one with the highest oom_badness() score is sacrificed for its
|
|
* parent. This attempts to lose the minimal amount of work done while
|
|
* still freeing memory.
|
|
*/
|
|
read_lock(&tasklist_lock);
|
|
|
|
/*
|
|
* The task 'p' might have already exited before reaching here. The
|
|
* put_task_struct() will free task_struct 'p' while the loop still try
|
|
* to access the field of 'p', so, get an extra reference.
|
|
*/
|
|
get_task_struct(p);
|
|
for_each_thread(p, t) {
|
|
list_for_each_entry(child, &t->children, sibling) {
|
|
unsigned int child_points;
|
|
|
|
if (process_shares_mm(child, p->mm))
|
|
continue;
|
|
/*
|
|
* oom_badness() returns 0 if the thread is unkillable
|
|
*/
|
|
child_points = oom_badness(child, memcg, oc->nodemask,
|
|
totalpages);
|
|
if (child_points > victim_points) {
|
|
put_task_struct(victim);
|
|
victim = child;
|
|
victim_points = child_points;
|
|
get_task_struct(victim);
|
|
}
|
|
}
|
|
}
|
|
put_task_struct(p);
|
|
read_unlock(&tasklist_lock);
|
|
|
|
p = find_lock_task_mm(victim);
|
|
if (!p) {
|
|
put_task_struct(victim);
|
|
return;
|
|
} else if (victim != p) {
|
|
get_task_struct(p);
|
|
put_task_struct(victim);
|
|
victim = p;
|
|
}
|
|
|
|
/* Get a reference to safely compare mm after task_unlock(victim) */
|
|
mm = victim->mm;
|
|
atomic_inc(&mm->mm_count);
|
|
/*
|
|
* We should send SIGKILL before setting TIF_MEMDIE in order to prevent
|
|
* the OOM victim from depleting the memory reserves from the user
|
|
* space under its control.
|
|
*/
|
|
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
|
|
mark_oom_victim(victim);
|
|
pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
|
|
task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
|
|
K(get_mm_counter(victim->mm, MM_ANONPAGES)),
|
|
K(get_mm_counter(victim->mm, MM_FILEPAGES)),
|
|
K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
|
|
task_unlock(victim);
|
|
|
|
/*
|
|
* Kill all user processes sharing victim->mm in other thread groups, if
|
|
* any. They don't get access to memory reserves, though, to avoid
|
|
* depletion of all memory. This prevents mm->mmap_sem livelock when an
|
|
* oom killed thread cannot exit because it requires the semaphore and
|
|
* its contended by another thread trying to allocate memory itself.
|
|
* That thread will now get access to memory reserves since it has a
|
|
* pending fatal signal.
|
|
*/
|
|
rcu_read_lock();
|
|
for_each_process(p) {
|
|
if (!process_shares_mm(p, mm))
|
|
continue;
|
|
if (same_thread_group(p, victim))
|
|
continue;
|
|
if (unlikely(p->flags & PF_KTHREAD))
|
|
continue;
|
|
if (is_global_init(p))
|
|
continue;
|
|
if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
|
|
continue;
|
|
|
|
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
|
|
}
|
|
rcu_read_unlock();
|
|
|
|
mmdrop(mm);
|
|
put_task_struct(victim);
|
|
}
|
|
#undef K
|
|
|
|
/*
|
|
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
|
|
*/
|
|
void check_panic_on_oom(struct oom_control *oc, enum oom_constraint constraint,
|
|
struct mem_cgroup *memcg)
|
|
{
|
|
if (likely(!sysctl_panic_on_oom))
|
|
return;
|
|
if (sysctl_panic_on_oom != 2) {
|
|
/*
|
|
* panic_on_oom == 1 only affects CONSTRAINT_NONE, the kernel
|
|
* does not panic for cpuset, mempolicy, or memcg allocation
|
|
* failures.
|
|
*/
|
|
if (constraint != CONSTRAINT_NONE)
|
|
return;
|
|
}
|
|
/* Do not panic for oom kills triggered by sysrq */
|
|
if (is_sysrq_oom(oc))
|
|
return;
|
|
dump_header(oc, NULL, memcg);
|
|
panic("Out of memory: %s panic_on_oom is enabled\n",
|
|
sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
|
|
}
|
|
|
|
static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
|
|
|
|
int register_oom_notifier(struct notifier_block *nb)
|
|
{
|
|
return blocking_notifier_chain_register(&oom_notify_list, nb);
|
|
}
|
|
EXPORT_SYMBOL_GPL(register_oom_notifier);
|
|
|
|
int unregister_oom_notifier(struct notifier_block *nb)
|
|
{
|
|
return blocking_notifier_chain_unregister(&oom_notify_list, nb);
|
|
}
|
|
EXPORT_SYMBOL_GPL(unregister_oom_notifier);
|
|
|
|
/**
|
|
* out_of_memory - kill the "best" process when we run out of memory
|
|
* @oc: pointer to struct oom_control
|
|
*
|
|
* If we run out of memory, we have the choice between either
|
|
* killing a random task (bad), letting the system crash (worse)
|
|
* OR try to be smart about which process to kill. Note that we
|
|
* don't have to be perfect here, we just have to be good.
|
|
*/
|
|
bool out_of_memory(struct oom_control *oc)
|
|
{
|
|
struct task_struct *p;
|
|
unsigned long totalpages;
|
|
unsigned long freed = 0;
|
|
unsigned int uninitialized_var(points);
|
|
enum oom_constraint constraint = CONSTRAINT_NONE;
|
|
|
|
if (oom_killer_disabled)
|
|
return false;
|
|
|
|
blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
|
|
if (freed > 0)
|
|
/* Got some memory back in the last second. */
|
|
return true;
|
|
|
|
/*
|
|
* If current has a pending SIGKILL or is exiting, then automatically
|
|
* select it. The goal is to allow it to allocate so that it may
|
|
* quickly exit and free its memory.
|
|
*
|
|
* But don't select if current has already released its mm and cleared
|
|
* TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
|
|
*/
|
|
if (current->mm &&
|
|
(fatal_signal_pending(current) || task_will_free_mem(current))) {
|
|
mark_oom_victim(current);
|
|
return true;
|
|
}
|
|
|
|
/*
|
|
* Check if there were limitations on the allocation (only relevant for
|
|
* NUMA) that may require different handling.
|
|
*/
|
|
constraint = constrained_alloc(oc, &totalpages);
|
|
if (constraint != CONSTRAINT_MEMORY_POLICY)
|
|
oc->nodemask = NULL;
|
|
check_panic_on_oom(oc, constraint, NULL);
|
|
|
|
if (sysctl_oom_kill_allocating_task && current->mm &&
|
|
!oom_unkillable_task(current, NULL, oc->nodemask) &&
|
|
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
|
|
get_task_struct(current);
|
|
oom_kill_process(oc, current, 0, totalpages, NULL,
|
|
"Out of memory (oom_kill_allocating_task)");
|
|
return true;
|
|
}
|
|
|
|
p = select_bad_process(oc, &points, totalpages);
|
|
/* Found nothing?!?! Either we hang forever, or we panic. */
|
|
if (!p && !is_sysrq_oom(oc)) {
|
|
dump_header(oc, NULL, NULL);
|
|
panic("Out of memory and no killable processes...\n");
|
|
}
|
|
if (p && p != (void *)-1UL) {
|
|
oom_kill_process(oc, p, points, totalpages, NULL,
|
|
"Out of memory");
|
|
/*
|
|
* Give the killed process a good chance to exit before trying
|
|
* to allocate memory again.
|
|
*/
|
|
schedule_timeout_killable(1);
|
|
}
|
|
return true;
|
|
}
|
|
|
|
/*
|
|
* The pagefault handler calls here because it is out of memory, so kill a
|
|
* memory-hogging task. If any populated zone has ZONE_OOM_LOCKED set, a
|
|
* parallel oom killing is already in progress so do nothing.
|
|
*/
|
|
void pagefault_out_of_memory(void)
|
|
{
|
|
struct oom_control oc = {
|
|
.zonelist = NULL,
|
|
.nodemask = NULL,
|
|
.gfp_mask = 0,
|
|
.order = 0,
|
|
};
|
|
|
|
if (mem_cgroup_oom_synchronize(true))
|
|
return;
|
|
|
|
if (fatal_signal_pending(current))
|
|
return;
|
|
|
|
if (!mutex_trylock(&oom_lock))
|
|
return;
|
|
|
|
if (!out_of_memory(&oc)) {
|
|
/*
|
|
* There shouldn't be any user tasks runnable while the
|
|
* OOM killer is disabled, so the current task has to
|
|
* be a racing OOM victim for which oom_killer_disable()
|
|
* is waiting for.
|
|
*/
|
|
WARN_ON(test_thread_flag(TIF_MEMDIE));
|
|
}
|
|
|
|
mutex_unlock(&oom_lock);
|
|
}
|