DISCLAIMER:
=====================================================================
This patch is intended to go upstream after collecting feedback from
Android community that it resolves the issues reported by various
partners. It is not meant to be merged into android-mainline.
=====================================================================
uclamp_max effectiveness could be easily impacted by small transient
tasks that wake up frequency to do small work then go back to sleep.
If there's a busy task that is capped by uclamp_max to run at a smaller
frequency, due to max-aggregation rule tasks that wake up on the same
cpu will increase the rq->uclamp_max value if they were higher than the
capped task. Given that all tasks by default have a uclamp_max = 1024,
this is the likely case by default.
Note that since the capped task is likely to be a busy and throttled
one, its util, and hence the rq->util, will be very high and as soon as
we lift the capping the requested frequency will be very high.
To address this issue of increasing the resilience of uclamp_max against
these transient tasks that don't really need to run at a higher
frequency, we implement a simple filter mechanism to ignore uclamp_max
for those tasks.
The algorithm looks at the runtime of the task and compares it to
sched_slice(). By default we assume any task that its runtime is 1/4th
of sched_slice() or less is a small transient task that we can ignore
its uclamp_max requirement.
runtime < sched_slice() / divider
We can tweak the divider by
/proc/sys/kernel/sched_util_uclamp_max_filter_divider sysctl. It accepts
values 0-4.
divider = 1 << sched_util_uclamp_max_filter_divider
We add a new task_tick_uclamp() function to verify this condition
periodically and ensure the conditions checked at wake up are still true
- in case this transient task suddenly becomes a busy one.
For EAS, we can't use sched_slice() there to figure out if uclamp_max
will be ignored because the task is not enqueued yet. So we leave it
as-is to figure out the placement based on worst case scenario.
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: Ie3afa93a7d70dab5b7c22e820cc078ffd0e891ef
[yaro: ported to msm-5.4 and remove sysctl parts for now]
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
commit 0e25498f8cd43c1b5aa327f373dd094e9a006da7 upstream.
There are two big uses of do_exit. The first is it's design use to be
the guts of the exit(2) system call. The second use is to terminate
a task after something catastrophic has happened like a NULL pointer
in kernel code.
Add a function make_task_dead that is initialy exactly the same as
do_exit to cover the cases where do_exit is called to handle
catastrophic failure. In time this can probably be reduced to just a
light wrapper around do_task_dead. For now keep it exactly the same so
that there will be no behavioral differences introducing this new
concept.
Replace all of the uses of do_exit that use it for catastraphic
task cleanup with make_task_dead to make it clear what the code
is doing.
As part of this rename rewind_stack_do_exit
rewind_stack_and_make_dead.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.
This is also referred to as 'the default boost value of RT tasks'.
See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").
On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.
For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.
Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.
They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.
The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.
Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.
To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.
I anticipate this to be mostly in the form of modifying the init script
of a particular device.
To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.
Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.
With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
(cherry picked from commit 13685c4a08fca9dd76bf53bfcbadc044ab2a08cb)
Conflicts:
kernel/fork.c
kernel/sysctl.c
Upstream has commit 5a5cf5cb30d7 ("cgroup: refactor fork helpers") and
further commit ef2c41cf38a7 ("clone3: allow spawning processes into
cgroups") which affect the calls after this. Picking the first would
be easy but the 2nd would be much bigger. Also, my cherry-pick put my
sysctl in the wrong place in the table in sysctl.c, so I manually
moved it. Weird.
BUG=b:160171130
TEST=With series rt tasks don't get boosted
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Change-Id: I678d8ee899ecfbe0a1f0bb94da85d54fff924a57
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2340433
Reviewed-by: Joel Fernandes <joelaf@google.com>
Tasks without a user-defined clamp value are considered not clamped
and by default their utilization can have any value in the
[0..SCHED_CAPACITY_SCALE] range.
Tasks with a user-defined clamp value are allowed to request any value
in that range, and the required clamp is unconditionally enforced.
However, a "System Management Software" could be interested in limiting
the range of clamp values allowed for all tasks.
Add a privileged interface to define a system default configuration via:
/proc/sys/kernel/sched_uclamp_util_{min,max}
which works as an unconditional clamp range restriction for all tasks.
With the default configuration, the full SCHED_CAPACITY_SCALE range of
values is allowed for each clamp index. Otherwise, the task-specific
clamp is capped by the corresponding system default value.
Do that by tracking, for each task, the "effective" clamp value and
bucket the task has been refcounted in at enqueue time. This
allows to lazy aggregate "requested" and "system default" values at
enqueue time and simplifies refcounting updates at dequeue time.
The cached bucket ids are used to avoid (relatively) more expensive
integer divisions every time a task is enqueued.
An active flag is used to report when the "effective" value is valid and
thus the task is actually refcounted in the corresponding rq's bucket.
Bug: 120440300
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit e8f14172c6b11e9a86c65532497087f8eb0f91b1)
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Change-Id: I4f014c5ec9c312aaad606518f6e205fd0cfbcaa2
Signed-off-by: Quentin Perret <qperret@google.com>
sched_avg stats enabled for SMP, but it is used only in WALT.
So move sched_avg under WALT.
Change-Id: I19b558833e12b80f02fea27c9a9fc8b7630d8689
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: celtare21 <celtare21@gmail.com>
[ Upstream commit bf2c59fce4074e55d622089b34be3a6bc95484fb ]
In the CPU-offline process, it calls mmdrop() after idle entry and the
subsequent call to cpuhp_report_idle_dead(). Once execution passes the
call to rcu_report_dead(), RCU is ignoring the CPU, which results in
lockdep complaining when mmdrop() uses RCU from either memcg or
debugobjects below.
Fix it by cleaning up the active_mm state from BP instead. Every arch
which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
from AP. The only exception is parisc because it switches them to
&init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
but the patch will still work there because it calls mmgrab(&init_mm) in
smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().
WARNING: suspicious RCU usage
-----------------------------
kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!
other info that might help us debug this:
RCU used illegally from offline CPU!
Call Trace:
dump_stack+0xf4/0x164 (unreliable)
lockdep_rcu_suspicious+0x140/0x164
get_work_pool+0x110/0x150
__queue_work+0x1bc/0xca0
queue_work_on+0x114/0x120
css_release+0x9c/0xc0
percpu_ref_put_many+0x204/0x230
free_pcp_prepare+0x264/0x570
free_unref_page+0x38/0xf0
__mmdrop+0x21c/0x2c0
idle_task_exit+0x170/0x1b0
pnv_smp_cpu_kill_self+0x38/0x2e0
cpu_die+0x48/0x64
arch_cpu_idle_dead+0x30/0x50
do_idle+0x2f4/0x470
cpu_startup_entry+0x38/0x40
start_secondary+0x7a8/0xa80
start_secondary_resume+0x10/0x14
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pw
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>
xNombre: Android modifies some scheduler parameters on boot.
Applying these manually resulted in better hackbench performance.
Signed-off-by: Juhyung Park <qkrwngud825@gmail.com>
Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com>
To identify certain apps which request max cpu freq to affine its
tasks to specific cpus, besides checking its lib name, task name is
also a factor that we can identify the suspcious task.
Test: build and test the 'perfect kick 2' game.
Bug: 163293825
Bug: 161324271
Change-Id: I4359859db743b4c9122e9df40af0b109370e8f1f
Signed-off-by: Jimmy Shiu <jimmyshiu@google.com>
This change is for general scheduler improvement.
Change-Id: I50d41aa3338803cbd45ff6314b2bb3978c59282b
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Certain userspace applications, to achieve max performance, affines its
threads to cpus that run the fastest. This is not always the
correct strategy. For e.g. in certain architectures all the
cores have the same max freq but few of them have a bigger
cache. Affining to the cpus that have bigger cache is advantageous
but such an application would end up affining them to all the cores.
Similarly if an architecture has just one cpu that runs at max freq,
it ends up crowding all its thread on that single core, which is
detrimental for performance.
To address this issue, we need to detect a suspicious looking affinity
request from userspace and check if it links in a particular library.
The latter can easily be detected by traversing executable vm areas
that map a file and checking for that library name.
When such a affinity request is found, change it to use a proper
affinity. The suspicious affinity request, the proper affinity request
and the library name can be configured by the userspace.
Change-Id: I6bb8c310ca54c03261cc721f28dfd6023ab5591a
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
With the introduction of placement hint patch, boosted tasks will not
scheduled from big cores. We tune capacity margin to let important
boosted tasks get scheduled on big cores. However, the capacity margin
affects all group of tasks, so that non-boosted tasks get more chances
to be scheduled on big cores, too. This could be solved by separating
capacity margin for boosted tasks.
Bug: 147785606
Test: margin set correctly
Signed-off-by: Rick Yiu <rickyiu@google.com>
Change-Id: I2b02e138e36a6844afbc1ade60fe86a001814b30
Energy aware feature control is previously done through debugfs,
which will be deprecated, so move the control to sysctl.
Bug: 141333728
Test: function works as expected
Change-Id: I55411d3bb2669ba1fae3225d67cdf1cf8b3b3a7f
Signed-off-by: Rick Yiu <rickyiu@google.com>
Changes in 4.14.256
xhci: Fix USB 3.1 enumeration issues by increasing roothub power-on-good delay
binder: use euid from cred instead of using task
binder: use cred instead of task for selinux checks
Input: elantench - fix misreporting trackpoint coordinates
Input: i8042 - Add quirk for Fujitsu Lifebook T725
libata: fix read log timeout value
ocfs2: fix data corruption on truncate
mmc: dw_mmc: Dont wait for DRTO on Write RSP error
parisc: Fix ptrace check on syscall return
tpm: Check for integer overflow in tpm2_map_response_body()
media: ite-cir: IR receiver stop working after receive overflow
ALSA: ua101: fix division by zero at probe
ALSA: 6fire: fix control and bulk message timeouts
ALSA: line6: fix control and interrupt message timeouts
ALSA: synth: missing check for possible NULL after the call to kstrdup
ALSA: timer: Fix use-after-free problem
ALSA: timer: Unconditionally unlink slave instances, too
x86/irq: Ensure PI wakeup handler is unregistered before module unload
cavium: Return negative value when pci_alloc_irq_vectors() fails
scsi: qla2xxx: Fix unmap of already freed sgl
cavium: Fix return values of the probe function
sfc: Don't use netif_info before net_device setup
hyperv/vmbus: include linux/bitops.h
mmc: winbond: don't build on M68K
bpf: Prevent increasing bpf_jit_limit above max
xen/netfront: stop tx queues during live migration
spi: spl022: fix Microwire full duplex mode
watchdog: Fix OMAP watchdog early handling
vmxnet3: do not stop tx queues after netif_device_detach()
btrfs: fix lost error handling when replaying directory deletes
hwmon: (pmbus/lm25066) Add offset coefficients
regulator: s5m8767: do not use reset value as DVS voltage if GPIO DVS is disabled
regulator: dt-bindings: samsung,s5m8767: correct s5m8767,pmic-buck-default-dvs-idx property
EDAC/sb_edac: Fix top-of-high-memory value for Broadwell/Haswell
mwifiex: fix division by zero in fw download path
ath6kl: fix division by zero in send path
ath6kl: fix control-message timeout
ath10k: fix control-message timeout
ath10k: fix division by zero in send path
PCI: Mark Atheros QCA6174 to avoid bus reset
rtl8187: fix control-message timeouts
evm: mark evm_fixmode as __ro_after_init
wcn36xx: Fix HT40 capability for 2Ghz band
mwifiex: Read a PCI register after writing the TX ring write pointer
libata: fix checking of DMA state
wcn36xx: handle connection loss indication
RDMA/qedr: Fix NULL deref for query_qp on the GSI QP
signal: Remove the bogus sigkill_pending in ptrace_stop
signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
power: supply: max17042_battery: Prevent int underflow in set_soc_threshold
power: supply: max17042_battery: use VFSOC for capacity when no rsns
powerpc/85xx: Fix oops when mpc85xx_smp_guts_ids node cannot be found
serial: core: Fix initializing and restoring termios speed
ALSA: mixer: oss: Fix racy access to slots
ALSA: mixer: fix deadlock in snd_mixer_oss_set_volume
xen/balloon: add late_initcall_sync() for initial ballooning done
PCI: aardvark: Do not clear status bits of masked interrupts
PCI: aardvark: Do not unmask unused interrupts
PCI: aardvark: Fix return value of MSI domain .alloc() method
PCI: aardvark: Read all 16-bits from PCIE_MSI_PAYLOAD_REG
quota: check block number when reading the block in quota file
quota: correct error number in free_dqentry()
pinctrl: core: fix possible memory leak in pinctrl_enable()
iio: dac: ad5446: Fix ad5622_write() return value
USB: serial: keyspan: fix memleak on probe errors
USB: iowarrior: fix control-message timeouts
Bluetooth: sco: Fix lock_sock() blockage by memcpy_from_msg()
Bluetooth: fix use-after-free error in lock_sock_nested()
platform/x86: wmi: do not fail if disabling fails
MIPS: lantiq: dma: add small delay after reset
MIPS: lantiq: dma: reset correct number of channel
locking/lockdep: Avoid RCU-induced noinstr fail
smackfs: Fix use-after-free in netlbl_catmap_walk()
x86: Increase exception stack sizes
mwifiex: Run SET_BSS_MODE when changing from P2P to STATION vif-type
mwifiex: Properly initialize private structure on interface type changes
media: mt9p031: Fix corrupted frame after restarting stream
media: netup_unidvb: handle interrupt properly according to the firmware
media: uvcvideo: Set capability in s_param
media: s5p-mfc: fix possible null-pointer dereference in s5p_mfc_probe()
media: s5p-mfc: Add checking to s5p_mfc_probe().
media: mceusb: return without resubmitting URB in case of -EPROTO error.
ia64: don't do IA64_CMPXCHG_DEBUG without CONFIG_PRINTK
ACPICA: Avoid evaluating methods too early during system resume
media: usb: dvd-usb: fix uninit-value bug in dibusb_read_eeprom_byte()
tracefs: Have tracefs directories not set OTH permission bits by default
ath: dfs_pattern_detector: Fix possible null-pointer dereference in channel_detector_create()
ACPI: battery: Accept charges over the design capacity as full
leaking_addresses: Always print a trailing newline
memstick: r592: Fix a UAF bug when removing the driver
lib/xz: Avoid overlapping memcpy() with invalid input with in-place decompression
lib/xz: Validate the value before assigning it to an enum variable
tracing/cfi: Fix cmp_entries_* functions signature mismatch
mwl8k: Fix use-after-free in mwl8k_fw_state_machine()
PM: hibernate: Get block device exclusively in swsusp_check()
iwlwifi: mvm: disable RX-diversity in powersave
smackfs: use __GFP_NOFAIL for smk_cipso_doi()
ARM: clang: Do not rely on lr register for stacktrace
gre/sit: Don't generate link-local addr if addr_gen_mode is IN6_ADDR_GEN_MODE_NONE
ARM: 9136/1: ARMv7-M uses BE-8, not BE-32
spi: bcm-qspi: Fix missing clk_disable_unprepare() on error in bcm_qspi_probe()
parisc: fix warning in flush_tlb_all
task_stack: Fix end_of_stack() for architectures with upwards-growing stack
parisc/kgdb: add kgdb_roundup() to make kgdb work with idle polling
cgroup: Make rebind_subsystems() disable v2 controllers all at once
media: dvb-usb: fix ununit-value in az6027_rc_query
media: mtk-vpu: Fix a resource leak in the error handling path of 'mtk_vpu_probe()'
media: si470x: Avoid card name truncation
media: cx23885: Fix snd_card_free call on null card pointer
cpuidle: Fix kobject memory leaks in error paths
ath9k: Fix potential interrupt storm on queue reset
crypto: qat - detect PFVF collision after ACK
crypto: qat - disregard spurious PFVF interrupts
hwrng: mtk - Force runtime pm ops for sleep ops
b43legacy: fix a lower bounds test
b43: fix a lower bounds test
memstick: avoid out-of-range warning
memstick: jmb38x_ms: use appropriate free function in jmb38x_ms_alloc_host()
hwmon: Fix possible memleak in __hwmon_device_register()
ath10k: fix max antenna gain unit
drm/msm: uninitialized variable in msm_gem_import()
net: stream: don't purge sk_error_queue in sk_stream_kill_queues()
mmc: mxs-mmc: disable regulator on error and in the remove function
platform/x86: thinkpad_acpi: Fix bitwise vs. logical warning
mwifiex: Send DELBA requests according to spec
phy: micrel: ksz8041nl: do not use power down mode
PM: hibernate: fix sparse warnings
smackfs: use netlbl_cfg_cipsov4_del() for deleting cipso_v4_doi
s390/gmap: don't unconditionally call pte_unmap_unlock() in __gmap_zap()
irq: mips: avoid nested irq_enter()
samples/kretprobes: Fix return value if register_kretprobe() failed
libertas_tf: Fix possible memory leak in probe and disconnect
libertas: Fix possible memory leak in probe and disconnect
net: amd-xgbe: Toggle PLL settings during rate change
net: phylink: avoid mvneta warning when setting pause parameters
crypto: pcrypt - Delay write to padata->info
ibmvnic: Process crqs after enabling interrupts
RDMA/rxe: Fix wrong port_cap_flags
ARM: s3c: irq-s3c24xx: Fix return value check for s3c24xx_init_intc()
ARM: dts: at91: tse850: the emac<->phy interface is rmii
scsi: dc395: Fix error case unwinding
MIPS: loongson64: make CPU_LOONGSON64 depends on MIPS_FP_SUPPORT
JFS: fix memleak in jfs_mount
ALSA: hda: Reduce udelay() at SKL+ position reporting
arm: dts: omap3-gta04a4: accelerometer irq fix
soc/tegra: Fix an error handling path in tegra_powergate_power_up()
memory: fsl_ifc: fix leak of irq and nand_irq in fsl_ifc_ctrl_probe
video: fbdev: chipsfb: use memset_io() instead of memset()
serial: 8250_dw: Drop wrong use of ACPI_PTR()
usb: gadget: hid: fix error code in do_config()
power: supply: rt5033_battery: Change voltage values to µV
scsi: csiostor: Uninitialized data in csio_ln_vnp_read_cbfn()
RDMA/mlx4: Return missed an error if device doesn't support steering
ASoC: cs42l42: Correct some register default values
ASoC: cs42l42: Defer probe if request_threaded_irq() returns EPROBE_DEFER
serial: xilinx_uartps: Fix race condition causing stuck TX
mips: cm: Convert to bitfield API to fix out-of-bounds access
power: supply: bq27xxx: Fix kernel crash on IRQ handler register error
apparmor: fix error check
rpmsg: Fix rpmsg_create_ept return when RPMSG config is not defined
pnfs/flexfiles: Fix misplaced barrier in nfs4_ff_layout_prepare_ds
drm/plane-helper: fix uninitialized variable reference
PCI: aardvark: Don't spam about PIO Response Status
NFS: Fix deadlocks in nfs_scan_commit_list()
fs: orangefs: fix error return code of orangefs_revalidate_lookup()
mtd: spi-nor: hisi-sfc: Remove excessive clk_disable_unprepare()
dmaengine: at_xdmac: fix AT_XDMAC_CC_PERID() macro
auxdisplay: img-ascii-lcd: Fix lock-up when displaying empty string
auxdisplay: ht16k33: Connect backlight to fbdev
auxdisplay: ht16k33: Fix frame buffer device blanking
netfilter: nfnetlink_queue: fix OOB when mac header was cleared
dmaengine: dmaengine_desc_callback_valid(): Check for `callback_result`
m68k: set a default value for MEMORY_RESERVE
watchdog: f71808e_wdt: fix inaccurate report in WDIOC_GETTIMEOUT
ar7: fix kernel builds for compiler test
scsi: qla2xxx: Turn off target reset during issue_lip
i2c: xlr: Fix a resource leak in the error handling path of 'xlr_i2c_probe()'
xen-pciback: Fix return in pm_ctrl_init()
net: davinci_emac: Fix interrupt pacing disable
ACPI: PMIC: Fix intel_pmic_regs_handler() read accesses
bonding: Fix a use-after-free problem when bond_sysfs_slave_add() failed
mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration()
llc: fix out-of-bound array index in llc_sk_dev_hash()
nfc: pn533: Fix double free when pn533_fill_fragment_skbs() fails
vsock: prevent unnecessary refcnt inc for nonblocking connect
USB: chipidea: fix interrupt deadlock
ARM: 9155/1: fix early early_iounmap()
ARM: 9156/1: drop cc-option fallbacks for architecture selection
powerpc/lib: Add helper to check if offset is within conditional branch range
powerpc/bpf: Validate branch ranges
powerpc/bpf: Fix BPF_SUB when imm == 0x80000000
mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
mm, oom: do not trigger out_of_memory from the #PF
s390/cio: check the subchannel validity for dev_busid
PCI: Add PCI_EXP_DEVCTL_PAYLOAD_* macros
ext4: fix lazy initialization next schedule time computation in more granular unit
tracing: Resize tgid_map to pid_max, not PID_MAX_DEFAULT
parisc/entry: fix trace test in syscall exit path
PCI/MSI: Destroy sysfs before freeing entries
arm64: zynqmp: Fix serial compatible string
scsi: lpfc: Fix list_add() corruption in lpfc_drain_txq()
usb: musb: tusb6010: check return value after calling platform_get_resource()
scsi: advansys: Fix kernel pointer leak
ARM: dts: omap: fix gpmc,mux-add-data type
usb: host: ohci-tmio: check return value after calling platform_get_resource()
tty: tty_buffer: Fix the softlockup issue in flush_to_ldisc
MIPS: sni: Fix the build
scsi: target: Fix ordered tag handling
scsi: target: Fix alua_tg_pt_gps_count tracking
powerpc/5200: dts: fix memory node unit name
ALSA: gus: fix null pointer dereference on pointer block
powerpc/dcr: Use cmplwi instead of 3-argument cmpli
sh: check return code of request_irq
maple: fix wrong return value of maple_bus_init().
sh: fix kconfig unmet dependency warning for FRAME_POINTER
sh: define __BIG_ENDIAN for math-emu
mips: BCM63XX: ensure that CPU_SUPPORTS_32BIT_KERNEL is set
sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()
net: bnx2x: fix variable dereferenced before check
iavf: Fix for the false positive ASQ/ARQ errors while issuing VF reset
MIPS: generic/yamon-dt: fix uninitialized variable error
mips: bcm63xx: add support for clk_get_parent()
mips: lantiq: add support for clk_get_parent()
platform/x86: hp_accel: Fix an error handling path in 'lis3lv02d_probe()'
net: virtio_net_hdr_to_skb: count transport header in UFO
i40e: Fix NULL ptr dereference on VSI filter sync
NFC: reorganize the functions in nci_request
NFC: reorder the logic in nfc_{un,}register_device
perf/x86/intel/uncore: Fix filter_tid mask for CHA events on Skylake Server
perf/x86/intel/uncore: Fix IIO event constraints for Skylake Server
tun: fix bonding active backup with arp monitoring
hexagon: export raw I/O routines for modules
mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag
btrfs: fix memory ordering between normal and ordered work functions
parisc/sticon: fix reverse colors
cfg80211: call cfg80211_stop_ap when switch from P2P_GO type
drm/udl: fix control-message timeout
drm/amdgpu: fix set scaling mode Full/Full aspect/Center not works on vga and dvi connectors
perf/core: Avoid put_page() when GUP fails
batman-adv: mcast: fix duplicate mcast packets in BLA backbone from LAN
batman-adv: mcast: fix duplicate mcast packets from BLA backbone to mesh
batman-adv: Consider fragmentation for needed_headroom
batman-adv: Reserve needed_*room for fragments
batman-adv: Don't always reallocate the fragmentation skb head
RDMA/netlink: Add __maybe_unused to static inline in C file
ASoC: DAPM: Cover regression by kctl change notification fix
usb: max-3421: Use driver data instead of maintaining a list of bound devices
soc/tegra: pmc: Fix imbalanced clock disabling in error code path
Linux 4.14.256
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I32f0b43f5aa192eda1aa3a220a2f348ade0536d2
commit 85b6d24646e4125c591639841169baa98a2da503 upstream.
Currently, the exit_shm() function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.
This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
leads to use-after-free (reproducer exists).
This is an attempt to fix the problem by extending exit_shm mechanism to
handle shm's destroy from several IPC ns'es.
To achieve that we do several things:
1. add a namespace (non-refcounted) pointer to the struct shmid_kernel
2. during new shm object creation (newseg()/shmget syscall) we
initialize this pointer by current task IPC ns
3. exit_shm() fully reworked such that it traverses over all shp's in
task->sysvshm.shm_clist and gets IPC namespace not from current task
as it was before but from shp's object itself, then call
shm_destroy(shp, ns).
Note: We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we
using special helper get_ipc_ns_not_zero() which allows to get IPC ns
refcounter only if IPC ns not in the "state of destruction".
Q/A
Q: Why can we access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther than IPC namespace
lifetime, so, if we get shp object from the task->sysvshm.shm_clist
while holding task_lock(task) nobody can steal our namespace.
Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: No. It's just fixes non-covered case when process may leave IPC
namespace without getting task->sysvshm.shm_clist list cleaned up.
Link: https://lkml.kernel.org/r/67bb03e5-f79c-1815-e2bf-949c67047418@colorfullife.com
Link: https://lkml.kernel.org/r/20211109151501.4921-1-manfred@colorfullife.com
Fixes: ab602f7991 ("shm: make exit_shm work proportional to task activity")
Co-developed-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9cc2fa4f4a92ccc6760d764e7341be46ee8aaaa1 ]
The function end_of_stack() returns a pointer to the last entry of a
stack. For architectures like parisc where the stack grows upwards
return the pointer to the highest address in the stack.
Without this change I faced a crash on parisc, because the stackleak
functionality wrote STACKLEAK_POISON to the lowest address and thus
overwrote the first 4 bytes of the task_struct which included the
TIF_FLAGS.
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Cgroup v1 implements the freezer controller, which provides an ability
to stop the workload in a cgroup and temporarily free up some
resources (cpu, io, network bandwidth and, potentially, memory)
for some other tasks. Cgroup v2 lacks this functionality.
This patch implements freezer for cgroup v2.
Cgroup v2 freezer tries to put tasks into a state similar to jobctl
stop. This means that tasks can be killed, ptraced (using
PTRACE_SEIZE*), and interrupted. It is possible to attach to
a frozen task, get some information (e.g. read registers) and detach.
It's also possible to migrate a frozen tasks to another cgroup.
This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
tried to imitate the system-wide freezer. However uninterruptible
sleep is fine when all tasks are going to be frozen (hibernation case),
it's not the acceptable state for some subset of the system.
Cgroup v2 freezer is not supporting freezing kthreads.
If a non-root cgroup contains kthread, the cgroup still can be frozen,
but the kthread will remain running, the cgroup will be shown
as non-frozen, and the notification will not be delivered.
* PTRACE_ATTACH is not working because non-fatal signal delivery
is blocked in frozen state.
There are some interface differences between cgroup v1 and cgroup v2
freezer too, which are required to conform the cgroup v2 interface
design principles:
1) There is no separate controller, which has to be turned on:
the functionality is always available and is represented by
cgroup.freeze and cgroup.events cgroup control files.
2) The desired state is defined by the cgroup.freeze control file.
Any hierarchical configuration is allowed.
3) The interface is asynchronous. The actual state is available
using cgroup.events control file ("frozen" field). There are no
dedicated transitional states.
4) It's allowed to make any changes with the cgroup hierarchy
(create new cgroups, remove old cgroups, move tasks between cgroups)
no matter if some cgroups are frozen.
Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
No-objection-from-me-by: Oleg Nesterov <oleg@redhat.com>
Cc: kernel-team@fb.com
Change-Id: I3404119678cbcd7410aa56e9334055cee79d02fa
(cherry picked from commit 76f969e8948d82e78e1bc4beb6b9465908e74873)
Bug: 154548692
Signed-off-by: Marco Ballesio <balejs@google.com>
(cherry picked from commit 666d8913b8f1adef750ae86d9acb74c9cb84c4ef)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
For us, it's most helpful to have the round-robin timeslice as low as is
allowed by the scheduler to reduce latency. Since it's limited by the
scheduler tick rate, just set the default to 1 jiffy, which is the
lowest possible value.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
If sysctl_sched_prefer_spread is enabled, then tasks would be freely
migrated to idle cpus within same cluster to reduce runnables.
By default, the feature is disabled.
User can trigger feature with:
echo 1 > /proc/sys/kernel/sched_prefer_spread
Aggressively spread tasks with in little cluster.
echo 2 > /proc/sys/kernel/sched_prefer_spread
Aggressively spread tasks with in little cluster as well as
big cluster, but not between big and little.
Change-Id: I0a4d87bd17de3525548765472e6f388a9970f13c
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
[render: minor fixups]
Signed-off-by: Zachariah Kennedy <zkennedy87@gmail.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
commit 85572c2c4a45a541e880e087b5b17a48198b2416 upstream.
The scheduler code calling cpufreq_update_util() may run during CPU
offline on the target CPU after the IRQ work lists have been flushed
for it, so the target CPU should be prevented from running code that
may queue up an IRQ work item on it at that point.
Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
is set for at least one cpufreq policy in the system, because that
allows the CPU going offline to run the utilization update callback
of the cpufreq governor on behalf of another (online) CPU in some
cases.
If that happens, the cpufreq governor callback may queue up an IRQ
work on the CPU running it, which is going offline, and the IRQ work
may not be flushed after that point. Moreover, that IRQ work cannot
be flushed until the "offlining" CPU goes back online, so if any
other CPU calls irq_work_sync() to wait for the completion of that
IRQ work, it will have to wait until the "offlining" CPU is back
online and that may not happen forever. In particular, a system-wide
deadlock may occur during CPU online as a result of that.
The failing scenario is as follows. CPU0 is the boot CPU, so it
creates a cpufreq policy and becomes the "leader" of it
(policy->cpu). It cannot go offline, because it is the boot CPU.
Next, other CPUs join the cpufreq policy as they go online and they
leave it when they go offline. The last CPU to go offline, say CPU3,
may queue up an IRQ work while running the governor callback on
behalf of CPU0 after leaving the cpufreq policy because of the
dvfs_possible_from_any_cpu effect described above. Then, CPU0 is
the only online CPU in the system and the stale IRQ work is still
queued on CPU3. When, say, CPU1 goes back online, it will run
irq_work_sync() to wait for that IRQ work to complete and so it
will wait for CPU3 to go back online (which may never happen even
in principle), but (worse yet) CPU0 is waiting for CPU1 at that
point too and a system-wide deadlock occurs.
To address this problem notice that CPUs which cannot run cpufreq
utilization update code for themselves (for example, because they
have left the cpufreq policies that they belonged to), should also
be prevented from running that code on behalf of the other CPUs that
belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
in that case the cpufreq_update_util_data pointer of the CPU running
the code must not be NULL as well as for the CPU which is the target
of the cpufreq utilization update in progress.
Accordingly, change cpufreq_this_cpu_can_update() into a regular
function in kernel/sched/cpufreq.c (instead of a static inline in a
header file) and make it check the cpufreq_update_util_data pointer
of the local CPU if dvfs_possible_from_any_cpu is set for the target
cpufreq policy.
Also update the schedutil governor to do the
cpufreq_this_cpu_can_update() check in the non-fast-switch
case too to avoid the stale IRQ work issues.
Change-Id: Idb7f18129f59a82485a5eb93dc26c6f1a463a76a
Fixes: 99d14d0e16 ("cpufreq: Process remote callbacks from any CPU if the platform permits")
Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/
Reported-by: Anson Huang <anson.huang@nxp.com>
Tested-by: Anson Huang <anson.huang@nxp.com>
Cc: 4.14+ <stable@vger.kernel.org> # 4.14+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Tested-by: Peng Fan <peng.fan@nxp.com> (i.MX8QXP-MEK)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Git-commit: 85572c2c4a45a541e880e087b5b17a48198b2416
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Santosh Mardi <gsantosh@codeaurora.org>
[render: Account for function name differences]
Signed-off-by: Zachariah Kennedy <zkennedy87@gmail.com>
This change is for general scheduler improvement.
Change-Id: I5fbadf248c0bfe27bc761686de7a925cec2e4163
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Signed-off-by: Sai Harshini Nimmala <snimmala@codeaurora.org>
This change is for general scheduler improvement.
Change-Id: I33e9ec890f8b54d673770d5d02dba489a8e08ce7
Signed-off-by: Sai Harshini Nimmala <snimmala@codeaurora.org>
Currently, sched updown migration handler derives cluster topology
based on arch topology, the cluster information is already populated
in walt sched_cluster. So reuse it instead of deriving it again.
And move updown tunables support to under WALT.
Change-Id: Iddf4d18ddf75cc20637281d9889f671f42369513
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
[render: minor fixup]
Signed-off-by: Zachariah Kennedy <zkennedy87@gmail.com>
This change is for general scheduler improvement.
Change-Id: I8459bcf7b412a5f301566054c28c910567548485
Signed-off-by: Sai Harshini Nimmala <snimmala@codeaurora.org>
This change is for general scheduler improvement.
Change-Id: Ida39a3ee5e6b4b0d3255bfef95601890afd80709
Signed-off-by: Shaleen Agrawal <shalagra@codeaurora.org>
This change is for general scheduler improvement.
Change-Id: I8ff4768d56d8e63b2cfa78e5f34cb156ee60e3da
Signed-off-by: Amir Vajid <avajid@codeaurora.org>
This change is for general scheduler improvement.
Change-Id: I737751f065df6a5ed3093e3bda5e48750a14e4c9
Signed-off-by: Amir Vajid <avajid@codeaurora.org>
Update lpm_disallowed() to receive bias_time directly
from the scheduler.
Change-Id: I756c4f5e545083b455041af5ad8ddf7ee7986365
Signed-off-by: Amir Vajid <avajid@codeaurora.org>
The current code is using a bitmap for sched_busy_hysteresis_enable_cpus
tunable. Since the feature is not enabled for any of the CPUs, the
default value is printed as "\n". This is very inconvenient for
user space applications which tries to write a new value and try to
restore the previous value. Like other scheduler tunables, use
the bitmask to avoid this issue.
Change-Id: I0c5989606352be5382dd688602aefd753fb62317
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Currently sched busy hysteresis feature is applied only for
CPUs other than the min capacity CPUs. This policy restricts
the flexibility on a system with more than 2 clusters. Add a
tunable to specify which CPUs needs this feature. By default,
the feature is turned off for all the CPUs.
The usage of this tunable:
echo 4-7 > /proc/sys/kernel/sched_busy_hysteresis_enable_cpus
Change-Id: I636575af2c42e2774007582f3d589495c6a3a9f1
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
This change is for general scheduler improvement.
Change-Id: I310bbdc19bb65a0c562ec6a208f2da713eba954d
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
[render: minor fixups]
Signed-off-by: Zachariah Kennedy <zkennedy87@gmail.com>
This change is for general scheduler improvement.
Change-Id: I7d794ad1be10a6811602fabb388facd39c8f3c53
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Zachariah Kennedy <zkennedy87@gmail.com>
This change is for general scheduler improvement.
Change-Id: I18364c6061ed7525755aaf187bf15a8cb9b54a8a
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
[render: minor fixup]
Signed-off-by: Zachariah Kennedy <zkennedy87@gmail.com>
This change is for general scheduler improvement.
Change-Id: I05a6645db80cc04993b45d7ec25a3fb7a112cf3e
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
This change is for general scheduler improvement.
Change-Id: Idef278a9551e6d7d3c1a945dcfd8804cbc7d6aff
Signed-off-by: Puja Gupta <pujag@codeaurora.org>
This change is for general scheduler improvement.
Change-Id: I5d89acdde73f5379d68ebc8513d0bbeaac128f5d
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
Signed-off-by: Jonathan Avila <avilaj@codeaurora.org>
[render: minor fixups]
Signed-off-by: Zachariah Kennedy <zkennedy87@gmail.com>
This change is for general scheduler improvement.
Change-Id: Ib963aef88d85e15fcd19cda3d3f0944b530239ab
Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
This change is for general scheduler improvement.
Change-Id: Ie37ab752a4d69569bce506b0a12715bb45ece79e
Co-developed-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>