Now that CPU pause is gone, this is a lot more manageable. The
remaining conflicts were caused mostly by vendor hooks and Android-specific
tweaks to the EAS topology code, but easily fixable by hand.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I3665a2d78cb0b8eca6ba5110e90dc7f72030805e
This reverts commit bca99ddbf0.
CPU Pause causes major merge conflicts with the 5.11 scheduler changes
(migrate-disable specifically), so lets revert Pause temporarily as it
is not needed urgently in android-mainline.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I54df88781798cab73813a4a9c4e5aad3b7f506f8
This reverts commit 1d3a64fbd2.
CPU Pause causes major merge conflicts with the 5.11 scheduler changes
(migrate-disable specifically), so lets revert Pause temporarily as it
is not needed urgently in android-mainline.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I201898fe111d5da25c82b0f6a7b32cd0b1d6f237
This reverts commit 18695a1200.
CPU Pause causes major merge conflicts with the 5.11 scheduler changes
(migrate-disable specifically), so lets revert Pause temporarily as it
is not needed urgently in android-mainline.
Signed-off-by: Quentin Perret <qperret@google.com>
Change-Id: I467f4e74565711ec376cd5e3df9ab1731bbc0af5
Merge misc updates from Andrew Morton:
- a few random little subsystems
- almost all of the MM patches which are staged ahead of linux-next
material. I'll trickle to post-linux-next work in as the dependents
get merged up.
Subsystems affected by this patch series: kthread, kbuild, ide, ntfs,
ocfs2, arch, and mm (slab-generic, slab, slub, dax, debug, pagecache,
gup, swap, shmem, memcg, pagemap, mremap, hmm, vmalloc, documentation,
kasan, pagealloc, memory-failure, hugetlb, vmscan, z3fold, compaction,
oom-kill, migration, cma, page-poison, userfaultfd, zswap, zsmalloc,
uaccess, zram, and cleanups).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (200 commits)
mm: cleanup kstrto*() usage
mm: fix fall-through warnings for Clang
mm: slub: convert sysfs sprintf family to sysfs_emit/sysfs_emit_at
mm: shmem: convert shmem_enabled_show to use sysfs_emit_at
mm:backing-dev: use sysfs_emit in macro defining functions
mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening
mm: use sysfs_emit for struct kobject * uses
mm: fix kernel-doc markups
zram: break the strict dependency from lzo
zram: add stat to gather incompressible pages since zram set up
zram: support page writeback
mm/process_vm_access: remove redundant initialization of iov_r
mm/zsmalloc.c: rework the list_add code in insert_zspage()
mm/zswap: move to use crypto_acomp API for hardware acceleration
mm/zswap: fix passing zero to 'PTR_ERR' warning
mm/zswap: make struct kernel_param_ops definitions const
userfaultfd/selftests: hint the test runner on required privilege
userfaultfd/selftests: fix retval check for userfaultfd_open()
userfaultfd/selftests: always dump something in modes
userfaultfd: selftests: make __{s,u}64 format specifiers portable
...
Patch series "mm: memcg: deprecate cgroup v1 non-hierarchical mode", v1.
The non-hierarchical cgroup v1 mode is a legacy of early days
of the memory controller and doesn't bring any value today.
However, it complicates the code and creates many edge cases
all over the memory controller code.
It's a good time to deprecate it completely. This patchset removes
the internal logic, adjusts the user interface and updates
the documentation. The alt patch removes some bits of the cgroup
core code, which become obsolete.
Michal Hocko said:
"All that we know today is that we have a warning in place to complain
loudly when somebody relies on use_hierarchy=0 with a deeper
hierarchy. For all those years we have seen _zero_ reports that would
describe a sensible usecase.
Moreover we (SUSE) have backported this warning into old distribution
kernels (since 3.0 based kernels) to extend the coverage and didn't
hear even for users who adopt new kernels only very slowly. The only
report we have seen so far was a LTP test suite which doesn't really
reflect any real life usecase"
This patch (of 3):
The non-hierarchical cgroup v1 mode is a legacy of early days of the
memory controller and doesn't bring any value today. However, it
complicates the code and creates many edge cases all over the memory
controller code.
It's a good time to deprecate it completely.
Functionally this patch enabled is by default for all cgroups and forbids
switching it off. Nothing changes if cgroup v2 is used: hierarchical mode
was enforced from scratch.
To protect the ABI memory.use_hierarchy interface is preserved with a
limited functionality: reading always returns "1", writing of "1" passes
silently, writing of any other value fails with -EINVAL and a warning to
dmesg (on the first occasion).
Link: https://lkml.kernel.org/r/20201110220800.929549-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201110220800.929549-2-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Steps on the way to 5.11-rc1
Resolves merge conflicts in:
include/uapi/linux/prctl.h
kernel/sys.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I85ea8cffcd22f93277b357872254fe21d68bd82c
Pull scheduler updates from Thomas Gleixner:
- migrate_disable/enable() support which originates from the RT tree
and is now a prerequisite for the new preemptible kmap_local() API
which aims to replace kmap_atomic().
- A fair amount of topology and NUMA related improvements
- Improvements for the frequency invariant calculations
- Enhanced robustness for the global CPU priority tracking and decision
making
- The usual small fixes and enhancements all over the place
* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
sched/fair: Trivial correction of the newidle_balance() comment
sched/fair: Clear SMT siblings after determining the core is not idle
sched: Fix kernel-doc markup
x86: Print ratio freq_max/freq_base used in frequency invariance calculations
x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
x86, sched: Calculate frequency invariance for AMD systems
irq_work: Optimize irq_work_single()
smp: Cleanup smp_call_function*()
irq_work: Cleanup
sched: Limit the amount of NUMA imbalance that can exist at fork time
sched/numa: Allow a floating imbalance between NUMA nodes
sched: Avoid unnecessary calculation of load imbalance at clone time
sched/numa: Rename nr_running and break out the magic number
sched: Make migrate_disable/enable() independent of RT
sched/topology: Condition EAS enablement on FIE support
arm64: Rebuild sched domains on invariance status changes
sched/topology,schedutil: Wrap sched domains rebuild
sched/uclamp: Allow to reset a task uclamp constraint value
sched/core: Fix typos in comments
Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
...
Pull misc fixes from Christian Brauner:
"This contains several fixes which felt worth being combined into a
single branch:
- Use put_nsproxy() instead of open-coding it switch_task_namespaces()
- Kirill's work to unify lifecycle management for all namespaces. The
lifetime counters are used identically for all namespaces types.
Namespaces may of course have additional unrelated counters and
these are not altered. This work allows us to unify the type of the
counters and reduces maintenance cost by moving the counter in one
place and indicating that basic lifetime management is identical
for all namespaces.
- Peilin's fix adding three byte padding to Dmitry's
PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.
- Two smal patches to convert from the /* fall through */ comment
annotation to the fallthrough keyword annotation which I had taken
into my branch and into -next before df561f6688 ("treewide: Use
fallthrough pseudo-keyword") made it upstream which fixed this
tree-wide.
Since I didn't want to invalidate all testing for other commits I
didn't rebase and kept them"
* tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
nsproxy: use put_nsproxy() in switch_task_namespaces()
sys: Convert to the new fallthrough notation
signal: Convert to the new fallthrough notation
time: Use generic ns_common::count
cgroup: Use generic ns_common::count
mnt: Use generic ns_common::count
user: Use generic ns_common::count
pid: Use generic ns_common::count
ipc: Use generic ns_common::count
uts: Use generic ns_common::count
net: Use generic ns_common::count
ns: Add a common refcount into ns_common
ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()
Below symbols would be used by vendor modules for tracking tasks
in various cgroups
1. cgroup_taskset_first
2. cgroup_taskset_next
Bug: 175045928
Change-Id: I6d89148eb4c71174a02a27acab196ff940be9082
Signed-off-by: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
This reverts commit 1243dc518c.
percpu_rwsem is an rcu based lock. Under loaded conditions
this lock will require that each cpu perform a switch, causing
exessive delays with long-running tasks on those CPUs.
In the case of hotplug/pause, this can slow down the ability
to activate/deactivate a CPU.
Revert the change from cpuset_mutex to percpu_rwsem, to
eliminate that overhead, and revert to a global lock.
Bug: 161210528
Change-Id: Id00dcaa6d601b561d1321d3e944b6c52e9663f1a
Signed-off-by: Stephen Dickey <dickey@codeaurora.org>
In the resume_cpus() path, cpus cannot be taken
advantage of until the cpus write lock is acquired,
and cpus are activated and domains rebuilt. This
can incurr significant delay in the unpause operation.
Additionally, if scheduled through the kworker thread,
the wait time for rebuilding sched domains becomes
large due to a busy system that can prevent the kworker
from executing.
Activate the cpus and call the cpuset_hotplug_workfn
directly within resume_cpus prior to getting the cpus
write lock, thereby eliminating delays associated
with scheduling this activity.
Bug: 161210528
Change-Id: Ie2521f28ed9078b22d421d792f08413016d4dd62
Signed-off-by: Stephen Dickey <dickey@codeaurora.org>
Signed-off-by: Todd Kjos <tkjos@google.com>
In the event of a partial hotunplug, a stable state with a CPU set in the
online_mask and cleared from active_mask can happen. An online CPU, from a
scheduler point of view, should be part of the cpu_active_mask.
Bug: 161210528
Change-Id: I0d0aa6fca4c6dc145634c4aad6519045e0afc8e2
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Vendors might want to change tasks affinity settings when they are
moving from one cpuset into the other. Add vendor hook to give control
to vendor to implement what they need.
Bug: 174125747
Change-Id: Icee0405be0bca432002dae4a26ebe945082ce052
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
update_cpumask had a special case for empty buf which
did not update cpus_requested. This change reduces
differences (only to parsing) in empty/non-empty codepaths
to make them consistent.
Bug: 174125747
Bug: 120444281
Fixes: 4803def4e0b2 ("ANDROID: cpuset: Make cpusets restore on hotplug")
Test: check that writes to /dev/cpuset/background/tasks
Test: work as expected, e.g.:
Test: echo $$ > /dev/cpuset/background/tasks
Test: echo > /dev/cpuset/background/tasks
Signed-off-by: Roman Kiryanov <rkir@google.com>
Change-Id: I49d320ea046636ec38bd23f053317abc59f64f8e
[satyap@codeaurora.org: port to android-mainline kernel]
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
This deliberately changes the behavior of the per-cpuset
cpus file to not be effected by hotplug. When a cpu is offlined,
it will be removed from the cpuset/cpus file. When a cpu is onlined,
if the cpuset originally requested that that cpu was part of the cpuset,
that cpu will be restored to the cpuset. The cpus files still
have to be hierachical, but the ranges no longer have to be out of
the currently online cpus, just the physically present cpus.
Bug: 174125747
Bug: 120444281
Change-Id: I22cdf33e7d312117bcefba1aeb0125e1ada289a9
Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
[AmitP: Refactored original changes to align with upstream commit
201af4c0fa ("cgroup: move cgroup files under kernel/cgroup/")]
Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
[satyap@codeaurora.org: port to android-mainline kernel]
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
One of our machines keeled over trying to rebuild the scheduler domains.
Mainline produces the same splat:
BUG: unable to handle page fault for address: 0000607f820054db
CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
Workqueue: events cpuset_hotplug_workfn
RIP: build_sched_domains
Call Trace:
partition_sched_domains_locked
rebuild_sched_domains_locked
cpuset_hotplug_workfn
It happens with cgroup2 and exclusive cpusets only. This reproducer
triggers it on an 8-cpu vm and works most effectively with no
preexisting child cgroups:
cd $UNIFIED_ROOT
mkdir cg1
echo 4-7 > cg1/cpuset.cpus
echo root > cg1/cpuset.cpus.partition
# with smt/control reading 'on',
echo off > /sys/devices/system/cpu/smt/control
RIP maps to
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
from sd_init(). sd_id is calculated earlier in the same function:
cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
sd_id = cpumask_first(sched_domain_span(sd));
tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
value from per_cpu_ptr() above.
The problem is a race between cpuset_hotplug_workfn() and a later
offline of CPU N. cpuset_hotplug_workfn() updates the effective masks
when N is still online, the offline clears N from cpu_sibling_map, and
then the worker uses the stale effective masks that still have N to
generate the scheduling domains, leading the worker to read
N's empty cpu_sibling_map in sd_init().
rebuild_sched_domains_locked() prevented the race during the cgroup2
cpuset series up until the Fixes commit changed its check. Make the
check more robust so that it can detect an offline CPU in any exclusive
cpuset's effective mask, not just the top one.
Fixes: 0ccea8feb9 ("cpuset: Make generate_sched_domains() work with partition")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jordan@oracle.com
Steps on the way to 5.10-rc1
Resolves conflicts in:
fs/userfaultfd.c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ie3fe3c818f1f6565cfd4fa551de72d2b72ef60af
Do not report failure on zero sized writes, and handle them as no-op.
There's issues for example in case of writev() when there's iovec
containing zero buffer as a first one. It's expected writev() on below
example to successfully perform the write to specified writable cgroup
file expecting integer value, and to return 2. For now it's returning
value -1, and skipping the write:
int writetest(int fd) {
const char *buf1 = "";
const char *buf2 = "1\n";
struct iovec iov[2] = {
{ .iov_base = (void*)buf1, .iov_len = 0 },
{ .iov_base = (void*)buf2, .iov_len = 2 }
};
return writev(fd, iov, 2);
}
This patch fixes the issue by checking if there's nothing to write,
and handling the write as no-op by just returning 0.
Signed-off-by: Jouni Roivas <jouni.roivas@tuxera.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Switch over cgroup namespaces to use the newly introduced common lifetime
counter.
Currently every namespace type has its own lifetime counter which is stored
in the specific namespace struct. The lifetime counters are used
identically for all namespaces types. Namespaces may of course have
additional unrelated counters and these are not altered.
This introduces a common lifetime counter into struct ns_common. The
ns_common struct encompasses information that all namespaces share. That
should include the lifetime counter since its common for all of them.
It also allows us to unify the type of the counters across all namespaces.
Most of them use refcount_t but one uses atomic_t and at least one uses
kref. Especially the last one doesn't make much sense since it's just a
wrapper around refcount_t since 2016 and actually complicates cleanup
operations by having to use container_of() to cast the correct namespace
struct out of struct ns_common.
Having the lifetime counter for the namespaces in one place reduces
maintenance cost. Not just because after switching all namespaces over we
will have removed more code than we added but also because the logic is
more easily understandable and we indicate to the user that the basic
lifetime requirements for all namespaces are currently identical.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/159644980994.604812.383801057081594972.stgit@localhost.localdomain
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Pull core block updates from Jens Axboe:
"Good amount of cleanups and tech debt removals in here, and as a
result, the diffstat shows a nice net reduction in code.
- Softirq completion cleanups (Christoph)
- Stop using ->queuedata (Christoph)
- Cleanup bd claiming (Christoph)
- Use check_events, moving away from the legacy media change
(Christoph)
- Use inode i_blkbits consistently (Christoph)
- Remove old unused writeback congestion bits (Christoph)
- Cleanup/unify submission path (Christoph)
- Use bio_uninit consistently, instead of bio_disassociate_blkg
(Christoph)
- sbitmap cleared bits handling (John)
- Request merging blktrace event addition (Jan)
- sysfs add/remove race fixes (Luis)
- blk-mq tag fixes/optimizations (Ming)
- Duplicate words in comments (Randy)
- Flush deferral cleanup (Yufen)
- IO context locking/retry fixes (John)
- struct_size() usage (Gustavo)
- blk-iocost fixes (Chengming)
- blk-cgroup IO stats fixes (Boris)
- Various little fixes"
* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
block: blk-timeout: delete duplicated word
block: blk-mq-sched: delete duplicated word
block: blk-mq: delete duplicated word
block: genhd: delete duplicated words
block: elevator: delete duplicated word and fix typos
block: bio: delete duplicated words
block: bfq-iosched: fix duplicated word
iocost_monitor: start from the oldest usage index
iocost: Fix check condition of iocg abs_vdebt
block: Remove callback typedefs for blk_mq_ops
block: Use non _rcu version of list functions for tag_set_list
blk-cgroup: show global disk stats in root cgroup io.stat
blk-cgroup: make iostat functions visible to stat printing
block: improve discard bio alignment in __blkdev_issue_discard()
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
block: defer flush request no matter whether we have elevator
block: make blk_timeout_init() static
block: remove retry loop in ioc_release_fn()
block: remove unnecessary ioc nested locking
block: integrate bd_start_claiming into __blkdev_get
...
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.
sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.
The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.
This bug seems to be introduced since the beginning, commit
d979a39d72 ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b229
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.
Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: Cameron Berkenpas <cam@neo-zeon.de>
Reported-by: Peter Geis <pgwipeout@gmail.com>
Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reported-by: Daniël Sonck <dsonck92@gmail.com>
Reported-by: Zhang Qiang <qiang.zhang@windriver.com>
Tested-by: Cameron Berkenpas <cam@neo-zeon.de>
Tested-by: Peter Geis <pgwipeout@gmail.com>
Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
cgroup_rstat_updated is only used by core block code, no need to
export it.
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull cgroup updates from Tejun Heo:
"Just two patches: one to add system-level cpu.stat to the root cgroup
for convenience and a trivial comment update"
* 'for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: add cpu.stat file to root cgroup
cgroup: Remove stale comments
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally
defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from
Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro
Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from
Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to
userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise
a packet scheduler init failure will make a device inoperative. From
Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several
drivers, it's netdev_tx_t but many drivers were using
'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij
Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from
Andrew Lunn. Marvell PHY driver is the first to support this
facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper
Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
selftests: net: ip_defrag: ignore EPERM
net_failover: fixed rollback in net_failover_open()
Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
vmxnet3: allow rx flow hash ops only when rss is enabled
hinic: add set_channels ethtool_ops support
selftests/bpf: Add a default $(CXX) value
tools/bpf: Don't use $(COMPILE.c)
bpf, selftests: Use bpf_probe_read_kernel
s390/bpf: Use bcr 0,%0 as tail call nop filler
s390/bpf: Maintain 8-byte stack alignment
selftests/bpf: Fix verifier test
selftests/bpf: Fix sample_cnt shared between two threads
bpf, selftests: Adapt cls_redirect to call csum_level helper
bpf: Add csum_level helper for fixing up csum levels
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
crypto/chtls: IPv6 support for inline TLS
Crypto/chcr: Fixes a coccinile check error
Crypto/chcr: Fixes compilations warnings
...
Pull thread updates from Christian Brauner:
"We have been discussing using pidfds to attach to namespaces for quite
a while and the patches have in one form or another already existed
for about a year. But I wanted to wait to see how the general api
would be received and adopted.
This contains the changes to make it possible to use pidfds to attach
to the namespaces of a process, i.e. they can be passed as the first
argument to the setns() syscall.
When only a single namespace type is specified the semantics are
equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
equals setns(pidfd, CLONE_NEWNET).
However, when a pidfd is passed, multiple namespace flags can be
specified in the second setns() argument and setns() will attach the
caller to all the specified namespaces all at once or to none of them.
Specifying 0 is not valid together with a pidfd. Here are just two
obvious examples:
setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
setns(pidfd, CLONE_NEWUSER);
Allowing to also attach subsets of namespaces supports various
use-cases where callers setns to a subset of namespaces to retain
privilege, perform an action and then re-attach another subset of
namespaces.
Apart from significantly reducing the number of syscalls needed to
attach to all currently supported namespaces (eight "open+setns"
sequences vs just a single "setns()"), this also allows atomic setns
to a set of namespaces, i.e. either attaching to all namespaces
succeeds or we fail without having changed anything.
This is centered around a new internal struct nsset which holds all
information necessary for a task to switch to a new set of namespaces
atomically. Fwiw, with this change a pidfd becomes the only token
needed to interact with a container. I'm expecting this to be
picked-up by util-linux for nsenter rather soon.
Associated with this change is a shiny new test-suite dedicated to
setns() (for pidfds and nsfds alike)"
* tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests/pidfd: add pidfd setns tests
nsproxy: attach to namespaces via pidfds
nsproxy: add struct nsset
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.
The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the root cgroup does not have a cpu.stat file. Add one which
is consistent with /proc/stat to capture global cpu statistics that
might not fall under cgroup accounting.
We haven't done this in the past because the data are already presented
in /proc/stat and we didn't want to add overhead from collecting root
cgroup stats when cgroups are configured, but no cgroups have been
created.
By keeping the data consistent with /proc/stat, I think we avoid the
first problem, while improving the usability of cgroups stats.
We avoid the second problem by computing the contents of cpu.stat from
existing data collected for /proc/stat anyway.
Signed-off-by: Boris Burkov <boris@bur.io>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
- The default root is where we can create v2 cgroups.
- The __DEVEL__sane_behavior mount option has been removed long long ago.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add a simple struct nsset. It holds all necessary pieces to switch to a new
set of namespaces without leaving a task in a half-switched state which we
will make use of in the next patch. This patch switches the existing setns
logic over without causing a change in setns() behavior. This brings
setns() closer to how unshare() works(). The prepare_ns() function is
responsible to prepare all necessary information. This has two reasons.
First it minimizes dependencies between individual namespaces, i.e. all
install handler can expect that all fields are properly initialized
independent in what order they are called in. Second, this makes the code
easier to maintain and easier to follow if it needs to be changed.
The prepare_ns() helper will only be switched over to use a flags argument
in the next patch. Here it will still use nstype as a simple integer
argument which was argued would be clearer. I'm not particularly
opinionated about this if it really helps or not. The struct nsset itself
already contains the flags field since its name already indicates that it
can contain information required by different namespaces. None of this
should have functional consequences.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com
Make bpf_link update support more generic by making it into another
bpf_link_ops methods. This allows generic syscall handling code to be agnostic
to various conditionally compiled features (e.g., the case of
CONFIG_CGROUP_BPF). This also allows to keep link type-specific code to remain
static within respective code base. Refactor existing bpf_cgroup_link code and
take advantage of this.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200429001614.1544-2-andriin@fb.com
This reverts commit 9a9e97b2f1 ("cgroup: Add memory barriers to plug
cgroup_rstat_updated() race window").
The commit was added in anticipation of memcg rstat conversion which needed
synchronous accounting for the event counters (e.g. oom kill count). However,
the conversion didn't get merged due to percpu memory overhead concern which
couldn't be addressed at the time.
Unfortunately, the patch's addition of smp_mb() to cgroup_rstat_updated()
meant that every scheduling event now had to go through an additional full
barrier and Mel Gorman noticed it as 1% regression in netperf UDP_STREAM test.
There's no need to have this barrier in tree now and even if we need
synchronous accounting in the future, the right thing to do is separating that
out to a separate function so that hot paths which don't care about
synchronous behavior don't have to pay the overhead of the full barrier. Let's
revert.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Link: http://lkml.kernel.org/r/20200409154413.GK3818@techsingularity.net
Cc: v4.18+
In a quest to divide up the 5.7-rc1 merge chunks into reviewable pieces.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2e5960415348c06e8f10e10cbefb3ee5c3745e73
Pull cgroup updates from Tejun Heo:
- Christian extended clone3 so that processes can be spawned into
cgroups directly.
This is not only neat in terms of semantics but also avoids grabbing
the global cgroup_threadgroup_rwsem for migration.
- Daniel added !root xattr support to cgroupfs.
Userland already uses xattrs on cgroupfs for bookkeeping. This will
allow delegated cgroups to support such usages.
- Prateek tried to make cpuset hotplug handling synchronous but that
led to possible deadlock scenarios. Reverted.
- Other minor changes including release_agent_path handling cleanup.
* 'for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup-v1: Document the cpuset_v2_mode mount option
Revert "cpuset: Make cpuset hotplug synchronous"
cgroupfs: Support user xattrs
kernfs: Add option to enable user xattrs
kernfs: Add removed_size out param for simple_xattr_set
kernfs: kvmalloc xattr value instead of kmalloc
cgroup: Restructure release_agent_path handling
selftests/cgroup: add tests for cloning into cgroups
clone3: allow spawning processes into cgroups
cgroup: add cgroup_may_write() helper
cgroup: refactor fork helpers
cgroup: add cgroup_get_from_file() helper
cgroup: unify attach permission checking
cpuset: Make cpuset hotplug synchronous
cgroup.c: Use built-in RCU list checking
kselftest/cgroup: add cgroup destruction test
cgroup: Clean up css_set task traversal