kernel_google_wahoo

Author	SHA1	Message	Date
Eric W. Biederman	f30288e5b7	cgroup-v1: Require capabilities to set release_agent commit 24f6008564183aa120d07c03d9289519c2fe02af upstream. The cgroup release_agent is called with call_usermodehelper. The function call_usermodehelper starts the release_agent with a full set fo capabilities. Therefore require capabilities when setting the release_agaent. Reported-by: Tabitha Sable <tabitha.c.sable@gmail.com> Tested-by: Tabitha Sable <tabitha.c.sable@gmail.com> Fixes: `81a6a5cdd2` ("Task Control Groups: automatic userspace notification of idle cgroups") Cc: stable@vger.kernel.org # v2.6.24+ Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tejun Heo <tj@kernel.org> [mkoutny: Adjust for pre-fs_context, duplicate mount/remount check, drop log messages.] Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-03-07 20:13:50 +08:00
Waiman Long	e122672357	cgroup: Make rebind_subsystems() disable v2 controllers all at once [ Upstream commit 7ee285395b211cad474b2b989db52666e0430daf ] It was found that the following warning was displayed when remounting controllers from cgroup v2 to v1: [ 8042.997778] WARNING: CPU: 88 PID: 80682 at kernel/cgroup/cgroup.c:3130 cgroup_apply_control_disable+0x158/0x190 : [ 8043.091109] RIP: 0010:cgroup_apply_control_disable+0x158/0x190 [ 8043.096946] Code: ff f6 45 54 01 74 39 48 8d 7d 10 48 c7 c6 e0 46 5a a4 e8 7b 67 33 00 e9 41 ff ff ff 49 8b 84 24 e8 01 00 00 0f b7 40 08 eb 95 <0f> 0b e9 5f ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 [ 8043.115692] RSP: 0018:ffffba8a47c23d28 EFLAGS: 00010202 [ 8043.120916] RAX: 0000000000000036 RBX: ffffffffa624ce40 RCX: 000000000000181a [ 8043.128047] RDX: ffffffffa63c43e0 RSI: ffffffffa63c43e0 RDI: ffff9d7284ee1000 [ 8043.135180] RBP: ffff9d72874c5800 R08: ffffffffa624b090 R09: 0000000000000004 [ 8043.142314] R10: ffffffffa624b080 R11: 0000000000002000 R12: ffff9d7284ee1000 [ 8043.149447] R13: ffff9d7284ee1000 R14: ffffffffa624ce70 R15: ffffffffa6269e20 [ 8043.156576] FS: 00007f7747cff740(0000) GS:ffff9d7a5fc00000(0000) knlGS:0000000000000000 [ 8043.164663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8043.170409] CR2: 00007f7747e96680 CR3: 0000000887d60001 CR4: 00000000007706e0 [ 8043.177539] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8043.184673] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 8043.191804] PKRU: 55555554 [ 8043.194517] Call Trace: [ 8043.196970] rebind_subsystems+0x18c/0x470 [ 8043.201070] cgroup_setup_root+0x16c/0x2f0 [ 8043.205177] cgroup1_root_to_use+0x204/0x2a0 [ 8043.209456] cgroup1_get_tree+0x3e/0x120 [ 8043.213384] vfs_get_tree+0x22/0xb0 [ 8043.216883] do_new_mount+0x176/0x2d0 [ 8043.220550] __x64_sys_mount+0x103/0x140 [ 8043.224474] do_syscall_64+0x38/0x90 [ 8043.228063] entry_SYSCALL_64_after_hwframe+0x44/0xae It was caused by the fact that rebind_subsystem() disables controllers to be rebound one by one. If more than one disabled controllers are originally from the default hierarchy, it means that cgroup_apply_control_disable() will be called multiple times for the same default hierarchy. A controller may be killed by css_kill() in the first round. In the second round, the killed controller may not be completely dead yet leading to the warning. To avoid this problem, we collect all the ssid's of controllers that needed to be disabled from the default hierarchy and then disable them in one go instead of one by one. Fixes: 334c3679ec4b ("cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Change-Id: I62fb64dec4392451fd649d6bdbb8e409858d9513	2022-11-15 21:35:31 +01:00
Anay Wadhera	89423923b3	Revert "cgroup: Disable IRQs while holding css_set_lock" This reverts commit ac7b270e91c7b0d1b1c5544532852b55177004f1. Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:59 +01:00
Colin Cross	5ed9a85104	cgroup: Add generic cgroup subsystem permission checks Rather than using explicit euid == 0 checks when trying to move tasks into a cgroup via CFS, move permission checks into each specific cgroup subsystem. If a subsystem does not specify a 'allow_attach' handler, then we fall back to doing our checks the old way. Use the 'allow_attach' handler for the 'cpu' cgroup to allow non-root processes to add arbitrary processes to a 'cpu' cgroup if it has the CAP_SYS_NICE capability set. This version of the patch adds a 'allow_attach' handler instead of reusing the 'can_attach' handler. If the 'can_attach' handler is reused, a new cgroup that implements 'can_attach' but not the permission checks could end up with no permission checks at all. Change-Id: Icfa950aa9321d1ceba362061d32dc7dfa2c64f0c Original-Author: San Mehat <san@google.com> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:59 +01:00
Rom Lemarchand	c11726cada	cgroup: refactor allow_attach function into common code move cpu_cgroup_allow_attach to a common subsys_cgroup_allow_attach. This allows any process with CAP_SYS_NICE to move tasks across cgroups if they use this function as their allow_attach handler. Bug: 18260435 Change-Id: I6bb4933d07e889d0dc39e33b4e71320c34a2c90f Signed-off-by: Rom Lemarchand <romlem@android.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:58 +01:00
Tejun Heo	312b018a47	cgroup: add tracepoints for basic operations Debugging what goes wrong with cgroup setup can get hairy. Add tracepoints for cgroup hierarchy mount, cgroup creation/destruction and task migration operations for better visibility. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:58 +01:00
Daniel Bristot de Oliveira	e61d484a2d	cgroup: Disable IRQs while holding css_set_lock While testing the deadline scheduler + cgroup setup I hit this warning. [ 132.612935] ------------[ cut here ]------------ [ 132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80 [ 132.612952] Modules linked in: (a ton of modules...) [ 132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2 [ 132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014 [ 132.612982] 0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e [ 132.612984] 0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b [ 132.612985] 00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00 [ 132.612986] Call Trace: [ 132.612987] <IRQ> [<ffffffff813d229e>] dump_stack+0x63/0x85 [ 132.612994] [<ffffffff810a652b>] __warn+0xcb/0xf0 [ 132.612997] [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170 [ 132.612999] [<ffffffff810a665d>] warn_slowpath_null+0x1d/0x20 [ 132.613000] [<ffffffff810aba5b>] __local_bh_enable_ip+0x6b/0x80 [ 132.613008] [<ffffffff817d6c8a>] _raw_write_unlock_bh+0x1a/0x20 [ 132.613010] [<ffffffff817d6c9e>] _raw_spin_unlock_bh+0xe/0x10 [ 132.613015] [<ffffffff811388ac>] put_css_set+0x5c/0x60 [ 132.613016] [<ffffffff8113dc7f>] cgroup_free+0x7f/0xa0 [ 132.613017] [<ffffffff810a3912>] __put_task_struct+0x42/0x140 [ 132.613018] [<ffffffff810e776a>] dl_task_timer+0xca/0x250 [ 132.613027] [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170 [ 132.613030] [<ffffffff8111371e>] __hrtimer_run_queues+0xee/0x270 [ 132.613031] [<ffffffff81113ec8>] hrtimer_interrupt+0xa8/0x190 [ 132.613034] [<ffffffff81051a58>] local_apic_timer_interrupt+0x38/0x60 [ 132.613035] [<ffffffff817d9b0d>] smp_apic_timer_interrupt+0x3d/0x50 [ 132.613037] [<ffffffff817d7c5c>] apic_timer_interrupt+0x8c/0xa0 [ 132.613038] <EOI> [<ffffffff81063466>] ? native_safe_halt+0x6/0x10 [ 132.613043] [<ffffffff81037a4e>] default_idle+0x1e/0xd0 [ 132.613044] [<ffffffff810381cf>] arch_cpu_idle+0xf/0x20 [ 132.613046] [<ffffffff810e8fda>] default_idle_call+0x2a/0x40 [ 132.613047] [<ffffffff810e92d7>] cpu_startup_entry+0x2e7/0x340 [ 132.613048] [<ffffffff81050235>] start_secondary+0x155/0x190 [ 132.613049] ---[ end trace f91934d162ce9977 ]--- The warn is the spin_(lock\|unlock)_bh(&css_set_lock) in the interrupt context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid this problem - and other problems of sharing a spinlock with an interrupt. Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: cgroups@vger.kernel.org Cc: stable@vger.kernel.org # 4.5+ Cc: linux-kernel@vger.kernel.org Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:58 +01:00
Johannes Weiner	75eb76c4e2	FROMLIST: kernel: cgroup: add poll file operation Cgroup has a standardized poll/notification mechanism for waking all pollers on all fds when a filesystem node changes. To allow polling for custom events, add a .poll callback that can override the default. This is in preparation for pollable cgroup pressure files which have per-fd trigger configurations. Link: http://lkml.kernel.org/r/20190124211518.244221-3-surenb@google.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> (in linux-next: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=c88177361203be291a49956b6c9d5ec164ea24b2) Conflicts: include/linux/cgroup-defs.h kernel/cgroup.c 1. made changes in kernel/cgroup.c instead of kernel/cgroup/cgroup.c 2. replaced __poll_t with unsigned int Bug: 111308141 Test: modified lmkd to use PSI and tested using lmkd_unit_test Change-Id: Ie3d914197d1f150e1d83c6206865566a7cbff1b4 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:58 +01:00
Tejun Heo	3f46e4936c	UPSTREAM: cgroup add cftype->open/release() callbacks Pipe the newly added kernfs->open/release() callbacks through cftype. While at it, as cleanup operations now can be performed from ->release() instead of ->seq_stop(), make the latter optional. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Acked-by: Zefan Li <lizefan@huawei.com> (cherry picked from commit e90cbebc3fa5caea4c8bfeb0d0157a0cee53efc7) Bug: 111308141 Test: modified lmkd to use PSI and tested using lmkd_unit_test Change-Id: Iff9794cbbc2c7067c24cb2f767bbdeffa26b5180 Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:58 +01:00
Shakeel Butt	4f559d974b	cgroup: memcg: net: do not associate sock with unrelated cgroup [ Upstream commit e876ecc67db80dfdb8e237f71e5b43bb88ae549c ] We are testing network memory accounting in our setup and noticed inconsistent network memory usage and often unrelated cgroups network usage correlates with testing workload. On further inspection, it seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in irq context specially for cgroup v1. mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context and kind of assumes that this can only happen from sk_clone_lock() and the source sock object has already associated cgroup. However in cgroup v1, where network memory accounting is opt-in, the source sock can be unassociated with any cgroup and the new cloned sock can get associated with unrelated interrupted cgroup. Cgroup v2 can also suffer if the source sock object was created by process in the root cgroup or if sk_alloc() is called in irq context. The fix is to just do nothing in interrupt. WARNING: Please note that about half of the TCP sockets are allocated from the IRQ context, so, memory used by such sockets will not be accouted by the memcg. The stack trace of mem_cgroup_sk_alloc() from IRQ-context: CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1 Hardware name: ... Call Trace: <IRQ> dump_stack+0x57/0x75 mem_cgroup_sk_alloc+0xe9/0xf0 sk_clone_lock+0x2a7/0x420 inet_csk_clone_lock+0x1b/0x110 tcp_create_openreq_child+0x23/0x3b0 tcp_v6_syn_recv_sock+0x88/0x730 tcp_check_req+0x429/0x560 tcp_v6_rcv+0x72d/0xa40 ip6_protocol_deliver_rcu+0xc9/0x400 ip6_input+0x44/0xd0 ? ip6_protocol_deliver_rcu+0x400/0x400 ip6_rcv_finish+0x71/0x80 ipv6_rcv+0x5b/0xe0 ? ip6_sublist_rcv+0x2e0/0x2e0 process_backlog+0x108/0x1e0 net_rx_action+0x26b/0x460 __do_softirq+0x104/0x2a6 do_softirq_own_stack+0x2a/0x40 </IRQ> do_softirq.part.19+0x40/0x50 __local_bh_enable_ip+0x51/0x60 ip6_finish_output2+0x23d/0x520 ? ip6table_mangle_hook+0x55/0x160 __ip6_finish_output+0xa1/0x100 ip6_finish_output+0x30/0xd0 ip6_output+0x73/0x120 ? __ip6_finish_output+0x100/0x100 ip6_xmit+0x2e3/0x600 ? ipv6_anycast_cleanup+0x50/0x50 ? inet6_csk_route_socket+0x136/0x1e0 ? skb_free_head+0x1e/0x30 inet6_csk_xmit+0x95/0xf0 __tcp_transmit_skb+0x5b4/0xb20 __tcp_send_ack.part.60+0xa3/0x110 tcp_send_ack+0x1d/0x20 tcp_rcv_state_process+0xe64/0xe80 ? tcp_v6_connect+0x5d1/0x5f0 tcp_v6_do_rcv+0x1b1/0x3f0 ? tcp_v6_do_rcv+0x1b1/0x3f0 __release_sock+0x7f/0xd0 release_sock+0x30/0xa0 __inet_stream_connect+0x1c3/0x3b0 ? prepare_to_wait+0xb0/0xb0 inet_stream_connect+0x3b/0x60 __sys_connect+0x101/0x120 ? __sys_getsockopt+0x11b/0x140 __x64_sys_connect+0x1a/0x20 do_syscall_64+0x51/0x200 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The stack trace of mem_cgroup_sk_alloc() from IRQ-context: Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking") Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:58 +01:00
Yang Yingliang	2f26723dee	cgroup: add missing skcd->no_refcnt check in cgroup_sk_clone() Add skcd->no_refcnt check which is missed when backporting ad0f75e5f57c ("cgroup: fix cgroup_sk_alloc() for sk_clone_lock()"). This patch is needed in stable-4.9, stable-4.14 and stable-4.19. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:58 +01:00
Cong Wang	2c46e4aee8	cgroup: fix cgroup_sk_alloc() for sk_clone_lock() [ Upstream commit ad0f75e5f57ccbceec13274e1e242f2b5a6397ed ] When we clone a socket in sk_clone_lock(), its sk_cgrp_data is copied, so the cgroup refcnt must be taken too. And, unlike the sk_alloc() path, sock_update_netprioidx() is not called here. Therefore, it is safe and necessary to grab the cgroup refcnt even when cgroup_sk_alloc is disabled. sk_clone_lock() is in BH context anyway, the in_interrupt() would terminate this function if called there. And for sk_alloc() skcd->val is always zero. So it's safe to factor out the code to make it more readable. The global variable 'cgroup_sk_alloc_disabled' is used to determine whether to take these reference counts. It is impossible to make the reference counting correct unless we save this bit of information in skcd->val. So, add a new bit there to record whether the socket has already taken the reference counts. This obviously relies on kmalloc() to align cgroup pointers to at least 4 bytes, ARCH_KMALLOC_MINALIGN is certainly larger than that. This bug seems to be introduced since the beginning, commit d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets") tried to fix it but not compeletely. It seems not easy to trigger until the recent commit 090e28b229af ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged. Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Reported-by: Cameron Berkenpas <cam@neo-zeon.de> Reported-by: Peter Geis <pgwipeout@gmail.com> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reported-by: Daniël Sonck <dsonck92@gmail.com> Reported-by: Zhang Qiang <qiang.zhang@windriver.com> Tested-by: Cameron Berkenpas <cam@neo-zeon.de> Tested-by: Peter Geis <pgwipeout@gmail.com> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Zefan Li <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:58 +01:00
Anay Wadhera	f9b71acb71	cgroup: replace out_idr_free with actual code Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:57 +01:00
Tejun Heo	999dec49e4	BACKPORT: cgroup: misc changes Misc trivial changes to prepare for future changes. No functional difference. * Expose cgroup_get(), cgroup_tryget() and cgroup_parent(). * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp. * Rename cgroup_stats_show() to cgroup_stat_show() for consistency with the file name. Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 3e48930cc74f0c212ee1838f89ad0ca7fcf2fea1) Conflicts: kernel/cgroup/cgroup.c (1. manual merge because kernel/cgroup/cgroup.c is under kernel/cgroup.c 2. cgroup_stats_show change is skipped because the function dos not exist) Bug: 111308141 Test: modified lmkd to use PSI and tested using lmkd_unit_test Signed-off-by: Suren Baghdasaryan <surenb@google.com> Change-Id: I756ee3dcf0d0f3da69cd1b58e644271625053538 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:57 +01:00
Alexei Starovoitov	e7f13c458c	BACKPORT: bpf: multi program support for cgroup+bpf introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple bpf programs to a cgroup. The difference between three possible flags for BPF_PROG_ATTACH command: - NONE(default): No further bpf programs allowed in the subtree. - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program, the program in this cgroup yields to sub-cgroup program. - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program, that cgroup program gets run in addition to the program in this cgroup. NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't change their behavior. It only clarifies the semantics in relation to new flag. Only one program is allowed to be attached to a cgroup with NONE or BPF_F_ALLOW_OVERRIDE flag. Multiple programs are allowed to be attached to a cgroup with BPF_F_ALLOW_MULTI flag. They are executed in FIFO order (those that were attached first, run first) The programs of sub-cgroup are executed first, then programs of this cgroup and then programs of parent cgroup. All eligible programs are executed regardless of return code from earlier programs. To allow efficient execution of multiple programs attached to a cgroup and to avoid penalizing cgroups without any programs attached introduce 'struct bpf_prog_array' which is RCU protected array of pointers to bpf programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> for cgroup bits Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> (cherry picked from commit 324bda9e6c5add86ba2e1066476481c48132aca0) Signed-off-by: Connor O'Brien <connoro@google.com> Bug: 121213201 Bug: 138317270 Test: build & boot cuttlefish Change-Id: I06b71c850b9f3e052b106abab7a4a3add012a3f8 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:55 +01:00
Alexei Starovoitov	d265c35bc9	BACKPORT: bpf: introduce BPF_F_ALLOW_OVERRIDE flag If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command to the given cgroup the descendent cgroup will be able to override effective bpf program that was inherited from this cgroup. By default it's not passed, therefore override is disallowed. Examples: 1. prog X attached to /A with default prog Y fails to attach to /A/B and /A/B/C Everything under /A runs prog X 2. prog X attached to /A with allow_override. prog Y fails to attach to /A/B with default (non-override) prog M attached to /A/B with allow_override. Everything under /A/B runs prog M only. 3. prog X attached to /A with allow_override. prog Y fails to attach to /A with default. The user has to detach first to switch the mode. In the future this behavior may be extended with a chain of non-overridable programs. Also fix the bug where detach from cgroup where nothing is attached was not throwing error. Return ENOENT in such case. Add several testcases and adjust libbpf. Fixes: 3007098494be ("cgroup: add support for eBPF programs") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Daniel Mack <daniel@zonque.org> Signed-off-by: David S. Miller <davem@davemloft.net> Fixes: Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28 ("cgroup: add support for eBPF programs") [AmitP: Refactored original patch for android-4.9 where libbpf sources are in samples/bpf/ and test_cgrp2_attach2, test_cgrp2_sock, and test_cgrp2_sock2 sample tests do not exist.] (cherry picked from commit 7f677633379b4abb3281cdbe7e7006f049305c03) Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:50 +01:00
Daniel Mack	58761abdf3	UPSTREAM: cgroup: add support for eBPF programs Cherry-pick from commit 3007098494bec614fb55dee7bc0410bb7db5ad18 This patch adds two sets of eBPF program pointers to struct cgroup. One for such that are directly pinned to a cgroup, and one for such that are effective for it. To illustrate the logic behind that, assume the following example cgroup hierarchy. A - B - C \ D - E If only B has a program attached, it will be effective for B, C, D and E. If D then attaches a program itself, that will be effective for both D and E, and the program in B will only affect B and C. Only one program of a given type is effective for a cgroup. Attaching and detaching programs will be done through the bpf(2) syscall. For now, ingress and egress inet socket filtering are the only supported use-cases. Signed-off-by: Daniel Mack <daniel@zonque.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Bug: 30950746 Change-Id: I3df35d8d3b1261503f9b5bcd90b18c9358f1ac28 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:49 +01:00
Tejun Heo	e6a42daec1	cgroup: don't online subsystems before cgroup_name/path() are operational commit 07cd12945551b63ecb1a349d50a6d69d1d6feb4a upstream. While refactoring cgroup creation, a5bca2152036 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()") incorrectly onlined subsystems before the new cgroup is associated with it kernfs_node. This is fine for cgroup proper but cgroup_name/path() depend on the associated kernfs_node and if a subsystem makes the new cgroup_subsys_state visible, which they're allowed to after onlining, it can lead to NULL dereference. The current code performs cgroup creation and subsystem onlining in cgroup_create() and cgroup_mkdir() makes the cgroup and subsystems visible afterwards. There's no reason to online the subsystems early and we can simply drop cgroup_apply_control_enable() call from cgroup_create() so that the subsystems are onlined and made visible at the same time. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Fixes: a5bca2152036 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:49 +01:00
Tejun Heo	e6b410585e	cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent() 4c737b41de7f ("cgroup: make cgroup_path() and friends behave in the style of strlcpy()") broke error handling in proc_cgroup_show() and cgroup_release_agent() by not handling negative return values from cgroup_path_ns_locked(). Fix it. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com> Change-Id: I30fefb0167ca977eabbf7cbb77cd2ca230cf25b1	2022-03-04 20:16:49 +01:00
Tejun Heo	b22828bb93	cgroup: fix invalid controller enable rejections with cgroup namespace On the v2 hierarchy, "cgroup.subtree_control" rejects controller enables if the cgroup has processes in it. The enforcement of this logic assumes that the cgroup wouldn't have any css_sets associated with it if there are no tasks in the cgroup, which is no longer true since a79a908fd2b0 ("cgroup: introduce cgroup namespaces"). When a cgroup namespace is created, it pins the css_set of the creating task to use it as the root css_set of the namespace. This extra reference stays as long as the namespace is around and makes "cgroup.subtree_control" think that the namespace root cgroup is not empty even when it is and thus reject controller enables. Fix it by making cgroup_subtree_control() walk and test emptiness of each css_set instead of testing whether the list_head is empty. While at it, update the comment of cgroup_task_count() to indicate that the returned value may be higher than the number of tasks, which has always been true due to temporary references and doesn't break anything. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Evgeny Vereshchagin <evvers@ya.ru> Cc: Serge E. Hallyn <serge.hallyn@ubuntu.com> Cc: Aditya Kali <adityakali@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: stable@vger.kernel.org # v4.6+ Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541 Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:49 +01:00
Andrey Vagin	293b50a101	kernel: add a helper to get an owning user namespace for a namespace Return -EPERM if an owning user namespace is outside of a process current user namespace. v2: In a first version ns_get_owner returned ENOENT for init_user_ns. This special cases was removed from this version. There is nothing outside of init_user_ns, so we can return EPERM. v3: rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:49 +01:00
Johannes Weiner	e0ea866e20	cgroup: duplicate cgroup reference when cloning sockets When a socket is cloned, the associated sock_cgroup_data is duplicated but not its reference on the cgroup. As a result, the cgroup reference count will underflow when both sockets are destroyed later on. Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup") Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:48 +01:00
Tejun Heo	0ed2e64615	cgroup: make cgroup_path() and friends behave in the style of strlcpy() cgroup_path() and friends used to format the path from the end and thus the resulting path usually didn't start at the start of the passed in buffer. Also, when the buffer was too small, the partial result was truncated from the head rather than tail and there was no way to tell how long the full path would be. These make the functions less robust and more awkward to use. With recent updates to kernfs_path(), cgroup_path() and friends can be made to behave in strlcpy() style. * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now always return the length of the full path. If buffer is too small, it contains nul terminated truncated output. * All users updated accordingly. v2: cgroup_path() usage in kernel/sched/debug.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com> Change-Id: Ibb5c7544237e0ab6cde03c8bec80b3376917c0a8	2022-03-04 20:16:48 +01:00
Eric W. Biederman	74aa401e5b	cgroupns: Only allow creation of hierarchies in the initial cgroup namespace Unprivileged users can't use hierarchies if they create them as they do not have privilieges to the root directory. Which means the only thing a hiearchy created by an unprivileged user is good for is expanding the number of cgroup links in every css_set, which is a DOS attack. We could allow hierarchies to be created in namespaces in the initial user namespace. Unfortunately there is only a single namespace for the names of heirarchies, so that is likely to create more confusion than not. So do the simple thing and restrict hiearchy creation to the initial cgroup namespace. Cc: stable@vger.kernel.org Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:48 +01:00
Eric W. Biederman	def9bd5bcd	cgroupns: Fix the locking in copy_cgroup_ns If "clone(CLONE_NEWCGROUP...)" is called it results in a nice lockdep valid splat. In __cgroup_proc_write the lock ordering is: cgroup_mutex -- through cgroup_kn_lock_live cgroup_threadgroup_rwsem In copy_process the guts of clone the lock ordering is: cgroup_threadgroup_rwsem -- through threadgroup_change_begin cgroup_mutex -- through copy_namespaces -- copy_cgroup_ns lockdep reports some a different call chains for the first ordering of cgroup_mutex and cgroup_threadgroup_rwsem but it is harder to trace. This is most definitely deadlock potential under the right circumstances. Fix this by by skipping the cgroup_mutex and making the locking in copy_cgroup_ns mirror the locking in cgroup_post_fork which also runs during fork under the cgroup_threadgroup_rwsem. Cc: stable@vger.kernel.org Fixes: a79a908fd2b0 ("cgroup: introduce cgroup namespaces") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:48 +01:00
Martin KaFai Lau	506c202af1	cgroup: Add cgroup_get_from_fd Add a helper function to get a cgroup2 from a fd. It will be stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will be introduced in the later patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Tejun Heo <tj@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:48 +01:00
Tejun Heo	72aa4eb255	cgroup: allow NULL return from ss->css_alloc() cgroup core expected css_alloc to return an ERR_PTR value on failure and caused NULL deref if it returned NULL. It's an easy mistake to make from an alloc function and there's no ambiguity in what's being indicated. Update css_create() so that it interprets NULL return from css_alloc as -ENOMEM. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:48 +01:00
Johannes Weiner	738d60a480	cgroup: remove unnecessary 0 check from css_from_id() css_idr allocation starts at 1, so index 0 will never point to an item. css_from_id() currently filters that before asking idr_find(), but idr_find() would also just return NULL, so this is not needed. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:48 +01:00
Johannes Weiner	e8798eb464	cgroup: fix idr leak for the first cgroup root The valid cgroup hierarchy ID range includes 0, so we can't filter for positive numbers when freeing it, or it'll leak the first ID. No big deal, just disruptive when reading the code. The ID is freed during error handling and when the reference count hits zero, so the double-free test is not necessary; remove it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:48 +01:00
Wenwei Tao	7de549b633	cgroup: remove redundant cleanup in css_create When create css failed, before call css_free_rcu_fn, we remove the css id and exit the percpu_ref, but we will do these again in css_free_work_fn, so they are redundant. Especially the css id, that would cause problem if we remove it twice, since it may be assigned to another css after the first remove. tj: This was broken by two commits updating the free path without synchronizing the creation failure path. This can be easily triggered by trying to create more than 64k memory cgroups. Signed-off-by: Wenwei Tao <ww.tao0320@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Fixes: `9a1049da9b` ("percpu-refcount: require percpu_ref to be exited explicitly") Fixes: `01e586598b` ("cgroup: release css->id after css_free") Cc: stable@vger.kernel.org # v3.17+ Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:48 +01:00
Felipe Balbi	15398649e0	cgroup: fix compile warning commit 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") added the following compile warning: kernel/cgroup.c: In function ‘cgroup_show_path’: kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable] int len = 0, ret = 0; ^ fix it. Fixes: 4f41fc59620f ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") Signed-off-by: Felipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:48 +01:00
Serge E. Hallyn	215cbf575a	cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces Patch summary: When showing a cgroupfs entry in mountinfo, show the path of the mount root dentry relative to the reader's cgroup namespace root. Short explanation (courtesy of mkerrisk): If we create a new cgroup namespace, then we want both /proc/self/cgroup and /proc/self/mountinfo to show cgroup paths that are correctly virtualized with respect to the cgroup mount point. Previous to this patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo does not. Long version: When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup namespace, and then mounts a new instance of the freezer cgroup, the new mount will be rooted at /a/b. The root dentry field of the mountinfo entry will show '/a/b'. cat > /tmp/do1 << EOF mount -t cgroup -o freezer freezer /mnt grep freezer /proc/self/mountinfo EOF unshare -Gm bash /tmp/do1 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer The task's freezer cgroup entry in /proc/self/cgroup will simply show '/': grep freezer /proc/self/cgroup 9:freezer:/ If instead the same task simply bind mounts the /a/b cgroup directory, the resulting mountinfo entry will again show /a/b for the dentry root. However in this case the task will find its own cgroup at /mnt/a/b, not at /mnt: mount --bind /sys/fs/cgroup/freezer/a/b /mnt 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer In other words, there is no way for the task to know, based on what is in mountinfo, which cgroup directory is its own. Example (by mkerrisk): First, a little script to save some typing and verbiage: echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup \| grep freezer)" cat /proc/self/mountinfo \| grep freezer \| awk '{print "\tmountinfo:\t\t" $4 "\t" $5}' Create cgroup, place this shell into the cgroup, and look at the state of the /proc files: 2653 2653 # Our shell 14254 # cat(1) /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer Create a shell in new cgroup and mount namespaces. The act of creating a new cgroup namespace causes the process's current cgroups directories to become its cgroup root directories. (Here, I'm using my own version of the "unshare" utility, which takes the same options as the util-linux version): Look at the state of the /proc files: /proc/self/cgroup: 10:freezer:/ mountinfo: / /sys/fs/cgroup/freezer The third entry in /proc/self/cgroup (the pathname of the cgroup inside the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is rooted at /a/b in the outer namespace. However, the info in /proc/self/mountinfo is not for this cgroup namespace, since we are seeing a duplicate of the mount from the old mount namespace, and the info there does not correspond to the new cgroup namespace. However, trying to create a new mount still doesn't show us the right information in mountinfo: # propagating to other mountns /proc/self/cgroup: 7:freezer:/ mountinfo: /a/b /mnt/freezer The act of creating a new cgroup namespace caused the process's current freezer directory, "/a/b", to become its cgroup freezer root directory. In other words, the pathname directory of the directory within the newly mounted cgroup filesystem should be "/", but mountinfo wrongly shows us "/a/b". The consequence of this is that the process in the cgroup namespace cannot correctly construct the pathname of its cgroup root directory from the information in /proc/PID/mountinfo. With this patch, the dentry root field in mountinfo is shown relative to the reader's cgroup namespace. So the same steps as above: /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release 3164 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> Tested-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:47 +01:00
Tejun Heo	bca450043c	cgroup: implement cgroup_subsys->implicit_on_dfl Some controllers, perf_event for now and possibly freezer in the future, don't really make sense to control explicitly through "cgroup.subtree_control". For example, the primary role of perf_event is identifying the cgroups of tasks; however, because the controller also keeps a small amount of state per cgroup, it can't be replaced with simple cgroup membership tests. This patch implements cgroup_subsys->implicit_on_dfl flag. When set, the controller is implicitly enabled on all cgroups on the v2 hierarchy so that utility type controllers such as perf_event can be enabled and function transparently. An implicit controller doesn't show up in "cgroup.controllers" or "cgroup.subtree_control", is exempt from no internal process rule and can be stolen from the default hierarchy even if there are non-root csses. v2: Reimplemented on top of the recent updates to css handling and subsystem rebinding. Rebinding implicit subsystems is now a simple matter of exempting it from the busy subsystem check. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:47 +01:00
Tejun Heo	19e33e8776	cgroup: use css_set->mg_dst_cgrp for the migration target cgroup Migration can be multi-target on the default hierarchy when a controller is enabled - processes belonging to each child cgroup have to be moved to the child cgroup itself to refresh css association. This isn't a problem for cgroup_migrate_add_src() as each source css_set still maps to single source and target cgroups; however, cgroup_migrate_prepare_dst() is called once after all source css_sets are added and thus might not have a single destination cgroup. This is currently worked around by specifying NULL for @dst_cgrp and using the source's default cgroup as destination as the only multi-target migration in use is self-targetting. While this works, it's subtle and clunky. As all taget cgroups are already specified while preparing the source css_sets, this clunkiness can easily be removed by recording the target cgroup in each source css_set. This patch adds css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and used by cgroup_migrate_prepare_dst(). This also makes migration code ready for arbitrary multi-target migration. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:47 +01:00
Tejun Heo	a34b4ea7cd	cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup On the default hierarchy, a migration can be multi-source and/or multi-destination. cgroup_taskest_migrate() used to incorrectly assume single destination cgroup but the bug has been fixed by `1f7dd3e5a6` ("cgroup: fix handling of multi-destination migration from subtree_control enabling"). Since the commit, @dst_cgrp to cgroup[_taskset]_migrate() is only used to determine which subsystems are affected or which cgroup_root the migration is taking place in. As such, @dst_cgrp is misleading. This patch replaces @dst_cgrp with @root. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:47 +01:00
Tejun Heo	2e85875d05	cgroup: move migration destination verification out of cgroup_migrate_prepare_dst() cgroup_migrate_prepare_dst() verifies whether the destination cgroup is allowable; however, the test doesn't really belong there. It's too deep and common in the stack and as a result the test itself is gated by another test. Separate the test out into cgroup_may_migrate_to() and update cgroup_attach_task() and cgroup_transfer_tasks() to perform the test directly. This doesn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:47 +01:00
Tejun Heo	3f11f0b79f	cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses() cgroup_update_dfl_csses() should move each task in the subtree to self; however, it was incorrectly calling cgroup_migrate_add_src() with the root of the subtree as @dst_cgrp. Fortunately, cgroup_migrate_add_src() currently uses @dst_cgrp only to determine the hierarchy and the bug doesn't cause any actual breakages. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:47 +01:00
Tejun Heo	a4004c1510	cgroup: update css iteration in cgroup_update_dfl_csses() The existing sequences of operations ensure that the offlining csses are drained before cgroup_update_dfl_csses(), so even though cgroup_update_dfl_csses() uses css_for_each_descendant_pre() to walk the target cgroups, it doesn't end up operating on dead cgroups. Also, the function explicitly excludes the subtree root from operation. This is fragile and inconsistent with the rest of css update operations. This patch updates cgroup_update_dfl_csses() to use cgroup_for_each_live_descendant_pre() instead and include the subtree root. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:47 +01:00
Tejun Heo	8bde48a7f2	cgroup: allocate 2x cgrp_cset_links when setting up a new root During prep, cgroup_setup_root() allocates cgrp_cset_links matching the number of existing css_sets to later link the new root. This is fine for now as the only operation which can happen inbetween is rebind_subsystems() and rebinding of empty subsystems doesn't create new css_sets. However, while not yet allowed, with the recent reimplementation, rebind_subsystems() can rebind subsystems with descendant csses and thus can create new css_sets. This patch makes cgroup_setup_root() allocate 2x of the existing css_sets so that later use of live subsystem rebinding doesn't blow up. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:47 +01:00
Tejun Heo	18ca54b364	cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask cgroup_calc_subtree_ss_mask() currently takes @cgrp and @subtree_control. @cgrp is used for two purposes - to decide whether it's for default hierarchy and the mask of available subsystems. The former doesn't matter as the results are the same regardless. The latter can be specified directly through a subsystem mask. This patch makes cgroup_calc_subtree_ss_mask() perform the same calculations for both default and legacy hierarchies and take @this_ss_mask for available subsystems. @cgrp is no longer used and dropped. This is to allow using the function in contexts where available controllers can't be decided from the cgroup. v2: cgroup_refres_subtree_ss_mask() is removed by a previous patch. Updated accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:47 +01:00
Tejun Heo	84ad08b2df	cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends rebind_subsystem() open codes quite a bit of css and interface file manipulations. It tries to be fail-safe but doesn't quite achieve it. It can be greatly simplified by using the new css management helpers. This patch reimplements rebind_subsytsems() using cgroup_apply_control() and friends. * The half-baked rollback on file creation failure is dropped. It is an extremely cold path, failure isn't critical, and, aside from kernel bugs, the only reason it can fail is memory allocation failure which pretty much doesn't happen for small allocations. * As cgroup_apply_control_disable() is now used to clean up root cgroup on rebind, make sure that it doesn't end up killing root csses. * All callers of rebind_subsystems() are updated to use cgroup_lock_and_drain_offline() as the apply_control functions require drained subtree. * This leaves cgroup_refresh_subtree_ss_mask() without any user. Removed. * css_populate_dir() and css_clear_dir() no longer needs @cgrp_override parameter. Dropped. * While at it, add WARN_ON() to rebind_subsystem() calls which are expected to always succeed just in case. While the rules visible to userland aren't changed, this reimplementation not only simplifies rebind_subsystems() but also allows it to disable and enable csses recursively. This can be used to implement more flexible rebinding. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com> Change-Id: I23d1a815cc4c412e83331aebd323f111d1046dd0	2022-03-04 20:16:47 +01:00
Tejun Heo	8a3eb115a6	cgroup: use cgroup_apply_enable_control() in cgroup creation path cgroup_create() manually updates control masks and creates child csses which cgroup_mkdir() then manually populates. Both can be simplified by using cgroup_apply_enable_control() and friends. The only catch is that it calls css_populate_dir() with NULL cgroup->kn during cgroup_create(). This is worked around by making the function noop on NULL kn. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:47 +01:00
Tejun Heo	a546b0b6ef	cgroup: combine cgroup_mutex locking and offline css draining cgroup_drain_offline() is used to wait for csses being offlined to uninstall itself from cgroup->subsys[] array so that new csses can be installed. The function's only user, cgroup_subtree_control_write(), calls it after performing some checks and restarts the whole process via restart_syscall() if draining has to release cgroup_mutex to wait. This can be simplified by draining before other synchronized operations so that there's nothing to restart. This patch converts cgroup_drain_offline() to cgroup_lock_and_drain_offline() which performs both locking and draining and updates cgroup_kn_lock_live() use it instead of cgroup_mutex() if requested. This combined locking and draining operations are easier to use and less error-prone. While at it, add WARNs in control_apply functions which triggers if the subtree isn't properly drained. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:46 +01:00
Tejun Heo	79fd68cdd7	cgroup: factor out cgroup_{apply\|finalize}_control() from cgroup_subtree_control_write() Factor out cgroup_{apply\|finalize}_control() so that control mask update can be done in several simple steps. This patch doesn't introduce behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:46 +01:00
Tejun Heo	3e1f1b93d6	cgroup: introduce cgroup_{save\|propagate\|restore}_control() While controllers are being enabled and disabled in cgroup_subtree_control_write(), the original subsystem masks are stashed in local variables so that they can be restored if the operation fails in the middle. This patch adds dedicated fields to struct cgroup to be used instead of the local variables and implements functions to stash the current values, propagate the changes and restore them recursively. Combined with the previous changes, this makes subsystem management operations fully recursive and modularlized. This will be used to expand cgroup core functionalities. While at it, remove now unused @css_enable and @css_disable from cgroup_subtree_control_write(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:46 +01:00
Tejun Heo	cdae46302b	cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable\|enable}() recursive The three factored out css management operations - cgroup_drain_offline() and cgroup_apply_control_{disable\|enable}() - only depend on the current state of the target cgroups and idempotent and thus can be easily made to operate on the subtree instead of the immediate children. This patch introduces the iterators which walk live subtree and converts the three functions to operate on the subtree including self instead of the children. While this leads to spurious walking and be slightly more expensive, it will allow them to be used for wider scope of operations. Note that cgroup_drain_offline() now tests for whether a css is dying before trying to drain it. This is to avoid trying to drain live csses as there can be mix of live and dying csses in a subtree unlike children of the same parent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:46 +01:00
Tejun Heo	c4c9da1296	cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write() Factor out css enabling and showing into cgroup_apply_control_enable(). * Nest subsystem walk inside child walk. The child walk will later be converted to subtree walk which is a bit more expensive. * Instead of operating on the differential masks @css_enable, simply enable or show csses according to the current cgroup_control() and cgroup_ss_mask(). This leads to the same result and is simpler and more robust. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:46 +01:00
Tejun Heo	54360e29d5	cgroup: factor out cgroup_apply_control_disable() from cgroup_subtree_control_write() Factor out css disabling and hiding into cgroup_apply_control_disable(). * Nest subsystem walk inside child walk. The child walk will later be converted to subtree walk which is a bit more expensive. * Instead of operating on the differential masks @css_enable and @css_disable, simply disable or hide csses according to the current cgroup_control() and cgroup_ss_mask(). This leads to the same result and is simpler and more robust. * This allows error handling path to share the same code. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:46 +01:00
Tejun Heo	ad4afa4c82	cgroup: factor out cgroup_drain_offline() from cgroup_subtree_control_write() Factor out async css offline draining into cgroup_drain_offline(). * Nest subsystem walk inside child walk. The child walk will later be converted to subtree walk which is a bit more expensive. * Relocate the draining above subsystem mask preparation, which doesn't create any behavior differences but helps further refactoring. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:46 +01:00
Tejun Heo	a8d64b0f8e	cgroup: introduce cgroup_control() and cgroup_ss_mask() When a controller is enabled and visible on a non-root cgroup is determined by subtree_control and subtree_ss_mask of the parent cgroup. For a root cgroup, by the type of the hierarchy and which controllers are attached to it. Deciding the above on each usage is fragile and unnecessarily complicates the users. This patch introduces cgroup_control() and cgroup_ss_mask() which calculate and return the [visibly] enabled subsyste mask for the specified cgroup and conver the existing usages. * cgroup_e_css() is restructured for simplicity. * cgroup_calc_subtree_ss_mask() and cgroup_subtree_control_write() no longer need to distinguish root and non-root cases. * With cgroup_control(), cgroup_controllers_show() can now handle both root and non-root cases. cgroup_root_controllers_show() is removed. v2: cgroup_control() updated to yield the correct result on v1 hierarchies too. cgroup_subtree_control_write() converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Chatur27 <jasonbright2709@gmail.com>	2022-03-04 20:16:46 +01:00

1 2 3 4 5 ...

944 Commits