Skip to content

Commit 3309b63

Browse files
shakeelbhtejun
authored andcommitted
cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated
On x86-64, this_cpu_cmpxchg() uses CMPXCHG without LOCK prefix which means it is only safe for the local CPU and not for multiple CPUs. Recently the commit 36df6e3 ("cgroup: make css_rstat_updated nmi safe") make css_rstat_updated lockless and uses lockless list to allow reentrancy. Since css_rstat_updated can invoked from process context, IRQ and NMI, it uses this_cpu_cmpxchg() to select the winner which will inset the lockless lnode into the global per-cpu lockless list. However the commit missed one case where lockless node of a cgroup can be accessed and modified by another CPU doing the flushing. Basically llist_del_first_init() in css_process_update_tree(). On a cursory look, it can be questioned how css_process_update_tree() can see a lockless node in global lockless list where the updater is at this_cpu_cmpxchg() and before llist_add() call in css_rstat_updated(). This can indeed happen in the presence of IRQs/NMI. Consider this scenario: Updater for cgroup stat C on CPU A in process context is after llist_on_list() check and before this_cpu_cmpxchg() in css_rstat_updated() where it get interrupted by IRQ/NMI. In the IRQ/NMI context, a new updater calls css_rstat_updated() for same cgroup C and successfully inserts rstatc_pcpu->lnode. Now concurrently CPU B is running the flusher and it calls llist_del_first_init() for CPU A and got rstatc_pcpu->lnode of cgroup C which was added by the IRQ/NMI updater. Now imagine CPU B calling init_llist_node() on cgroup C's rstatc_pcpu->lnode of CPU A and on CPU A, the process context updater calling this_cpu_cmpxchg(rstatc_pcpu->lnode) concurrently. The CMPXCNG without LOCK on CPU A is not safe and thus we need LOCK prefix. In Meta's fleet running the kernel with the commit 36df6e3, we are observing on some machines the memcg stats are getting skewed by more than the actual memory on the system. On close inspection, we noticed that lockless node for a workload for specific CPU was in the bad state and thus all the updates on that CPU for that cgroup was being lost. To confirm if this skew was indeed due to this CMPXCHG without LOCK in css_rstat_updated(), we created a repro (using AI) at [1] which shows that CMPXCHG without LOCK creates almost the same lnode corruption as seem in Meta's fleet and with LOCK CMPXCHG the issue does not reproduces. Link: http://lore.kernel.org/efiagdwmzfwpdzps74fvcwq3n4cs36q33ij7eebcpssactv3zu@se4hqiwxcfxq [1] Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: stable@vger.kernel.org # v6.17+ Fixes: 36df6e3 ("cgroup: make css_rstat_updated nmi safe") Signed-off-by: Tejun Heo <tj@kernel.org>
1 parent c2f2b01 commit 3309b63

1 file changed

Lines changed: 8 additions & 5 deletions

File tree

kernel/cgroup/rstat.c

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,6 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
7171
{
7272
struct llist_head *lhead;
7373
struct css_rstat_cpu *rstatc;
74-
struct css_rstat_cpu __percpu *rstatc_pcpu;
7574
struct llist_node *self;
7675

7776
/*
@@ -104,18 +103,22 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
104103
/*
105104
* This function can be renentered by irqs and nmis for the same cgroup
106105
* and may try to insert the same per-cpu lnode into the llist. Note
107-
* that llist_add() does not protect against such scenarios.
106+
* that llist_add() does not protect against such scenarios. In addition
107+
* this same per-cpu lnode can be modified through init_llist_node()
108+
* from css_rstat_flush() running on a different CPU.
108109
*
109110
* To protect against such stacked contexts of irqs/nmis, we use the
110111
* fact that lnode points to itself when not on a list and then use
111-
* this_cpu_cmpxchg() to atomically set to NULL to select the winner
112+
* try_cmpxchg() to atomically set to NULL to select the winner
112113
* which will call llist_add(). The losers can assume the insertion is
113114
* successful and the winner will eventually add the per-cpu lnode to
114115
* the llist.
116+
*
117+
* Please note that we can not use this_cpu_cmpxchg() here as on some
118+
* archs it is not safe against modifications from multiple CPUs.
115119
*/
116120
self = &rstatc->lnode;
117-
rstatc_pcpu = css->rstat_cpu;
118-
if (this_cpu_cmpxchg(rstatc_pcpu->lnode.next, self, NULL) != self)
121+
if (!try_cmpxchg(&rstatc->lnode.next, &self, NULL))
119122
return;
120123

121124
lhead = ss_lhead_cpu(css->ss, cpu);

0 commit comments

Comments
 (0)