The verification challenge is the interesting part. The kernel verifier has to ensure that every path through the eBPF program properly acquires and releases locks, which is essentially solving a subset of the halting problem through conservative static analysis. False positives (rejecting valid programs) are acceptable; false negatives (allowing deadlocks) are not.
TL;DR the main issue arises because the context switch and sampling event both need to be written to the `ringBuffer` eBPF map. sampling event lock needs to be taken in an NMI which is by definition non-maskable. This leads to lock contention and recursive locks etc as explained when context switch handler tries to do the same thing.
Why not have context switches write to ringBuffer1 and sampling events write to ringBuffer2 (i.e. use different ringBuffers). This way buggy kernels should work properly too !?
That would work, but at the cost of doubling memory usage, since you then have two fixed-size ring buffers instead of one. Also, in our particular cases, the correct ordering of events is important, which is ~automatic with a single ring buffer, but gets much trickier with two.
> This way buggy kernels should work properly too !?
We have a workaround for older/buggy kernels in place. We simply guard against same-CPU recursion by maintaining per-CPU state that indicates whether a given CPU is currently in the process of adding data to the ring buffer. If that state is set, we discard events, which prevents the recursion too.