The case of the vanishing CPU: A Linux kernel debugging story(clickhouse.com)

47 pointsby manish_gill4 months ago2 comments

araes4 months ago
The article was eventually kind of interesting, although it had so much investigation involved I forgot what I was even reading by the end.
General idea was interesting, and probably something to look at (apparently there's an issue open). Final result was (I think...) that the Least Recently Used (LRU) memory function requires a spinlock to actually swap out memory pages, and there's huge amounts of contention.
```
  "during 3 seconds there were at least 138 threads active. 84% of stacktraces have 'evict_folios' frame according to the flamegraph, so it is very likely that more than 100 threads are constantly trying to do something with the spinlock."
```
So, basically 100 threads fighting over evict_folio and lru_lock constantly, and at least it seems (although admit eyes started glazing over) they're all fighting over the same memory page regions every time they're trying to lru_lock (initiate a spinlock for memory access releases).
Note: Totally way outside of my standard programming realm, so if somebody has a clearer / better explanation / summary ...
- csense4 months ago
  What you're describing was actually the second problem.
  The first problem (that ~80% of the article was about) was having ~1k threads reading from mmap'ed files. When you run out of memory, the kernel's supposed to drop some of those mmap'ed pages (you can always read the file again if you need the data).
  Since it's a cgroup (Docker) the kernel only scans for droppable pages when the cgroup memory limit's reached. The scan is single threaded, and the kernel needs two scans with a zero access bit to decide a page is "droppable".
  OP's container runs ~1k threads on 4 CPU's so the scan takes minutes of wall-clock time because of thread contention, and it has to scan all pages (having never scanned them before). By the time the second scan runs, the application's memory access pattern (database) has already ended up setting most of the access bits again.
  Upshot is, even though the container would have plenty of free memory if all its mmap'ed pages were evicted, the kernel ends up repeatedly doing very slow scans that end up finding very few evictable pages. The lock held by the scan causes some other symptoms (like ps hanging).
  As far as a kernel patch, I would suggest these mitigation strategies:
  - (1) Boost the priority of the scan thread if it doesn't seem to be getting at least ~0.5 core worth of runtime
  - (2) Spontaneously initiate a scan of cgroup memory when ~75% (configurable) of its memory limit is used
  - (3) Always bring memory usage down to at most ~95% (configurable) of the cgroup limit, randomly picking pages to be evicted if necessary.
  I say "would suggest" because OP eventually admits "Oh by the way, this issue stopped happening when we upgraded our kernel, and the new version's release notes said they completely redesigned this whole subsystem."
  - araes4 months ago
    Thanks for the clarification and explanation. Missed the part about 1000 threads fighting over memory page files, and minimal allocation for scans.
    The article would have been a bit clearer to read with something like your summary up near the front to provide at least a framework of what to look for further onward.
    After re-reading, does sound like Linux 6.1 ended having a fix with this portion almost near the end:
    A significant change in memory management is the introduction of Multi-Gen LRU (MGLRU), a complete redesign of the LRU-based page reclaim mechanism. It was introduced in Linux kernel 6.1 and since then more and more distributions have enabled it by default. I decided to give it a try and enabled this feature. And it turned out to be the cure for COS 109!
  - serxa4 months ago
    Vmscan is not single-threaded. Every thread that tries to allocate more memory when current memory consumption equals memory limit triggers vmscan. So lru_lock is used only to isolate some pages to scan and later to return pages that were not reclaimed back to cgroup. The scan itself is done without lru_lock.
    The first problem was that the vmscans are done under mmap_read_lock. And there is no good reason for it. It happened that vmscan is called from page fault that requires mmap_lock. At least some parts of page fault handling require mmap_lock, but definitely not reclaiming itself.
    > Always bring memory usage down to at most ~95% (configurable) of the cgroup limit I agree. It could be sometimes done in the application itself.
nasretdinov4 months ago
Really nice illustration of how to use eBPF, I wish more people would share this