Clang/LLVM static analyzer that detects microarchitectural hazards(abokhalill.github.io)

4 pointsby yousef062 days ago1 comment

yousef062 days ago
I've been building lshaz, a tool which reasons about struct field byte offsets, cache line boundaries, and atomic ordering semantics statically. The class of problems you'd find with perf c2c after the fact, caught pre-merge.
The tool works, but there's a subtle detail I didn't anticipate, AST analysis alone produces the wrong answer, and in both directions.
Consider a function that calls std::atomic<int>::fetch_add with memory_order_seq_cst. At the AST level, that's a clear signal: seq_cst on x86-64 compiles to a LOCK XCHG or LOCK XADD, which is meaningfully more expensive than a relaxed RMW. Flag it, suggest memory_order_relaxed or acquire/release depending on the use case.
Except: the optimizer frequently eliminates these. A seq_cst fetch_add on a local atomic that doesn't escape, common in code that uses atomics defensively inside a hot function — gets promoted to a register or eliminated entirely at O2. The AST doesn't know this. You've flagged a hardware cost that doesn't exist at runtime.
The inverse problem is worse: virtual dispatch. At the AST level, a virtual call looks like a virtual call. But if the callee type is fully determined at the call site, which the devirtualizer handles routinely at O2, the indirect branch is gone. Direct call, inlineable, no BTB pressure. Flag it and you're wrong again.
The same problem appears for heap allocations (small vector optimization, stack promotion), stack frame size estimates (IR alloca sizes diverge from AST estimates after inlining), and fence emission (whether a fence actually appears in the optimized binary depends on the surrounding optimization context).
The solution: a separate IR refinement pass
After the AST analysis runs per-TU, lshaz emits LLVM IR for each translation unit by invoking the compiler with -emit-llvm at the same optimization level used for production builds (configurable; default O2). The IR is cached by content hash of (source path + compiler flags) so subsequent runs skip re-emission for unchanged TUs.
A separate DiagnosticRefiner pass then loads the IR and cross-references AST findings against what actually survived optimization:
seq_cst operation flagged by FL010 → check if the AtomicRMWInst or AtomicCmpXchgInst is present in the optimized IR. If the optimizer eliminated it → suppress with -0.20 confidence adjustment.
Virtual call flagged by FL030 → count remaining indirect calls in the IR function. If zero survive → suppress (-0.25). If indirect calls remain → boost (+0.10).
Heap allocation flagged by FL020 → check if the malloc/new call site survived inlining. Eliminated → suppress (-0.15).
Stack frame flagged by FL021 → cross-reference against IR alloca sizes. IR-confirmed → boost confidence.
Every adjustment is traced into the diagnostic's escalation log so the output is auditable. You can basically see exactly why a finding was suppressed or boosted.
I ran it against Abseil. 256 diagnostics. The highest-confidence finding is ThreadIdentity in absl/base/internal/thread_identity.h: three atomics (ticker, wait_start, is_idle) sharing a cache line, confirmed across 36 TUs, with a compound hazard (cache spanning + false sharing + wide write surface) on the same struct. The Abseil authors documented the cross-thread access explicitly in comments, so it's a deliberate trade-off, not a bug. But it's invisible to any reader who hasn't computed the byte offsets manually.
Full writeup with the layout analysis and the rest of the findings: https://abokhalill.github.io/lshaz-writeup/writeups/abseil-d...
Repo: github.com/abokhalill/lshaz