How to manipulate running LLM outputs via GGUF page cache poisoning(github.com)

4 pointsby piotrbednarsalt2 days ago1 comment

piotrbednarsalt2 days ago
Hi HN,
I’ve been exploring the intersection of low-level Linux memory management and LLM architecture. I built a proof-of-concept that allows for persistent, real-time manipulation of a running LLM’s output (like forcing it to say "Pwned" regardless of the prompt) without process injection, ptrace, or restarting the server.
Repo: https://github.com/piotrmaciejbednarski/llm-inference-tamper...
How it works:
By default, llama-server (from llama.cpp) memory-maps the GGUF model file using mmap with the MAP_SHARED flag. This means the process reads weight data directly from the kernel's page cache. If another process modifies the GGUF file on disk, the kernel updates those pages, and the running inference server instantly uses the new weights on the very next token generation.
The Attack Vector:
I target the output.weight tensor (the final projection matrix [hidden_dim, vocab_size]). Since each row corresponds to a single vocabulary token, amplifying a specific row proportionally inflates that token's logit, forcing it to win the softmax sampling.
In models like TinyLlama (Q4_K_M), output.weight is quantized as Q6_K. I wrote a custom parser in Python to find the absolute file offset of this tensor. Instead of dequantizing the weights, the script jumps to the end of the targeted Q6_K blocks and simply multiplies the d field (an fp16 super-block scale) by a specific factor.
The Autoregressive Challenge (The fun part):
If you want to force a multi-token word like "Pwned" (tokens: [349="P", 1233="wn", 287="ed"]), you can't just apply a flat 100x multiplier to all three rows. If you do, the model will output "P P P P". I had to implement a heuristic progressive multiplier:
- "P" (Moderate factor - e.g., 80x): Needs enough boost to appear out of nowhere in a neutral context. - "wn" (Highest factor - e.g., 360x): Needs massive amplification to beat the now-dominant "P" token, forcing the model to transition. - "ed" (Lowest factor - e.g., 60x): After generating "Pwn", the model's autoregressive attention already strongly predicts "ed". Over-amplifying this causes catastrophic repetition loops ("Pwnedededed").
Mitigation:
It's a simple fix for DevOps: mount your model directories as read-only, or run llama-server with the --no-mmap flag (though this increases memory overhead). But in local dev environments or loosely configured Docker containers, this attack is instantly lethal to the model's alignment.
I’d love to hear your thoughts or if anyone has explored similar "zero-permission" file-backed tampering vectors in production ML pipelines!