PromptForest: Fast Ensemble Detection of Malicious Prompts for LLMs(github.com)

1 pointby appleroll10 days ago1 comment

appleroll10 days ago
PromptForest — a fast, ensemble-based prompt injection detector for real-world AI safety
Prompt injection is an adversarial attack in LLM systems: malicious inputs that manipulate model behavior by slipping in hidden instructions. As AI usage grows in products, pipelines, and public APIs, detecting and mitigating these injections becomes a practical production problem.
PromptForest is an open-source ensemble detector that emphasizes speed, uncertainty awareness, and reliability without relying on massive models.
How it works - Runs multiple lightweight prompt-injection detectors in parallel. - Uses a voting/discrepancy mechanism to flag risky prompts. - Generates uncertainty scores: disagreement between models can trigger human review or stricter handling. - Small ensemble → faster inference (~100 ms per request) and lower resource usage. - Better-calibrated confidence estimates reduce overconfident mistakes compared to some existing detectors.
Why it matters
Prompt injection can leak private prompts or subvert agent workflows. Most current defenses rely on large classifiers or hard-coded heuristics:
- Big models are slow and expensive at scale. - Single detectors can be overconfident on edge cases. - Zero-risk doesn’t exist, but better calibration helps trigger sensible defenses.
PromptForest aims to be practical, open, and easy to run without a massive GPU footprint.
Technical Highlights
- Ensemble with voting/discrepancy scoring for ambiguous cases. - Supports multiple detection backends (e.g., LLaMA prompt guard variants). - Python-first with CLI and server mode for easy integration. - Optimized for latency and confidence calibration.
Who is this for
- Developers integrating LLMs in user-generated content pipelines - AI researchers focused on adversarial safety - Infrastructure teams needing fast, explainable detection - Community contributors who prefer open source tools over black boxes
Repo: https://github.com/appleroll-research/promptforest Try it out here: https://colab.research.google.com/drive/1EW49Qx1ZlaAYchqplDI...
Feedback is welcome, especially on integration patterns, benchmarks, or potential improvements.