The frustration that started this: I kept context-switching between Grafana dashboards, log files, and docs every time something broke in production. I wanted to just ask what was wrong and get an actual answer.
Argus is an open-source AI agent that monitors your infrastructure and investigates anomalies autonomously using a ReAct loop with 18+ tools, reading logs, querying metrics, tracing requests then proposes fixes for your approval before anything executes. The human-in-the-loop part was intentional; I didn't want an agent that could nuke a database without asking first.
It's LLM-agnostic (OpenAI, Anthropic, Gemini) and runs entirely in one Docker container. SQLite + DuckDB under the hood, no external dependencies. Still early but already handles the case I built it for — would love brutal feedback, especially from anyone running their own infra.