What breaks when your agent has 100k tools(getviktor.com)

10 pointsby peteralbert4 hours ago3 comments

fryd_w3 hours ago
Co-founder here. Want to add some context on a decision the article doesn't cover: why Slack.
We tested two surfaces early on - a standalone web app and Slack. The web app: clean UI, full control over the experience. But a lot of friction - why open a browser if you can speak to Viktor in the way you speak to your team? Also, in the webapps, users are used to immediate answers. And Viktor needs time to think/work - like your coworker. In slack we're used to longer wait times. In the end - we speak to humans.
Slack won because it's where work already happens. The agent reads the same channels your team does (crucial for the magical moments!), responds in threads, reacts to messages. There's no context switch. When someone asks the agent to "check what John said about the Q3 budget," it can actually go look - because it's already in the channel where John said it.
The tradeoff is real though. You inherit every Slack UX limitation. You can't build custom UI components. Your entire interaction model is text, threads, buttons, and emoji. We've had to get creative - approval workflows through button clicks, rich output through uploaded files, progress updates through emoji reactions. It's constraining, but the constraint forces simplicity that users actually prefer.
The other thing I'd highlight: the skill system is a compounding moat we didn't fully appreciate at first. Every time any user on a team corrects the agent or teaches it something, that knowledge persists for everyone. Six months in, a team's Viktor knows their project IDs, their naming conventions, which endpoints are broken, who prefers what format. A new hire gets the benefit of all that accumulated context on day one. That's not something you get from a chatbot with a system prompt.
mattswulinski4 hours ago
The "treat your context window like RAM" framing resonates. We've been running into this exact tension building agentic workflows with Claude Code - the more tools you make available, the worse selection accuracy gets, even with very capable models.
Curious what others have found: does the code-generation approach to tool calling (agent writes Python instead of picking from JSON schemas) actually hold up at scale? It seems elegant for composition, but I'd worry about hallucinated function names or incorrect arguments being harder to catch than a malformed structured call. With JSON schemas you at least get validation for free.
Also interested in the "use intelligence once to create automation that runs forever without intelligence" pattern for cron jobs. Has anyone found a good middle ground between fully scripted automations and full LLM-every-loop? The cost blowup they describe ($5k/month from a 5-minute cron) seems like it would kill most production deployments before they prove value.
- peteralbert3 hours ago
  We validate tool calls with Pydantic models built directly from the JSON schemas, running inside the container. So the agent gets instant feedback if it passes the wrong type, misses a required field, or hallucinates a parameter — before anything hits the external API. You get the composability of code generation with the validation guarantees of structured calls.
  In practice the self-correction rate is high. The agent writes a script, gets a traceback or validation error, reads it, and fixes the issue — usually within one retry. The skill files help a lot here because they contain the exact function signatures and known gotchas, so the model isn't guessing from memory. It's closer to a developer with good docs open than a model hallucinating API calls.
  On the cron middle ground: the three-tier system is exactly that, and the conditional tier is where most automations end up. A typical example: "alert me when a competitor publishes a new blog post." The agent writes a Python script that checks the RSS feed every 30 minutes. If there's a new post, it spins up an LLM to summarize it and decide if it's worth alerting about. The check costs fractions of a cent. The LLM only runs when there's actually something to reason about.
  The key insight we had is that the agent itself is often the best judge of which tier a cron should be. When a user describes what they want, the agent decides whether it needs reasoning every run or just a script with a conditional trigger. And if you ask it to audit its own crons, it'll often downgrade full-agent crons to conditional or scripted ones on its own. Turns out "look at this thing you're doing every hour and figure out if you actually need to think each time" is a prompt that works surprisingly well.
- 3 hours ago
  undefined
peteralbert4 hours ago
Author here (Peter, CTO). We've been iterating on this for about a year. Happy to answer questions on any section.