So, to give an example of what's worked really well for me: I'm working for an app hosting startup named Wasmer, and we host a decent of apps. Some of these are malicious. So to very effectively detect the malicious apps, we have an app-vetter agent named Herman. Herman reads the index page of every newly created app, alongside a printscreen of the index page, then he flags if he thinks that the app is malicious. Then some human (usually me) inspect the app and make the final decision of it should banned or not.
This allows us to scan a quite large amount of apps and filter the noisy non-malicious apps. Doing this with a 'dumb' service wouldn't really be feasible, and the context of an LLM fits perfectly where it gets both an image and the source code. An LLM is also quite 'omnipotent', in that it for example knows that DANA is a bank in Malaysia, something I personally have no idea about.
I think tedious and time consuming chores like this is a great way of using agents. Next in line for my experimentation is to utilize agents for 'fuzzy' integration testing, where the LLM simply has access to a browser + cli tools and UAT specifications, and may (in an isolated environment) do whatever it wants. Then it should report back any findings and improvements using an MCP integration towards our ticketing system. So to utilize the hallucinations to find issues.
To be honest, my experience with agents is still pretty limited, so I’d really appreciate any advice, especially around best practices or a roadmap for implementation. The goal is to build something that can learn and reflect the company’s culture, answer situational questions like “what to do in this case,” assist with document error checking, and generally serve as a helpful internal assistant.
All of this came from the client's desire to have a tool that aligns with their internal knowledge and workflows.
Is something like this feasible in terms of quality and reliability? And beyond hallucinations, are there major security or roadblocks concerns I should be thinking about?
I really like the OpenAI approach and how they outlined the thought process of when and how to use agents.
[1] https://www.willowtreeapps.com/craft/retrieval-augmented-gen...
[2] https://www.willowtreeapps.com/craft/building-ai-agents-with...
In this case, the agent would also need to learn from new events, like project lessons learned, for example.
Just curious: can a RAG[1] system actually learn from new situations over time in this kind of setup, or is it purely pulling from what's already there?
"Learning" happens when initially training the llm or arguably when fine-tuning. Neither of which are needed for your use case as presented.
In my case, there will be a large amount of initial data fed into the system as context. But the client also expects the agent to act more like a smart assistant or teacher, one that can respond to new, evolving scenarios.
Without getting into too much detail, imagine I feed the system an instruction like: “Box A and Box B should fit into Box 1 with at least 1" clearance.” Later, a user gives the agent Box A, Box B, and now adds Box D and E, and asks it to fit everything into Box 1, which is too small. The expected behavior would be that the agent infers that an additional Box 2 is needed to accommodate everything.
So I understand this isn't "learning" in the training sense, but rather pattern recognition and contextual reasoning based on prior examples and constraints.
Basically, I should be saying "contextual reasoning" instead of "learning."
Does that framing make sense?
In practice you have to send the entire conversation history with every prompt. So you should think of it as appending to an expanding list of rules that you put send every time.
What you're attempting to do, integrating an agent in your business, is difficult. It is however relatively easy to fake. Just setup a quick RAG tool, plug it into your LLM, and you're done. From the outside, the only difference between a quick-n-dirty integration and a much more robust approach will be in numbers. One will be more accurate than the other, but you need to actually measure and count performance to establish it as a fact and not just a vibe.
First advice: build up a dataset and measure performance as you develop your agent. Or just don't, and deliver what hype demands.
As for advices ... and looking at those other commenters left ... If you want to do this seriously, I'd recommend that you hire someone who already did that kind of integration, at least as a consultant. Someone whose first reflex won't be to just tell you LLMs are fixed and can't learn but will also add this isn't a limitation since RAG pipelines are better suited for this task than fine-tuning [1].
Also RAG isn't a monolithic solutions, there are many, many variations. For your use-case, I'd consider more elaborate solutions than just baseline RAG, such as GraphRAG [2]. For the box problem above, you might want to consider integrating symbolic reasoning tools such as prolog, or consider using reasoning models and developing your own reinforcement learning environments. Needless to say, all of these aspects need to be carefully balanced and optimized to work together, and you need to follow a benchmark/dataset centric-approach to developing your solution. For this task consider frameworks that were designed to optimize llm/agentic workflows as a whole [3][4].
Shit is complex really.
[1] https://arxiv.org/abs/2505.24832 tells us generalization happens in LLM once their capacity for remembering things is saturated, and this might explain why fine-tuning has been less efficient than RAG so far.
[2] https://microsoft.github.io/graphrag/
Supposing you hired a consultant to be "culture keeper" for this company -- and she or he said, "I'm just going to reason about context by treating this culture as a body of text" -- you would instantly assume that they didn't have skin in the game and didn't understand how culture actually grows and accretes, let alone monitoring and validating eventual quality or reliability. We can't read about what rules apply in some foreign culture's situations and then remotely prescribe what to do socially in a foreign culture we've never set foot in. We can't accurately anticipate even the second-order effects of our recommendations in that situation.
We simply have to participate first. It would be better for this to be a role that involves someone inside of the company who does participate in navigating the culture themselves so that they make accurate observations from experience. A person trustworthy enough to steward this culture would also necessarily be trustworthy enough not to alarm the chief of HR. Based on my model of how work works, from experience, I am wondering if they imagine they want this sensitive role filled with a nonhuman 'trusted' advisor so that it can't ever become a social shadow power center within the firm.
Or maybe they don't want to admit that modeling culture is beyond the reach of their matter-of-fact internal process models and simulations, and they're just wishfully hoping you can abstract away all of the soft elements without producing social fever dreams or ever having to develop a costly true soft element model. But then you absolutely abstract away where the rubber meets the road! That's quite a roadblock, to be honest with you.
It definitely will never be a replacement for HR or top executive thinking. At best, I’ll be proposing something much lighter, more like a glorified internal search tool for real user examples. To be honest, I’m still all figuring it out. Best case: a helpful resource guide. Worst case: it adds no real value.
The tricky part is, if I don’t provide something, even just a prototype, they’re already looking at other consultants who’ll happily promise the moon. And that’s my bigger concern: if I’m not involved, someone else might introduce a half-baked solution that interferes with the SaaS I’ve already built for them.
So now I’m in a position where I need to put together a clear, honest demo that shows what this tech can and can’t do, just to avoid further complications down the line.
Ironically, this all started when another “AI expert” sold them the idea.
I’ve been saying the same thing all along, we’re not quite there yet. Maybe one day, but not now.
I also get that businesses want to take full advantage of this tech when it’s pitched as a money-saving opportunity, the pressure to act fast is real.
I wonder how many other devs and consultants are facing similar situations?
It’s still highly experimental and needs to be observed, corrected, and tweaked constantly, kind of like teaching a child, where feedback and reinforcement are key.
I may share my experience with the HN community down the line. Thanks again!
In particular you can reduce most concerns around security and reliability when you treat your LLM call as a library method with structured output (Factor 4) and own your own control flow (Factor 8). There should never be a case where your agent is calling a tool with unconstrained input.
Definitely bookmarking this for reference. Appreciate you sharing it.
1) Search GitHub for an issue in their repo.
2) Fix the issue and push it to GitHub.
3) Search Jira for any tasks related to this bug, and update their status.
4) Post a message to slack, notifying the team that you fixed it.
Now, let’s assume this agent is available to 1000 users at a company. How does the system obtain the necessary GitHub, Jira, and Slack permissions for a specific user?
The answer is fairly obvious if the user approves each action as the task propagates between agents, but how do you do this in a hands-free manner? Let’s assume the user is only willing to approve the necessary permissions once, after submitting their initial prompt and before the orchestrator agent attempts to call the GitHub agent.
If anyone could offer any advice on this, I would really appreciate it. Thank you!
Then I would control who had permission to tell that to do, and log everything in detail.
Google: https://ia600601.us.archive.org/15/items/google-ai-agents-wh...
Anthropic: https://www.anthropic.com/engineering/building-effective-age...
Also, the examples provided are not only not practical but potentially bad practice. Why do you need a manager pattern to control a bunch of language translation agents when most models will do fine especially for latin-based languages? In practice a single LLM will not only be more cost-effective but also good for the overall user experience too.
Also, prompting is the real unsung hero that barely gets a mention. In practice you cannot get away with just a couple of lines describing the problem / solution at a high-level. Prompts are complex and very much an art form because and frankly, let's be honest, there is no science whatsoever behind them - just intuition. But in practice they do have enormous effect on the overall agent performance.
This guide is not aimed at developers to really educate how to build agents but at business executives and decision-makers who need a high-level understanding without getting into the practical implementation details. It glosses over the technical challenges and complexity that developers actually face when building useful agent systems in production environments.
Usually that's how these agent tutorials work. I don't think there any open sourced large scale agent applications that anyone has published online yet, especially those agent that hand off to other agent sort of apps.
Answer to this is agent explorable interfaces a la HATEOAS. Define a single entry point to your MCP gateway and give the agent clues to drill down and find what it needs to use.