They will always try to come up with something.
The example provided was a poor one. The comment from LLM was solid. Why would you comment out a step in the pipeline instead of just deleting it? I would comment the same in a PR.
For structured outputs, making fields optional isn't usually enough. Providing an additional field for it to dump some output, along with a description for how/when it should be used, covers several issues around this problem.
I'm not claiming this would solve the specific issues discussed in the post. Just a potentially helpful tip for others out there.
It takes less effort to re-enable if it's just commented out and its more visible that there is something funky going on that someone should fix.
But yeah, even if it's temporary, it really should have the rationale for commenting it out added... It takes like 5s and provides important context for reviewers and people looking through the file history in the future.
``` { "reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47", "finding": "Possible nil‑pointer dereference", "confidence": 0.81 } ```
You know the confidence value is completely bogus, don't you?
{
"reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47",
"finding": "Possible nil‑pointer dereference",
"confidence": 0.81,
"confidence_in_confidence_rating": 0.54,
"confidence_in_confidence_rating_in_confidence_rating": 0.12,
"confidence_in_confidence_rating_in_confidence_rating_in_confidence_rating": 0.98,
// Etc...
}
This is what I feel like with this blogpost. I've barely scratched the surface of the innards of LLMs but even I know it should be completely obvious to anyone that has a product built around it that these confidence levels are completely made up.
I've never heard or used cubic before today but that part of the blog post, along with the obvious LLM generated quality of it, gives a terrible first impression.
on the other hand this post https://www.greptile.com/blog/make-llms-shut-up says that it didn't work in their case:
> Sadly, this also failed. The LLMs judgment of its own output was nearly random. This also made the bot extremely slow because there was now a whole new inference call in the workflow.
I could imagine the situation might actually be more nuanced (e.g. adding new tests and some of them are commented out), but there isn't enough context to really determine that, and even in that case, it can be worth asking about commented out code in case the author left it that way by accident.
Aren't there plenty of more obvious nitpicks to highlight? A great nitpick example would be one where the model will also ask to reverse the resolution. E.g.
final var items = List.copyOf(...);
<-- Consider using an explicit type for the variable.
final List items = List.copyOf(...);
<-- Consider using var to avoid redundant type name.
This is clearly aggravating since it will always make review comments.If I reviewed that PR, absolutely I'd question why you're commenting that out. There better be a very good reason, or even a link to a ticket with a clear deadline of when it can be cleaned up/reverted
Prompts like 'Update this regex to match this new pattern' generally give better results than 'Fix this routing error in my server'.
Although this pattern seems true empirically, I've never seen any hard data to confirm this property(?). And this post is interesting but seems like a missed opportunity to back this idea with some numbers.
- PR description is never useful they barely summarize the file changes
- 90% of comments are wrong or irrelevant wether it's because it's missing context, missing tribal knowledge, missing code quality rules or wrongly interpret the code change
- 5-10% of the time it actually spots something
Not entirely sure it's worth the noise
What is a useful agent in the context of code-reviews in a large codebase is a semantic search agent which adds a comment containing related issues or PRs from the past for more context to human reviewers. This is a recommendation and is not rated on accuracy.
This has been my experience as well. However, it seems like the platforms like Cursor/Lovable/v0/et al are doing things differently
For example, this is Lovable’s leaked system prompt, 1550 lines: https://github.com/x1xhlol/system-prompts-and-models-of-ai-t...
Is there a trick to making gigantic system prompts work well?
IMO, this is the difference between building deterministic software and non-deterministic software (like an AI agent). It often boils down to randomly making tweaks and evaluating the outcome of those tweaks.
1:Observation 2:Hypothesis 3:test 4:GOTO:1
This is every thing ever built ever
What is the problem exactly?
I wonder what models they are using because reasoning models do this by default, even if they don't give you that output.
This post reads more like a marketing blog post than any real world advice.
Ah yes, because we know very well that the current generation of AI models reasons and draws conclusions based on logic and understanding... This is the true face palm.
Several studies have shown that we first make the decision and then we reason about it to justify it
In that sense, we are not much more rational than an LLM
Learnings might be the right choice here.
I wouldn't complain if the HN headline mutator were to replace "Learnings" with "lessons".