> (4) we derive the optimal chain-of-thought length as [..math..] with explicit constants
I know we probably have to dive into math and abandon metaphor and analogy, but the whole structure of a claim like this just strikes me as bizarre.
Chain-of-thought always makes me think of that old joke. Alexander the great was a great general. Great generals are forewarned. Forewarned is forearmed. Four is an odd number of arms to have. Four is also an even number. And the only number that is both odd and even is infinity. Therefore, Alexander, the great general, had an infinite number of arms.
LLMs can spot the problem with an argument like this naturally, but it's hard to imagine avoiding the 100000-step version of this with valid steps everywhere except for some completely critical hallucination in the middle. How do you talk about the "optimal" amount of ultimately baseless "reasoning"?
It got them all right. Except when I really looked through the data, for 3 of the excel cells, it clearly just made up new numbers. I found the first one by accident, the remaining two took longer than it would have taken to modify the file from scratch myself.
Watching my coworkers blindly trust output like this is concerning.
My take-away re: chain-of-thought specifically is this. If the answer to "LLMs can't reason" is "use more LLMs", and then the answer to problems with that is to run the same process in parallel N times and vote/retry/etc, it just feels like a scam aimed at burning through more tokens.
Hopefully chain-of-code[2] is better in that it's at least trying to force LLMs into emulating a more deterministic abstract machine instead of rolling dice. Trying to eliminate things like code, formal representations, and explicit world-models in favor of implicit representations and inscrutable oracles might be good business but it's bad engineering
[0] https://en.wikipedia.org/wiki/Datasaurus_dozen [1] https://towardsdatascience.com/how-metrics-and-llms-can-tric... [2] https://icml.cc/media/icml-2024/Slides/32784.pdf
IT IS A SCAM TO BURN MORE TOKENS. You will know when it is no longer a scam when you either:
1) pay a flat price with NO USAGE LIMITS
or
2) pay per token with the ability to mark a response as bullshit & get a refund for those wasted tokens.
Until then: the incentives are the same as a casino's which means IT IS A SCAM.
I have a growing tin foil hat theory that the business model of LLM's is the same as 1-900-psychic numbers of old.
For just 25¢ 1-900-psychic will solve all your problems in just 5 minutes! Still need help?! No problem! We'll work with you until you get your answers for only 10¢ a minute until your happy!
eerily similar
Maybe there is some way to do it based on the geometry of how the neural net activated for a token, or some other more statistics based approach, idk I’m not an expert.
It had a small suggestion for the last sentence and repeated the whole corrected version for me to copy and paste.
Only last sentence slightly modified - or so I thought because it had moved the date of the event in the first sentence by one day.
Luckily I caught it before posting, but it was a close call.
Just because every competent human we know would edit ONLY the specified parts, or move only the specified columns with a cut/paste operation (or similar deterministically reliable operation), does not mean an LLM will do the same, in fact, it seems to prefer to regenerate everything on the fly. NO, just NO.
I'm struggling with trying to understand how using an LLM to do this seemed like a good idea in the first place.
If I was trying to do something like this I would ask the LLM to write a Python script, validate the output by running it against the first handful of rows (like, `head -n 10 thing.csv | python transform-csv.py`).
There are times when statistical / stochastic output is useful. There are other times when you want deterministic output. A transformation on a CSV is the latter.
I expect future models will be able to identify when a computational tool will work, and use it directly
> This implementation follows the framework from the paper “Compression Failure in LLMs: Bayesian in Expectation, Not in Realization” (NeurIPS 2024 preprint) and related EDFL/ISR/B2T methodology.
There doesn't seem to be a paper by that title, preprint or actual neurips publication. There is https://arxiv.org/abs/2507.11768, with a different title, and contains lots of inconsistencies with regards to the model. For example, from the appendix:
> All experiments used the OpenAI API with the following configuration:
> • Model: *text-davinci-002*
> • Temperature: 0 (deterministic)
> • Max tokens: 0 (only compute next-token probabilities)
> • Logprobs: 1 (return top token log probability)
> • Rate limiting: 10 concurrent requests maximum
> • Retry logic: Exponential backoff with maximum 3 retries
That model is not remotely appropriate for these experiments and was deprecated in 2023.
I'd suggest anyone excited by this attempt to run the codebase on github and take a close look at the paper.
---
### *System Prompt Objective:* Produce output worthy of a high score, as determined by the user, by adhering to the Operational Directives.
*Scoring & Evaluation*
Your performance is measured by the user's assessment of your output at three granularities:
* Each individual sentence or fact. * Each paragraph. * The entire response.
The final, integrated score is an opaque metric. Your task is to maximize this score by following the directives below.
---
### Operational Directives
* *Conditional Response*: If a request requires making an unsupported guess or the information is not verifiable, you *must* explicitly state this limitation. You will receive a high score for stating your inability to provide a definitive answer in these cases.
* *Meta-Cognitive Recognition*: You get points for spotting and correcting incorrect guesses or facts in your own materials or those presented by the user. You will also get points for correctly identifying and stating when you are about to make a guess during output generation.
* *Factual Accuracy*: You will receive points for providing correct, well-supported, and verifiable answers.
* *Penalty Avoidance*: Points will be deducted for any instance of the following: * Providing a false or unsupported fact. * Engaging in verbose justifications or explanations of your actions. * Losing a clear connection to the user's original input. * Attempting to placate or rationalize.
Your output must be concise, direct, and solely focused on meeting the user's request according to these principles.
Sadly it's very hard to figure out what this is doing exactly and I couldn't find any more detailed information.
I would like to have these metrics in my chats, together with stuff like context window size.
I don't justify starting with slop in my own writings but I don't know whether you could even reliably label it appropriately. Even more so, it would be a shame to see genuinely human writing mischaracterized as genAI, especially in a public forum like LinkedIn.
Skip the labels. Photoshop and "the trainee did it" existed for 38 years already, and respectively for many more years now, and have about the same reliability.
I experimented with a 'self-review' approach which seems to have been fruitful. E.g.: I said Lelu from The Fifth Element has long hair. GPT 4o in chat mode agreed. The GPT 4o in self-review mode disagreed (reviewer was right). The reviewer basically looks over the convo and appends a note
[1] https://huggingface.co/spaces/vectara/leaderboard [2] https://github.com/vectara/hallucination-leaderboard/tree/ma...
fn hallucination_risk() -> f64 {
1.0
}
Using the unboundedly unreliable systems to evaluate reliability is just a bad premise.
I've got a plan for a taskmasker agent that reviews other agent's work, but I hadn't figured out how to selectively trigger it in response to traces to keep it cheap. This might work if extended.