Multi-Stream LLMs: new paper on parallelizing/separating prompts, thinking, I/O(arxiv.org)

59 pointsby atomicthumbs6 hours ago4 comments

fprog16 minutes ago
I strongly suspect this will scale up very well and become the standard approach to building models before too long. The advantages are manifold. The paper highlights some UX improvements, like that the model can start thinking sooner, one time step after the first token arrives. It also mentions increased safety: models trained this way better resist adversarial attempts to divulge secrets. I suspect the learnable stream embedding (sec. 3.3) is doing heavy lifting there. It also seems conceptually simpler than the recent Thinking Machines micro-turn based approach.
With the context advantage gained, maybe future stream models could finally operate directly on bytes, solving some of the odd tokenization-specific challenges of LLMs. Perhaps the thinking stream could drop the LM head and operate purely in embedding space, as with Meta’s Coconut. The multiplicative effect the combined techniques might have is tantalizing.
jhack2 hours ago
This sounds like a gamechanger for speed and efficiency if it can scale up.
"However, our models are nevertheless relatively small and trained on tiny amounts of instruction examples, compared to the scale of modern instruction data and multiple post-training stages used to reinforce the default message-based format. We do think that parallel streams are a conceptually enticing format, and that future work on a larger scale will go further to show these benefits."
Eextra9534 hours ago
Am I understanding correctly that an implication of this is reduced context? since they are streaming by splitting the input into streams the total context is now split amongst those streams and a particular streams context will be shorted to to context/ streams?
- danlentonan hour ago
  I think the main benefit is improved speed and parallelism. Very similar to https://thinkingmachines.ai/blog/interaction-models/
atomicthumbs6 hours ago
New paper out of the Max Planck Institute for Intelligent Systems. If this holds up, it seems big.
Abstract: The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information. In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability.