1 pointby DARSHANFOFADIYA8 hours ago2 comments
  • DARSHANFOFADIYA8 hours ago
    I've been working on optimizing training for long-context models (70B+) and found that while Tensor Parallelism is well-documented, the newer "Unified" Sequence Parallelism techniques (like DeepSpeed Ulysses) are often treated as black boxes.

    I wrote this deep dive to visualize exactly how we shard the Q, K, V projections and how the All-to-All communication primitives work during the attention step to handle 1M+ tokens.

    The post covers:

    The architectural difference between Ring Attention and Ulysses (and why Ulysses often wins on H100 clusters).

    Diagrams of the specific "All-to-All" communication steps.

    How to handle the KV-cache bottleneck without exploding memory.

    Happy to answer questions about the implementation or the communication cost analysis!

  • ClaireGz8 hours ago
    This is super helpful — most writeups skip over the actual communication steps, so seeing the All-to-All flow laid out makes it much clearer.

    Curious from your experiments: at 1M+ context, does communication start dominating vs compute?

    I keep seeing cases where bigger context windows are technically possible but don’t translate into better results unless the context is very structured, so I wonder where the real scaling limit ends up being in practice.

    • DARSHANFOFADIYA41 minutes ago
      As we scale to 1MN context length (inference) the biggest bottleneck is memory and to tackle that at scale we pay the price of communication overhead. Now fortunately the gpus are smartly fetching data for the next step while the previous step is computing thus masking the communication overhead and keeping responses at such scale appear realistic.

      The quality degradation as context length increaes is a whole another science problem