12 pointsby simianwords5 hours ago5 comments
  • pyentropy2 hours ago
    Take a look at the harmony repo which specifies the internal OpenAI format - the effort level is specified in the context after the <|start|> tag - https://github.com/openai/harmony

    Note that inference libs also have parsers that put hard limits on reasoning tokens with separate counters (similar to how you can put a limit on token generation per completion versus waiting for an <eos>). For that, take a look at vllm reasoning docs.

  • sometimelurkeran hour ago
    they use multitoken prediction behind the scenes, that might interact with the CoT in a strange way. maybe for different thinking modes they have different MTP models? if so thats interesting
    • pyentropyan hour ago
      The number of tokens you predict at time (multi or not) has nothing to do with whether the model wants to emit any, some or a lot of reasoning tokens in reasoning tag -- similar to how branch prediction will not really change the for loop iteration count.
  • __patchbit__4 hours ago
    At a guess. May be associated with token length context window. Down selecting is consistent with warning message, forcing cutoff to context window. The technical term cache being a synonym. Increasing the headroom for more "thinking" should allow the implementation to access more resources without warning about the cache breaking.
  • aabdi3 hours ago
    Different models do slight variants.

    Usually it’s done in post training to enforce behavior based on prompt. Ie. System prompt with thinking:max or low or wtv.

    Enforcement then goes via constrained decoding, checking for think token start and end with max lengths, or other variations

  • shanewei4 hours ago
    [dead]