4 pointsby radarsat12 days ago1 comment
  • radarsat12 days ago
    Recently I got interested in how to "compile" a program definition into the weights of a Transformer. I settled for distilling the MLPs individually, but the attention weights are fully "calculated".

    The example program [1] generates a Transformer that executes an RPN expression, using "breadcrumb" tokens to track its progress. The output looks like:

      Prompt: 3 4 + EXEC
      Output: c2 c1 c0 7
      
      Prompt: 3 4 + 3 3 + * EXEC
      Output: c2 c1 c0 7 c5 c4 c3 6 c6 c5 c2 42
      
      Prompt: 10 2 3 * + 2 + EXEC
      Output: c3 c2 c1 6 c4 c3 c0 16 c6 c5 c4 18
    
    I think there's still a lot that could be improved but I wanted to document what I have done so far. It turned out very interesting and made me think about transformers, attention and particularly the structure of the residual stream in a new way.

    [1]: https://github.com/radarsat1/rpn_transformer/blob/main/src/p...