Differential Transformer(arxiv.org)

562 pointsby weirdcat9 months ago32 comments

Imnimo9 months ago
I feel like I'm missing a key insight here. I understand the problem that regular softmax attention struggles to approach assigning zero attention to irrelevant stuff. And I get that having this subtraction formula makes it possible to assign exactly (or near) zero attention weight without having crazy outlier activations. But it seems like it also makes it very easy to have negative attention weight (which is equivalent to having positive attention weight on the negation of your value vectors). Intuitively, it just feels like a difficult balancing act to keep all the stuff you don't care about so close to zero.
But Figure 1 clearly shows that it works, so I don't doubt that it is in fact possible. I'm just struggling to build a picture of how exactly the network accomplishes this.
- Grosvenor9 months ago
  Regular softmax (and attention) has an error in it.
  softmax should be exp()/1+∑exp()
  Notice the 1 added to the denominator.
  The difference is at the negative limit, softmax can be 0, instead of some epsilon. The same could be done by adding an extra zero value in x.
  Downside is, you have to retrain your model from scratch to fix this.
  - brrrrrm9 months ago
    that silly softmax1 blog post is not worth the read. no one uses it in practice
    if you think about it, the "escape hatch" is the design of the entire transformer dictionary. if Key/Query attention misaligns with Value's weights, you get a layer head that does not attend to anything...
    timlarshanson9 months ago
    Yep. From what I've seen, if the head wants to do nothing, it can attend to itself = no inter-token communication.
    Still, differential attention is pretty interesting & the benchmarking good, seems worth a try! It's in the same vein as linear or non-softmax attention, which also can work.
    Note that there is an error below Eq. 1: W^V should be shape [d_model x d_model] not [d_model, 2*d_model] as in the Q, K matrices.
    Idea: why not replace the lambda parameterization between softmax operations with something more general, like a matrix or MLP? E.g: Attention is the affine combination of N softmax attention operations (say, across heads). If the transformer learns an identity matrix here, then you know the original formulation was correct for the data; if it's sparse, these guys were right; if it's something else entirely then who knows...
  - impossiblefork9 months ago
    I've tried that in a small transformer that I trained from scratch and it didn't really make any difference. I also made a version where I made this trainable somehow, probably by replacing the 1 with a constant associated with the layer, and that didn't make any difference either.
    I didn't follow Miller's proposal quite as he wrote it though and I put the mechanism in all the layers rather than avoiding it at the end.
    My test doesn't absolutely rule out usefulness-- there's always different ways of applying something, but I saw no indication of it.
    Grosvenor9 months ago
    I guess the next step is to see if you're getting those mega activations as he describes.
    A/B test the two models and compare?
    Would be interesting to see if these activations only show up on larger models, or they're some relation to model size.
    Grosvenor9 months ago
    https://news.ycombinator.com/item?id=36871528
    Hah. Yes. It looks like they only show up in models with 6.7B parameters or more.
    The problem can start at 125M. Small enough to test on a whim.
    So train a model that exhibits these behaviours, then try it out.
  - godelski9 months ago
    > softmax should be exp()/1+∑exp()
    You referring to Miller's blogpost?[0] There's not an error in attention. Adding the +1 actually makes it not attention because you no longer generate a probability distribution[1]. There's nothing really preventing attention to have a zero in any of the entries, the thing is that you probably won't get -inf (very large negative number) inside inner product and you're going to have a difficult time updating those weights via gradient descent.
    I've also tested it on many networks and different types of attention and I've yet to see a meaningful improvement (or even an improvement), even in generalization.
    It really is the training method...
    As to the paper, I'm also still at a big lost and honestly, if reviewing could not accept it. The results look good, but I can't tell why and there's some "black magic" going on here.
    - Figure 3 has "Transformer" and doesn't specify. Is this StableLM-3B-4E1T? - What fucking dataset is this on? Stable has a WandB link[2] for that project and I don't see any experiment with similar (presumably entropy?) loss values (come on... this is fucking research... label your fucking graphs...) - Where the fuck is the ablation? (Yes, I saw Fig 6 and Sec 3.8) - How do I know that (assuming this is Stable) that the difference isn't just hyperparemeters? Or worse, GPUs! (yes, number of GPUs can change results due to sharding and this changing the statistics) - How do I know it isn't down to 1k warmup steps instead of 5k? - What about hidden size, layers, heads, or FFN size? Stable has 32/2560/32/? and this has 28/3072/12/8192 (these all will mess with sharding statistics too). Is the head dimension the same? - How do I know it isn't down to the tokenizer? - What is this magic? `0.8 - 0.6 * math.exp(-0.3 * depth)` - Was this learned? Hand picked? This is a huge factor - Any information about the learned parameters? Their final values? Trajectories? - The code does not seem to be the same as whats in the algos...
    Obviously they improved something, but there is nothing in the paper that is convincing me that it is the differential attention. There are too many parameters at play and how am I supposed to know that the difference is by the thing they are proposing. And more importantly, how much it is improved by that specific thing and not by other things.
    [0] https://www.evanmiller.org/attention-is-off-by-one.html [1] This is a bit convoluted but without this condition many "alternative forms" you see would be equivalent to other architectures like linear layers or gated units. Term is not well defined, but this really appears to be the only agreed upon aspect, even if only implicitly stated. This is a much longer conversation though. [2] https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo [2.1] The config: https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-3b-4e1t.yml
    chessgecko9 months ago
    I feel like that blogpost was almost just ragebait for ai researchers. It goes between calling not including the +1 an error (which to me implies it would improve training losses, which it doesn't really https://news.ycombinator.com/item?id=36854613) and saying possibly it could help with some types of quantization (which could very well be true but is a much weaker statement) and the author provides basically no evidence for either.
    godelski9 months ago
    It's the stereotypical computer scientist who thinks they know something others don't and don't feel the need to prove their claim. Specifically when it disagrees with experts. And unsurprisingly it's been something others have already investigated and even written about. Definitely not all CS people, but it is a stereotype many other fields believe.
    I know he's an economist btw. I was also surprised he got a job at anthropic a few months after. I wonder if they're related.
    tananan9 months ago
    Haven't gone through the paper fully, but just looking at the functional form of their attention, it seems more like a constraint on a standard MHA than an architectural discovery.
    Take a vanilla MHA, tie the V projection between consecutive heads, make the output projection subtract consecutive heads, with some fixed prefactor and voila, you're most if not all of the way there.
- qz949 months ago
  It does sound like we're hindering the model a bit by allowing negative weights to exist instead of sending them through, say, a ReLU. But, dealing with this might be an easier problem than you think for the model.
  In the first diagram with the attention weights, there actually are some negative scores in the noise section. But, the attention to that section is very small anyway. All the second attention map needs to do is predict the noise in the first one -- a task that can be done very accurately, because it has full access to the input of the first.
  To refer back to their real-world comparison, noise-canceling headphones have access to what your ear hears through a microphone, so they can output exactly the right cancellation signal. Similarly, the second attention map knows what's being input into the first one, so it can output a corresponding cancellation signal. It's not perfect -- just as noise-canceling headphones aren't perfect -- but it still gets you 99% of the way there, which is enough to boost performance.
  - chessgecko9 months ago
    Are you one of the authors?
- 9 months ago
  undefined
- sigmoid109 months ago
  >I'm just struggling to build a picture of how exactly the network accomplishes this.
  I mean, intuitively it would be trivial for the model to just optimise lambda to zero during training. Then you essentially have built a vanilla transformer with an overcomplicated parameter pruning mechanism. Pruning is already pretty well established in the literature as something that works surprisingly good for reducing parameter counts up to (hold on to your papers)... about 40%. In practice the model probably doesn't work exactly like that, but I wouldn't be surprised if it just approximates the normal transformer in the end anyways.
- watsonmusic9 months ago
  negative values can enhance the expressibility
  - Jerrrrrrry9 months ago
    doubt is the seed of reason
aDyslecticCrow9 months ago
Very clever. I like this kind of nitty-gritty detail work, and the change is small enough to be adapted easily by others. Bravo!
I'm a little concerned about the last sentence of the section introduction of "2 Differential Transformer". It mentions using improvements from previous papers, but in the grammatical context, it's unclear if this improvement is added to both the normal transformer and their diff transformer. This would otherwise sully the comparisons. It's the "main difference" wording in the previous sentence that raised a flag for me.
Of course, a good-faith researcher would know this and may not feel the need to clarify. But you can never be too careful about some published research in this field.
- vessenes9 months ago
  Yes. This looks really, really good to me. Cross the board improvements in training time, perplexity improvements per both token trained and per model size. I'm reminded of MoE architectures, in that world we're choosing an optimal small model to process part or all of the inference job; I wonder if MoE got some of the same benefits from forcing the Transformer to distinguish between alternate possibilities.
  In any event, I'd imagine that this will get widely adopted if the numbers hold up; like I said, this seems to be basically no downside, and should be easy to replicate.
  - f38zf5vdt9 months ago
    There is a downside, every attention layer has to effectively compute attention twice (run scaled_dot_product_attention). As scaled_dot_product_attention is usually one of the most expensive operations in training and inference of a model, it seems like networks using this may be substantially slow and perhaps should considered against larger networks with more attention layers.
    https://github.com/microsoft/unilm/blob/master/Diff-Transfor...
    vessenes9 months ago
    Interesting. This is one of those areas where edge inference needs might be different than data center: to get an 11b quality model in 7b at the cost of 30% more inference time is probably a full yes for anyone doing local inference. And let’s remember that memory bandwidth is a huge factor as well; 30% smaller equals 30% boost in memory based time costs. Anyway I’m interested in trying this out.
    I wonder if the specific setup might be extra effective for coding tuned models as well - you get one coding transformer and one ‘bad habits/chat/other non coding stuff’ negative transformer.
    aDyslecticCrow9 months ago
    In Big-O notation, O(2n) = O(n). Two times slower is actually not that much. If this slowdown results in better inference in the same number of training rounds or better-tuned weights with fewer redundant features, that can be a very worthwhile sacrifice.
    It's also a complex optimization problem, not just about computing. Two times, the parameters take more than two times the time to tune and two times the working memory to train and use. There are also plenty of model training scenarios where data throughput from the dataset into memory and back out is the final bottleneck.
    So, though I agree it is indeed a downside, I think it's a worthwhile sacrifice if the results they show are reproducible.
    godeldirac9 months ago
    Glad to see your ideas here. Could you clarify a point to me? The W matrix in the paper is d_model x 2d. Does this mean a differential attention model will double the W matrix of a standard attention model, which is d_model x d? E g. Suppose llama3 has W of 8192 x 1024, does the diffattn model of the same architecture have W of 8192 x (1024 x 2)?
    cuteboy199 months ago
    The O for any transformer is always quadratic
- Chirono9 months ago
  The two other changes they mention have been widely adopted, and are included in at least some of the models they benchmark against. It seems they list them for completeness as changes to the original transformer architecture.
  - aDyslecticCrow9 months ago
    Nicely spotted! Then, I really look forward to seeing this method tested by others! Epic stuff.
msoad9 months ago
Like most things in this new world of Machine Learning, I'm really confused why this works?
The analogy to noise-cancelling headphones is helpful but in that case we clearly know which is signal and which is noise. Here, if we knew why would we even bother to the noise-cancelling work?
- blackbear_9 months ago
  With a single softmax you cannot predict exactly 0, but only very small numbers. When you have a large number of values to add up, this "poisons" the output with a lot of irrelevant stuff (the noise mentioned in the paper).
  To make things worse, low attention values will have very low gradient, thus needing a lot of weight updates to undo that kind of mistakes. On the other hand, subtracting the output of two softmax allows the model to predict a weight of exactly zero for some of the values, while keeping a reasonable gradient flowing through.
  So the model already knows what is noise, but a single softmax makes it harder to exclude it.
  Moreover, with a single softmax the output of all heads is forced to stay in the convex hull of the value vectors, whereas with this variant each head can choose its own lambda, thus shifting the "range" of the outputs outside the convex hull pre-determined by the values. This makes the model as a whole more expressive.
  - nyrikki9 months ago
    While I don't discount the value of this, can you expand on the meaning of your claim that it makes the model 'more expressive'
    Everything I am seeing in this paper is related to reduced size and noise, which implies a reduction in expressiveness.
    The improvement in needle and a haystack, benchmarks on multi-hop questions of in corpus data and multishot in-context learning points to this.
    This is a wonderful thing if robustness is more important than generality, but it doesn't address trimming away activations that may be spurious in the general use case but may improve an individual domain specificity.
    Context would dramatically impact what tradeoffs and more desireble, and noise is probably never desirable. But the ability of this paper to enable bit size for inference points to a reduction in expressiveness.
    Perhaps I am too focused on generalization?
    blackbear_9 months ago
    What I meant is that by changing lambda each attention head is able to put its outputs in a subspace that is different than that of the other heads. This means that the outputs of different heads do not mingle with each other, and it's easier for the following layer to pick them apart. So I was thinking at increased expressiveness because the attention output can in principle cover a larger volume.
    Maybe expressiveness is not the right term, or not the main consequence. I could imagine that having different subspaces like that also introduces a degree of robustness to out-of-distribution inputs, as this would make it harder for the outputs of one attention head to shift towards the in-distribution outputs of another head, and thus for the following layer to confuse them.
  - freeqaz9 months ago
    I'm able to follow most of what you're saying. It's unclear to me what "convex hull" means though.
    Also, where is each softmax happening here? For each attention head?
    Majromax9 months ago
    > It's unclear to me what "convex hull" means though.
    The convex hull (https://en.wikipedia.org/wiki/Convex_hull) of a set is the smallest convex shape that includes that set. Geometrically, it's what you'd get if you "shrink wrapped" the thing you're looking at: edges still protrude, but any indentations get smoothed over.
    In this context, the grandparent comment is pointing out that with a traditional transformer block, the resulting computed value for a token can never "stick out" past some weighted average of the values of attended-to tokens, but this differential attention formalism allows that result.
    sdenton49 months ago
    The softmax value y is a linear combination of the vectors you're attending over: y = a1v1 + a2v2 + ... + an*vn where a_i >= 0 and sum(a_i) = 1.
    Then y is a convex combination of the v_i, and sits in the convex hull of the v_i.
    blackbear_9 months ago
    The convex hull of a set of points is the region "between" those points. So the convex hull of three points (that do not lie on the same line) is a triangle with those three points as vertices. If you add a fourth point inside the triangle, the convex hull remains the same, but if you add it outside then the convex hull becomes the four-sided region with those points as vertices.
    In the context of standard transformer attention, each output lies in the convex hull ("somewhere between") the input values. With the modification of this paper, the input values can be scaled a little so that the output of different heads can be in different "regions" and thus do not interfere with each other (so yes to your third question, the two softmaxes are performed separately for each head).
    pizza9 months ago
    O_i = softmax(...) * V_i and softmax is between 0 and 1, so O_i = alpha * V_i for some alpha between 0 and 1 so that makes it convex, and it makes the O_i just a shrunken version of V_i. Whereas if you have the diff of softmaxes, you get O_i = (alpha - beta) * V_i, which can range from -V_i to +V_i, so its output could rescale /or/ flip V_i. And yes this is happening in every head in parallel, then they get summed.
    kridsdale39 months ago
    By simply inputting your comment in to 4o, with no other context about the paper, I was able to get a pretty good analysis of the dual-head concept's implications.
    https://chatgpt.com/share/67058973-ba94-8008-bed7-c7f9d08dc5...
    spwa49 months ago
    Uh, this is extracting a LOT from very little data. I don't understand where it's coming from but it's explanation just keeps going into more and more detail ... that doesn't seem to follow from the data it's got.
    I just don't see how you could answer these questions without trying it out. And chatgtp DEFINITELY isn't doing that.
    Plus the obvious question I'd pose is not in there. What's the difference in performance between this trick and just "softmax() - 0.5 * 2" ? That seems very relevant.
    robertsdionne9 months ago
    It means one of these things: https://en.wikipedia.org/wiki/Simplex#Standard_simplex
  - x10009 months ago
    Could you help explain how we would achieve an attention score of exactly 0, in practice? Here’s my take:
    If we’re subtracting one attention matrix from another, we’d end up with attention scores between -1 and 1, with a probability of effectively 0 for any single entry to exactly equal 0.
    What’s more, the learnable parameter \lambda allows for negative values. This would allow the model to learn to actually add the attention scores, making a score of exactly 0 impossible.
    jszymborski9 months ago
    Your comment brings up two interesting variants that could be interesting if your goal is to increase the sparsity of the attention:
    - Rectify the difference of the softmaxes. (min(0, s(A1) - lambda s(A2)))
    - Apply the Heaviside function to the second softmax. (softmax(A1) - lambda H(s(A1) - lambda s(A2))
    The second one being a bit more drastic and maybe harder to train.
  - espadrine9 months ago
    It is a neat approach, but one that comes with a tradeoff, IIUC: doubling the key heads.
    I wonder if a different approach without that issue exists. For instance, using max(0, exp(x)-1) instead of exp(x) in the softmax attention formula. That way when the query is orthogonal to the key (or worse), it does not contribute.
    smallnamespace9 months ago
    > using max(0, exp(x)-1) instead of exp(x)
    Won't this cause the gradient to vanish on the left half, causing problems with training?
    espadrine9 months ago
    That is a concern that is shared with ReLU. But since the weights are shared across the context/minibatch, perhaps that would not be an issue, similar to ReLU.
  - 9 months ago
    undefined
  - dartos9 months ago
    > predict a weight of exactly zero for some of the values
    Wouldn’t this be pretty unlikely, though?
    schopra9099 months ago
    Quite the opposite — if you have a long sequence only a smattering of the words will influence the meaning of the current word. Everything else is “noise”.
    Attention is really good at finding this smattering of words (ie assign most weight there). But it struggles to put exactly 0 on the other words.
    dartos9 months ago
    I mean wouldn’t it be unlikely that
    SoftmaxA[n] - SoftmaxB[n] is exactly 0?
    Even if 2 attention layers learn two different things, I would imagine the corresponding weights in each layer wouldn’t exactly cancel each other out.
    absoflutely9 months ago
    why say lot word when few word do
    dartos9 months ago
    Few word no do tho
    kridsdale39 months ago
    U+1FAE5
    1024core9 months ago
    Phew!
  - mik099 months ago
    is this the same reason why fp4 with more parameters beats fp16 with less?
  - watsonmusic9 months ago
    [dead]
- phire9 months ago
  Noise cancelling headphones are probably the wrong analogy here.
  The better example is the differential signalling used in professional audio and many digital signaling protocols like Ethernet, HDMI and USB.
  Instead of using one wire, referencing to ground, they send the signal as the difference between both wires. Both wires end up carrying the same signal with inverted polarity. Because both wires are running next to each-other any external noice will be applied to both equally.
  The voltage will change, but the difference in voltage between both wires is untouched. And when you subtract the two voltages at the receiver end, any noise simply gets subtracted out.
  - seamossfet9 months ago
    I think when they bring up differential amplifiers they're referring more to the DSP technique of how headphone noise cancelling works but the actual electrical properties of how a differential amplifier does that muddies the message a bit.
    It sort of feels closer to heterodyning and "demodulating" the signal encoded in the softmax. Those tiny little errors we're trying to denoise with this technique are almost closer to carrier waves (when encoded to softmax) than noise imo. This wouldn't get rid of noise in the training data or noise in the dimensionality of the key / value space. It's really only removing noise introduced by the process itself.
  - theGnuMe9 months ago
    This is a cool idea!
  - watsonmusic9 months ago
    [dead]
- WithinReason9 months ago
  Don't look for an analogy, this just adds a new mathematical capability. It enables "negative attention", the network can say "I want to subtract the contribution of this token" in the attention calculation. Previously it could only reduce how much it adds.
  The simple way of doing this would be to just remove the softmax or use a sigmoid instead, but in practice a softmax works better it seems.
- chessgecko9 months ago
  My hypothesis for why this works that it mitigates the downsides of rope
  to eli5:
  rope is the modern strategy used to give information to the model about how far a query and a key are apart when doing attention. It's the best strategy we have now, but has a major downside, where it makes some connections between tokens that are far apart much stronger than you would like them to be. Xpos (https://arxiv.org/pdf/2212.10554) is another paper by microsoft tackling issues with rope and you can see figure 1 on page 4 to get a visual interpretation of the sinusoidal attention strength (you would like it to be smooth).
  I think a big reason differential transformers is working so well, especially on long sequence stuff, because when both q1 and q2 don't match a token, the rope relative strength will still have the same value and the noise will cancel out. Leaving intended matches, but at the cost of somewhat dampening the original value rope brought.
  Just a hypothesis though. It would be easy to test by running this experiment against a baseline where both use alibi attention (https://arxiv.org/pdf/2108.12409) which has a different set of tradeoffs this wouldn't mitigate, but still a really interesting result.
- _hl_9 months ago
  Some of the "prior art" here is ladder networks and to some handwavy extent residual nets, both of which can be interpreted as training the model on reducing the error to its previous predictions as opposed to predicting the final result directly. I think some intuition for why it works has to do with changing the gradient descent landscape to be a bit friendlier towards learning in small baby steps, as you are now explicitly designing the network around the idea that it will start off making lots of errors in its predictions and then get better over time.
- seamossfet9 months ago
  It sounds like they're just splitting the query / key space down the middle. We don't know which dimensions are encoded in each matrix, but they're assuming the "noise" introduced in one query / key space is equivalent to noise introduced in the other space.
  If that is the case, then the "signal" in this case would be the softmax that encodes the dimensions captured by the query / key space. Since the noise ideally is the same in both softmax encodings, subtracting them should "cancel out" the noise.
- mistercheph9 months ago
  I think common mode filtering in balanced audio cables is a much better analogy than noise canceling headphones (and where this paper gets its name from I assume), you don't know what the noise is ahead of time, but if you take two samples with one positive and one negative, noise displaces both absolutely, which you can take advantage of to denoise the signal (find the differential mode).
  For example, if you are trying to send a +1V signal on one wire, and a -1V signal on the other and a +0.5V noise exists, one wire will have +1.5V and the other will have -0.5V,
  Take the difference and divide by 2:
  (+1.5V - -0.5V) / 2 = +1V or, if your setup is different (-0.5V - +1.5V) / 2 = -1V
- HarHarVeryFunny9 months ago
  I don't understand either. It seems the general idea is that they calculate attention twice, which due to random initialization might be expected to give two slightly different results. I'd have thought that what these two attention maps would have in common would be the signal, and where they would differ would be noise, so rather than subtracting them (resulting in all noise?!) what you really want is to add (so the common signal gets reinforced) and normalize.
  - Carlseymanh9 months ago
    I think there might be some communalities with system engineering, where you subtract the output from the input in order to get a control signal that steers the plant to the target values. I too fail to see how that would be supposed to work in practice.
  - kelseyfrog9 months ago
    The values between the groups are also going to diverge during training due to the structure of the DiffAttn equation.
    The analogy I can think of is when you're paying attention to a variety of things and you actively avoid concentrating on something because it will distract you. You don't give it zero attention, you give it negative attention.
- watsonmusic9 months ago
  the model is supposed to learn this
islewis9 months ago
> Differential attention takes the difference between two softmax attention functions to eliminate attention noise
If I understand correctly, this architecture trades twice as much attention memory in exchange for either a higher quality model, or less parameters at a similar quality.
> According to the fitted curves, 6.8B-size DIFF Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parameters
This raises a few questions for me:
- Would having only 60% of the parameters negate the double space for attention, leaving a similar memory profile as a traditional transformer?
- Does that tradeoff change noticeably between training and inference?
- _hl_9 months ago
  My understanding was that the extra parameters required for the second attention mechanism are included in those 6.8B parameters (i.e. those are the total parameters of the model, not some made-up metric of would-be parameter count in a standard transformer). This makes the result doubly impressive!
  Here's the bit from the paper:
  > We set the number of heads h = dmodel/2d, where d is equal to the head dimension of Transformer. So we can align the parameter counts and computational complexity.
  In other words, they make up for it by having only half as many attention heads per layer.
- chessgecko9 months ago
  I think they mitigated the extra memory/compute from this by using half the number of overall heads and doubling V and O. Without actually checking the math I think it should be equivalent in flops, not counting the extra (cheap) multiply by const and subtract.
- entropicdrifter9 months ago
  I think it would negate the RAM savings, but it would also reduce the amount of storage needed at rest and possibly reduce initial start up times depending on storage speed and model size. So, possibly good for low-end models on consumer devices?
- Kubuxu9 months ago
  It would double the size of the KV cache, which can be significant (multi-GB) at larger context sizes.
WithinReason9 months ago
We empirically find that the setting λᵢₙᵢₜ = 0.8 − 0.6 × exp(−0.3 · (l − 1)) works well in practice
I wonder about the story behind that formula...
- Kubuxu9 months ago
  Hmm, 0.8 works well, but let's try setting lower layers to lower initial value. Let's say 0.2. Ok, I need a formula that will go between 0.2 and 0.8, slowly approaching 0.8. Starts fiddling with numbers for 20min, I guess this can work.
  - godelski9 months ago
    Sure, but in research show some comparisons
    physicsguy9 months ago
    In practice there's always a trade off between getting some result out and published and rigorously exploring every avenue of optimisation in research. Sometimes you have to say 'this is good enough and long enough already'.
    godelski9 months ago
    Right, but it has to be *good enough*. My issue isn't that they didn't do more work, my issue is that they didn't even report work that they did do that communicates the impact of the literal proposed method.
    https://news.ycombinator.com/item?id=41783013
- kridsdale39 months ago
  A whole lot of things are tuned optimally by rotating an analog dial until things look / sound right.
- stellalo9 months ago
  Looks like this makes (at least initially in training) the “negative” attention term smaller in the early layers (smaller l) compared to later layers (larger l). Which I guess makes sense: you probably want to attend a little bit to everything before concluding that it’s really a few spots you should look at.
  (Although it seems the author do not discuss this choice anywhere in the paper?)
iandanforth9 months ago
The key bit I didn't understand at first was what happens if the two groups of attention learn the same thing; because their attention masks are subtracted from one another if they both output similar values the attention across the board will drop to zero and this will lead to high loss. So the only way to reduce loss is if they learn to attend to different things. One of the simplest strategies they could learn (and this paper claims that they do) is for one group to focus on relevant context and the other to focus on irrelevant context. Thus one group learns the noise and the other the signal (it's not this cut and dry but is a useful simplification for understanding IMO).
- magicalhippo9 months ago
  An interesting aspect is that they don't do a plain subtraction, but rather subtract a portion of the second softmax.
  This makes sense, if one considers that the two copies are identical then the softmax outputs would be identical and the difference is zero everywhere. However, by subtracting a scaled copy, the normalization of the difference seems to really boost the signal value(s) over the "noise", making the signal stand out compared to pre-normalization.
  - testdfkjahdfh9 months ago
    if two attentions A, B are identical, would (A - lambda * B) be just (1-lambda) * A, how does it "boost the signal value(s) over the "noise""?
    magicalhippo9 months ago
    How embarrassing, I had one of those "autocorrect moments". I somehow put the lambda inside the softmax when thinking and trying it without realizing. So what I was playing with in a spreadsheet (so not so obvious as plain code) was
    softmax(A) - softmax(lambda * A)
    And as so happens, normalizing the output of that that with my test vectors seems to really boost the output the largest component if A and B are equal.
- patcon9 months ago
  > what happens if the two groups of attention learn the same thing
  I wonder if there's a metaphor here for our own experience and utility in "surprise".
  Like if one attention head is surprised by what another learns, up-weight it. But if they both find the same, assume it's not very surprising and down-weight it.
  Admittedly, "surprise" is something that has a big section of my knowledgebase[1][2][3] (both as a subjective feeling and adaptive function of our minds, one of the most complex adaptive system we know of)
  [1] https://plus.maths.org/content/information-surprise
  [2] https://blakeelias.name/papers/Multi-Agent-Cooperation-Intri...
  [3] https://complexity.simplecast.com/episodes/81/transcript
- dartos9 months ago
  There’s probably a small chance that they could both learn the same thing, but it’s probably not likely enough to be a major issue.
- nextaccountic9 months ago
  Maybe the loss function could penalize them learning the same thing?
patcon9 months ago
I wonder what is lost here. Surely there's a trade-off...
I'm wondering if there's any effect of "creativity", or ability to interpolate between concepts. Hallucination and creativity feel very related to me. I understand hallucinating as simply being misaligned with the space humans feel appropriate to interpolate between
- dartos9 months ago
  > Hallucination and creativity feel very related to me.
  Why? I see them as just sampling errors.
  Sure a mistake can spark inspiration sometimes, but creativity is much more than mistakes.
  > I understand hallucinating as simply being misaligned with the space humans feel appropriate to interpolate between
  These language models are next-token predictors. The way the next token is predicted is by sampling a probability space outputted by the model.
  That sampling process can be non deterministic.
  Hallucinations are when that sampling results in tokens that come together to create a false or otherwise unintended statement.
  You can just as well think of everything a model outputs as a hallucination, but we train the model to output a space what we want them to hallucinate is more likely. Otherwise it just outputs meaningless noise.
  “Hallucinate” is really an awful word for what it’s trying to describe.
  - slashdave9 months ago
    > You can just as well think of everything a model outputs as a hallucination
    Exactly. Don't forget that an important factor in the success of GPT3 was RLHF, which is essentially training the model to produce "hallucinations" that are more acceptable on average to human trainers.
  - radarsat19 months ago
    Often see this argument but it doesn't hold water for me. What we call hallucination is usually when the model says something confidently wrong. Yes the sampling procedure is nondeterministic but this is unrelated to hallucinations. The model can generate a distribution to sample with very little weight on the "wrong" output and then this should be ignored by procedures like top-k sampling. The fact that this doesn't easily solve the problem shows that hallucination is a deeper problem in the model itself and not just a byproduct of sampling.
    dartos9 months ago
    > What we call hallucination is usually when the model says something confidently wrong
    This is a poor definition that only applies to language models trained to be truthful. If you trained a language model to lie, and it told the truth, that would also be a hallucination.
    Or if a model was trained to never sound confident, and it made confident, but correct, claims.
    My definition is more accurate.
    > Yes the sampling procedure is nondeterministic but this is unrelated to hallucinations.
    It’s not the only factor, but it’s absolutely related. It’s also really easy to explain in a comment.
    For example, if you always sampled the lowest ranked token, the model would always hallucinate (by output mostly garbage)
    Top-k sampling doesn’t eliminate all errors, unless you’re just always picking the most likely token. At that point the sampling process is deterministic, but we’ve seen model output be poor with that setting for reasons I explain next.
    > that hallucination is a deeper problem
    Of course, it’s because the training process itself is nondeterministic. We can’t make a perfect model, it’s just not how statistical models work.
    seanhunter9 months ago
    Yes exactly. It seems intuitive that the model could generate a better distribution and thus cure hallucination but that doesn't actually match what the model does.
    The model doesn't sample a probability distribution of individual "facts"[1] it samples a probability distribution of tokens which are generally parts of words, bits of punctuation etc. That we get "facts" out of it which may even be wrong in the first place is an emergent behaviour because of the attention mechanism.
    Totally agree that it's a deeper problem and may be intrinsic to the design of the models and the fact that they are trained on a next word prediction task. Karpathy talks about the models as "dreaming text". In that sense it's not surprising that some of it is whacky. Our dreams are too.
    [1] By which I mean atomic things that can be right or wrong
    radarsat19 months ago
    Agreed. I have a loose idea that hallucination is related to training to maximize the probability of individual tokens while ignoring the joint sequence probability, which is along the lines of what you are saying -- it is not trained to output the most probable final sequence, so it gets stuck in the "wrong place" half way through.
  - thomastjeffery9 months ago
    Hallucinate is an awful word because of what it is trying to describe.
    Hallucination describes the same feature you just called "non deterministic sampling", but exclusively the cases that we don't like. It would be really convenient if we could actually draw that line, but we can't. If non-determinism is a core feature, then that feature will be present in every case; including the ones we find desirable, and the ones we find undesirable.
  - nextaccountic9 months ago
    > Sure a mistake can spark inspiration sometimes, but creativity is much more than mistakes.
    It looks like creativity has many steps but being able to come with novel, unprompted stuff is important, as long as you are able to discard the bullshit earlier.
    "Hallucination" is only a problem if later layers (or additional networks) can't detect and remove it
    dartos9 months ago
    > "Hallucination" is only a problem if later layers (or additional networks) can't detect and remove it
    Yeah I mean sure. Anything is only a problem if it goes undetected. The issue is that if you rely on statistical model, you’ll always have hallucinations, so you can’t filter statistical output with another statistical model if you need real guarantees.
    Many products don’t need those guarantees though.
  - skybrian9 months ago
    LLM’s are too unpredictable for many practical uses so I’d guess better predictability is better. Hopefully the change the paper proposes will help!
    But here’s a case for the other side: sure, most mistakes are just errors, but evolution happens via “mistakes.” Also, LLM’s often deliberately add add randomness at inference time.
    dartos9 months ago
    > evolution happens via “mistakes.”
    That’s a nice slogan, but it’s a gross oversimplification.
    In the natural world, you can say that mistakes in DNA replication leads to evolution, but that’s discounting the entire process of natural selection.
    Same with creativity. Look at Picasso. His was a technically brilliant realistic painter at 15, but his work later in life evolved to be more abstract and weird. I don’t think that was the result of mistakes, but rather intentionally breaking patterns he learned in his youth.
    skybrian9 months ago
    To oversimplify, evolution is a generate-and-test process and the evaluation step is critical. Something needs to decide which variations are better. Often, with generative AI, it’s people who judge the results. Still, generating interesting examples (the brainstorming phase) plays some role in that.
    I don’t know a whole lot about Picasso’s art, but I imagine the way he evaluated his own work played an important role, in being able to see that sometimes creative accidents are interesting.
- magicalhippo9 months ago
  > Surely there's a trade-off...
  For one, speed and memory. They have twice as many Q and K weights in the attention blocks, leading to a ~10% reduction in throughput on their H100 (table 7 in appendix A).
  - lennxa9 months ago
    they mention similar performance to vanilla transformer with significantly reduced param count though
  - karmasimida9 months ago
    I mean it doesn’t necessarily needs 2x QK to match that performance, in terms of accuracy, of a regular transformer right?
- watsonmusic9 months ago
  not all hallucinations are creativity Imaginate that for a RAG application, the model is supposed to follow the given documents
chessgecko9 months ago
I wonder how much of the value here is from canceling out the positional noise rope produces. I would love to see a table comparing an alibi version of this to an alibi baseline in addition to the rope models here.
Crazy gains though congrats to the researchers
vsroy9 months ago
Is the thing that's going on here that softmax can't push a value to 0, but by subtracting 2 softmax maps we can output 0s?
- vsroy9 months ago
  Follow-up question is: Isn't it extremely unlikely to output 0?
- pkoird9 months ago
  Or negatives
machinelearning9 months ago
This is a good problem to solve but the approach is wrong imo.
It has to be done in a hierarchical way to know what you attended to + full context.
If the differential vector is being computed with the same input as the attention vector how do you know how to modify the attention vector correctly
- quantadev9 months ago
  Doesn't everything just get tweaked in whatever direction the back-propagation derivative says and proportionally to that "slope"? In other words, simply by having back-propagation system in effect there's never any question about which way to adjust the weights, right?
pxdm9 months ago
What's the comparison with conventional attention using a more aggressive (lower temperature) softmax? I can imagine that for the multi-needle retrieval test this may also give a performance boost, although at some cost other more creative tasks.
- mota79 months ago
  I had the same thought: Just eye-balling the graphs, the result of the subtraction looks very close to just reducing the temperature.
  They're effectively doing softmax with a fixed temperature, but it's unclear that this work is going to do better than just learning a per-head temperature parameter.
  c.f. https://arxiv.org/abs/2010.04245 which shows an improvement by learning per-head temperature.
  The other way to think about this is that it looks like a hacked-up kinda-sorta gated attention. If that's the case, then doing softmax(alphaq_1k_1^T - log_sigmoid(betaq_2k_2^T)) might be better? (where alpha,beta are learned temperatures).
nmacias9 months ago
AdderaLLM was right there
miven9 months ago
Is there an intuitive reason why this ends up working this well compared to, say, applying some kind of thresholding to attention activations that are below average for a given head to filter that same attention noise out?
pizza9 months ago
Was just going to mention that it seems that it should be possible to make a Flash Attention version of this algorithm and was pleasantly surprised to see they already included an implementation of one :)
watsonmusic9 months ago
The modification is simple and beautiful. And the improvements are quite significant.
singularity20019 months ago
Anyone remember siamese networks?
slashdave9 months ago
I don't get it. Arbitrary linear combinations are already accommodated via feed forward. What am I missing?
- michalsustr9 months ago
  My hunch is that this effectively creates a differentiable minimax “search” “tree” that can be backpropagated through. Not a tree — a dag really — and not search, but learning. :)
- thatsadude9 months ago
  True, that's a yellow flag for me.
WithinReason9 months ago
Hmmm, this could be expressed as 2 consecutive attentions in a residual branch:
Simplified differential T. looks like: (softmax(Q₁K₁) − λ softmax(Q₂K₂)) V
You can factor this into:
```
    x = softmax(Q₁K₁)V
    x += -λ softmax(Q₂K₂)V
```
which is like 2 subsequent regular attentions added that are sharing V
- kelseyfrog9 months ago
  You could also extrapolate this into more than two terms by squinting your eyes and saying that λ ∈ {1, -1} is close enough to λi ∈R^d ∣ ∥λi ∥=1. No idea if it would result in better performance, but that's research babyyyy!
- tananan9 months ago
  Now I'm wondering, isn't there usually a `num_heads x value_dim -> model_dim` projection that goes after a MHA? The W in `softmax(QK)VW`? That one can play the role of this subtraction in a vanilla transformer, no? So I wonder what kind of advantage does splitting things up like this bring.
h_tbob9 months ago
I wish they didn’t use swiGLU and preRMSnorm so we could have a better comparison.
Then we would know how much this transformer innovation helps by itself.
digdugdirk9 months ago
Is there any way to replicate this with existing models, or are we going to need to wait for models to be trained in this style?
I'm imagining a smaller model examining the output tokens of a larger model and metaphorically slapping it on the wrist with a ruler if the output tokens start drifting off topic. Not quite the same, but an entertaining thought nonetheless.
- bionhoward9 months ago
  Yes, I believe this is possible, you could clone weights of one or more existing models and fine tune them in groups with different random seeds for noise/drop to produce reasonable outputs under a differential transformer decoding scheme whereby tokens with disagreement receive more attention (surprisal analysis)
- causal9 months ago
  It's a different attention mechanism with a different map setup, so fundamentally a different type of model
  - om89 months ago
    Looks like it is a drop in replacement for attention, but models will need to be retrained for this one, yes.
    aDyslecticCrow9 months ago
    It may not need to be entirely retrained. The value spans and input are the same, and no extra weights are needed. You may be able to tune an existing model with this attention mechanism and get some of the benefits.
    But overall... it's mainly a training change, so training is needed to make a difference.
dartos9 months ago
> By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization
I’m very interested in this claim. I was under the impression that hallucination is unavoidable in these kinds of models. IIRC proof for that was trending on HN a couple weeks ago.
- pshc9 months ago
  More broadly I think hallucination is inevitable in pure text models. We need model architectures incorporating a stream of real-world ground truth such as a live video feed or embodiment.
- moffkalast9 months ago
  It's not possible to get rid of it entirely, but if you can get the model to bullshit only 0.1% of the time instead of 5% of the time it's a massive improvement.
  Most of it should be happening when there's no data to draw conclusions from. E.g. STT models make up words in silence, vision models find things in lens cap noise, LLMs make up explanations when they have no data to pull from.
  The real solution would be more along the lines of training models to specifically ignore these cases, or in the case of LLMs to just know when to say "I don't know".
- ErikBjare9 months ago
  Mitigate, not completely fix.
mik099 months ago
r/machine learning comment thread has some interesting ideas, one of them linking this one with similar work in CV: https://www.reddit.com/r/MachineLearning/comments/1g0lnij/r_...
lucidrains9 months ago
does this not mean we should explore usage of talking heads (Shazeer et al) a bit more? https://arxiv.org/abs/2003.02436
- watsonmusic9 months ago
  [dead]
x49asvk9 months ago
This concept is really interesting to me, I am very very new to transformers but would love to learn more about normal transformers and differential too. Can anyone suggest any resources?
pikseladam9 months ago
Did this mean they solved the hallucination problem of transformers?
edit: not fully but it gives promising results. quiet an improvement actually.
- HarHarVeryFunny9 months ago
  I don't think there's any narrow definition of what "hallucination" means. It generally refers to the model giving non-factual answers in contexts that are meant to be factual, but not all causes of this are going to be fixable without very major changes.
  The fundamental issue is that most of the time LLMs are going to be combining statistics derived from many training samples when generating a single continuation, and there is just no guarantee that this will result in a semantically coherent response. Of course the model's depth of parsing and semantic analysis usually means that each generated word is highly plausible, but this isn't the same as being factually correct, especially so in these cases where the model is drawing on multiple sources to create a mashup response, which is the normal mode of operation.
- lafreb9 months ago
  The paper says that they've improved hallucination mitigation, but not really "solved" the issue.
  - Rhapso9 months ago
    "Hallucination" isn't really a problem that can be "fixed". Its just model error.
    The root problem is simply that the model doesn't capture reality, just an approximation. What we are incorrectly calling "hallucination" is just the best the model has to offer.
    spencerchubb9 months ago
    it's not "just" model error
    during pre-training, there is never an incentive for the model to say "I don't know" because it would be penalized. the model is incentivized to make an educated guess
    large transformer models are really good at approximating their dataset. there is no data on the internet about what LLMs know. and even if there were such data, it would probably become obsolete soon
    that being said, maybe a big shift in the architecture could solve this. I hope!
    happypumpkin9 months ago
    > it would probably become obsolete soon
    Suppose there are many times more posts about something one generation of LLMs can't do (arithmetic, tic-tac-toe, whatever), than posts about how the next generation of models can do that task successfully. I think this is probably the case.
    While I doubt it will happen, it would be somewhat funny if training on that text caused a future model to claim it can't do something that it "should" be able to because it internalized that it was an LLM and "LLMs can't do X."
    spencerchubb9 months ago
    also presumes that the LLM knows it is an LLM
    adwn9 months ago
    System prompts sometimes contain the information that "it" is an LLM.
    Maybe in the future, those prompts will include motivational phrases, like "You can do it!" or "Believe in yourself, then you can achieve anything."
    Vecr9 months ago
    They're generally fine tuned not to. I'm not sure how long that will hold though.
    ykonstant9 months ago
    - Are you an LLM?
    - As a Large Language Model, I am fine tuned to be unable to answer this question.
    singularity20019 months ago
    in another paper which popped up recently they approximated uncertainty with Entropy and inserted "wait!" tokens whenever Entropy was high, simulating chain of thought within the system.
    spywaregorilla9 months ago
    > during pre-training, there is never an incentive for the model to say "I don't know" because it would be penalized. the model is incentivized to make an educated guess
    The guess can be "I don't know". The base LLM would generally only say I don't know if it "knew" that it didn't know, which is not going to be very common. The tuned LLM would be the level responsible for trying to equate a lack of understanding to saying "I don't know"
    tucnak9 months ago
    I'm led to believe this is mostly because "known unknowns" are not well-represented in the training datasets... I think, instead of bothering with refusals and enforcing a particular "voice" with excessive RL, they ought to focus more on identifying "gaps" in the datasets and feeding them back, perhaps they're already doing this with synthetic data / distillation.
    dilap9 months ago
    it can be fixed in theory if the model knows-what-it-knows, to avoid saying things its uncertain about (this is what (some) humans do to reduce the frequency w which they say untrue things).
    theres some promising research using this idea, tho i dont have it at hand.
    hoosieree9 months ago
    LLMs can't hallucinate. They generate the next most likely token in a sequence. Whether that sequence matches any kind of objective truth is orthogonal to how models work.
    I suppose depending on your point of view, LLMs either can't hallucinate, or that's all they can do.
    ToValueFunfetti9 months ago
    >Whether that sequence matches any kind of objective truth is orthogonal to how models work.
    Empirically, this cannot be true. If it were, it would be statistically shocking how often models coincidentally say true things. The training does not perfectly align the model with truth, but 'orthogonal' is off by a minimum of 45 degrees.
    viraptor9 months ago
    It matches the training data. Whether the training data matches truth (and whether it's correctly understood - sarcasm included) is a completely separate thing.
    > The training does not perfectly align the model with truth, but 'orthogonal'
    Nitpicky, but the more dimensions you have, the easier it is for almost everything to be orthogonal. (https://softwaredoug.com/blog/2022/12/26/surpries-at-hi-dime...) That's why averaging embeddings works.
    ToValueFunfetti9 months ago
    I went to school to learn about the world and the overwhelming majority of that learning was from professors and textbooks. Whether the professors' beliefs and the textbooks' contents reflected the true properties of the world was a completely separate thing, entirely outside of my control. But I did come away with a better understanding of the world and few would say that education is orthogonal to that goal.
    If you add two vectors that don't have a truth component (ie. are orthogonal to the truth), the resulting vector should be no closer to the truth. If you start with random weights and perform some operation on them such that the new weights have a higher likelihood of producing true statements, the operation must not have been orthogonal to the truth. Am I wrong there?
    viraptor9 months ago
    > But I did come away with a better understanding of the world and few would say that education is orthogonal to that goal.
    That's due to the reward function / environment. But even outside extremes like North Korea, lots of education environments value conformity over independent analysis.
    ToValueFunfetti9 months ago
    Certainly an AI trained on North Korean data would emerge with some very suspect beliefs regarding Kim Jong-Un. My point is just that aligning something with training data is aligning it with truth, to the degree that the training data is true and regardless of why it is true. educate(me, truth) can hardly be called orthogonal to the truth, even if the 'educate' and 'me' terms do nothing to prevent educate(me, falsehood).
    timcobb9 months ago
    Isn't this the same thing that happens when you train a human on truths vs falsehoods?
    CooCooCaCha9 months ago
    Whenever someone takes issue with using the word “hallucinate” with LLMs I get the impression they’re trying to convince me that hallucination is good.
    Why do you care so much about this particular issue? And why can’t hallucination be something we can aim to improve?
    AnimalMuppet9 months ago
    I'm pretty sure there's something I don't understand, but:
    Doesn't an LLM pick the "most probable next symbol" (or, depending on temperature, one of the most probable next symbols)? To do that, doesn't it have to have some idea of what the probability is? Couldn't it then, if the probability falls below some threshold, say "I don't know" instead of giving what it knows is a low-probability answer?
    dTal9 months ago
    It doesn't really work like that.
    1) The model outputs a ranked list of all tokens; the probability always sums to 1. Sometimes there is a clear "#1 candidate", very often there are a number of plausible candidates. This is just how language works - there are multiple ways to phrase things, and you can't have the model give up every time there is a choice of synonyms.
    2) Probability of a token is not the same as probability of a fact. Consider a language model that knows the approximate population of Paris (2 million) but is not confident about the exact figure. Feed such a model the string "The exact population of Paris is" and it will begin with "2" but halfway through the number it will have a more or less arbitrary choice of 10 digits. "2.1I don't know" is neither a desirable answer, nor a plausible one from the model's perspective.
    darkPotato9 months ago
    My understanding is that the hallucination is, out of all the possibilities, the most probable one (ignoring temperature). So the hallucination is the most probable sequence of tokens at that point. The model may be able to predict an "I don't have that information" given the right context. But ensuring that in general is an open question.
    viraptor9 months ago
    > Doesn't an LLM pick the "most probable next symbol"
    Yes, but that very rarely matters. (Almost never when it's brought up in discussions)
    > Couldn't it then, if the probability falls below some threshold, say "I don't know" instead of giving what it knows is a low-probability answer?
    A low probability doesn't necessarily mean something's incorrect. Responding to your question in French would also have very low probability, even if it's correct. There's also some nuance around what's classified as a hallucination... Maybe something in the training data did suggest that answer as correct.
    There are ideas similar to this one though. It's just a bit more complex than pure probabilities going down. https://arxiv.org/abs/2405.19648
    anticensor9 months ago
    > Responding to your question in French would also have very low probability, even if it's correct.
    It's actually a common utterance in Paris.
    anon2919 months ago
    You need to separate out the LLM, which only produces a set of probabilities, from the system, which includes the LLM and the sampling methodology. Sampling is currently not very intelligent at all.
    The next bit of confusion is that the 'probability' isn't 'real'. It's not an actual probability but a weight that sums up to one, which is close enough to how probability works that we call it that. However, sometimes there are several good answers and so all the good answers get a lower probability because there are 5 of them. A fixed threshold is not a good idea in this case. Instead, smarter sampling methods are necessary. One possibility is that if we do have seeming confusion, to put a 'confusion marker' into the text and predict the next output and train models to refine the answer as they go along. Not sure if any work has been done here, but this seems to go along with what you're interested in
    viraptor9 months ago
    > However, sometimes there are several good answers and so all the good answers get a lower probability because there are 5 of them.
    That's the result after softmax. If you want to act on the raw results, you can still do that.
    anon2919 months ago
    The results before softmax don't sum to one so don't even act like a probability distribution. And that's the point. When you have the pre-softmax activations, there are infinitely many ways to convert them to something probability-like. You can normalize them after taking the square root, the square, raising to three, etc. Or you can exponentiate and for some reason that does better. Either way it's not a 'real' probability distribution.
    ithkuil9 months ago
    This may work when the next token is a key concept but when it's a filler word or a part of one of many sequences of words that can convey the same meaning but in different ways (synonyms but not only at the word also at the sentence levels) then it's harder to know whether the probability is low because the word is absolutely unlikely or because it's likelihood is spread/shared among other truthful statements
    skydhash9 months ago
    You would need some kind of referential facts that you hold as true, then some introspection method to align sentences to those. if it can’t be done, the output may be “I don’t know”. But even for programming languages (simplest useful languages), it would be hard to do.
    PaulHoule9 months ago
    My guess is the problem is words with high probabilities that happen to be part of a wrong answer.
    For one thing the probability of a word occurring is just a probability of the word occurring in a certain sample, it's not an indicator of truth. (e.g. the most problematic concept in philosophy in that just introducing it undermines the truth, see "9/11 truther") It's also not sufficient to pick a "true" word or always pick a "true" word but rather the truthfulness of a statement needs to be evaluated based on the statement as a whole.
    A word might have a low probability because it competes with a large number of alternatives that are equally likely which is not a reason to stop generation.
    visarga9 months ago
    This reminds me it's easy to train similarity models, hard to train identity/equivalence prediction. Two strings can be similar in many ways, like "Address Line 1" and "Address Line 2" or "Position_X" and "Position_Y", yet distinct in meaning. That one character makes all the difference. On the other hand "Vendor Name" is equivalent with "Seller Company" even though they are pretty different lexically.
    The dot product, which is at the core of attention, is good for similarity not identity. I think this is why models hallucinate - how can they tell the distinction between "I have trained on this fact" and "Looks like something I trained on".
    atrus9 months ago
    I don't think that fixes it, even in theory, since there's always some uncertainty.
- watsonmusic9 months ago
  that would be huge!
nowayno5839 months ago
Does anyone understand why they are taking the difference between transformers instead of the sum? It seems to me that in a noise reducing solution we would be more interested in the sum, as random noise would cancel out and signal would be constructive.
Of course, even if I'm right proper training would account to that by inverting signs where appropriate. Still, it seems weird to present it as the difference, especially seeing as they compare this directly to noise cancelling headphones, where we sum both microphones inputs.
- aDyslecticCrow9 months ago
  The noise isn't truly random; it's just a matrix of small values that shouldn't be taken into account. Subtracting them cancels them out.
  As pointed out by a different comment, it's actually the attention we are interested in that is cancelled out *if they are both equal*. This is what the paper mentions in its abstract;
  > promoting the emergence of sparse attention patterns
  In theory, it is quite clever, and their results seem to back it up.
  - watsonmusic9 months ago
    [dead]
- thegeomaster9 months ago
  I suspect that plus vs minus is arbitrary in this case (as you said, due to being able to learn a simple negation during training), but they are presenting it in this way because it is more intuitive. Indeed, adding two sources that are noisy in the same way just doubles the noise, whereas subtracting cancels it out. It's how balanced audio cables work, for example.
  But with noise cancelling headphones, we don't sum anything directly---we emit an inverted sound, and to the human ear, this sounds like a subtraction of the two signals. (Audio from the audio source, and noise from the microphone.)
  - nowayno5839 months ago
    Oh! It's been a good while since I've worked in noise cancelling. I didn't know current tech was at the point where we could do direct reproduction of the outside noise, instead of just using mic arrays! That's very cool, it used to be considered totally sci fi to do it fast enough in a small headset.
badsandwitch9 months ago
What is purpose of the lambda parameter? Why isn't it a constant of 1?
esafak9 months ago
How is this different than using a sparsity-inducing prior?
magicalhippo9 months ago
The visualization reveals that Transformer tends to allocate only a small proportion of attention scores to the correct answer, while disproportionately focusing on irrelevant context.
[...] Specifically, we partition the query and key vectors into two groups and compute two separate softmax attention maps. Then the result of subtracting these two maps is regarded as attention scores.
[...] The approach is analogous to noise-canceling headphones and differential amplifiers in electrical engineering, where the difference between two signals cancels out common-mode noise.
Simple change, with seemingly decent improvements across the board.
- 9 months ago
  undefined
- watsonmusic9 months ago
  [flagged]
campers9 months ago
The tl;dr on high level performance improvements
"The scaling curves indicate that Diff Transformer requires only about 65% of model size or training tokens needed by Transformer to achieve comparable language modeling performance."
"Diff Transformer retains high performance even at reduced bit-widths, ranging from 16 bits to 6 bits. In comparison, Transformer’s accuracy significantly drops with 6-bit quantization. The 4-bit Diff Transformer achieves comparable accuracy as the 6-bit Transformer, and outperforms the 4-bit Transformer by about 25% in accuracy."
- 9 months ago
  undefined
breukh9 months ago
[flagged]
ExxKA9 months ago
Very interesting. Currently working on timeseries with Transformers. Let me know if anyone else out there is also reading it from that context.
- d3m0t3p9 months ago
  Really cool, I'm a CS majoring in AI, but I'm also interested in that domain, would you have any recommendation to get started ?
  - ExxKA9 months ago
    Get a lot of data, and just dig in :) No better way to learn.