1. Replace the DDN layer with a flow between images and a latent variable. During training, compute in the direction image -> latent. During inference, compute in the direction latent -> image. 2. For your discrete options 1, ..., k, have trainable latent variables z_1, ..., z_k. This is a "code book".
Training looks like the following: Start with an image and run a flow from the image to the latent space (with conditioning, etc.). Find the closest option z_i, and compute the L2 loss between z_i and your flowed latent variable. Additionally, add a loss corresponding to the log determinant of the Jacobian of the flow. This second loss is the way a normalizing flow avoids mode collapse. Finally, I think you should divide the resulting gradient by the softmax of the negative L2 losses for all the latent variables. This gradient division is done for the same reason as dividing the gradient when training a mixture-of-experts model.
During inference, choose any latent variable z_i and flow from that to a generated image.
1. DDN doesn’t need to be invertible. 2. Its latent is discrete, not continuous. 3. As far as I know, flow keeps input and output the same size so it can compute log|detJ|. DDN’s latent is 1-D and discrete, so that condition fails. 4. To me, “hierarchical many-shot generation + split-and-prune” is simpler and more general than “invertible design + log|detJ|.” 5. Your design seems to have abandoned the characteristics of DDN. (ZSCG, 1D tree latent, lossy compression)
The two designs start from different premises and are built differently. Your proposal would change so much that whatever came out wouldn’t be DDN any more.
The best way I can summarize it is a Mixture-of-Experts combined with an 'x0-target' latent diffusion model. The main innovation is the guided sampler (rather than router) & split-and-prune optimizer; making it easier to train.
(This is mentioned in Q1 in the "Common Questions About DDN" section at the bottom.)
- There are no experts. The outputs are approximating random samples from the distribution.
- There is no latent diffusion going on. It's using convolutions similar to a GAN.
- At inference time, you select ahead-of-time the sample index, so you don't discard any computations.
Supplement for @f_devd:
During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).
The ahead-of-time sampling doesn't make much sense to me mechanically, and isn't really mentioned much. But I will hold my judgement for future versions since the FID performance of this first iteration is still not that great.
Why is this possible? I tend to think of it as reflecting the ability to 'memorize' all possible data, and the independent generation is just when you 'remember' a specific part of a memory. The latent space is a Platonic object which doesn't change, so why should your generative process for materializing any specific point in the latent space have to? It's not surprising if you could generate arbitrary points from a function like 'y = mx + b' without generating every other point, right? It's just an atemporal mathematical object. Similarly with 'generating images from a random seed'. They too are just (complicated) functions mapping one number to another number.
(You might wonder if this is limited to images? It is not. In fact, you can generate even natural language like this to some degree: https://github.com/ethan-w-roland/AUNN based on my proposal for taking the 'independent generation' idea to a pathological extreme: https://gwern.net/aunn )
3x3 convolution is the most common operator used to provide basic computational power.
Looking at the examples below the abstract, there's several details that surprise me with how correct the model is. For examples: hairline in row 2 column 3; shirt color in row 2 columns 7, 8, 9, 11; lipstick throughout rows 4 and 6; face/hair position and shape in row 6 column 4. Of particular note is the red in the bottom left of row 6 column 4. It's a bit surprising—but still understandable—that the model realized there is something red there, but it's very surprising that it chose to put the red blob in exactly the right spot.
I think some of this can be explained by bias in the dataset (e.g. lipstick) and cherry picking on my part (I'm ignoring the ones it got wildly wrong), but I can't explain the red shoulder strap/blob. Is there any possibility of data leakage and/or overfitting of a biased dataset, or are these just coincidences?
Besides, I feel the red shoulder strap/blob is reconstructed rather poorly.
Discrete hierarchical representations are super cool. The pattern of activations across layers amounts to a “parse tree” for each input. You have effectively compressed the image into a short sequence of integers.
The one shown on their page is L=3.
A genuinely interesting and novel approach, I'm very curious how it will perform when scaled up and applied to non-image domains! Where's the best place to follow your work?
Is that right?
If so, how is this increased size handled by each downstream DDL?
Or, is there a 2x pooling in the "concat" step so that final size remains unchanged?
Much like DiffusionDet, which applies diffusion models to detection, DDN can adopt the same philosophy. I expect DDN to offer several advantages over diffusion-based approaches: - Single forward pass to obtain results, no iterative denoising required. - If multiple samples are needed (e.g., for uncertainty estimation), DDN can directly produce multiple outputs in one forward pass. - Easy to impose constraints during generation due to DDN's Zero-Shot Conditional Generation capability. - DDN supports more efficient end-to-end optimization, thus more suitable for integration with discriminative models and reinforcement learning.
> Inspired by the theory of *evolution and genetic algorithms*, we propose the Split-and-Prune algorithm to address the above issues, as outlined in algorithm 1.
Now I just need a time-turner.
The authors benefit from having "testimonials" of how anonymous reviewers interpreted their works, and it also allows opens the door to people outside of the classic academic pipeline to see the behind the scenes arguments to accept/reject a paper.
Here are the reviews for this paper btw: https://openreview.net/forum?id=xNsIfzlefG
And here's a list of all the rejected papers: https://openreview.net/group?id=ICLR.cc/2025/Conference#tab-...
I've added it to my reading list.
Thank you for sharing it on HN.
I made an initial attempt to combine [DDN with GPT](https://github.com/Discrete-Distribution-Networks/Discrete-D...), aiming to remove tokenizers and let LLMs directly model binary strings. In each forward pass, the model adaptively adjusts the byte length of generated content based on generation difficulty (naturally supporting speculative sampling).
> To our knowledge, Taiji-DDN is the first generative model capable of directly transforming data into a semantically meaningful binary string which represents a leaf node on a balanced binary tree.
This property excites me just as much.
And, their work is far more polished; I’ve only put together a quick GPT+DDN proof-of-concept.
Thank you for sharing.
The current goal in research is scaling up. Here are some thoughts in blog about future directions: https://github.com/Discrete-Distribution-Networks/Discrete-D...
Even one of the examples is a very effective re-colorized that beat other approaches I've seen with less risk of modifying the subject. It's clever, and simple.
it's compared more with GAN in the article than Diffusion, and that excites me. GAN are badly behaved, but are really powerful reinforcement learners. If this method can compensate for the greatest bane of GAN (mode collapse), it can be very useful.
- The DDN single-shot generator architecture is more efficient than diffusion.
- DDN is fully end-to-end differentiable, allowing for more efficient optimization when integrated with discriminative models or reinforcement learning.
- Moreover, DDN inherently avoids mode collapse.
These points are all mentioned in the blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...
In DDN, the GT is only used to calculate the loss and guide sampling; it never becomes an input to the model.
Is the inference cost of generating this tree to be pruned something of a hindrance? In particular I'm watching your MNIST example and thinking - does each cell in that video require a full inference? Or is this done in parallel at least? In any case, you're basically memory for "faster" runtime (for more correct outputs), no?
And it's recommended to combine it with an autoregressive model (GPT) for more powerful modeling capabilities.
International Conference on Learning Representations
https://en.wikipedia.org/wiki/International_Conference_on_Le...
Similarities: - Both map data to a discrete latent space.
Differences: - VQ-VAE needs an external prior over code indices (e.g. PixelCNN or a hierarchical prior) to model distribution. DDN builds its own hierarchical discrete distribution and can even act as the prior for a VQ-VAE-like system. - DDN’s K outputs are features that change with the input; VQ-VAE’s codebook is a set of independent parameters (embeddings) that remain fixed regardless of the input. - VQ-VAE produces a 2-D grid of code indices; DDN yields a 1-D/tree-structured latent. - VQ-VAE needs Straight-Through Estimator. - DDN supports zero-shot conditional generation.
So I’d call them complementary rather than “80 % the same.” (See the paper’s “Connections to VQ-VAE.”)
Both the "feature" and the "selected output" are designed to be passed to the next layer.
I never even ran an ablation that disabled the stem features; I assume the network would still train without them, but since the previous layer has already computed the features, it would be wasteful not to reuse them. Retaining the stem features also lets DDN adopt the more efficient single-shot-generator architecture.
Another deeper reason is that, unlike diffusion models, DDN does not need the Markov-chain property between adjacent layers.
> Many high rated papers would have been done by someone else if their authors never published them or were rejected. However, if this paper is not published, it is not likely that anyone would come up with this approach. This is real publication value. I am reminding again the original diffusion paper from 2015 (Sohl-Dickstein) that was almost not noticed for 5 years. Had it not been published, would we have had the amazing generative models we have today?
Cite from: https://openreview.net/forum?id=xNsIfzlefG¬eId=Dl4bXmujh1
Besides, we compared DDN with other approaches in the Table 1 of original paper, including VQ-VAE.
Here are my thoughts on the statistics behind this. First, let D be the data sample. Start with the expectation of -Log[P(D)] (standard generative model objective).
We then condition on the model output at step N.
- Expectation of Log[Sum over model outputs at step N{P(D | model output at step N) * P(model output at step N)}]
Now use Jensen's inequality to transform this to
<= - expectation of Sum over model outputs at step N{Log[P(D | model output at step N) * P(model output at step N)]}
Apply Log product to sum rule
= - expectation of Sum over model outputs at step N {Log(P(D | model output at step N)) + Log(P(model output at step N))}
If we assume there is some normally distributed noise we can transform the first term into the standard L2 objective.
= - expectation of Sum over model outputs at step N {L2 distance(D, model output at step N) + Log(P(model output at step N))}
Apply linearity of expectation
= Sum over model outputs at step N [expectation of{L2 distance(D, model output at step N)}] - Sum over model outputs at step N [expectation of {Log(P(model output at step N))}]
and the summations can be replaced with sampling
= expectation of {L2 distance(D model output at step N)} - expectation of {Log(P(model output at step N))}]
Now, focusing on just the - expectation of Log(P(sampled model output at step N)) term.
= - expectation of Log[P(model output at step N)]
and condition on the prior step to get
= - expectation of Log[Sum over possible samples at N-1 of (P(sample output at step N| sample at step N - 1) * P(sample at step N - 1))]
Now, for each P(sample at step T | sample at step T - 1) this is approximately equal to 1/K. This is enforced by the Split-and-Prune operations which try to keep each output sampled at roughly equal frequencies.
So this is approximately equal to
≃ - expectation of Log[Sum over possible samples at N-1 of (1/K * P(possible sample at step N - 1))]
And you get an upper bound by only considering the actual sample.
<= -Log[1/K * expectation of P(actual sample at step N - 1))]
And applying some log rules you get
= Log(K) - expectation of Log[P(sample at step N - 1)]
Now, you have (approximately) expectation of -Log[P(sample at step N)] <= Log(K) - expectation of Log[P(sample at step N - 1)]. You can repeatedly apply this transformation until step 0 to get
(approximately) expectation of -Log[P(sample at step N)] <= N * Log(K) - expectation of Log[P(sample at step 0)]
and WLOG assume that expectation of P(sample at step 0) is 1 to get
expectation of -Log[P(sample at step N)] <= N * Log(K)
Plugging this back into the main objective, we get (assuming the Split-and-Prune is perfect)
expectation of -Log[P(D)] <= expectation of {L2 distance(D, sampled model output at step N)} + N * Log(K)
And this makes sense. You are providing the model with an additional Log_2(K) bits of information every time you perform an argmin operation, so in total you have provided the model with N * Log_2(K) bits for information. However, this is constant so you can ignore it from the gradient based optimizer.
So, given this analysis my conclusions are:
1) The Split-and-Merge is a load-bearing component of the architecture with regards to its statistical correctness. I'm not entirely sure about how this fits with the gradient based optimizer. Is it working with the gradient based optimizer, fighting the gradient based optimizer, or somewhere in the middle? I think the answer to this question will strongly affect this approaches scalability. This will also need a more in-depth analysis to study how deviations from perfect splitting affect the upper bound on loss.
2) With regards to statistical correctness, the L2 distance between the output at step N and D is the only one that is important. The L2 losses in the middle layers can be considered auxiliary losses. Maybe the final L2 loss / L2 losses deeper in the model should be weighted more heavily? In final evaluation the intermediate L2 losses can be ignored.
3) Future possibilities could include some sort of RL to determine the number of samples K and depth N on a dynamic basis. Even a split with K=2 increases NLL loss by Log_2(2) = 1. For many samples after a given depth the increase in loss due to the additional information outweighs the decrease in L2 loss. This also points to another difficulty, it is hard to give fractional information in this Discrete Distribution Network architecture. In contrast, diffusion models and autoregressive models can handle fractional bits. This could be another point of future development.
The model can be thought of as N Discrete Distribution Networks, one of each depth 1 to N, that are stacked on each other and are being trained simultaneously.
This is the bad case I am concerned about.
Layer 1 -> (A, B) Layer 2 -> (C, D)
Lets say Layer 1 outputs A and B each with probability 1/2 (perfect split). Now, Layer 2 outputs C when it gets A as an input and D when it gets B as an input. Layer 2 is then outputting each output with probability 1/2, but it is not outputting each output with probability 1/2 when conditioned on the output of layer 1.
If this happens, the claim of exponential increase in diversity each layer breaks down.
It could be that the first-order approximation provided by Split-and-Prune is good enough. My guess though is that the gradient and the split-and-prune are helping each other to keep the outputs reasonably balanced on the datasets you are working on. The split and prune lets the optimization process "tunnel" though regions of the loss landscape that would make it hard to balance the classes.
The zero-shot conditional generation part is wild. Most methods rely on gradients or fine-tuning, so I wonder what makes DDN tick there. Maybe the tree structure of the latent space helps navigate to specific conditions without needing retraining? Also, I'm intrigued by the 1D discrete representation—how does that even work in practice? Does it make the model more interpretable?
The Split-and-Prune optimizer sounds new—I'd love to see how it performs against Adam or SGD on similar tasks. And the fact that it's fully differentiable end-to-end is a big plus for training stability.
I also wonder about scalability—can this handle high-res images without blowing up computationally? The hierarchical approach seems promising, but I'm not sure how it holds up when moving from simple distributions to something complex like natural images.
Overall though, this feels like one of those papers that could really shift the direction of generative models. Excited to dig into the code and see what kind of results people get with it!
1. The comparison with GANs and the issue of mode collapse are addressed in Q2 at the end of the blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...
2. Regarding scalability, please see “Future Research Directions” in the same blog: https://github.com/Discrete-Distribution-Networks/Discrete-D...
3. Answers or relevant explanations to any other questions can be found directly in the original paper (https://arxiv.org/abs/2401.00036), so I won’t restate them here.
Very nitpicky comment, but I personally find such things to make for a bad impression. To be more specific, branching structures are a fairly universal idea, so the choice of relating it ancient proverbs instead of something much mundane raises an eyebrow.
Common names will always be influenced by the authors culture, so it seems unfair to exclude a name based on any individual opinion that it is or isn't a mundane choice.
Unless you would also exclude names based on their relationship to old, but culturally relevant texts in the western tradition to, e.g., the bible.