If you can create a general machine that can take 3 examples and synthesize a program that predicts the 4th, you've just solved oracle synthesis. If you train a network on all human knowledge, including puzzle making, and then fine-tune it on 99% of the dataset and give it a dozen attempts for the last 1%, you've just made an expensive compressor for test-maker's psychology.
The true test of an AGI is it's ability to assimilate disparate information into a coherent world-view, which is effectively what the pretraining is doing. And even then, it is likely that any intelligence capable of doing that will need to be "preloaded" with assumptions about the world it will occupy, structurally. Similar to the regions of the brain which are adept at understanding spatial relationships, or language, or interpreting our senses, etc.
It doesn't have to actually "think" as long as it can present an indistinguishable facsimile, but if you have to rebuild its training set for each task, that does not qualify. We don't reset human brains from scratch to pick up new skills.
Is not
https://en.wikipedia.org/wiki/Artificial_general_intelligenc...
Also
People keep thinking "General" means one AI can "do everything that any human can do everywhere all at once".
When really, humans are also pretty specialized. Humans have Years of 'training' to do a 'single job'. And they do not easily switch tasks.
What? Humans switch tasks constantly and incredibly easily. Most "jobs" involve doing so rapidly many times over the course of a few minutes. Our ability to accumulate knowledge of countless tasks and execute them while improving on them is a large part of our fitness as a species.
You probably did so 100+ times before you got to work. Are you misunderstanding the context of what a task is in ML/AI? An AI does not get the default set of skills humans take for granted, its starting as a blank slate.
You don't have a human spend years getting an MBA, then drop them in a Physics Lab and expect them to perform.
But that is what we want from AI, to do 'all' jobs equally as great as any individual human in that one job.
There are steps of automation that could fulfill that requirement without ever being AGI - it’s theoretically possible (and far more likely) that we achieve that result without making a machine or program that emulates human cognition.
It just so happens that our most recent attempts are very good at mimicking human communication, and thus are anthropomorphized as being near human cognition.
I'm just making point that for AI "General" Intelligence.
That humans are also not as "General" as we assume in these discussion. Humans are also limited in a lot of ways, and narrowly trained, make stuff up, etc...
So even a human isn't necessarily a good example for what AGI would mean. Human is not a good target either.
Humans are extremely general. Every single type of thing we want an AGI to do is a type of things that a human is good at doing, and none of those humans were designed specifically to do that thing. It is difficult for humans to move from specialization to specialization, but we do learn them with only the structure to "learn, generally" being our scaffolding.
What I mean by this is that we do want AGI to be general in the way a human is. We just want it to be more scalable. It's capacity for learning does not need to be limited by material issues (i.e. physical brain matter constraints), time, or time scale.
So where a human might take 16 years to learn how to perform surgery well, and then need another 12 years to switch to electrical engineering, an AGI should be able to do it the same way, but with the timescale only limited by the amount of hardware we can throw at it.
If it has to be structured from the ground up for each task, it is not a general intelligence, it's not even comparable to humans, let alone scalable beyond us.
Where today those are being done, but specialized architectures, models, combination of methods.
Then that would be a 'general' intelligence, the one type of model that can do either. Trained to be an engineer or doctor. And like a human once trained, they might not do the other job well. But they did both start with same 'tech', like humans all have the same architecture in the 'brain'.
I don't think it will be an LLM, it will be some combo of methods in use today.
Ok. I'll buy that. I'm not sure everyone is using 'general' in that way. I think more-often people think a single AI instance that can do everything/everywhere/all at once. Be an engineer and doctor at same time. Since it can do all the tasks at same time, it is 'general'. Since we are making AI's that can do everything, could have a case statement inside to switch models, half joking. At some point all the different AI methods will be incorporated together and will appear even more human/general.
Humans are a good target since we know human intelligence is possible, its much easier to target something that is possible rather than some imaginary intelligence.
The model should learn the rule, don't make a model based on the rules. When you make a model based on the rules then it isn't a general model.
Human DNA isn't made to play tennis, but a human can still learn to play it. The same should be for a model, it should learn it, the model shouldn't be designed by humans to play tennis.
> humans CAN do.
I think people often get confused with claims - Humans CAN generalize
- Humans CAN reason
- Humans CAN be intelligent
- Humans CAN be conscious
Generalization[0] is something MOST humans CAN do, but MOST humans DO NOT do. Do not confuse "can" and "are".One of my pet peeves is how often qualifying words are just ignored. They are critical parts of any communication.[1]
Another pet peeve is over anthropomorphization. Anthropomorphism is a useful tool, but.. well... we CAN over generalize ;)
[0] I don't know what you mean by "true generalization". I'm not going to address that because you can always raise the bar for what is "true" and let's try to be more concrete. Maybe I misunderstand. I definitely misunderstand.
[1] Classic example: someone says "most x are y" then there's a rebuttal of "but x_1 isn't y" or "I'm x and I'm not y" or some variation. Great! Most isn't all? This is not engaging in good faith and there's examples found with any qualifying word. It is quite common to see.
A short list of abilities that cannot be performed by CompressARC includes:
Assigning two colors to each other (see puzzle 0d3d703e)
Repeating an operation in series many times (see puzzle 0a938d79)
Counting/numbers (see puzzle ce9e57f2)
Translation, rotation, reflections, rescaling, image duplication (see puzzles 0e206a2e, 5ad4f10b, and 2bcee788)
Detecting topological properties such as connectivity (see puzzle 7b6016b9)
Note: I am not saying newborns can solve the corresponding ARC problems! The point is there is a lot of evidence that many of the concepts ARC-AGI is (allegedly) measuring are innate in humans, and maybe most animals; e.g. cockroaches can quickly identify connected/disconnected components when it comes to pathfinding. Again, not saying cockroaches can solve ARC :) OTOH even if orcas were smarter than humans they would struggle with ARC - it would be way too baffling and obtuse if your culture doesn't have the concept of written standardized tests. (I was solving state-mandated ARCish problems since elementary school.) This also applies to hunter-gatherers, and note the converse: if you plopped me down among the Khoisan in the Kalahari, they would think I was an ignorant moron. But it makes as much sense scientifically to say "human-level intelligence" entails "human-level hunter-gathering" instead of "human-level IQ problems."I'd argue that "innate" here still includes a brain structure/nervous system that evolved on 3.5 billion years worth of data. Extensive pre-training of one kind or another currently seems the best way to achieve generality.
Each new training from scratch is a perfect blank slate and the only thing ensuring words come out is the size of the corpus?
I don't think training runs are done entirely from scratch.
Most training runs in practice will start from some pretrained weights or distill an existing model - taking some model pretrained on ImageNet or Common Crawl and fine-tuning it to a specific task.
But even when the weights are randomly initialized, the hyperparameters and architectural choices (skip connections, attention, ...) will have been copied from previous models/papers by what performed well empirically, sometimes also based on trying to transfer our own intuition (like stacking convolutional layers as a rough approximation of our visual system), and possibly refined/mutated through some grid search/neural architecture search on data.
All organisms are born pre-trained because if you can't hide or survive the moment you're born, you get eaten.
uhhh... no, most newborns can't "hide or survive the moment they're born", no matter the species. I'm sure there are a few examples, but I seriously doubt it's the norm.
Many species survive by reproducing en masse, where it takes many (sometimes thousands of) eaten offspring for one to survive to adulthood.
Makes sense though, I’m pretty sure mammals don’t do well with the insects and diseases that come with waste saturated bed.
I’m not sure we would call anyone intelligent today if they had no education. Intelligence relies on building blocks that are only learned, and the “masters” of certain fields are drawing on decades and decades of learnings about their direct experience.
So our best examples of intelligence include experience, training, knowledge, evolutionary factors, what have you — so we probably need to draw on that to create a general intelligence. How can we expect to have an intelligence in a certain field if it hasn’t spent a lot of time “ruminating on”/experiencing/learning about/practicing/evolving/whatever, on those types of problems?
That quote about how "the only intuitive interface ever devised was the nipple"? Turns out there's still a fair bit of active training required all around to even get that going. There's no such thing as intuitive, only familiar.
Yes, they enjoy millions of years of pretraining thanks to evolution, ie. their pretrained base model has some natural propensity for visual, auditory, and tactile sensory modalities, and some natural propensity for spatial and causal reasoning.
So like humans after all but faster.
I guess it's just hard to write a book about the way you write that book.
There's a reason people with comparable intelligence operate at varying degrees of effectiveness, and it has to do with how knowledgeable they are.
This paper claimed transformers learn a gradient-descent mesa-optimizer as part of in-context learning, while being guided by the pretraining objective, and as the parent mentioned, any general reasoner can bootstrap a world model from first principles.
I guess a superset. But it doesn't really matter either way. Ultimately, there's no useful distinction between pretraining and in-context learning. They're just an artifact of the current technology.
And no, I don't think the knowledge of language is necessary. To give a concrete example, tokens from TinyStories dataset (the dataset size is ~1GB) are known to be sufficient to bootstrap basic language.
this is pretty vague. I certainly dont think a mastery of any concept invented in last thousand years would be considered encoded in genes though we would want or expect an AGI to be able to learn calculus for instance. In terms of "encoded in genes", I'd say most of what is asked or expected of AGI goes beyond what feral children (https://en.wikipedia.org/wiki/Feral_child) were able to demonstrate.
There are a few orders of magnitude more neural connections in a human than there are base pairs in a human genome. I would also assume that there are more than 4 possible ways for neural connections to be formed, while there are only 4 possible base pairs. Also, most genetic information corresponds to lower level functions.
For that matter, if it had no pre-training, it means it can generalize to any new programming languages, libraries, and entire tasks. You can use it to analyze the grammar of a dying African language, write stories in the style of Hemingway, and diagnose cancer on patient data. In all of these, there are only so many samples to fit on.
But I do have enough knowledge to know what an IDE is, and where that sits in a technological stack, i know what a string is, and all that it relies on etc. There's a huge body of knowledge that is required to even begin approaching the problem. If you posted that challenge to an intelligent person from 2000 years ago, they would just stare at you blankly. It doesn't matter how intelligent they are, they have no context to understand anything about the task.
Depending on how you pose it. If I give you a long enough series of ordered cards, you'll on some basic level begin to understand the spatiotemporal dynamics of them. You'll get the intuition that there's a stack of heads scanning the input, moving forward each turn, either growing the mark, falling back, or aborting. If not constrained by using matrices, I can draw you a state diagram, which would have much clearer immediate metaphors than colored squares.
Do these explanations correspond to some priors in human cognition? I suppose. But I don't think you strictly need them for effective few-shot learning. My main point is that learning itself is a skill, which generalist LLMs do possess, but only as one of their competencies.
However, I assumed what we're talking about when we discuss AGI is what we'd expect a human to be able to accomplish in the world at our scale. The examples of learning without knowledge you've given, to my mind at least, are a lower level of intelligence that doesn't really approach human level AGI.
It doesn't need to know the capital of Togo, metabolic pathways of the eukaryotic cell, or human psychology.
What if knowing those things distills down to a pattern that matches a pattern of your code and vice versa? There's a pattern in everything, so know everything, and be ready to pattern match.
If you just look at object oriented programming, you can easily see how knowing a lot translates to abstract concepts. There's no reason those concepts can't be translated bidirectionally.
I thought the knowledge is the training set and the intelligence is the emergent/side effect of reproducing that knowledge by making sure the reproduction is not rote memorisation?
That's reinforcement learning -- an algorithm, which requires accurate knowledge acquisition, to be effective.
The argument being advanced is that intelligence is the proposal of more parsimonious models, aka compression.
> I feel like extensive pretraining goes against the spirit of generality.
What do you mean by generality?Pretraining is fine. It is even fine for pursuit to AGI. Humans and every animal has "baked in" memory. You're born knowing how to breath and have latent fears (chickens and hawks).
Generalization is the ability to learn on a subset of something and then adapt to the entire (or a much larger portion) of the superset. It's always been that way. Humans do this, right? You learn some addition, subtraction, multiplication, and division and then you can do novel problems you've never seen before that are extremely different. We are extremely general here because we've learned the causal rule set. It isn't just memorization. This is also true for things like physics, and is literally the point of science. Causality is baked into scientific learning. Of course, it is problematic when someone learns a little bit of something and thinks they know way more about it, but unfortunately ego is quite common.
But also, I'm a bit with you. At least with what I think you're getting at. These LLMs are difficult to evaluate because we have no idea what they're trained on and you can't really know what is new, what is a slight variation from the training, and this is even more difficult considering the number of dimensions involved (meaning things may be nearly identical in latent space though they don't appear so to us humans).
I think there's still a lot of ML/AI research that can and SHOULD be done at smaller scales. We should be studying more about this adaptive learning and not just in the RL setting. One major gripe I have with the current research environment is that we are not very scientific when designing experiments. They are highly benchmark/data-set-evaluation focused. Evaluation needs to go far beyond test cases. I'll keep posting this video of Dyson recounting his work being rejected by Fermi[0][1]. You have to have a good "model". What I do not see happening in ML papers is proper variable isolation and evaluation based on this: i.e. hypothesis testing. Most papers I see are not providing substantial evidence for their claims. It may look like this, but the devil is always in the details. When doing extensive hyper-parameter tuning it becomes very difficult to determine if the effect is something you've done via architectural changes, change in the data, change in training techniques, or change in hyperparameters. To do a proper evaluation would require huge ablations with hold-one-out style scores reported. This is obviously too expensive, but the reason it gets messy is because there's a concentration on getting good scores on whatever evaluation dataset is popular. But you can show a method's utility without beating others! This is a huge thing many don't understand. Worse, by changing hyper-parameters to optimize for the test-set result, you are doing information leakage. Anything you change based on the result of the evaluation set, is, by definition, information leakage. We can get into the nitty gritty to prove why this is, but this is just common practice these days. It is the de facto method and yes, I'm bitter about this. (a former physicist who came over to ML because I loved the math and Asimov books)
[0] https://www.youtube.com/watch?v=hV41QEKiMlM
[1] I'd also like to point out that Dyson notes that the work was still __published__. Why? Because it still provides insights and the results are useful to people even if the conclusions are bad. Modern publishing seems to be more focused on novelty and is highly misaligned from scientific progress. Even repetition helps. It is information gain. But repetition results in lower information gain with each iteration. You can't determine correctness by reading a paper, you can only determine correctness by repetition. That's the point of science, right? As I said above? Even negative results are information gain! (sorry, I can rant about this a lot)
Edit. In fact, there is a simple test for intelligence: can you read a function in C and tell how a change in input changes the output. For complex algorithms, you have to build an internal model because how else are you going to run qsort on million items? That's also how you'd tell if a student is faking it or really understands. A harder test would be to do the opposite: from a few input/output examples come up with an algorithm.
It seems like the central innovation is the construction of a "model" which can be optimized with gradient descent, and whose optimum is the "simplest" model that memorizes the input-output relationships. In their setup, "simplest" has the concrete meaning of "which can be efficiently compressed" but more generally it probably means something like "whose model complexity is lowest possible".
This is in stark contrast to what happens in standard ML: typically, we start by prescribing a complexity budget (e.g. by choosing the model architecture and all complexity parameters), and only then train on data to find a good solution that memorizes input-output relationship.
The new method is ML on its head: we optimize the model so that we reduce its complexity as much as possible while still memorizing the input-output pairs. That this is able to generalize from 2 training examples is truly remarkable and imho hints that this is absolutely the right way of "going about" generalization.
Information theory happened to be the angle from which the authors arrived at this construction, but I'm not sure that is the essential bit. Rather, the essential bit seems to be the realization that rather than finding the best model for a fixed pre-determined complexity budget, we can find models with minimal possible complexity.
1. Minimize a weighted sum of data error and complexity.
2. Minimize the complexity, so long as the data error is kept below a limit.
3. Minimize the error on the data, so long as the complexity is kept below a limit.
It does seem like classical regularization of this kind has been out of fashion lately. I don't think it plays much of a role in most Transformer architectures. It would be interesting if it makes some sort of comeback.
Other than that, I think there are so many novel elements in this approach that it is hard to tell what is doing the work. Their neural architecture, for example, seems carefully hacked to maximize performance on ARC-AGI type tasks. It's hard to see how it generalizes beyond.
To your point in the other thread, once you start optimizing both data fidelity and complexity, it's no longer that different from other approaches. Regularization has been common in neural nets, but usually in a simple "sum of sizes of parameters" type way, and seemingly not an essential ingredient in recent successful models.
I'm struggling to put a finger on it, but it feels that the approach in the blog post has the property that it finds the _minimum_ complexity solution, akin to driving the regularization strength in conventional ml higher and higher during training, and returning the solution at the highest such regularization that does not materially degrade the error (epislon in their paper). information theory plays the role of a measuring device that allows them to measure the error term and model complexity on a common scale, so as to trade them off against each other in training.
I haven't thought about it much but i've seen papers speculating that what happens in double-descent is finding lower complexity solutions.
Each puzzle is kind of a similar format, and the data that changes in the puzzle is almost precisely that needed to deduce the rule. By reducing the amount of information needed to describe the rule, you almost have to reduce your codec to what the rule itself is doing - to minimise the information loss.
I feel like if there was more noise or arbitrary data in each puzzle, this technique would not work. Clearly there's a point at which that gets difficult - the puzzle should not be "working out where the puzzle is" - but this only works because each example is just pure information with respect to the puzzle itself.
But because this is clean data, I wonder if there's basically a big gap here: the codec that encodes the "correct rule" can achieve a step-change lower bandwidth requirement than similar-looking solutions. The most elegant ruleset - at least in this set of puzzles - always compresses markedly better. And so you can kind of brute-force the correct rule by trying lots of encoding strategies, and just identify which one gets you that step-change compression benefit.
From my (admittedly sketchy, rushed) understanding of what they're doing, they're essentially trying to uncover the minimal representation of the solution/problem space. Through their tracking of the actual structure of the problem through equivariences, they're actually deriving something like the actual underlying representation of the puzzle and how to solve them, rather than hoping to pick up on this from many solved examples.
It pleases me to find this as it supports my own introspection (heck, it’s in my profile!)
> Intelligence is compressing information into irreducible representation.
https://en.wikipedia.org/wiki/Kolmogorov_complexity
https://en.wikipedia.org/wiki/Solomonoff%27s_theory_of_induc...
https://en.wikipedia.org/wiki/Minimum_description_length
Seems like these could be related, going to dive into this more! :)
I thought that was physics ;)
Well they kind of define intelligence as the ability to compress information into a set of rules, so yes, compression does that…
This phrasing suggests that each puzzle took 20 mins, so for the 100 puzzle challenge that's 33.3 hours, which exceeds the target of 12 hours for the challenge. Pretty cool approach though.
I'm curious about the focus on information compression, though. The classical view of inference as compression is beautiful and deserves more communication, but I think the real novelty here is in how the explicitly "information-constrained" code z participates in the forward pass.
About their overall method, they write:
> It isn’t obvious why such a method is performing compression. You’ll see later how we derived it from trying to compress ARC-AGI.
I must be learning something in my PhD, because the relation with compression _did_ seem obvious! Viewing prediction loss and KL divergence of a latent distribution p(z) as "information costs" of an implicit compression scheme is very classical, and I think a lot of people would feel the same. However, while they explained that a L2 regularization over model weights can be viewed (up to a constant) as an approximation of the bits needed to encode the model parameters theta, they later say (of regularization w.r.t. theta):
> We don’t use it. Maybe it matters, but we don’t know. Regularization measures the complexity of f in our problem formulation, and is native to our derivation of CompressARC. It is somewhat reckless for us to exclude it in our implementation.
So, in principle, the compression/description length minimization point of view isn't an explanation for this success any more than it explains success of VAEs or empirical risk minimization in general. (From what I understand, this model can be viewed as a VAE where the encoding layer has constant input.) That's no surprise! As I see it, our lack of an adequate notion of "description length" for a network's learned parameters is at the heart of our most basic confusions in deep learning.
Now, let's think about the input distribution p(z). In a classical VAE, the decoder needs to rely on z to know what kind of data point to produce, and "absorbing" information about the nature of a particular kind of data point is actually what's expected. If I trained a VAE on exactly two images, I'd expect the latent z to carry at most one bit of information. If CompressARC were allowed to "absorb" details of the problem instance in this way, I'd expect p(z) to degenerate to the prior N(0, 1)—that is, carry no information. The model could, for example, replace z with a constant at the very first layer and overfit the data in any way it wanted.
Why doesn't this happen? In the section on the "decoding layer" (responsible for generating z), the authors write:
> Specifically, it forces CompressARC to spend more bits on the KL whenever it uses z to break a symmetry, and the larger the symmetry group broken, the more bits it spends.
As they emphasize throughout this post, this model is _very_ equivariant and can't "break symmetries" without using the parameter z. For example, if the model wants to do something like produce all-green images, the tensors constituting the "multitensor" z can't all be constant w.r.t. the color channel---at least one of them needs to break the symmetry.
The reason the equivariant network learns a "good algorithm" (low description length, etc.) is unexplained, as usual in deep learning. The interesting result is that explicitly penalizing the entropy of the parameters responsible for breaking symmetry seems to give the network the right conditions to learn a good algorithm. If we took away equivariance and restricted our loss to prediction loss plus an L2 "regularization" of the network parameters, we could still motivate this from the point of view of "compression," but I strongly suspect the network would just learn to memorize the problem instances and solutions.
Do you think it's accurate to describe equivariance as both a strength and a weakness here? As in it allows the model to learn a useful compression, but you have to pick your set of equivariant layers up front, and there's little the model can do to "fix" bad choices.
1. Choose random samples z ~ N(μ, Σ) as the "encoding" of a puzzle, and a distribution of neural network weights p(θ) ~ N(θ, <very small variance>).
2. For a given z and θ, you can decode to get a distribution of pixel colors. We want these pixel colors to match the ones in our samples, but they're not guaranteed to, so we'll have to add some correction ε.
3. Specifying ε takes KL(decoded colors || actual colors) bits. If we had sources of randomness q(z), q(θ), specifying z and θ would take KL(p(z) || q(z)) and KL(p(θ) || q(θ)) bits.
4. The authors choose q(z) ~ N(0, 1) so KL(p(z) || q(z)) = 0.5(μ^2 + Σ^2 - 1 - 2ln Σ). Similarly, they choose q(θ) ~ N(0, 1/2λ), and since Var(θ) is very small, this gives KL(p(θ) || q(θ)) = λθ^2.
5. The fewer bits they use, the lower the Kolmogorov complexity, and the more likely it is to be correct. So, they want to minimize the number of bits
a * 0.5(μ^2 + Σ^2 - 1 - 2ln Σ) + λ * θ^2 + c * KL(decoded colors || actual colors).
6. Larger a gives a smaller latent, larger λ gives a smaller neural network, and larger c gives a more accurate solution. I think all they mention is they choose c = 10a, and that λ was pretty large.
They can then train μ, Σ, θ until it solves the examples for a given puzzle. Decoding will then give all the answers, including the unknown answer! The main drawback to this method is, like Gaussian splatting, they have to train an entire neural network for every puzzle. But, the neural networks are pretty small, so you could train a "hypernetwork" that predicts μ, Σ, θ for a given puzzle, and even predicts how to train these parameters.
They train a new neural network from scratch for each problem. The network is trained only on the data about that problem. The loss function tries to make sure it can map the inputs to the outputs. It also tries to keep the network weights small so that the neural network is as simple as possible. Hopefully a simple function that maps the sample inputs to the sample outputs will also do the right thing on the test input. It works 20~30% of the time.
a.) Why does this work as well as it does? Why does compression/fewer-parameters encourage better answers in this instance?
b.) Will it naturally transfer to other benchmarks that evaluate different domains? If so does that imply an approach similarly robust to pre-training that can be used for different domains/modalities?
c.) It works 20-30% of the time - do the researchers find any reason to believe that this could "scale" up in some fashion so that, say, a single larger network could handle any of the problems, rather than needing a new network for each problem? If so, would it improve accuracy as well as robustness?
Emphasis mine. No one should feel obligated to answer my questions. I had hoped that was obvious.
This wavey piece can be described by algorithm #1, while this spikey piece can be described by algorithm #2, while this...
More precisely, you try to write your phenomena as a weighted sum of these algorithms:
phenomena = sum weight * algorithm
There are exponentialy more algorithms that have more bits, so if you want this sum to ever converge, you need to have exponentially smaller weights for longer algorithms. Thus, most of the weight is concentrated in shorter algorithms, so a simple explanation is going to be a really good one!What the authors are trying to do is find a simple (small number of bits) algorithm that reconstructs the puzzles and the example solutions they're given. As a byproduct, the algorithm will construct a solution to the final problem that's part of the puzzle. If the algorithm is simple enough, it won't be able to just memorize the given examples—it has to actually learn the trick to solving the puzzle.
Now, they could have just started enumerating out programs, beginning with `0`, `1`, `00`, `01`, ..., and seeing what their computer did with the bits. Eventually, they might hit on a simple bit sequence that the computer interprets as an actual program, and in fact one that solves the puzzle. But, that's very slow, and in fact the halting problem says you can't rule out some of your programs running forever (and your search getting stuck). So the authors turned to a specialized kind of computer, one that they know will stop in a finite number of steps...
...and that "computer" is a fixed-size neural network! The bit sequence they feed in goes to determine (1) the inputs to the neural network, and (2) the weights in the neural network. Now, they cheat a little, and actually just specify the inputs/weights, and then figure out what bits would have given them those inputs/weights. That's because it's easier to search in the input/weight space—people do this all the time with neural networks.
They initialize the space of inputs/weights as random normal distributions, but they want to change these distributions to be concentrated in areas that correctly solve the puzzle. This means they need additional bits to specify how to change the distributions. How many extra bits does it take to specify a distribution q, if you started with a distribution p? Well, it's
- sum q(x) log p(x) + sum p(x) log p(x)
(expected # bits for random q) (expected # bits for random p)
This is known as the KL-divergence, which we write as KL(q||p). They want to minimize the length of their program, which means they want to minimize the expected number of additional bits they have to use, i.e. KL(q(inputs)||p(inputs)) + KL(q(weights)||p(weights)).There's a final piece of the puzzle: they want their computer to exactly give the correct answer for the example solutions they know. So, if the neural network outputs an incorrect value, they need extra bits to say it was incorrect, and actually here's the correct value. Again, the expected number of bits is just going to be a KL-divergence, this time between the neural network's output, and the correct answers.
Putting this altogether, they have a simple computer (neural network + corrector), and a way to measure the bitlength for various "programs" they can feed into the computer (inputs/weights). Every program will give the correct answers for the known information, but the very simplest programs are much more likely to give the correct answer for the unknown puzzles too! So, they just have to train their distributions q(inputs), q(weights) to concentrate on programs that have short bitlengths, by minimizing the loss function
KL(q(inputs)||p(inputs)) + KL(q(weights)||p(weights)) + KL(outputs||correct answers)
They specify p(inputs) as the usual normal distribtuion, p(weights) as a normal distribution with variance around 1/(dimension of inputs) (so the values in the neural network don't explode), and finally have trainable parameters for the mean and variance of q(inputs) and q(weights).Although if they partition z such that each section corresponds to one input and run f_θ through those sections iteratively, then I guess it makes sense.
Maybe their current setup of keeping θ "dumb" encourages the neural network to take on the role of the "algorithm" as opposed to the higher-variance input encoded by z (the puzzle), though this separation seems fuzzy to me.
They discuss how to avoid this in the section "Joint Compression via Weight Sharing Between Puzzles".
Kurt Godel (or maybe Douglas Hofstadter, rather) would raise an eyebrow. :)
There we go again. Claim: compression <something something> intelligence. Evidence: 34.75% on ARC AGI.
Like Carl Sagan once pointed out, "Observation: You couldn't see a thing. Conclusion: dinosaurs".
I don't see how this could possibly be controversial. Do you agree or disagree that, on a long enough timeline, intelligence lets you build a model of a system that can accurately and precisely reproduce all observations without simply remembering and regurgitating those observations? The model will always be smaller than simply remembering those observations, ergo, the intelligence maximally compressed the observations.
> Evidence: 34.75% on ARC AGI.
With no pretraining. Literally the only thing it's seen is 3 examples. And it does it with a very reasonable timeframe. I don't think you appreciate how impressive this is.
Why should I agree or disagree with that? Who says that? Is it just some arbitrary theory that you came up with on the spot?
>> The model will always be smaller than simply remembering those observations, ergo, the intelligence maximally compressed the observations.
"Maximally compressed the observations" does not follow from the model being smaller. You're missing an assumption of optimality. Even with that assumption, there is nothing in the work above to suggest that compression has something to do with intelligence.
>> I don't think you appreciate how impressive this is.
I do and it's not. 20% on unseen tasks. It only works on tasks with a very specific structure, where the solution is simply to reproduce the input with some variation, without having to capture any higher order concepts. With enough time and compute it should be possible to do even better than that by applying random permutations to the input. The result is insignificant and a waste of time.
Intelligence has some capabilities. I'm asking you if the capabilities of intelligence encompass the ability described, yes or no?
> "Maximally compressed the observations" does not follow from the model being smaller.
No other function can be smaller than the actual function that generates the observations. Correct answers indicates that "intelligence" reproduced the model in order to make a successful prediction, ergo "maximal compression".
You mean if I write a shitty program that takes the string "hello", copies it 10 million times and stores the copies, and then generates a single copy any time I run it, there's no algorithm that can achieve a better compression than that because "no other function can be smaller than the actual function that generates the observations"?
I suggest, before coming up with fancy theories about compression and intelligence, that you read a bit about computer science, first. For instance, read up on information theory, and maybe about Kolmogorov complexity, the abstractions that people who swear that "compression <something something> intelligence" are basing their ideas on.
Or, you know what? I've had it up to here with all the wild-ass swivel-eyed loon theories about what intelligence is and isn't. if you really think you know what intelligence is all about why don't you write it up as a program and run it on a computer, and show us all that your theory is right? Go ahead and create intelligence. Intelligence is compression? Go compress stuff and make it intelligent or whatever it is you think works.
Put your money where your mouth is. Until then don't accuse others of bad faith.