https://twiecki.io/blog/2015/11/10/mcmc-sampling/
Note that the point of the markov chain is it's possible to compute relative probabilities between two given points in the posterior even when you don't have a closed form expression for the posterior.
Also, the reason behind separating the proposal distribution and the acceptance probability is that it's a convenient method to make the Markov process stationary, which isn't true in general. (Wikipedia page on MCMC is also useful here).
(This is a massive pet peeve of mine - if you are going to call something "X for dummies", don't bury the lede! Tell me what "X" is as soon as possible, especially if it's an acronym!)
More generally speaking, you're asking when the iteration `x_+ = Ax` converges to a fixed point which is an eigenvector of A. This can happen a few different ways. The obvious way is that A has an eigenvector `v` with eigenvalue 1, and all other eigenvalues with magnitude < 1. Then those other components will die out with repeated application of A, leaving only `v` in the limit.
For Markov chains, we can get this exact property from the Perron-Frobenius theorem, which applies to non-negative irreducible matrices. Irreducible means that the transition graph of the Markov chain is strongly connected. If that's the case, then there is a unique eigenvector called the stationary distribution (with eigenvalue 1), and all initial conditions will converge to it.
In case A is not irreducible, you may have different connected components, and the stationary distribution may depend on which component your initial condition is in. Going back to the n x n identity matrix, it has n connected components (it's a completely disconnected graph with all the self-transition probabilities = 1). So every initial condition is stationary, because you can't change anything after the initial step.
Beyond just markov chains, matrices do “move” vectors towards eigenvectors sometimes. If a matrix A has eigenvectors x1 and x2 with eigenvalues r1 and r2, A(x1+x2)= r1x1+r2x2 because matrix multiplication is a linear transformation. If we repeatedly multiply x1+x2 by A, A^n(x1+x2)=r1^nx1+ r2^nx2. Then, if r1>r2, all of the terms are growing exponentially but the contribution of x1 to the result grows exponentially faster than the contribution of x2 so for some large n, A^n(x1+x2) = r1^nx1 + some irrelevant error term.
This means the largest eigenvalues sort of dominate, and you might especially care about eigenvalues of 1, because those mean Ax=x so x is a steady state and if you can write down A as a matrix you can solve for non zero x and learn about the steady state solutions.
A good way to immediately grasp the flaws of a Markov chain is to imagine it predicting a pixel in a 2D image.
Sure in a 1D chain of states that goes something like [red green blue] repeatedly a Markov chain is a great predictor. But imagine a 2D image where the pattern is purely in the vertical elements but you’re passing in elements left to right then next row. It’s going to try to make a prediction based on the recent horizontal elements and there no good way to have context of the pixel above. A Markov chain doesn’t mix two states, its context is the immediate previous chain of states. It’s really really hard for a Markov chain to take the pixel above into context if your feeding it pixels left to right (same is true for words but the pixel analogy is easier to grasp).
Now you may think a good solution to this is to have a Markov chain have some mechanism to take multiple previous states from multiple contexts (in this case vertical and horizontal) and somehow weight each one to get the next state. Go down this path and you essentially go down the path of early neural networks.
I've actually thought of this from time to time and come up with things like 'skip' states that just don't change the context on meaningless words, I've thought of maintaining distant states in some type of context on the markov chain, multiple contexts in multiple directions mixed with a neural network at the end (as per the 2D image example), etc. But then the big issue is in building the Markov state network.
The attention mechanism and subsequent multi-layer neural network it uses is honestly extremely inelegant. Basically a entire datacenter sledgehammer of contexts and back-propagated neural networks to make something that works while Markov chains would easily run on an Apple 2 if you got it right (it's very fast to linearly navigate a chain of states). But the hard part is making Markov chains that can represent language well. The best we've done is very simple linear next letter predictors.
I do honestly think there may be some value in stealing word weights from an attention mechanism and making Markov chains out of each of them. So each word starts navigating a markov chain of nearby words to end up at new states. You'd still mix all the contexts at the end with a neural network but it skips the computationally expensive attention mechanism. Still a hard problem to build these Markov chains sensibly though. Deepseek has shown us there's huge opportunities in optimizing the attention mechanism and Markov chains are super computationally simple but the 'how do you build good Markov chains' is a very hard problem.
People really underestimate both the magnitude of the parameter space for leaner this and the scale of the training data.
But, at the end of the day, the model is still just sampling one token at a time and updating its state (and of course, that state is conditioned on all previous tokens, so that’s a other point of departure)
(There are also types of LLMs where that metadata is limited in access, one such type is when the current state can only check metadata on previous states and weigh it against the base value of the next states in the chain.)
Then, imagine each chain of states in a Markov Chain is a 2D hash map, like a grid plot. Our current LLMs are like an Nth-dimensional hash map instead, and can have a finite, but extremely large depth. This is pretty near impossible to visualize as a human, but if you're familiar with array/map operations, you should get the idea.
This is a very "base level" understanding, as my learning on LLMs stopped around the time Tensorflow stopped being the new hotness, but hopefully that gives you an idea.
To make a big language model in Markov space, you need to very large ngrams. "If last text is "th" next letter is "e""
These get incredibly sparse, very quickly. Most sentences are novel in your dataset.
Neural networks (and other models too) get around this by placing tokens in multidimensional space, instead of having individual nodes.
Specific deep learning architectures and training variants reflect the problems they are solving, speeding up training, and reducing model sizes considerably. Both keep getting better, so efficiency improvements are not likely to plateau anytime soon.
They readily handle both statistically driven and pattern driven mappings in the data. Most modelling systems tend to be better or exclusively adaptive on one of those dimensions or the other.
Learning patterns means they don't just go input->prediction. They learn to recognize and apply relationships in the middle. So when people say they are "just" predictors, that tends to be misleading. They end up predicting, but they will do whatever they need to in between in terms of processing information, in order to predict.
They can learn both hard or soft pattern boundaries. Discrete and continuous relationships. And static and sequential patterns.
They can be trained directly on example outputs, or indirectly via reinforcement (learning what kind of outputs will get judged well and generating those). Those are only two of many flexible training schemes.
All those benefits and more make them exceptionally general, flexible, powerful and efficient tools relative to other classes of universal approximators, for large growing areas of data defined problems.
Per e.g. Wikipedia: "In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event."
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
You train on piece of text and then the output 'sounds' like that text it was trained despite being pure gibberish.