As someone with a background in quantum chemistry and some types of machine learning (but not neural networks so much) it was a bit striking while watching this video to see the parallels between the transformer model and quantum mechanics.

In quantum mechanics, the state of your entire physical system is encoded as a very high dimensional normalized vector (i.e., a ray in a Hilbert space). The evolution of this vector through time is given by the time-translation operator for the system, which can loosely be thought of as a unitary matrix U (i.e., a probability preserving linear transformation) equal to exp(-iHt), where H is the Hamiltonian matrix of the system that captures its “energy dynamics”.

From the video, the author states that the prediction of the next token in the sequence is determined by computing the next context-aware embedding vector from the last context-aware embedding vector alone. Our prediction is therefore the result of a linear state function applied to a high dimensional vector. This seems a lot to me like we have produced a Hamiltonian of our overall system (generated offline via the training data), then we reparameterize our particular subsystem (the context window) to put it into an appropriate basis congruent with the Hamiltonian of the system, then we apply a one step time translation, and finally transform the resulting vector back into its original basis.

IDK, when your background involves research in a certain field, every problem looks like a nail for that particular hammer. Does anyone else see parallels here or is this a bit of a stretch?

I have found the youtube videos by CodeEmporium to be simpler to follow https://www.youtube.com/watch?v=Nw_PJdmydZY

Transformer is hard to describe with analogies, and TBF there is no good explanation why it works, so it may be better to just present the mechanism, "leaving the interpretation to the viewer". Also, it's simpler to describe dot products as vectors projecting on one another

Here's a compelling visualization of the functioning of an LLM when processing a simple request: https://bbycroft.net/llm

This complements the detailed description provided by 3blue1brown

Awesome video. This helps to show how the Q*K matrix multiplication is a bottleneck, because if you have sequence (context window) length S, then you need to store an SxS size matrix (the result of all queries times all keys) in memory.

One great way to improve on this bottleneck is a new-ish idea called Ring Attention. This is a good article explaining it:


(I edited that article.)

His previous post 'But what is a GPT?' is also really good: https://www.3blue1brown.com/lessons/gpt
This video (with a slightly different title on YouTube) helped me realize that the attention mechanism isn't exactly a specific function so much as it is a meta-function. If I understand it correctly, Attention + learned weights effectively enables a Transformer to learn a semi-arbitrary function, one which involves a matching mechanism (i.e., the scaled dot-product.)
I think what made this so digestible for me were the animations. The timing, how they expand/contract and unfold while he’s speaking.. is all very well done.
Working in a closely related space and this instantly became part of my team's onboarding docs.

Worth noting that a lot of the visualization code is available in Github.


I finally understand this! Why did every other video make it so confusing!
Is there a reference which describes how the current architecture evolved? Perhaps from very simple core idea to the famous “all you need paper?”

Otherwise it feels like lots of machinery created out of nowhere. Lots of calculations and very little intuition.

Jeremy Howard made a comment on Twitter that he had seen various versions of this idea come up again and again - implying that this was a natural idea. I would love to see examples of where else this has come up so I can build an intuitive understanding.

You might also want to check out other 3b1b videos on neural networks since there are sort of progressions between each video https://www.3blue1brown.com/topics/neural-networks
It always blows my mind that Grant Sanderson can explain complex topics in such a clear, understandable way.

I've seen several tutorials, visualisations, and blogs explaining Transformers, but I didn't fully understand them until this video.

That example with the "was" token at the end of a murder novel is genius (at 3:58 - 4:28 in the video) really easy for a non technical person to understand.
It seems he brushes over the positional encoding, which for me was the most puzzling part of transformers. The way I understood it, positional encoding is much like dates. Just like dates, there are repeating minutes, hours, days, months...etc. Each of these values has shorter 'wavelength' than the next. The values are then used to identify the position of each tokens. Like, 'oh, im seeing january 5th tokens. I'm january 4th. This means this is after me'. Of course the real pos.encoding is much smoother and doesn't have abrupt end like dates/times, but i think this was the original motivation for positional encodings.
This was the best explanation I’ve seen. I think it comes down to essentially two aspects: 1) he doesn’t try to hide complexity and 2) he explains what he thinks is the purpose of each computation. This really reduces the room for ambiguity that ruins so many other attempts to explain transformers.
In training we learn a.) the embeddings and b.) the KQ/MLP-weights.

How well do Transformers perform given learned embeddings but only randomly initialized decoder weights? Do they produce word soup of related concepts? Anything syntactically coherent?

Once a well trained high dimensional representation of tokens are established. can they learn KQ/MLP weights significantly faster?

Hold on, every predicted token is only a function of the previous token? I must have something wrong. This would mean that within the embedding of "was", which is of length 12,228 in this example. Is it really possible that this space is so rich as to have a single point in it encapsulate a whole novel?
What I'm now wondering about is how intuition to connect completely separate ideas works in humans. I will have very strong intuition something is true, but very little way to show it directly. Of course my feedback on that may be biased, but it does seem some people have "better" intuition than others.
I like the way he uses a low-rank decomposition of the Value matrix instead of Value+Output matrices. Much more intuitive!
Fantastic work by Grant Sanderson, as usual.

Attention has won.[a]

It deserves to be more widely understood.


[a] Nothing has outperformed attention so far, not even Mamba: https://arxiv.org/abs/2402.01032

This is one of the best explanations that I’ve seen on the topic. I wish there was more work, however, not on how Transfomers work, but why they work. We are still figuring it out, but I feel that the exploration is not at all systematic.
Fun video. Much of my "art" lately has been dissecting models, injecting or altering attention, and creating animated visualizations of their inner workings. Some really fun shit.
The first time I really dug into transformers (back in the BERT days) I was working on a MS thesis involving link prediction in a graph of citations among academic documents. So I had graphs on the brain.

I have a spatial intuition for transformers as a sort of analog to a message passing network over a "leaky graph" in an embedding space. If each token is a node, its key vector sets the position of an outlet pipe that it spews value to diffuse out into the embedding space, while the query vector sets the position of an input pipe that sucks up value other tokens have pumped out into the same space. Then we repeat over multiple attention layers, meaning we have these higher order semantic flows through the space.

Seems to make a lot of sense to me, but I don't think I've seen this analogy anywhere else. I'm curious if anybody else thinks of transformers in this way. (Or wants to explain how wrong/insane I am?)