Are All Large Models Learning the Same Thing?
The Platonic Representation Hypothesis says yes — that as models scale, their internal representations of the world converge regardless of architecture or…

Two models train on different datasets, with different architectures, using different optimization configurations. One processes images. The other processes text. After training, you compare their internal representations — the high-dimensional vectors that describe how each model encodes concepts. The representations are strikingly similar.
This is not a thought experiment. It's an empirical observation that has appeared across enough settings that researchers have given it a name: the Platonic Representation Hypothesis. The hypothesis, formalized in a 2024 paper by Huh et al. from MIT, proposes that as neural networks scale, their internal representations of the world converge toward a shared structure — regardless of the architecture, modality, or training data they started from.
If this is right, it has significant implications for how we think about model interpretability, evaluation, fine-tuning, and multi-modal alignment. It also raises a harder question: what exactly are they all converging toward?
What Convergence Means Precisely
"Convergence" in this context has a specific technical meaning. It does not mean that two models produce the same outputs, or that they behave identically on benchmarks. It means that the internal vector spaces they use to represent concepts are linearly similar — that a mapping exists between their representations that preserves the geometric relationships between concepts.
If model A represents "dog" and "wolf" as nearby in its embedding space with "domestication" somewhere in the direction that separates them, and model B's representations have the same structure when you apply a linear transformation, those models have converged representations even if one is a vision model and one is a language model and they've never been jointly trained.
The measurement tool most commonly used is centered kernel alignment (CKA) — a technique that compares the similarity structure of representations across models without requiring the representations to live in the same vector space. High CKA between two models' representations for a shared set of inputs means they're encoding the same relational structure, even if the individual dimensions are incomparable.
The empirical finding is that CKA scores between large models — particularly at their deeper layers — are substantially higher than you would expect from random initialization with different data. And these scores increase with model scale.
The Evidence
Three categories of evidence are most compelling.
Cross-modal alignment without cross-modal training. Vision models and language models, trained separately on different data types, develop representations that are alignable. CLIP demonstrated this formally by showing that contrastive training between image and text encoders could produce a shared embedding space — but the surprising finding was that even models not trained with contrastive objectives show partial alignment. A vision model's representation of "chair" and a language model's representation of "chair" are geometrically closer than chance. Something about the statistical structure of the concepts, not the training procedure, is driving the similarity.
Model stitching experiments. If you take the early layers of one trained network and connect them to the later layers of a different network — passing activations from one into the other without any additional training — the resulting stitched model works better than random chance, and sometimes works well enough to be useful. This result is striking because it implies that the two models' internal representations are, at the layers where you stitched them, interoperable. They agree on what the intermediate representation means well enough that the second model can use the first model's activations as inputs.
Stitching succeeds most reliably when both models are large and both are trained on broad, diverse data. It fails more often with small models and specialized training distributions. This is the pattern you would expect if convergence is a property that emerges with scale and data coverage.
Representational universality across architectures. CNNs, Vision Transformers, and MLP-Mixers — architecturally very different image models — converge on similar representations when evaluated with CKA, particularly in deeper layers. The same pattern appears in language models: transformer-based models with different hyperparameters, attention mechanisms, and positional encodings show increasing representational similarity as they scale. The architecture appears to matter less than the scale and data distribution.
Why This Should Be Surprising
Gradient descent from random initialization, on a high-dimensional loss landscape, should not be expected to find the same solution twice. The landscape is non-convex, initialization matters, and the path through the optimization depends on the data ordering, the architecture, the learning rate schedule, and dozens of other factors.
That different runs converge to similar representations suggests that the structure of the representations is not primarily determined by the optimization path — it's determined by something external to the optimization. The Platonic Representation Hypothesis proposes that this external constraint is the statistical structure of reality, or more precisely, the statistical structure of data generated by processes in reality.
The intuition: if the world has a certain causal structure — objects persist, actions have consequences, words appear in contexts that reflect their meaning — then any model trained on enough data generated by that world will learn representations that reflect that causal structure. Different models are finding different paths through the optimization landscape, but they're all being pulled toward the same attractor: an accurate model of the underlying statistical regularities.
The word "Platonic" in the hypothesis name is deliberate. It alludes to the philosophical idea that there is an ideal form — a true representation of reality — and that models are all approximating the same ideal from different starting points.
Practical Implications
Understanding this has concrete consequences for how you build and evaluate ML systems.
Interpretability findings transfer. If two large models have similar representations, then mechanistic interpretability work done on one model is more likely to generalize to another than you might think. When researchers find that a certain attention head in GPT-style models implements a specific algorithm — induction heads, for instance, which implement in-context learning by attending to previous occurrences of the current token — there's reason to expect similar mechanisms in other large transformers. This makes interpretability research a more tractable enterprise than it would be if each model were an entirely different kind of system.
Benchmark similarity vs. representational similarity. Two models can have different benchmark scores while having similar internal representations, or similar benchmark scores while having different internal representations. Neither benchmark performance nor aggregate scores tell you much about whether two models are "the same kind of system." For regulated applications where you need to understand what a model is actually doing, representational analysis is more informative than capability benchmarks.
Transfer learning across domains. The success of fine-tuning large pre-trained models on specialized tasks is partially explained by representational convergence. The pre-trained model has already converged toward a representation of the world's statistical structure. Fine-tuning on a specialized domain isn't teaching the model a new representational language — it's teaching it new vocabulary and new patterns using representational infrastructure it already has. This is why a few thousand fine-tuning examples can shift model behavior dramatically: the hard representational work is already done.
Multi-modal models. The fact that vision and language representations are already partially aligned before any joint training explains why multi-modal models work as well as they do with relatively modest amounts of cross-modal supervision. CLIP-style contrastive training is finding a mapping between representations that were already structurally similar — it's doing alignment, not building the structure from scratch.
Where Convergence Breaks Down
The hypothesis is not that all models converge to the same representation. It's that scale and data diversity push representations toward convergence, and this pull is stronger at larger scales.
Convergence breaks down in predictable ways. Models trained on narrow distributions diverge from models trained on broad ones. A model trained only on medical text will develop representations of medical concepts that are more detailed and differently structured than a general-purpose language model. This is useful — domain specialization is real — but it means the convergence finding applies most strongly to large foundation models, not to specialized or small models.
There's also a question about what convergence looks like in the deeper layers versus the earlier ones. The empirical finding is that later layers converge more strongly than earlier layers. Early layers tend to reflect the statistics of the specific training data (what tokens appear near each other, what image patches co-occur) while later layers reflect more abstract structure. This suggests that "the same thing" that models are converging toward is abstract and structural, not surface-level.
Finally, convergence within a modality is stronger than convergence across modalities. Language models converge strongly with other language models; vision models converge strongly with other vision models; cross-modal convergence is real but weaker. This is what you would expect if representations are shaped both by the structure of reality and by the structure of the data format.
The Harder Question
The most important unresolved question is not whether convergence happens — the evidence that it does is substantial. The harder question is: converging toward what?
The Platonic framing suggests convergence toward truth: toward an accurate representation of the causal structure of reality. But there's an alternative explanation. Human-generated data encodes the statistical structure of human perception, human categories, and human language — which are themselves shaped by human cognitive biases, cultural assumptions, and evolutionary history. If all large models are trained primarily on human-generated data, they may all be converging toward the same thing, but that thing is the structure of human representation of reality, not reality itself.
This distinction matters for how you use and evaluate these systems. If models are converging toward accurate causal representations, then scale is a path toward better models of the world. If they're converging toward the structure of human cognition as encoded in training data, then scale is a path toward a more consistent but not necessarily more accurate model of human beliefs about the world.
The evidence doesn't cleanly distinguish these cases. The statistical structure of reality and the statistical structure of human perception are deeply intertwined, because human perception is shaped by reality and reality is what we have the most data about.
What the convergence findings do suggest is that the representations inside large models are not arbitrary. They reflect structure that is stable across architectures, training runs, and modalities. Whether that structure corresponds to reality, to human cognition, or to some combination of both is a question that mechanistic interpretability and careful evaluation methodology will have to answer.
For practitioners building systems on top of these models, the practical implication is this: the models you're working with are more structurally similar to each other — and more structurally stable — than their different benchmark scores might suggest. That's a reason to invest in understanding what that shared structure is, rather than treating each model as a new black box.
Want this implemented in your workflow?
I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call — no pitch, just a focused conversation about your situation.
I publish one post like this per month. Join AI Command Room and I'll send it directly to you.