Publication version

After the universe

A research direction for substrate-faithful representations of meaning.

John Vaught, SynorosApril 29, 202636 min read7,947 words

A longer version with full physics setup, cross-disciplinary grounding, and deeper development of each move is available at the full paper.

A slice observer in a richer host

Imagine an observer who can move only along a curve through a much larger geometric object. They model what they see, build intuitions calibrated to the local approximation along their curve, and develop a mathematics of motion, distance, and similarity that fits the slice they have access to. Then they meet a phenomenon their slice cannot account for: a relationship between two regions of the slice that behave as if they are joined through some part of the host the slice does not contain. The observer has two options. They can call the phenomenon anomalous and wait for it to go away. Or they can let the model expand until the host comes through.

We are that observer, generalised. The hyperbolic structure of spacetime and the curvature of the geometry around mass-energy are settled physics; neither has been in serious dispute for a century. Our perceptual and evolutionary history happened in a corner of that geometry where the hyperbolic structure is invisible and the curvature is too weak to register. Across every speed and energy scale that mattered for the survival of organisms our size, the corrections to flat-space three-dimensional intuition are smaller than the noise in any sense organ we have. Our intuitions are slice intuitions. Physics is the long-running project of inferring what the host the slice cuts through actually is, and what physics has been finding for a hundred years is that the host has more structure, and more kinds of structure, than the slice’s flat-space approximation can carry.

Models of meaning are built by us, with our intuitions, on hardware tuned to flat-space arithmetic. The default form of those models is a vector space whose mathematics matches the corner the intuitions calibrated to. The default is good enough for many tasks, the way Newtonian mechanics is good enough for many tasks. It is not good enough for the substance, because meaning is just our human label for our cognitive modeling of the universe, and the universe’s geometric structure is not flat, not three-dimensional, and not exhausted by any one regime. Pretending that the universe is the corner where the model was built is a tractability decision. It is not a decision the target permits.

Models of meaning should be built after the universe, not after our linear-algebra defaults.

The reasoning is target-fidelity, not analogy. The reading on which spacetime is hyperbolic and therefore embeddings should be hyperbolic is sloppy, and a careful reader would dismiss it. The argument here is that what we call modeling meaning is, with one layer of label peeled away, modeling the universe, and any honest model has to span the regimes the universe actually presents. Aircraft engineers studied birds without claiming aircraft are birds; they took the features that made flight work and built systems that respected those features. The same move applies here.

The chain to that conclusion runs in three moves. The first is empirical: three contemporary research programs have been forced into high-dimensional geometric structure by what they study. The second is dialectical: the convergence is internal pressure, not methodological fashion, and the forcing is what containment looks like from inside the slice. The third is architectural: if the forcing is real, the next-generation system for modelling meaning has a specific shape, and that shape generates falsifiable predictions concrete enough to test.

Three programs forced into geometry

In the last fifteen years, three otherwise-unrelated research programs have walked into high-dimensional geometric descriptions of what they study. The teams do not collaborate; they publish in different journals; they would not agree on what counts as a hard problem. Yet, when each program took its substance seriously enough to model it well, it ended up describing meaning, computation, or activity as something that lives in a space with many more dimensions than three. Different substances. Same mathematical family.

Large language models

A modern language model represents every chunk of text as a list of several thousand numbers, treated as coordinates in a high-dimensional vector space. The geometry is doing the work, not the implementation. Distance in the space tracks similarity of meaning: the token for cat lands near the token for kitten and far from the token for parliament. Direction in the space tracks relationship. There is a direction that means plural, a direction that means past tense, and the canonical king minus man plus woman lands near queen arithmetic that opened the word2vec literature in the early 2010s.

Anthropic’s Scaling Monosemanticity result mapped individual directions in Claude’s residual stream to recognisable concepts: the Golden Gate Bridge, sycophancy, categories of code vulnerability (Templeton et al. 2024). The concept and the direction are not identical, but the direction is where the concept lives, in the same way the Golden Gate Bridge has a location even though the location is not the bridge. Bronstein and colleagues argued in 2021 that the major architectures of deep learning unify under a single geometric framework organised around the symmetries and invariances of the substance the architecture is built to model (Bronstein et al. 2021). Convolutional networks bake in the rotational and translational structure of images. Transformers bake in the compositional structure of language. Geometry is the medium learning happens in, not a side effect.

Meaning, when modelled by the most successful systems we have built, lives as location and direction in a high-dimensional vector space.

Quantum computation

A quantum computer with \(N\) qubits operates in a Hilbert space of dimension \(2^N\). The growth is not large; it is cosmically large within a few dozen components. A hundred qubits give a Hilbert space of approximately \(10^{30}\) dimensions, more than the number of stars in the observable universe. Shor’s algorithm uses this dimensional volume directly: it puts the machine into a quantum state representing many candidate factors at once, lets the wrong candidates interfere destructively, and reads out the right one when the cancellation completes. The computation cannot be performed inside the configuration space of the physical components alone. It happens in the high-dimensional state space the components together define.

The interpretive question of where the computation is happening is contested. Deutsch has argued for forty years that the Hilbert space dimensions are a literal description of branching reality, and that the running quantum computer is empirical evidence for the many-worlds reading (Deutsch 1985, 1997). Most working physicists do not adopt that interpretation. The structural fact, which is the same regardless, is that quantum computation operates in a high-dimensional geometric arena and the arena is doing work the components themselves cannot do. The dimensions are either physical in Deutsch’s strong sense or mathematically physical in the weaker sense that the operations the machine performs have meaning only inside that space. Either reading places physical computation in a high-dimensional geometric arena.

Neural population geometry

The third program looks at the only computational system humans had access to before they learned to build their own. Mainstream neuroscience treats the brain as a three-dimensional object with three-dimensional wiring, which it is. What lives on top of the wiring is not.

When a population of neurons is recorded firing during a task, the activity is a trajectory through a space whose axes are the firing rates of the individual neurons. A thousand-electrode recording defines a thousand-dimensional firing-rate space, and the population activity is a point that moves through that space as the task unfolds. The empirical finding that has reshaped this part of the field over the last decade is that the activity does not fill the space available to it. The trajectories live on a curved surface only a few dozen dimensions across, embedded inside the high-dimensional ambient space, and the geometry of that surface matches the structure of the task being performed (Gallego et al. 2017). Chung and Abbott argued in 2021 that the same high-dimensional geometric framework unifies biological and artificial neural networks (Chung and Abbott 2021), not by design, but because the empirical structure shares the shape.

Neural connectivity carries topological structure that flat-graph descriptions cannot detect. Reimann and colleagues, working with reconstructions of cortical microcircuits, found simplicial complexes of dimension up to seven, with topological cavities enclosed by cliques of mutually-connected neurons (Reimann et al. 2017). The dimensions in question are topological, not spatial; the popular reading that “the brain operates in eleven dimensions” is a misreading of what the result shows. The technical content is that the connectivity carries shape that ordinary correlation analysis cannot see, and this shape was extracted by persistent-homology methods that had earlier been shown to detect geometric structure in neural firing correlations that distance-based methods missed (Giusti et al. 2015). The brain was using topological structure, and the structure was real.

What is shared

Three programs, three substances, one mathematical family.

What is shared is not the substance and not the conclusion. What is shared is the apparatus: high-dimensional vector spaces, manifolds embedded in larger spaces, topological structures that classical descriptions miss. Linear algebra and the geometry that goes with it, applied to whatever the system happens to be made of. Wigner’s 1960 essay on the unreasonable effectiveness of mathematics observed that fields with no obvious connection keep stumbling into the same equations (Wigner 1960). The convergence here is a more specific version of that observation: the same family of geometric apparatus shows up across substances because each program is forced into it by what the substance actually is, once that substance is being modelled seriously.

The section deliberately excludes a fourth research program, the Penrose-Hameroff hypothesis that consciousness has a quantum substrate inside neural microtubules. The hypothesis is contested at its foundations and the convergence I lean on does not need it. Other speculative programs reach for similar structure as well, including holographic reconstructions of bulk geometry from boundary correlations (Van Raamsdonk 2010; Swingle 2012) and the cause-effect-space proposal of Integrated Information Theory. They are compatible with the picture without being load-bearing for it. If any of them turn out to be wrong, the chain in this paper survives unchanged.

Why the convergence is forced

The natural objection is that the convergence is selection effect, not signal. High-dimensional vector spaces are the methodological tool of the era. Linear algebra is the mathematics that runs on graphics chips. Optimisation on long lists of numbers is the only kind of computation we have learned to do at industrial scale. So when an LLM lab needs to encode word meaning, when a quantum theorist writes down the state of a many-qubit system, when a neuroscientist tries to make sense of multi-electrode firing patterns, they each reach for the same toolbox. The convergence, on this reading, says nothing about the world. It says something about the era.

The objection has a clean historical parallel in calculus. Calculus recurs across physics from Newton’s mechanics to general relativity to quantum field theory, and the recurrence does not, on its own, license the inference that “calculus is hidden in nature.” The defensible reading is that rate-of-change is a real feature of physical processes, and any tool that handles rate-of-change well will recur because the feature is real. The convergence on calculus is the world being a place where rate-of-change is a structural feature; the tools that respect the feature are the ones that work. The selection-effect reading of high-dimensional geometry is meant to run the same way.

Part of the critique is right. The three programs are three out of many. Cognitive science, formal linguistics, dynamical systems, and category theory do not reach for high-dimensional vector geometry as their first move. Selecting three programs that do is itself a choice, and the bare fact that all three can be described in such spaces is, on its own, weak evidence that they share anything deep.

What the selection-effect critique misses is the direction of the forcing. In each of the three programs, the geometric structure was not picked off the methodological bench. It was extracted from the data, the formalism, or the proposal itself.

LLM embeddings: early word-embedding work in the 2010s tried representing meaning in a few dozen dimensions and worked badly. The dimension count rose because lower-dimensional embeddings empirically lost distinctions and produced worse downstream-task performance. The volume of relationships the representation had to carry, across hundreds of thousands of tokens, forced the dimension count into the thousands. The modellers did not pick the geometry. The data filtered the geometry through an empirical sieve.

Hilbert space: the dimensional explosion of quantum mechanics is a fact about how quantum systems combine, not a modelling choice. Two qubits do not have four states because the formalism has four slots to fill. They have four states because the system has four physically distinguishable configurations, and any honest description has to track them. A theory that did not respect the growth would predict the wrong probabilities in the laboratory. Hilbert space is settled physics in the same sense the Minkowski metric is.

Neural manifolds: the high-dimensional geometric description of brain activity was discovered in the data, not imposed by the analyst. Gallego and colleagues asked where, in the high-dimensional firing-rate space, population activity sits. The answer was not “everywhere.” The activity is confined to a curved surface inside that space, and the curved surface is where the computation happens. Reimann and colleagues did not bring topology to neural connectivity; the connectivity contained topological structure that flat-graph methods could not see, and persistent-homology tools surfaced it. The data forced topology onto the analysis.

In each of the three cases, the structure is the residue of a constraint imposed by the phenomenon, not the imprint of a methodology imposed by the field.

The calculus parallel cuts the other way once that is recognised. Calculus recurs not because it is fashionable but because rate-of-change is a real structural feature of the world. Any framework that gets traction on processes has to respect the feature. The same move applies here. If the convergence across the three programs is forced by what each is studying, the convergence is telling us something about what is being modelled. The three programs are modelling things that share a structural feature, namely high-dimensional geometric organisation of the relationships among parts. The geometric tools recur because the feature is real, and any framework that gets traction has to respect it.

This narrows the worry. It does not eliminate it. There is an open question whether high-dimensional geometry is the unifying principle the convergence points at, or only a useful approximation to a deeper principle for which we do not yet have a name. The case for the convergence being meaningful is strong. The case for it being meaningful in exactly the way described, with high-dimensional geometry as the principle rather than a placeholder for it, is weaker. The rest of the paper has to keep that distinction visible.

After the target, not by analogy

What survives is this. The convergence is not a methodological coincidence. Each of the three programs was forced into geometric structure by what it was studying. The three substances share a structural feature deep enough that any honest modelling has to respect it. The question changes shape. It is no longer whether the convergence is a coincidence. It is what we are modelling when we model what we call meaning, such that doing it well forces this structure.

The literature is full of an analogical version of this move that needs to be walked past before the careful one can be stated. The analogical version runs: the universe is non-Euclidean, therefore representations of meaning should be non-Euclidean. That reasoning does not hold. The non-Euclidean structure of spacetime arose from particular physical pressures, the invariance of the speed of light, the curvature induced by mass, the gauge structure of the Standard Model, and those pressures do not transfer to language modelling. Any reader who left the previous section thinking “spacetime is hyperbolic, so embeddings should be hyperbolic” was reading it wrong.

The argument I am making is target-fidelity. Meaning is our human label for our cognitive modelling of the universe. A model of meaning is therefore, with one layer of label peeled away, a model of the universe. The question of what shape that model should take is the question of what shape its target has. A model takes the shape of its target, or at the very least is designed after it, the way aircraft were designed after birds without being claimed to be birds.

The universe has visible structural features before any theory is named. It has similarity: “cat” and “kitten” sit nearer in sense than either does to “logarithm,” and any model of the universe at human scale has to encode the fact somewhere. Distance in some space is the natural home. It has hierarchy: “animal” contains “dog” contains “Labrador,” and the structure is a tree, not a chain. Trees grow exponentially with depth. Hyperbolic geometry, where the volume of a ball grows exponentially with its radius, is the natural home for tree-shaped data; Nickel and Kiela demonstrated this concretely in 2017 by embedding the WordNet noun hierarchy in the Poincaré ball and beating Euclidean baselines by orders of magnitude on every metric they cared about (Nickel and Kiela 2017). It has correspondence: the relationship between “father” and “son” is mirrored in the relationship between “tree” and “branch,” and the recognition of analogy is the recognition that two structures share their connectivity even though their substances differ. Topology, the branch of geometry that asks what survives continuous deformation, is the natural home for correspondence; persistent-homology methods make this concrete (Carlsson 2009).

The mappings from these three features to their geometric homes are not chosen for convenience. They are forced by the features themselves, in the same sense the special-relativity geometry was forced by experiment. A flat-space model of a tree-shaped concept eventually crowds together leaves it should be separating, because the volume of a flat space does not grow fast enough to give them room. A model with no topology eventually fails to recognise an analogy that would be obvious to a child. The geometry is not a stylistic choice; it is what fits or fails to fit the shape of the target the model is aimed at.

Concrete instances and emergent abstractions

Consider Kevin and George, two golden retrievers. They share roughly 99.9% of their DNA, similar morphology, similar behaviour. They are distinct instances: different worldlines through space and time, different histories, different biographies. Any honest model of meaning at human scale has to handle both their similarity and their distinctness, and these are not the same kind of fact.

Their similarity is a fact about how they deform the local structure of any meaning-substrate that contains them. Both produce the same kind of curvature signature when embedded; the way their presence reshapes the local geometry around them is, to a high approximation, identical. Their distinctness is a fact about their separate trajectories through the substrate: different paths, different events along those paths, different accumulated histories. The substrate that represents them faithfully has to carry both: the shared deformation pattern (their fiber, in the geometric sense, the local curvature signature attached to each instance) and the separate paths (their geodesics, each instance’s worldline through the larger geometry). Same fiber, different geodesics.

Now consider “Golden Retriever” as a category. It is not a thing in the world the way Kevin and George are. There is no separate Platonic Golden Retriever waiting to be embedded somewhere in the substrate. The category is what we name the dense region of fiber-space where Kevin’s fiber, George’s fiber, and every other golden retriever’s fiber concentrate. The category does not sit at any single location; it is the cluster, the density, the persistent topological feature in the geometry of how concrete instances are arranged.

The same move handles abstractions further from physical instantiation. “Justice” is not a thing in the world; it is a label English speakers apply to certain patterns in cognitive activity (deliberations, judgments, moral reasoning, indignation at violation). Each invocation is a concrete cognitive event with its own fiber. “Justice” is the cluster those event-fibers form. There is no Justice node anywhere in the substrate; there is only the cluster, dense enough that a query traversing the substrate naturally lands in its neighbourhood.

The substrate carries concrete instances as fiber/geodesic objects. Categories and abstractions emerge from the connection structure between concrete instances. This is structural realism applied to model-building under a specific further commitment: structures are real, individuals-and-categories are how that structure presents to a finite observer, and the substrate carries the structure without committing to either pure Platonism (categories as separate ontological tier) or pure nominalism (categories as arbitrary impositions). The structure is real; the carving is what an observer does to it.

The target-fidelity claim cannot be defeated the way an analogy argument can. An analogy argument falls when the analogy is shallow: the model of \(A\) does not need to share property \(X\) with \(A\) just because we noticed a resemblance. Target-fidelity is not an inference from a target’s properties to its model’s; it is a statement about what counts as a good model of a target. If the target has multiple geometric regimes, a model that respects only one is, by construction, modelling only one regime. The empirical convergence in the previous section is what target-fidelity looks like from inside the slice: three programs realizing, despite themselves, that the target has more structure than the flat-space tooling can carry.

The claim is a claim, not a derivation. A reader who declines it can still take much of the paper home: the features-to-homes argument is forced by the features themselves, and the architecture sketch rests on the features being real, not on any particular reading of why.

The standard worry about a structural account is that it dissolves correctness. If a cognizer’s internal model is just a structure, and “meaning” is just our word for what the model produces, then any internal structure is meaning of something, and the cognizer is incapable of being wrong. The objection has a clean answer once target-fidelity is the framing. A cognizer is itself a structure inside the universe. Its internal model is another structure, the representational machinery considered as a system of relations. The cognizer’s model is correct about a target to the extent that the structure of the model matches the structure of the target. Correctness is structure-to-structure fit. It is not arbitrary; the universe’s relations either are or are not preserved by the model. What is configuration-dependent is which target the cognizer is modelling and which structural features matter for its purposes. Different cognizers, configured differently by genetics, environment, and training, fit different aspects of the target. They get different “correct” models because they are modelling different sub-structures of the same target, not because correctness has dissolved into preference. The view does not collapse into relativism; it anchors at the substrate.

The position has identifiable kin in the philosophy-of-physics literature. The closest is structural realism. Worrall noticed that when one physical theory replaces another, the entities the old theory posited often get discarded outright (caloric, phlogiston, the aether) while the structural relationships those theories captured between observable quantities tend to survive into the new theory in modified form (Worrall 1989). Maxwell’s equations were written down for an aether-based theory of electromagnetism; the aether went, the equations stayed. Worrall’s diagnosis is that what science tracks across theory change is structure, not the underlying entities. Ladyman and Ross sharpened this into ontic structural realism, the claim that structure is not merely what science best tracks but what fundamentally exists (Ladyman and Ross 2007). The position taken here can fairly be described as ontic structural realism applied to model-building for the universe, with the further specification that concrete instances are the substrate’s primitives and categories emerge from the relations among them.

The closest distant neighbour is Tegmark’s mathematical universe hypothesis (Tegmark 2008). The position here is narrower. This universe has whatever structure it has, and our cognitive models of it (which we label “meaning”) share that structure because they are aimed at it. Whether every consistent mathematical structure exists physically is left open. The convergence and the target-fidelity move are compatible with “this universe is structural and the only universe” and with “one structural universe among many,” and the narrower claim does not need the broader one. Ellis’s standard objection to Tegmark, that the identification of physical reality with mathematical structure throws out the observational content of physical theories as baggage (Ellis 2009), does not transfer here, because observation is itself a structural process inside the universe rather than extra-structural content laid on top of it.

What to build

If meaning shares the host’s regimes, the engineering implication is direct: build the systems that model meaning to span those regimes, not the regimes that happen to fit a graphics chip. Current language models put meaning in flat Euclidean space; the features that have natural homes outside flat space (hierarchy, correspondence, cross-configuration relationships) are forced into the flat home, and the failures the field has spent the last several years cataloguing (compositional brittleness, hierarchical confusion, weak cross-domain transfer) are what that forcing produces. The instinctive response has been to scale through. The substrate-faithful direction is a different bet: if some failures come from geometry not fitting substance, changing the geometry should help in proportion to how much of the failure was geometric.

The substrate

The substrate is one geometric object: a Lorentzian manifold where time is a dimension of the geometry rather than a parameter outside it, with curvature that varies locally based on what is currently embedded. Concrete instances enter the substrate as fiber/geodesic objects. Each instance has a fiber: a local curvature contribution, a deformation signature attached at its location. Each instance has a geodesic: a worldline through the substrate’s spacetime. The fiber is the instance’s structural identity; the geodesic is its history. Same fiber means same kind-of-thing; same geodesic means same instance.

Variable curvature is load-bearing, not a refinement. The mass of a fiber, its significance or weight in the substrate’s economy, drives how sharply it deforms the local geometry. A heavy fiber bends nearby geodesics more; a light one bends them less. Without variable curvature, every instance contributes equally and the substrate cannot represent the difference between a structurally significant concept and a passing reference. Constant-curvature embeddings lose this distinction by construction; they are useful working approximations for restricted domains, but they cannot carry the full type/token machinery. The substrate-faithful version has curvature responding to mass-distribution at every point.

The substrate is faithful to whatever has been input to it, at whatever resolution the input is given. The Parthenon and its individual pillars can both have fibers, if both have been named in the data; if only the Parthenon has been named, the pillars never get fibers. Granularity is set by inputs, not by the substrate’s primitives. The substrate does not pre-decide what counts as a thing.

Four tiers, reframed

The substrate is one object; the four tiers are four kinds of structure visible in it, not four stacked modules. Each tier handles a different scope of what the substrate carries.

Tier 1, Euclidean local. Short-range compositional structure within a worldline. The work the embedding space has been doing since word2vec. The Bronstein et al. geometric-deep-learning framework (Bronstein et al. 2021) is the right tool here, and it stays.

Tier 2, hyperbolic hierarchical. Hierarchy is where flat space starts to fail. The volume of a Euclidean ball grows polynomially with radius; trees branch exponentially. Embedding a deep hierarchy in Euclidean space runs out of room, and the resulting representation crowds together leaves it should be separating. Hyperbolic geometry, with exponential volume growth, is the geometry that emergent category-clusters take their natural shape in. Categories are not separate nodes; they are dense regions of fiber-space, and hyperbolic curvature is what those regions look like when the underlying instance-relations are tree-shaped. The literature is substantial: Nickel and Kiela’s Poincaré-ball and Lorentz-model embeddings (Nickel and Kiela 2017, 2018), Ganea and colleagues’ operational toolkit (Ganea et al. 2018), MERU contrastive vision-language (Desai et al. 2023), the HypLoRA finding that trained language models already host tree-shaped structure that flat-space training has been forcing through the wrong-shaped doorway (Yang et al. 2025), and the HELM family of fully hyperbolic billion-parameter models (He et al. 2025). Variable curvature sharpens this: meaning-dense regions of the substrate curve more sharply than meaning-sparse ones, the way mass-energy curves spacetime more sharply where there is more of it.

Tier 3, topological correspondence. The tier shifts from “how are instances related?” to “what survives when we map between regions of fiber-space?” Analogy. Cross-domain transfer. The recognition that two structurally similar things in different fields are doing the same work. Topology, in the persistent-homology and topological-data-analysis tradition (Carlsson 2009; Giusti et al. 2015; Reimann et al. 2017), is the right home. Persistent homology is also the rigorous formalisation of “granularity is set by inputs”: features (clusters, holes, voids) are tracked across scales, with their birth and death across resolution determining what counts as a stable structural unit. The engineering here is less mature than the hyperbolic tier, roughly where hyperbolic networks were in 2018: the components exist, integration into large systems is still being worked out.

Tier 4, categorical relational. The first three tiers handle features within a single realised configuration of the substrate. The fourth handles relations between different ways the substrate can be carved, which is what different observer specifications produce. Different fine-tunings, different priors, different observer states, different language conventions: each is a configuration of the same substrate, and the structural relations between configurations are themselves data the substrate has to carry. Category theory is the natural language for this layer. Tier 4 is where the architecture’s I/O machinery, described next, structurally lives.

The closest published architecture is PHyCLIP, which takes hierarchy seriously and builds a Cartesian product of hyperbolic factors joined by an \(\ell_1\) product metric (Yoshikawa and Matsubara 2026). PHyCLIP multiplies within the hyperbolic regime; the proposal here decomposes across qualitatively different regimes and lets variable curvature respond to instance-mass within each. Forcing all three regimes into a hyperbolic product would do better than forcing them into flat space, but it would still be forcing. The substrate-faithful move is to let each regime have the geometry it wants.

Observers: how language enters and exits the substrate

The substrate is language-free and modality-free. Concrete instances enter it as fibers; abstractions emerge from cluster density; English, French, Rust, Python are nowhere in the substrate itself. Surface language is observer-side, not substrate-side.

The architecture handles this through a bounded division of labour. An encoder translates input (a string of English text, a code snippet, a perceptual signal) into a substrate-physics intervention: a mass-distribution that biases observer trajectories, an initial fiber, a deformation that the substrate’s dynamics then propagate. The encoder can be initialised from existing pretrained models and fine-tuned to produce substrate-distributions rather than discrete tokens. It does not add English to the substrate; it converts English into substrate-physics.

An observer is a substrate-native entity, a fiber/geodesic with its own mass, that traverses the substrate driven by the encoder’s mass-distribution and reports what it experiences. Different observers are trained for different output languages or modalities. An English observer expresses the deformation patterns it experienced as English tokens. A French observer expresses the same deformation patterns as French. A code observer expresses them as Rust, Python, or any other programming language it has been trained on.

The structural payoff is that the substrate is a universal interlingua. The same substrate state, interrogated by different observers, produces semantically equivalent outputs in different surface forms. Translation by interlingua is what target-fidelity predicts: surface-string to substrate-state to surface-string, with the substrate carrying the meaning-preserving structural content. Cross-language code translation, multimodality, and domain specialisation all fall out of the same architectural pattern: more observers, same substrate.

The design also bounds a real concern about the substrate-faithful direction. A substrate that does no recognition on its own would be a “non-Euclidean database engine” with the actual cognitive labour displaced into the I/O. The observer architecture splits that worry into two pieces: the substrate carries universal structural patterns (what cognitive activity has produced fibers for, in the geometry it deformed); each observer does one bounded job, translating between one language or modality and substrate state. The “real cognition” is the substrate plus the encoder; observers are translators, not minds. Encoder-decoder with shared latent representations is standard practice (Bronstein et al. 2021); what is novel here is that the latent is a substrate-faithful geometric object and the observers are substrate-native entities, with gauge-equivariant operations on the fiber bundle structure as the natural mathematical machinery.

Dynamics: always-on plasticity over a single substrate

Static placement is a snapshot; the host this paper takes seriously has dynamics, not just placement. Fibers update because they are embedded geometry, and embedded geometry responds to whatever is embedded in it; the substrate’s curvature is constituted by the current fibers. Plasticity is therefore the substrate’s only mode of operation; the train/inference distinction goes. Inputs deform the local geometry the way mass-energy curves spacetime, whether the system is “training” or “running.”

Several active research lines have already begun to bend the static-weights assumption. Hypernetworks let one network generate the weights of another, so the inner network’s geometry is produced dynamically rather than fixed (Ha et al. 2017). Liquid neural networks make weights continuous functions of state and time, evolving during inference rather than after training (Hasani et al. 2021). Test-time training keeps weights updating at deployment via self-supervised objectives on the test inputs themselves (Sun et al. 2020, 2024). Ramsauer and colleagues showed in 2020 that the attention mechanism in modern transformers is, mathematically, a single update step of a dense Hopfield network (Ramsauer et al. 2021; Hopfield 1982; Krotov and Hopfield 2016); attention is one iteration of attractor dynamics with the rest of the iterations cut for tractability. None of these is a finished substrate-faithful architecture; they are partial relaxations of the static-weights assumption, all moving in the direction the geometric reading predicts.

A second feature of the dynamics is harder to see from inside the static-weights frame, and worth marking. The host pairs gravitational clustering at one scale with a dispersing pressure at another: galaxies cluster, voids hold open, and the structure observed at cosmological scale is the joint product of the two opposing dynamics. A meaning architecture with only an attractive pull will collapse over enough iterations into one mega-cluster, every concept dragged toward every other concept until the representation becomes a single point. The substrate-faithful pattern is attraction-plus-repulsion: a slow ambient dispersion that prevents the representation from collapsing into monoculture, paired with the like-attracts-like pull that builds local structure. A small literature already does versions of this with contrastive objectives and repulsive losses. The substrate-faithful framing is that those terms are not regularisation tricks; they are the second half of a two-sign dynamic the host has at the cosmological scale and the architecture should respect at the representational one.

Energy-minimisation framings are an artefact of Riemannian intuition. Lorentzian signature does not give a bounded-below energy: the time term comes in with the opposite sign, and the fully Lorentzian version of the dynamics is action-principle-shaped rather than energy-minimisation-shaped (the way general relativity’s geodesics arise from extremising the Einstein-Hilbert action rather than rolling downhill in a potential well). For the framework-level sketch here, the Hopfield reading is the right intuition; the implementation paper takes the Lorentzian version on.

Safety and stability under always-on plasticity come from system-level mechanisms layered on top: consolidation cycles, replay, snapshots that let a deployment roll back to a known geometry, differential plasticity rates across regions, and rate limits on how fast curvature can update under any single input. The framework’s load-bearing claim is that the train/inference split is not a feature the substrate-faithful architecture should preserve.

Falsification and the phased path

A research direction that cannot be falsified is hand-waving. The criteria below are concrete enough for the implementation papers to run.

Out-of-distribution depth generalisation. Train on shallow hierarchies, test on deeper ones. Flat space physically runs out of volume as a hierarchy deepens; the substrate does not. A parameter-matched flat baseline trained on the same data should fail at depths where the substrate continues to succeed.
Spontaneous cluster formation. Do hierarchical clusters emerge from instance fibers under the substrate’s dynamics, even when no category nodes are explicitly trained? The original phrasing of this test was “can you reproduce a given hierarchy”; the substrate-faithful version is stronger: the substrate should produce hierarchical structure from instance density alone, without category supervision.
Gromov \(\delta\)-hyperbolicity of hidden states. A quantitative tree-likeness measure applied to the substrate’s internal geometry. The substrate-faithful architecture should produce hidden states with low \(\delta\); flat-space architectures should not. This is a mechanistic test, not just an outcome test.
Cross-observer semantic equivalence. Train two observers on the same substrate (English and French, or two programming languages). Probe the substrate at states for which no parallel data has been seen. Their outputs should be semantically equivalent. This is the sharpest prediction the architecture makes; flat-space approaches with shared multilingual embeddings can produce something like this only after seeing parallel data, while the substrate-faithful version predicts it for unseen substrate states because the substrate carries the structural work.
Structural-versus-positional dissociation. Construct a dataset where structural similarity and spatial proximity dissociate, such as twins separated to different continents. The substrate should keep them close on the structural axis (their fibers are similar) while letting them drift apart on the spatial axis (their geodesics diverge). A flat baseline that conflates structural and positional similarity into one coordinate cannot do this.
Time-dilation of high-entropy concepts. If meaning-content curves the substrate the way mass-energy curves spacetime, high-entropy concepts (regions where many distinctions are crowded together) should slow geodesic flow through their neighbourhood. This is a Lorentzian signature that flat-space and Riemannian baselines should not produce.
Data-efficiency at low data scales. Substrate-faithful geometry should produce robust patterns from less data than flat-space, because the geometry is not fighting the structure of the data. Train substrate and flat-space baselines on progressively smaller datasets and find the point where each fails to produce stable cluster geometry.
Parameter-matched classical baselines. Every comparison should be against a flat-space classifier with the same parameter count, trained on the same data, on the same compute budget. Without parameter-matching, geometric and non-geometric explanations of any apparent advantage cannot be separated.

The shape of the prediction is regime-specific advantage: gains on hierarchical, analogical, and cross-domain regimes, matched performance on flat-space-friendly ones. Outperforming nowhere falsifies the principle; outperforming everywhere by similar margins implicates something non-geometric.

The full architecture is research-grade ambitious. The implementation path is phased to make it shippable as a research program rather than a monolith.

Phase 1: hyperbolic-only substrate, single observer, encoder/decoder built on top of the existing hyperbolic-network literature. Tests out-of-distribution depth generalisation, spontaneous cluster formation, and Gromov \(\delta\)-hyperbolicity. If Phase 1 fails to beat parameter-matched flat-space baselines, the substrate-faithful claim’s hyperbolic prediction does not pay off and the project pivots.
Phase 2: add variable curvature with mass-driven local deformation. Tests structural-versus-positional dissociation, time-dilation of high-entropy concepts. If variable curvature does not outperform constant-curvature on these tests, the architecture ships at constant curvature.
Phase 3: add a second observer in a different language or programming-language. Tests cross-observer semantic equivalence on substrate states the system has never seen parallel data for. The sharpest test of whether the substrate carries the structural work or whether observers are doing it.
Phase 4: full architecture with persistent-homology I/O, gauge-equivariant operations, multimodal observers. Tests data-efficiency at low data scales and any remaining criteria from the falsification list.

Each phase tests one piece of the substrate-faithful claim. Each phase failing teaches the project something specific without killing the whole research direction.

The argument’s reach

The grounded part of the argument is what is forced by physics and confirmed by experiment. The geometry of spacetime has the hyperbolic structure the literature has established for a hundred years. We are observers calibrated to a regime where the curvature does not show. Three contemporary research programs have, independently, been forced into high-dimensional geometric structure by what they study. None of these moves is speculative.

The move from convergence to target-fidelity, the claim that “meaning” is our word for our cognitive modelling of the universe and that a model of meaning is therefore aimed at the universe and should take its shape, is a philosophical proposal. Forced by neither the physics nor the empirical convergence, and refuted by neither. A reader can decline it. If they take it, the architecture sketch follows as the natural engineering implication. If they decline it, the diagnosis of where flat-space architectures will run out of road still has independent purchase, because the features-to-homes argument rests on the features themselves rather than on the target-fidelity reading of why they are forced.

The architecture sketch is a proposal at the framework level. It commits to a Lorentzian substrate, concrete instances as fiber/geodesic objects, abstractions as emergent cluster density, variable curvature responsive to instance mass, observer-mediated translation between substrate state and surface output, always-on plasticity, and an explicit list of falsification criteria with a phased empirical path. It does not commit to specific equations, specific Lagrangians, or specific implementations of the tiers. The implementation papers that test the criteria phase by phase are the next artifact.

A few things this paper deliberately leaves alone: the metaphysics of existence (the structural picture is consistent with several resolutions of the Tegmark vs. structural-realist vs. relationalist disputes), a specific reading of quantum mechanics (the Hilbert-space arena’s role in the convergence holds across many-worlds, Copenhagen, pilot-wave, and consistent-histories alike), and any specific quantum-substrate theory of consciousness (the convergence is robust to the more speculative consciousness programs turning out wrong).

What is on the table is a research direction. The honest reading of the convergence is containment rather than analogy. The engineering implication is a substrate-faithful representation that spans the regimes the host actually presents, with concrete instances as primitives, abstractions as emergent geometry, observers as the bounded translation layer, and a falsification list concrete enough to test phase by phase.

The work of the next decade in machine learning will not, I think, be more transformers. It will be figuring out what shape the universe is, since that is what our models are aimed at, and building systems that respect that shape. The first architecture that gives up the flat-space defaults and earns its keep on a task flat space does badly will tell us whether the picture in this paper was the right reading or only a suggestive one. I do not know which way that experiment will fall. I think it is the experiment worth running.

References

Bronstein, Michael M., Joan Bruna, Taco Cohen, and Petar Veličković. 2021. “Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges.” arXiv Preprint. https://arxiv.org/abs/2104.13478.

Carlsson, Gunnar. 2009. “Topology and Data.” Bulletin of the American Mathematical Society 46 (2): 255–308. https://doi.org/10.1090/S0273-0979-09-01249-X.

Chung, SueYeon, and L. F. Abbott. 2021. “Neural Population Geometry: An Approach for Understanding Biological and Artificial Neural Networks.” Current Opinion in Neurobiology 70: 137–44. https://doi.org/10.1016/j.conb.2021.10.010.

Desai, Karan, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shanmukha Ramakrishna Vedantam. 2023. “Hyperbolic Image-Text Representations.” Proceedings of the 40th International Conference on Machine Learning, Proceedings of machine learning research, vol. 202: 7694–731. https://proceedings.mlr.press/v202/desai23a.html.

Deutsch, David. 1985. “Quantum Theory, the Church-Turing Principle and the Universal Quantum Computer.” Proceedings of the Royal Society A 400 (1818): 97–117. https://doi.org/10.1098/rspa.1985.0070.

Deutsch, David. 1997. The Fabric of Reality. Penguin Books.

Ellis, George F. R. 2009. “Some Comments on the Mathematical Universe.” arXiv Preprint. https://arxiv.org/abs/0904.0867.

Gallego, Juan A., Matthew G. Perich, Lee E. Miller, and Sara A. Solla. 2017. “Neural Manifolds for the Control of Movement.” Neuron 94 (5): 978–84. https://doi.org/10.1016/j.neuron.2017.05.025.

Ganea, Octavian, Gary Bécigneul, and Thomas Hofmann. 2018. “Hyperbolic Neural Networks.” Advances in Neural Information Processing Systems.

Giusti, Chad, Eva Pastalkova, Carina Curto, and Vladimir Itskov. 2015. “Clique Topology Reveals Intrinsic Geometric Structure in Neural Correlations.” Proceedings of the National Academy of Sciences 112 (44): 13455–60. https://doi.org/10.1073/pnas.1506407112.

Ha, David, Andrew M. Dai, and Quoc V. Le. 2017. “HyperNetworks.” International Conference on Learning Representations. https://arxiv.org/abs/1609.09106.

Hasani, Ramin, Mathias Lechner, Alexander Amini, Daniela Rus, and Radu Grosu. 2021. “Liquid Time-Constant Networks.” Proceedings of the AAAI Conference on Artificial Intelligence 35 (9): 7657–66. https://doi.org/10.1609/aaai.v35i9.16936.

He, Neil, Rishabh Anand, Hiren Madhu, et al. 2025. “HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts.” Advances in Neural Information Processing Systems 38 (NeurIPS 2025). https://openreview.net/forum?id=RnbJPkakkm.

Hopfield, John J. 1982. “Neural Networks and Physical Systems with Emergent Collective Computational Abilities.” Proceedings of the National Academy of Sciences 79 (8): 2554–58. https://doi.org/10.1073/pnas.79.8.2554.

Krotov, Dmitry, and John J. Hopfield. 2016. “Dense Associative Memory for Pattern Recognition.” Advances in Neural Information Processing Systems 29.

Ladyman, James, and Don Ross. 2007. Every Thing Must Go: Metaphysics Naturalized. Oxford University Press.

Nickel, Maximilian, and Douwe Kiela. 2017. “Poincaré Embeddings for Learning Hierarchical Representations.” Advances in Neural Information Processing Systems 30.

Nickel, Maximilian, and Douwe Kiela. 2018. “Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry.” International Conference on Machine Learning.

Ramsauer, Hubert, Bernhard Schäfl, Johannes Lehner, et al. 2021. “Hopfield Networks Is All You Need.” International Conference on Learning Representations. https://arxiv.org/abs/2008.02217.

Reimann, Michael W., Max Nolte, Martina Scolamiero, et al. 2017. “Cliques of Neurons Bound into Cavities Provide a Missing Link Between Structure and Function.” Frontiers in Computational Neuroscience 11: 48. https://doi.org/10.3389/fncom.2017.00048.

Sun, Yu, Xinhao Li, Karan Dalal, et al. 2024. “Learning to (Learn at Test Time): RNNs with Expressive Hidden States.” arXiv Preprint. https://arxiv.org/abs/2407.04620.

Sun, Yu, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. 2020. “Test-Time Training with Self-Supervision for Generalization Under Distribution Shifts.” International Conference on Machine Learning, Proceedings of machine learning research, vol. 119: 9229–48. https://arxiv.org/abs/1909.13231.

Swingle, Brian. 2012. “Entanglement Renormalization and Holography.” Physical Review D 86 (6): 065007. https://doi.org/10.1103/PhysRevD.86.065007.

Tegmark, Max. 2008. “The Mathematical Universe.” Foundations of Physics 38 (2): 101–50. https://doi.org/10.1007/s10701-007-9186-9.

Templeton, Adly, Tom Conerly, Jonathan Marcus, et al. 2024. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread, Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity/.

Van Raamsdonk, Mark. 2010. “Building up Spacetime with Quantum Entanglement.” General Relativity and Gravitation 42: 2323–29. https://doi.org/10.1007/s10714-010-1034-0.

Wigner, Eugene P. 1960. “The Unreasonable Effectiveness of Mathematics in the Natural Sciences.” Communications on Pure and Applied Mathematics 13 (1): 1–14. https://doi.org/10.1002/cpa.3160130102.

Worrall, John. 1989. “Structural Realism: The Best of Both Worlds?” Dialectica 43 (1–2): 99–124. https://doi.org/10.1111/j.1746-8361.1989.tb00933.x.

Yang, Menglin, Ram Samarth B B, Aosong Feng, et al. 2025. “Hyperbolic Fine-Tuning for Large Language Models.” Advances in Neural Information Processing Systems 38 (NeurIPS 2025). https://openreview.net/forum?id=TkEdQv0bXB.

Yoshikawa, Daiki, and Takashi Matsubara. 2026. “PHyCLIP: \(\ell_1\)-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning.” International Conference on Learning Representations (ICLR 2026). https://openreview.net/forum?id=I3Ct1eDmVI.

Reuse

CC BY 4.0