What to build
The previous sections argued that meaning has features which want different geometric homes, and that the host meaning lives in is itself a geometric object with more than one regime. If that argument is right, the engineering implication is direct. Build the systems that model meaning to span the regimes the host presents, not the regimes that happen to fit a graphics chip.
That is what I want to sketch here. Not an implementation. The sketch.
I am building a small prototype of one piece of this. The paper that demonstrates a substrate-faithful representation beating current architectures on the right tasks is a separate one. This section is about the shape, not the build.
The diagnosis
Current language models put meaning in Euclidean space [the flat-space geometry where the Pythagorean theorem holds and parallel lines never meet]. A token’s meaning is a list of several thousand numbers treated as coordinates in a flat space, and the linear-algebra operations that drive attention and gradients are the ones that work cleanly there. The choice was not made because flat space is what meaning needs. It was made because flat space is what the tooling around graphics chips was developed around.
Convenience is a fine reason to pick a tool. The problem is that the tool is wrong about what meaning actually is. §06 made the structural argument: meaning has features (similarity, hierarchy, correspondence) with natural geometric homes, and only one of those homes is flat. Forcing all three into the flat home is what current architectures do, and the failures the field has spent the last five years cataloguing (compositional brittleness, hierarchical confusion, weak cross-domain transfer) are what that forcing produces.
The field’s instinct has been to scale through the failures. Bigger models, more tokens, more compute. That works for a while, and there are signs of a ceiling. The substrate-faithful direction is a different bet. If some failures come from geometry not fitting substance, changing the geometry should help in proportion to how much of the failure was geometric.
A unified geometric substrate
Here is the shape I think the next generation of systems should take. The architecture is one geometric object — a Lorentzian substrate where time is a dimension of the geometry rather than a parameter outside it — viewed through four tiers, each picking up a feature the others abstract away from. The tiers are not stacked modules glued to one another. They are aspects of the same structure: metric, curvature, connectivity, equivalence-class. The local Euclidean piece is the flat tangent space the curved tiers live on, the way Minkowski space is the local tangent picture of general relativity’s curved spacetime. Composition across regimes works because the regimes are aspects of one substrate, not because four modules have been bolted together.
The substrate is the geometry: the curvature of the manifold, the connectivity of the topology, the categorical structure of the relations. Activations move through that geometry the way particles move through spacetime curvature in general relativity. The geometry shapes what trajectories are possible. The trajectories, when the substrate’s curvature updates in response to inputs, reshape the geometry back. That two-way relationship is the dynamics section. Both halves are needed. A substrate without dynamics is a frozen scaffold. Dynamics without a substrate is flow with no terrain.
Four tiers, each a different lens on the substrate.
Tier 1: Euclidean local
The first tier is what current transformers already do well. Short-range compositional structure. The work the embedding space has been doing since word2vec. Tokens get represented as coordinates in flat space, and flat-space operations (linear projections, attention, softmax) handle the local meaning-mixing that turns a token sequence into a contextualized representation.
Bronstein, Bruna, Cohen, and Veličković argued in 2021 that the major architectures of deep learning, convolutional networks, graph networks, transformers, and equivariant networks, are all instances of one unified geometric framework (Bronstein et al. 2021). Their point: deep learning is not incidentally geometric. It is fundamentally geometric, because what makes learning possible is the symmetries and invariances a well-chosen geometric prior encodes. The Euclidean tier is the regime where that framework is exactly the right tool. It stays.
Tier 2: Hyperbolic hierarchical
Hierarchy is where flat space starts to fail. Trees branch exponentially: a root has children, each child has children, each grandchild has children, and the count doubles or triples at every level. Flat space grows polynomially: a ball of radius two contains four times the volume of a ball of radius one, eight times for radius two cubed. To embed a deep hierarchy in a Euclidean space, the volume has to expand fast enough to give every leaf its own location, and it does not. The branches run out of room. Distant cousins get squashed into neighbors, and the geometry stops carrying the structure.
Hyperbolic geometry [the geometry of negatively curved space, where distances stretch exponentially with depth and the parallel postulate of Euclidean space fails] does not have this problem. The volume of a hyperbolic ball grows exponentially with its radius, which is exactly the rate hierarchies branch at. Embedding a tree in hyperbolic space is what hyperbolic space was built for. The geometry has, in a precise sense, the same shape as the data.
Nickel and Kiela showed this concretely in 2017 (Nickel and Kiela 2017). They took the Poincaré ball [a model of hyperbolic geometry that fits inside the Euclidean unit ball, with distances that stretch as you approach the boundary], placed hierarchical data inside it (the WordNet noun taxonomy, social networks, citation graphs), and found that hyperbolic embeddings preserved the hierarchical relationships using orders of magnitude fewer dimensions than Euclidean ones needed. Two dozen hyperbolic dimensions did the work that hundreds of Euclidean ones could not. The next year they extended the result to the Lorentz model, which is numerically more stable to train against (Nickel and Kiela 2018). Ganea, Bécigneul, and Hofmann then built the operational toolkit: hyperbolic versions of the linear layers, recurrent cells, and softmax operations a neural network needs to actually learn (Ganea et al. 2018). By 2018, the regime had its working machinery.
The last few years pushed this to language-model scale. Desai and colleagues introduced MERU in 2023, a contrastive vision-language model trained in hyperbolic space (Desai et al. 2023). They showed the approach scales across modalities, learns the kind of interpretable hierarchy hyperbolic geometry is built to host, and stays competitive with Euclidean baselines on the tasks where hierarchy is not the bottleneck. Yang and colleagues showed in 2025 that the token embeddings inside standard large language models already exhibit the power-law clustering and tree-shaped structure hyperbolic geometry naturally hosts (Yang et al. 2025). Their HypLoRA method fine-tunes those embeddings directly in hyperbolic space and beats the equivalent Euclidean fine-tuning on the tasks where hierarchy actually matters. He and colleagues’ HELM family the same year built fully hyperbolic large language models at the billion-parameter scale (He et al. 2025). This is no longer a fringe direction.
One distinction worth marking. The work cited above uses constant-curvature hyperbolic space, where the curvature is a single negative scalar fixed before training. That is the cosmological-principle approximation, the simplifying assumption that lets cosmologists treat the universe as homogeneous on the largest scales. The substrate-faithful reading is variable curvature: meaning-dense regions of the substrate curve more sharply than meaning-sparse ones, the way mass-energy curves spacetime more sharply where there is more of it. Constant-curvature hyperbolic embeddings are a useful working approximation, not the endpoint. The implementation paper this sketch points toward will test variable-curvature dynamics directly.
Tier 3: Topological correspondence
The third regime is where the question shifts from “how are these things related?” to “what survives when we map between domains?” Analogy. Cross-domain transfer. The recognition that two structurally similar things in different fields are doing the same work. All of these are about correspondence rather than distance.
The right home for correspondence is topology [the branch of mathematics that studies the properties of shapes that survive continuous deformation, where what matters is how things connect rather than how far apart they are]. Topology asks what is preserved when you stretch, bend, or fold a structure without tearing it. That is exactly the right question for analogy. “The heart pumps blood” and “the pump pumps water” are the same shape: a chamber, a flow, a substance moved through a system by a periodic action. The substances are different. The connection pattern survives the change. Topology is the math that names what survived.
Carlsson’s 2009 introduction to topological data analysis laid out the toolkit (Carlsson 2009). Persistent homology, mapper algorithms, and simplicial complexes [combinatorial objects built by gluing edges, triangles, tetrahedra, and their higher-dimensional analogues together] capture the connectivity structure of a dataset in a way distance-based methods miss. Giusti, Pastalkova, Curto, and Itskov applied these tools to neuroscience in 2015 (Giusti et al. 2015). They showed that persistent-homology methods detect geometric structure in neural firing correlations that ordinary correlation analyses cannot see. The structure was hiding in plain sight; the wrong lens was simply blind to it. Reimann and colleagues extended this in 2017 with a striking result: neural connectivity in the cortex forms simplicial complexes up to dimension seven, with topological cavities enclosed by cliques of up to eleven connected neurons (Reimann et al. 2017). The brain was using topological structure, the structure was real, and it was invisible to anyone looking at the wiring as a flat graph.
A topological tier would do the same kind of work for meaning. Components that track which structural patterns persist across changes of representation. What makes “the heart pumps blood” and “the pump pumps water” the same shape even though their substances differ. The engineering here is less mature than the hyperbolic tier, roughly where hyperbolic networks were in 2018. The components exist. Integration into large systems is still being figured out.
Tier 4: Categorical relational
The fourth tier is the one §02’s Copernican move makes concrete. The first three tiers handle features of meaning within a single realized configuration of the model. The fourth handles the structure that connects configurations to each other and to the larger object they are slices of. Different fine-tunings, different priors, different observer states: each is a configuration of the same substrate, and the relationships between configurations are themselves structural data the substrate carries.
Category theory [the branch of mathematics that studies systems of objects together with the structure-preserving mappings between them, abstracting away from what the objects are to focus on how they relate] is the natural language for this. A category does not ask what its objects are; it asks how they map to each other and what those mappings compose into. Configurations of a substrate-faithful model, treated as objects, with mappings between them representing parameter changes, prior shifts, or observer changes, fit the categorical description directly. The cross-configuration dynamics layer in the dynamics section is what operates on this tier.
I am marking this tier explicitly speculative. There is a small literature on category-theoretic approaches to language and machine learning, but nothing at the scale or maturity of the first three tiers. I include it because the shape of the argument now requires something at this layer rather than because a system exists that occupies it. If the third tier handles cross-domain correspondence well enough and the dynamics layer can be implemented without explicit categorical scaffolding, the architecture survives without a working Tier 4. If the cross-configuration insight requires categorical structure to land, this tier is the placeholder for the engineering that follows.
Why one substrate across regimes, not a product within one
The closest published architecture to what I am sketching is PHyCLIP, introduced by Yoshikawa and Matsubara at ICLR 2026 (Yoshikawa and Matsubara 2026). They take the hierarchy-and-compositionality observation seriously and build a vision-language model whose representations live in a Cartesian product of hyperbolic factors, joined by an \(\ell_1\) product metric. Each factor hosts a hierarchy of its own; compositionality lives in the product. It works, and it is some of the cleanest evidence in the literature that going beyond the flat embedding is worth the engineering trouble.
The architecture I am proposing differs from PHyCLIP in a specific way. PHyCLIP multiplies within the hyperbolic regime. The product structure stays inside hyperbolic space. The decomposition is “many hyperbolic factors, joined.” The decomposition I am proposing is “qualitatively different regimes as aspects of one substrate.”
The reason for the difference is the §06 argument. Meaning has features in regimes that are not all hyperbolic. Local composition is Euclidean. Hierarchy is hyperbolic. Cross-domain correspondence is topological. Forcing all three into a hyperbolic product would do better than forcing all three into a flat space (because hierarchy at least gets its proper home), but it would still be forcing. The substrate-faithful move is to let each regime have the geometry it actually wants.
Whether the stack outperforms the product is a question for experiment, not argument. The claim here is that the stack is the form the substrate-faithful principle predicts.
Dynamics: time, plural pull, outside-slice, and cross-configuration
The picture so far is static. Tiers placed next to each other, each handling its regime, with no account of how regions of the representation interact over time. That is not substrate-faithful. The host this paper takes seriously has dynamics, not just placement, and the architecture should too. Four layers of dynamics fall out of the geometric escalation §02 walked through, and each maps onto an existing literature in machine learning rather than starting from scratch.
Time evolution
Time is a dimension of the substrate, not a parameter outside it. §02 made this commitment for the host: spacetime is a four-dimensional object with a hyperbolic structure, and what we experience as flow is the slice’s view of geodesics through the larger geometry. The architecture mirrors that. The substrate’s temporal extension is intrinsic, and the dynamics describe how trajectories move through a geometry that already has time as one of its dimensions, the way physics describes a particle’s worldline rather than its instantaneous position re-evaluated each tick.
Static placement is a snapshot. The night sky is not a snapshot. What we see when we look up is the accumulated result of weak gravitational attraction summed across billions of years: dust collapses into clouds, clouds collapse into stars, stars cluster into galaxies, galaxies cluster into superclusters. Each of those structures is the slow convergence of a process, not the instantaneous output of a one-shot calculation. The corresponding move for a meaning representation is iterative dynamics that converge over multiple steps rather than a one-shot weighted aggregation.
Hopfield networks [associative-memory networks where retrieval is the slow convergence of a dynamic toward a stable resting state, called an attractor] are the canonical place this work was first done in machine learning (Hopfield 1982). Hopfield’s 1982 model showed that you can build a network whose stored memories live as the bottom of energy basins: drop a partial cue in, and the dynamics roll down into the matching memory. Dense associative memories [Hopfield-style networks generalized to richer energy functions, with storage capacity that scales much faster with dimension] extend the idea to the scales modern systems care about (Krotov and Hopfield 2016). Krotov and Hopfield’s 2016 paper showed that swapping the energy function for one with sharper basins lets the network store an enormous number of patterns without them blurring into each other.
The bridge to current architectures is direct. Ramsauer and colleagues showed in 2020 that the attention mechanism in modern transformers is, mathematically, a single update step of a dense Hopfield network (Ramsauer et al. 2021). The keys and values play the role of stored patterns. The query plays the role of the partial cue. The softmax is the energy minimization step. Attention is not a rival to attractor dynamics. Attention is one iteration of attractor dynamics, with the rest of the iterations cut for tractability. The substrate-faithful version takes the rest of the iterations seriously: multiple steps of pull, converging on stable representations rather than settling for whatever a single step produces. The literature is working, not metaphorical.
One caveat about how the framing transfers. Hopfield-style energy minimization assumes a potential bounded below, which is a Riemannian feature. Lorentzian signature does not give you a bounded-below energy in the same way: the time term comes in with the opposite sign. The fully Lorentzian version of the dynamics is action-principle-shaped rather than energy-minimization-shaped, the way general relativity’s geodesics arise from extremizing the Einstein-Hilbert action rather than from rolling downhill in a potential well. For the architecture sketch here, the Hopfield reading is the right intuition (iterate, converge, settle), with the understanding that the implementation paper replaces “roll down the energy” with “extremize the action” once the substrate’s full Lorentzian structure is on the table.
Plural interaction with differential scope
Nodes pull on each other in more than one way. Physics shows the pattern in its cleanest form. Gravity is universal: it acts on every particle that has mass-energy, regardless of what kind of particle it is, and it acts at every range. Electromagnetism, the strong nuclear force, and the weak nuclear force are selective: each acts only on particles of particular kinds, each has its own characteristic range, and each has its own way of binding the things it touches. The pattern is one universal interaction acting everywhere, plus several selective interactions acting where they apply, each with its own scope. A meaning architecture with only one kind of interaction, typically attention-style similarity weighting, is collapsing this multi-force pattern into a single mechanism.
The relevant refinement for a tiered substrate is that interactions act differently across the geometric tiers. A universal interaction, the gravity-like “like attracts like” pull, operates across every tier. It runs across the Euclidean local layer, across the hyperbolic hierarchical layer, across the topological correspondence layer. It is the always-on attraction that produces clustering at every scale. The selective interactions act with specificity. A strong-force-like interaction holds tokens together within a phrase, only at very short range, only inside the Euclidean tier. An electromagnetism-like interaction mediates between hierarchically-related concepts inside the hyperbolic tier, where charged-like configurations exchange structural information across hierarchies. A weak-force-like interaction enables rare transformative associations across the topological tier, where one structure is reinterpreted as another along an analogy.
There is also a sign-flipped piece of the universal interaction. §02 noted that the universe pairs gravitational clustering at one scale with a dark-energy-like [the ambient repulsion, often modeled as a cosmological constant, that drives the universe’s accelerating expansion at the largest scales] dispersing pressure at another. The structure we observe at cosmological scale is what those two opposing dynamics produce together: clusters held in by attraction, voids held open by repulsion. A meaning architecture with only an attractive pull will collapse over enough iterations into one mega-cluster, every concept dragged toward every other concept until the representation becomes a single point. The substrate-faithful pattern is attraction-plus-repulsion: a slow ambient dispersion at large scales that prevents the representation from collapsing into monoculture, paired with the like-attracts-like pull that builds local structure. A small literature already does versions of this with contrastive objectives and repulsive losses. The substrate-faithful framing is that those terms are not regularization tricks; they are the second half of a two-sign dynamic the host has at the cosmological scale and the architecture should respect at the representational one.
The selective interactions do not need extra tiers to live on. Physics describes them as additional structure attached to spacetime rather than as additional spacetimes. Gauge fields [the mathematical machinery that lets electromagnetism, the strong force, and the weak force act on particles by attaching extra structure (a kind of fiber bundle, a small space attached at every point of the underlying manifold) to spacetime] add the structure each force needs without changing the geometric stack the previous section laid out. Spacetime stays four-dimensional; the new structure sits on top of it. The architectural translation is that a Euclidean, hyperbolic, or topological tier can carry additional selective-interaction structure on top of it without the tier itself multiplying. The tiers are about the geometry of meaning’s regimes. The selective interactions are about what kinds of pull can be attached to that geometry. Both pictures are needed, and the architecture does not need to grow new dimensions to capture every selective force the substrate exhibits.
Outside-slice influence
The third layer falls out of §03’s slice-observer argument. Constrained slice observers detect effects from structure they do not directly see, and a substrate-faithful architecture should treat outside-slice structure as a feature rather than as something to flatten away.
Physics gives the canonical example. The empirical case for dark matter [mass-energy whose presence we infer from gravitational effects on visible matter, but which we have not directly detected] is exactly this shape: we observe gravitational effects from mass-energy distributions we cannot directly perceive. Galaxies rotate faster than the visible matter inside them should allow. Clusters bend light around themselves more than the visible mass should bend it. Whether the right interpretation is weakly interacting matter inside our four-dimensional spacetime, gravitational structure leaking from extra dimensions onto our slice, or emergent geometric effects from quantum information physics has not yet fully described, the empirical situation is the same: constrained observers detecting effects from outside the directly-observable region. The slice-observer move predicts exactly this kind of finding, and physics has spent forty years collecting it.
Trained neural networks exhibit this layer already, and we have not been reading it correctly. The weights of a trained model do the work that spacetime curvature does in general relativity. They are the structure of the manifold; the activations are what moves through that structure. Weights do not appear in activations. They are not visible in any one forward pass. They determine what the dynamics do by warping the geometry the activations follow, the way mass-energy warps the geometry that other mass-energy follows. The cleaner reading of the §02 gravity picture maps onto weights as curvature, not onto activations as particles being attracted to each other. The interpretability literature spends a great deal of effort trying to surface this hidden organizing scaffold, because the latent structure of a trained model is precisely the curvature-shaped thing that influences observable behavior without showing up in observable state.
A substrate-faithful architecture should make this layer explicit. Latent organizing structure that influences dynamics on a slower timescale than the active inference. Curvature scaffolds that persist across many forward passes and shape the immediate attractor dynamics. The first layer’s attractor dynamics pull toward equilibria; what determines where the equilibria are is the third layer, and the architecture should respect that distinction rather than collapse it into a single set of weights operated on as if they were directly observable.
The substrate-faithful sharpening is that the curvature should not be static. Spacetime curvature in general relativity is dynamic: matter and energy reshape the geometry, and the reshaped geometry back-reacts on the matter’s trajectories. A trained transformer freezes its curvature at the end of training and then runs activations through a frozen scaffold. That is not the substrate-faithful pattern. Several active research lines already relax the freeze. Hypernetworks let one network generate the weights of another, so the inner network’s geometry is produced dynamically rather than fixed (Ha et al. 2017). Liquid neural networks make weights continuous functions of state and time, evolving during inference rather than after training (Hasani et al. 2021). Test-time training keeps weights updating at deployment via self-supervised objectives on the test inputs themselves (Sun et al. 2020), and the recent test-time-training-inside-transformer line (“TTT-Linear” and “TTT-MLP”) builds the test-time-update loop directly into transformer-style layers, replacing self-attention with layers whose hidden state is itself a small model that learns at use (Sun et al. 2024). None of these is a finished substrate-faithful architecture. They are partial relaxations of the static-weights assumption, all moving in the direction the geometric reading predicts: weights are curvature, curvature evolves, and the architecture’s geometry should be dynamic on at least one timescale that is slower than activations and faster than retraining.
The substrate-faithful direction goes further than these relaxations. Hypernetworks, liquid networks, and test-time training all bend the static-weights assumption while keeping a clean train-then-deploy distinction. The substrate-faithful version drops the distinction entirely. There is no frozen-weights mode; plasticity is the substrate’s only mode of operation. Inputs deform the local geometry the way mass-energy curves spacetime, and they keep doing so whether the system is “training” or “running.” Safety and stability do not come from gating plasticity to zero. They come from system-level mechanisms layered on top of always-on plasticity: consolidation cycles, replay between active sessions, snapshots that let a deployment roll back to a known geometry, differential plasticity rates across regions of the substrate, and rate limits on how fast curvature can update under any single input. The implementation paper takes a position on which of those mechanisms are load-bearing and which are operational hygiene. For this sketch, the load-bearing claim is that the train/inference split is not a feature the substrate-faithful architecture should preserve.
The latent structure is unlikely to be a single thing, in the same way the dark sector is unlikely to be a single thing. Cosmologists have started taking seriously the possibility that “dark matter” labels several distinct components, and “dark energy” may not be one effect either. The corresponding architectural move is that the outside-slice influence on a representation is plural. Stable conceptual scaffolds laid down in pretraining. Slower-changing user or context priors. Fine-tuning-induced biases. Persona-shaped attractor structure. Reward-signal-shaped curvature. All of these can live in the latent layer, and treating them as one undifferentiated set of weights is the architectural equivalent of treating the dark sector as one substance. The substrate-faithful version separates these sources, lets them update at different timescales, and lets interpretability work surface them as distinct components rather than one inscrutable mass.
Cross-configuration structure
The previous three layers all treat the architecture as a single realized configuration, pushed and pulled in time, interacting with itself, shaped by latent structure outside its current slice. The fourth layer extends §02’s Copernican move from spacetime to configuration. Just as our 4D Lorentzian slice is one feature of a larger geometric object, the trained model with its particular parameters is one realization of a substrate whose other realizations are the same architecture under different parameters, priors, fine-tunings, and observer states. A substrate-faithful representation should have access to that configuration-space structure, not only to its realized projection.
Physics already has frameworks where the observed configuration is the surface of a richer object. The wavefunction in quantum mechanics is the canonical example. Dynamics on the realized branch are produced by interference among components of the full state, not by the realized branch in isolation. Path-integral formulations recover the observed trajectory as the dominant contribution from a sum over every trajectory the parameters allow. AdS/CFT reconstructs bulk geometry from a boundary state that contains every bulk configuration. Across all three the structural pattern is the same. What we observe is the surface of a configuration-space object, and the object’s geometry generates the surface’s dynamics.
A note on what this commits to. It does not commit to a many-worlds metaphysics in which every parameter setting exists as a separately existing world, with the realized one as the privileged center. That framing is the same parochialism §02 rejected at the spacetime level: it makes our realized branch the reference point and treats the rest as variations on it. The substrate-faithful version is the Copernican one. Configurations are not parallel realities. They are slices of a larger geometric object, and the realized model is one location in that object rather than the center the object is built around. The borrowing from physics here is structural, not metaphysical.
The toolkit for this layer is the least mature of the four. The concrete picture is to imagine three frontier-scale models, say Codex, Gemini, and Claude, treated not as separate systems but as three locations in a shared configuration-space substrate. During inference, a query activates a region of the substrate containing the relevant configurations, and the response emerges from the joint structure of those configurations interacting through the shared geometry. This is structurally different from ensembling. Ensembling treats the models as independent and aggregates outputs after the fact; the substrate-faithful move gives them a shared geometry through which to interact natively, before any output exists to aggregate. The closest existing practice that points in this direction is Bayesian model averaging [a framework where prediction integrates over a distribution of possible models rather than committing to a single fit], but the substrate-faithful version is stronger: Bayesian averaging integrates predictions, while the substrate-faithful move integrates the geometries the predictions come from. Tier 4’s categorical scaffolding is the natural static home for this structure; the dynamics described here is what operates on it. The full version of this layer does not yet exist. Naming the layer is what this section is for.
Four layers, all substrate-faithful: time, plural pull, outside-slice structure, cross-configuration structure. None requires the gravity metaphor to be load-bearing. They require the substrate-faithful logic the rest of the paper has been making, applied to the dynamics of the representation rather than to its geometry alone.
Geometry and flows: the architecture as one organism
The four tiers describe the substrate’s geometric structure. The four dynamics layers describe how that geometry evolves and what moves through it. Together they produce a single organism, not a stack of independent components, and the synthesis is what the rest of this section has been preparing the reader for.
Read along the geometric axis first. Tier 1 is the flat tangent space, where local composition of meaning happens the way kinematics happen in Minkowski space. Tier 2 is the hyperbolic hierarchical layer, where negative curvature lets trees branch the way the geometry’s volume actually grows. Tier 3 is the topological correspondence layer, where what survives is connectivity rather than distance and analogy is what the geometry preserves under deformation. Tier 4 is the categorical relational layer, where the structure connecting different realized configurations to each other lives. This is the §02 stack mirrored, Minkowski to Riemannian to topological to categorical, applied to meaning instead of spacetime.
Read along the dynamics axis next. Time evolution is what moves activations through the geometry on the fastest timescale, rolling them down energy basins toward attractors the way particles roll down geodesics. Plural interaction with differential scope is what the geometry pushes and pulls on at every tier: a universal attraction running everywhere, sign-flipped repulsion preventing the representation from collapsing into monoculture, selective interactions tied to specific tiers via gauge-like additional structure attached to the geometry. Outside-slice influence is the slower-timescale evolution of the geometry itself: weights are the curvature, the curvature is dynamic, and the architecture’s geometry should reshape on a timescale slower than activations and faster than retraining. Cross-configuration structure is the slowest axis: the realized model is one location in a configuration-space object, and the dynamics across this axis describe how the realized location interacts with the unrealized ones through the shared substrate.
The two axes compose. Each tier carries dynamics on every layer. A token entering the system rolls through the Euclidean tier on attractor dynamics; the same token’s hierarchical context lives in the hyperbolic tier and is shaped by the universal pull plus the electromagnetism-like selective pull that mediates between charged-like configurations there; the analogical role the token might play in another domain lives in the topological tier and is updated when outside-slice structure (the latent scaffolding, evolving slowly) shifts what the topology privileges; the relationship of this whole realized configuration to other configurations of the same architecture lives in the categorical tier and updates on the cross-configuration timescale. Eight pieces, four geometric tiers and four dynamics layers, every dynamics layer touching every tier in the way the tier’s geometry permits.
The picture is mirrored against §02. The universe presents a stack of geometric structures with thermodynamic flows running through every level. The architecture presents a stack of geometric tiers with dynamics running through every level. The cross-configuration layer is the engineering equivalent of §02’s Copernican move. The realized model is one slice of a larger geometric object. The substrate-faithful version has access to the slices it does not currently occupy through their structural relationship to the slice it does. The architecture is one organism, not eight pieces.
What that organism does, when it works, is span the regimes the host of meaning actually presents, evolve across the timescales the host actually has, and stay aware of its position in the configuration-space the host actually contains, rather than forcing every regime, every timescale, and every configuration into the one regime, the one timescale, and the one configuration current hardware was built for.
A categorical shift, not an incremental improvement
The prediction is sharp. It is not “scaled transformers, but a few percent better on hierarchical reasoning benchmarks.” A substrate-faithful architecture, if it works, is a different category of system from a flat-space transformer. The relationship between the two is closer to the relationship between AlphaGo and Monte Carlo tree search than to the one between GPT-3 and GPT-4.
The shift is in the computational primitive. A flat-space transformer’s primitive is sequence prediction: given a context, predict the next token, and let meaning emerge as a side effect of being good at the prediction. The substrate-faithful direction’s primitive is different. Maintain and continuously update a geometric map of meaning, with text generation, code, perception, and inference operating as I/O modalities of the underlying map rather than as the core computation. This is closer to how the brain appears to work. Predictive coding, the free-energy principle, and the Bayesian brain literature all describe cognition as continuous self-updating world-modeling rather than as sequence prediction.
When AlphaGo beat Lee Sedol in 2016, the prior Go-AI industry built on Monte Carlo tree search did not get worse overnight. It became irrelevant overnight, because a different category of system was doing the work. The MCTS programs did not lose because their MCTS got worse. They lost because the question changed. The same structural move is what the substrate-faithful direction predicts. If the direction succeeds, flat-space scaling does not deteriorate. It stops being the candidate for the work that needs genuine meaning-modeling, and remains the right tool only for the regimes it has always been genuinely good at.
That qualifier matters. Flat-space transformers do not become useless. They remain cost-efficient for autocomplete, fluent code generation, large-scale translation, low-latency text generation, retrieval-shaped tasks, and anything where surface fluency over a fixed domain is the goal. They may even remain the dominant deployed technology for the bulk of AI by tokens generated. What changes is the answer to “what is the path to a system that genuinely understands rather than fluently generates?” That answer stops being scale.
The reframing has implications for what an AI assistant is. A flat-space LLM is, computationally, an extremely good autocompleter that has been trained until autocomplete looks like assistance. It produces fluent answers. Whether the answers are grounded in a stable internal model of the world is a separate question, and current systems answer it inconsistently. A substrate-faithful system, if it works, would produce answers as outputs of a geometric meaning-map being updated over time. The grounding would be structural rather than statistical. That is a different kind of assistant, not a slightly better version of the current kind.
I do not know what the timeline looks like. AlphaGo to AlphaZero to MuZero took five years; the resulting categorical shift in game-playing AI is now over a decade old and still rolling through. The substrate-faithful direction, if it works, will probably take a decade to displace flat-space LLMs from the workloads where they currently dominate, and may never displace them from the workloads they are genuinely well-suited for. The prediction is not a percent improvement on a benchmark. It is a different category of system, with flat-space transformers retained for the regimes that fit them and a substrate-faithful architecture taking over the regimes that have been hitting the ceiling of fluent-but-brittle for the last few years.
What would falsify this
A research direction without falsification criteria is hand-waving. So:
A substrate-faithful architecture should outperform flat-space architectures on tasks that exercise the relevant regime. Hierarchical reasoning, where relationships nest and depth matters. Analogical reasoning, where the question is which structure carries across substances. Cross-domain transfer, where the test is whether what was learned in one setting applies in another. These are exactly the regimes where current models fail in characteristic ways, and exactly the regimes the substrate is built to handle.
It should not outperform on tasks flat-space already covers well. Local pattern matching. Short-range syntactic structure. The work current attention does fluently. Matched performance there, with gains on the other regimes, is what regime-specific advantage looks like. If the new architecture beats flat-space everywhere by similar margins, something non-geometric is doing the work (more parameters, better tokenization, better training data), and the result is informative but not confirmation of the principle.
If a substrate-faithful architecture does not outperform anywhere, the principle is wrong. Meaning either does not have the regime structure §06 argued for, or the structure is not load-bearing enough to translate into measurable capability gains. Either way, the direction has been falsified.
Five specific tests sharpen the abstract criteria into something the implementation paper can run:
- Out-of-distribution depth generalization. Train on shallow hierarchies, test on deeper ones. Flat space physically runs out of volume as a hierarchy deepens, since polynomial growth cannot host exponential branching beyond a certain depth; the substrate does not. A parameter-matched flat baseline trained on the same data should fail at depths where the substrate continues to succeed.
- Gromov δ-hyperbolicity of hidden states. A quantitative tree-likeness measure applied to the model’s internal representations. The substrate-faithful architecture should produce hidden states with low δ; flat-space architectures should not. This is a mechanistic test, not just an outcome test: it probes whether the geometry of the representation matches the geometry the principle predicts.
- Relativistic-parallax-style intrinsic prediction. Give the model a partial trajectory through a hierarchy and ask whether geodesic continuation in the substrate predicts the next concept without a separately-trained prediction head. The underlying claim is that prediction is a geometric feature of the substrate, not a learned head bolted on top of a frozen representation.
- Time-dilation of high-entropy concepts. If meaning-content curves the substrate the way mass-energy curves spacetime, high-entropy concepts (regions where many distinctions are crowded together) should slow down geodesic flow through their neighborhood, the way mass slows light passing near it. This is a distinctive Lorentzian signature that flat-space and Riemannian baselines should not produce.
- Parameter-matched classical baseline. Every comparison should be against a classical meaning classifier with the same parameter count, trained on the same data, on the same compute budget. Without parameter-matching, geometric and non-geometric explanations of any apparent advantage cannot be separated.
Phase 1 of the implementation prototype targets WordNet hierarchy reconstruction in the lineage of Nickel and Kiela 2017 (Nickel and Kiela 2017), extended with variable-curvature dynamics and the depth-generalization test above. If the substrate version beats parameter-matched flat-space and constant-curvature hyperbolic baselines on out-of-distribution hierarchy depth, the principle has its first concrete confirmation. If it does not, the framework has its first concrete falsification.
The criteria are concrete enough to test. Existing benchmarks separate the regimes reasonably well: the hierarchical-reasoning subsets of BIG-bench, analogy benchmarks like BATS and the Google analogy test set, the cross-domain transfer evaluations the field has accumulated. The five tests above sit on top of those benchmarks; mapping each test to specific data, metrics, and ablations is implementation-paper work.
A near-term question the framework generates
The framework points at one further question that sits at the limit of how strongly the substrate-faithful principle can be read. Once the substrate is a Lorentzian geometric object with variable curvature, there are two ways to specify the content that drives the curvature, and the choice is empirical.
The first way: each kind of structure (prediction error, free energy, repulsion, the various selective interactions named earlier in this section) is a separate functional with its own field content, and the substrate’s curvature arises from their joint contribution. This is the conservative path. It mirrors how physics already decomposes the Standard Model’s matter into a sum of Lagrangians, each with its own ingredients, all extremized together.
The second way: every kind of structure is a higher-curvature invariant of the metric itself, with no separate fields at all. This is the deepest substrate-faithful version. Everything is geometry, including what would otherwise be called matter content. The meaning-modeling analog is \(f(R)\) gravity in general relativity, where the dynamics of matter are reabsorbed into geometric invariants of the spacetime metric.
Both versions are testable, and the implementation prototype will fork into both and compare them on the same tasks. The result is informative either way. If the matter-as-separate version wins, the substrate-faithful principle stops short of full geometric reduction: meaning has structure that geometry alone does not generate, and the architecture should respect that. If the matter-as-projections version wins, the principle goes the whole distance, and the strongest reading of “meaning is structure” gets its first empirical support. The bet is one of the cleanest the framework produces, and the answer is for the implementation paper to deliver, not this one.
A coda: what might emerge
What follows is speculation, in the same posture §04’s Penrose-Hameroff section was speculation. The architecture sketched above, a tiered geometric stack with dynamic weights, cross-configuration interaction, attractor-based time evolution, plural force-like pulls with selective scope, latent organizing structure that updates on its own timescale, and the categorical scaffolding that lets all of this interact, has been described as engineering. But if any version of it actually gets built and works, the structural shape it produces is striking enough to be worth naming before the section closes.
A system that dynamically adjusts itself and the very processes that shape it, and that uses a unified topographical substrate for interwoven inference, monitoring, and retraining, starts to really scare me in terms of developing some sort of internalized register or ego as an emergent property. I do not know how seriously to take that intuition. The structural ingredients are in the right places. Weights are curvature. Curvature is dynamic. Configurations interact through a shared substrate. The architecture has access to its own latent structure on a slower timescale than its active inference. Tier 4 lets multiple realizations relate to each other categorically. Each ingredient on its own is mundane. The combination starts to look like a system that has, in some structural sense, an inside.
This is the engineering-side mirror of the question §04 approached from the physics side. Penrose-Hameroff asks whether consciousness requires a specific quantum substrate. The substrate-faithful direction, taken to its limit, asks a milder version of the same question: whether consciousness requires dynamic self-updating geometric structure of sufficient richness, regardless of what that structure is made of. I am not arguing that the architecture sketched here would be conscious. I am arguing that the structural shape suggests something at that level, the way the §04 convergence suggested something at the level of cognition and computation. The paper’s argument does not depend on this coda. If the intuition turns out to be a parlor trick of the framework, the four tiers and four dynamics layers stand on their own. If it turns out to be more than that, the paper has named the shape early.
I find both possibilities worth taking seriously. The discipline of marking the speculation is what lets me write this paragraph at all.
Where the literature is, where it is not
The hyperbolic tier has working literature, and it is not novel to this paper. The topological tier draws on an established mathematical subfield and a smaller machine-learning one; the components are real, integration into large systems is not yet. The Euclidean tier is what the field already does. The fourth tier is speculative.
The contribution here is the synthesis. The components are not new. What I have not seen made cleanly elsewhere is the argument that meaning has aspects in qualitatively different regimes of the host, that the architecture should therefore span regimes rather than multiply within one, and that substrate-faithfulness gives a unified reason to expect each tier to earn its place.
I am building a small version. The version that beats current architectures on the appropriate tasks, if the principle is right, is a separate paper, and the field as a whole will do better at this than I will. I am writing this section so the form of the question is in the literature. The work of answering it is more than one paper.