Separation Before Depth
An exploration of why continuous learning requires architectural separation of new and old knowledge, rather than just temporal depth.
The dominant paradigm in artificial intelligence encodes everything a system learns into the same structure it uses to compute. This conflation, of data and mechanism, of memory and inference, creates problems that scale cannot solve.
Most modern AI systems learn by adjusting weights: numerical parameters distributed across a network that together determine how inputs are transformed into outputs. When a system encounters new data, gradients flow backward through the network and nudge those weights incrementally toward configurations that produce better predictions. Repeat this process across enough examples and the system acquires, distributed across billions of parameters, something that functions like knowledge.
This approach has produced remarkable results. It would be difficult to argue otherwise. But the architectural choice to encode experience directly into weights carries consequences that become increasingly significant when systems are expected to adapt continuously, operate in changing environments, and remain coupled to the consequences of their actions.
Three of those consequences are worth examining carefully.
The first problem is structural and goes deeper than catastrophic forgetting, though it explains why catastrophic forgetting is so difficult to solve.
When a neural network learns from new data, gradients adjust the weights that encode everything the network has previously learned. There is no separation between the storage of past experience and the mechanism that processes new experience. Both live in the same parameter space. A weight that encodes a useful regularity from the training distribution is the same weight that gets nudged when the network encounters something new.
This conflation creates interference that is not incidental but inevitable. Updating weights to accommodate new experience necessarily risks distorting the representations built from old experience. Techniques like Elastic Weight Consolidation attempt to mitigate this by penalising changes to weights deemed important for previous tasks. But this is a constraint applied on top of a fundamentally entangled architecture, not a resolution of the underlying problem.
The deeper issue is that the differentiation mechanism, the gradient, operates on the same space as the data it is trying to learn from. In one of our other essays, we explored what it means for a system to be coupled to its environment through consequence. The weight-based architecture makes that coupling structurally difficult because every act of learning is also an act of potential interference with what was previously learned.
A cleaner architectural separation would store experience in a form that is distinct from the mechanism used to evaluate and retrieve it. Retrieval would operate on stored patterns without modifying them. New experience would be added to storage rather than blended into an undifferentiated parameter space. The evaluation mechanism would remain stable while the stored data evolves.
This is not a novel idea at the conceptual level. Complementary learning systems theory, developed in neuroscience, proposes that biological brains maintain a similar separation: a fast, episodic store for recent experience and a slower, distributed system for consolidated knowledge. The two interact, but they are not the same structure. The separation is what allows new experience to be integrated without immediately overwriting what came before.
The weight-based paradigm collapses this separation by design. Everything goes into the same place, updated by the same mechanism, at the same time.
The second problem is temporal. Adjusting weights through gradient descent is slow. Not slow in the colloquial sense, but slow relative to the timescales on which environments change and on which prediction errors need to be acted upon.
Training a large model requires many passes over large datasets, careful scheduling of learning rates, and significant computational resources. This is manageable when learning is a phase that precedes deployment. It becomes a constraint when learning needs to happen continuously, at runtime, in response to signals arriving faster than a training cycle can accommodate.
The latency between encountering new evidence and updating internal structure, in a weight-based system, is measured in training steps, batches, and epochs. In a system genuinely coupled to consequence, that latency needs to be measured in interactions. The environment does not pause while the network converges.
This temporal mismatch is not simply an engineering problem to be solved with faster hardware. It reflects a more fundamental issue: gradient-based weight updates are a mechanism designed for offline optimisation, applied to a problem that is inherently online. Making them faster does not change what they are.
A runtime adaptation mechanism needs to operate on a different timescale. It needs to update internal state within the window of a single interaction, not across thousands of them. This points toward architectures where adaptation is not a matter of adjusting distributed numerical parameters but of updating discrete, retrievable structures that can be modified quickly and selectively.
Such architectures exist in early form. They tend to maintain explicit stores of past experience that can be searched, extended, and revised without requiring recomputation across the entire parameter space. They are faster to update precisely because they do not entangle storage with computation. And they are more legible, because the contents of the store can be inspected directly rather than inferred from the behaviour of a high-dimensional weight matrix.
The third problem goes deepest of all, because it concerns not just how experience is stored or how quickly the system updates, but what signal drives learning in the first place.
Reinforcement learning, the framework most explicitly designed to couple AI systems to consequence, requires a reward signal: a scalar value defined in advance by the designer that tells the system whether its actions were good or bad. This seems like a natural design choice. It mirrors, at least superficially, how animals learn from pleasure and pain.
The problem is that it gets the causal structure backwards.
In biological systems, as argued in detail in Peter Sterling and Simon Laughlin's Principles of Neural Design, reward does not define what is worth learning. It reinforces what already worked. The more primitive and more general signal driving adaptation is prediction fidelity: whether the system's model of the world accurately anticipated what happened next. A biological organism that cannot predict how its environment continues cannot act effectively, regardless of what higher-level goals it might have. Prediction accuracy is the substrate on which everything else is built.
This matters architecturally because it means predefined reward signals are not just practically difficult to specify. They are the wrong abstraction. They sit on top of a more fundamental mechanism and attempt to replace it with something a designer constructs in advance. In environments where what counts as a good outcome is multidimensional, delayed, context-dependent, or simply not known until after the fact, this substitution fails. The system is optimising toward a proxy that may not survive contact with reality.
A system grounded in prediction accuracy as its primary signal is more general. It does not require anyone to know in advance what good looks like. It requires only that the system maintain a model of how the world continues, act on that model, and update it based on whether the prediction was borne out. The reward, in this framing, is not something the designer specifies. It is something the environment provides automatically, in the form of confirmed or disconfirmed expectations.
This reframes what consequence-coupled adaptation actually needs. It does not need a reward function. It needs a prediction mechanism that is sensitive to its own accuracy and updates accordingly. And crucially, that update needs to happen in a storage structure that is separate from the evaluation mechanism, and fast enough to close the prediction loop within the timescale of the interaction. The three problems described in this essay are therefore not independent. They point in the same direction.
The three problems are distinct in their mechanisms but share a common origin. They all follow from the decision to encode experience into weights.
Encoding experience into weights requires a predefined signal to drive the encoding, because gradients need a loss function to flow through. It entangles data with the mechanism that processes data, because both live in the same parameter space. And it imposes a temporal cost on every act of learning, because adjusting distributed parameters is inherently slower than updating discrete structures.
None of these problems is insurmountable within the weight-based paradigm, and significant research effort is directed at each of them. But they are not independent bugs to be fixed one at a time. They are consequences of the same architectural choice, and addressing them at the margins does not change what the underlying architecture is.
The question worth asking is not how to make weight-based learning more adaptive. It is whether adaptation, at runtime, in response to consequence, on the timescales that matter, requires a different kind of storage altogether.
The achievements of weight-based learning are real and should not be minimised. Large models trained on large datasets have demonstrated capabilities that were not anticipated even a decade ago. The question raised here is not whether this approach works. It is whether it scales to the problem of continuous adaptation under consequence.
If intelligence, at its most basic level, is a mechanism for allocating resources under uncertainty by maintaining accurate predictions about how the world continues, then the adaptation mechanism must be fast enough to close the prediction loop, general enough to operate without a predefined reward signal, and clean enough to integrate new experience without systematically degrading what was previously learned.
Weight-based learning, as currently practised, satisfies none of these requirements fully. That is not an argument for abandoning it. It is an argument for being precise about what it is for, and for taking seriously the possibility that runtime adaptation requires something it was not designed to provide.
Why is prediction accuracy a better grounding signal than reward?
Because it is more primitive and more general. A system cannot pursue any goal effectively without first being able to predict how its environment continues. Prediction fidelity is the substrate on which goal-directed behaviour is built. A predefined reward signal attempts to specify what is worth learning before the system has encountered the environment, which is only feasible when the designer already knows what good outcomes look like. In novel or changing environments, that knowledge is precisely what is missing.
Isn't catastrophic forgetting already being addressed by existing research?
Yes, and that research is valuable. But most approaches apply constraints on top of a fundamentally entangled architecture rather than resolving the underlying separation problem. Penalising changes to important weights, replaying old data, or isolating parameters per task are all responses to the interference that entanglement creates. A cleaner solution would separate the storage of experience from the mechanism that processes it, so that new learning does not require protecting old learning from itself.
Couldn't you just use a very large model with enough capacity to avoid interference?
Capacity helps, but it does not eliminate the problem. A larger parameter space reduces the probability of direct interference on any given update, but it does not change the architecture. The differentiation mechanism still operates on the same space as the stored representations. And it does not address the temporal problem: a larger model takes longer to update, not shorter.
What would a faster adaptation mechanism look like?
Rather than adjusting distributed numerical parameters, it would update discrete, retrievable structures that can be modified within a single interaction. Retrieval would be fast because it operates on explicit stored patterns rather than requiring inference across a high-dimensional weight space. New experience would be added to storage rather than blended into existing parameters. The evaluation mechanism would remain stable while the stored data evolves. The result would be a system that can close the prediction loop on the timescale of the environment rather than the timescale of training.
Does this mean large language models are useless for adaptive systems?
No. Large pretrained models provide rich representational foundations that are genuinely valuable. The argument here is about what happens after pretraining, during deployment, when the environment continues to change. A hybrid approach, combining the representational breadth of large pretrained models with a faster, structurally separate adaptation mechanism operating at runtime, may be more capable than either alone. That is a direction worth exploring seriously.
We are currently engaging with investors and strategic partners interested in long-term technological impact grounded in scientific discipline.
Elysium Intellect represents a fundamentally different approach to artificial intelligence, prioritising continuous adaptation, reduced compute dependence, and real industrial application.
Conversations focus on collaboration, evidence building, and shared ambition.
Start a conversation