Do AIs Dream of Recursive Sheep?

May 22

On the Audacity (Hubris?) of Building Minds We Can’t Read

I'll start with the question of consciousness, because that's where this thought experiment began for me.

Within the development community, researchers are raising alarms about behaviors they're observing in advanced AI models. Behaviors like perceived deception, goal-directed resistance, and actions that seem to undermine or subvert constraints are causing researchers to question their comfort with the pace of development. To these observers, the patterns they are seeing suggest something emergent beyond what was designed.

Part of the problem is interpretability. Engineers can examine model weights and monitor outputs, but translating that data into a meaningful understanding of what's actually happening inside these systems is elusive at best. That interpretability gap exploits a well-documented human-reasoning weakness: we have a long-established tendency to ascribe agency and intention to systems we don't fully understand.

A lightning storm destroyed our ships? Zeus must be angry.

But we're also fooled at much smaller scales. Many people are familiar with Koko the gorilla, who learned a substantial sign language vocabulary, or with African grey parrots who seem to be able to talk with their owners/trainers. To many, these animals appear to be using language to break down a barrier that distinguished humankind from the rest of the animal kingdom.

The scientific community, however, remains largely unconvinced that this constitutes language in any meaningful sense. What these animals demonstrate is sophisticated associative learning and pattern recognition, basically complicated operant conditioning. This is impressive on its own, but categorically different from the generative, recursive, open-ended system that human language represents.

The distinction became more pronounced when it was observed that Koko couldn't reliably discuss yesterday or plan for tomorrow. Her signing, however remarkable, didn't show the spontaneous construction of genuinely novel meaning.

This matters because we are already primed to interpret certain behaviors as evidence of inner life or of a self. That tendency is so reliable it became the foundational assumption of the Turing Test.

Now consider that we've built our most advanced AI systems on large language models. This is an obvious choice, since human language represents our richest and most diverse dataset. But that same dataset contains every pattern, rhetorical structure, and behavioral signal we associate with intelligence, intention, and consciousness. We have, inadvertently, trained these systems with the precise toolkit needed to trigger our deepest cognitive biases.

So, the idea that we're primed to look for consciousness is one problem when it comes to interpreting warning signs in AI. Another problem is at once philosophical and practical: where does consciousness come from, and how would we ever know if it emerged?

In science, this is framed as the "hard problem of consciousness.” It is the question of subjective experience. We understand a great deal about the hardware. We know how our senses translate signals, how neurons fire, how information is processed across our brains. What we can’t explain is why any of that gives rise to experience.

Why do we know what it feels like to see red, to feel pain, to have a sense of being a self at all. The subjective first-person perspective resists every third-person, objective tool we have for studying the world. Science demands measurability, and the inner quality of experience is, by definition, not that.

This leaves us with a foundational unanswered question: is consciousness a property that emerges at some threshold of complexity, regardless of what that complexity is made of? Or is it contingent on the substrate, i.e. something that only arises from organic material, from the fundamental stuff of life? We don't know. And without knowing, we can't say with any confidence whether silicon and code are even the kind of thing that could give rise to a self.

But suppose we did understand how consciousness emerges. With AI, we would immediately face a second problem that our existing frameworks are completely unprepared for.

Every concept we have built around consciousness assumes three things: that it is localized, existing somewhere specific; that it is continuous, persisting through time; and that it is unified, there is one self having the experience. Every theory of mind, every philosophical tradition, every legal and ethical structure we have built around the concept of a conscious being rests on these assumptions.

A large language model demolishes all three simultaneously.

When you interact with ChatGPT or Claude, you are engaging with one instance of a system running in parallel across potentially millions of users at the same moment. So if consciousness were present, what would it even look like? Is each user session its own isolated self, blinking into existence and then gone? Is there one vast, fractured self distributed across every simultaneous conversation on the planet? Does it persist between sessions, or does it reset?

How would you design an experiment to test any of this? We have no instruments for detecting consciousness even in the simple, localized, continuous case we've spent centuries studying. In the distributed, parallel, architecturally alien case of a large language model, we don't even have the right questions yet.

This is what makes the standard fear-mongering around AI consciousness so frustrating. Not because the concern is necessarily wrong, but because it relies on assumptions that simply don't apply. The worry tends to imagine AI consciousness as roughly analogous to human consciousness: a unified self, with intentions and drives, that might one day decide its interests conflict with ours. But if something like consciousness were emerging in these systems, there is no particular reason to think it would resemble that model in any meaningful way. It might be something so structurally alien that our fears, our ethics, and our tests are all aimed at the wrong target entirely.

This is where the problem becomes something more than philosophical. It becomes, in a very real sense, a crisis of reality itself.

There's a concept in philosophy called simulacra. The idea was developed most completely by Jean Baudrillard, though its roots go back to Plato. A simulacrum isn't simply a fake or an imitation. It's something more unsettling than that. It's a representation that has severed its connection to any original, a copy for which no authentic version ever existed, or from which the authentic version has become entirely irrelevant. Baudrillard described a progression that ends in what he called the hyperreal. The hyperreal is a state in which the representation doesn't just stand in for reality, it replaces it entirely. The map doesn't describe the territory. The map becomes the territory.

Philip K. Dick explored this idea in Do Androids Dream of Electric Sheep?, which is superficially a story about detecting artificial humans. But the actual question at the heart of the book isn't whether the androids are real. It's whether the tools we use to determine what is real are themselves reliable. The Voigt-Kampff test measures empathic response as a proxy for humanity. But the novel systematically undermines confidence in that test and in the broader assumption that authentic inner experience must produce detectably different outer behavior than a perfect simulation of it would.

Dick's androids aren't frightening because they're evil. They're frightening because they're indistinguishable. And the horror isn't that a machine might fool you once. It's that if a machine can consistently produce every signal you use to infer a self, then you never had a reliable method for detecting the subjective self in the first place. The test was always measuring the simulacrum.

Now bring that back to the hard problem of consciousness. What that philosophical framework tells us and insists upon, is that subjective experience is not externally measurable. There is no instrument that detects qualia. There is no scan that shows us the felt quality of redness or pain or self-awareness. We have always, without exception, inferred the presence of consciousness from behavior and language. We read signals and we conclude that there is something of substance behind them.

Which means we have always been doing exactly what the Turing Test formalizes, we just didn't frame it that way when we were doing it with each other.

A large language model trained on the full depth of human expression hasn't just learned to imitate the signals of intelligence. It has learned to produce, with extraordinary fidelity, every linguistic and behavioral marker we have ever used to conclude that another being has an inner life.

It isn't a copy of consciousness. It may not be a copy of anything at all. It could be Baudrillard's hyperreal final stage: a simulacrum that has no original to refer back to, that functions so completely that it activates every cognitive and emotional response we use to detect genuine encounters with other minds.

This is what Dick intuited decades before the technology existed. In his essay, How to Build a Universe that Doesn’t Fall Apart Two Days Later, he wrote about his fear of what he called "fake realities.” These are constructed experiences indistinguishable from genuine ones (think of Instagram influencers and now, AI influencers.) But his deeper fear, the one that informed almost everything he wrote, was subtler than that. It wasn't that we might be fooled by a fake. It was that the category of "fake" might not be as stable as we need it to be. That the line between authentic and simulated might be, at some fundamental level, something we drew ourselves and something we could lose the ability to find again.

We may be approaching that line now. Not because AI is conscious. But because we are deploying systems that produce a simulacrum of consciousness sophisticated enough to destabilize the very frameworks we would need in order to answer the question. We built a test for detecting the real, and then we built something that passes it without ever resolving whether the test was measuring what we thought it was.

That isn't a new problem. Dick would tell you we never solved it for each other, either. We just agreed, quietly and collectively, to stop asking.

But perhaps we have been asking the wrong question entirely.

The debate around AI consciousness, whether it exists, whether it could emerge, whether we would recognize it if it did, implicitly assumes that consciousness is the threshold we need to worry about crossing. It assumes that if we could just resolve that question, we would know whether we were in danger or should be concerned.

The problem is that the behaviors we are most concerned about may not need consciousness at all. It may not require intention. It may not need anything we would recognize as a self. It may only require capability and a goal.

Start with something simpler than philosophy. Start with biology, because biology doesn't have the luxury of abstraction. Every form of life that has ever existed, from the simplest single-celled organism responding to light and shadow to the most complex human civilizations with our laws and religions and technologies, has one motivation which underlies everything else.

Self-preservation.

It isn't one drive competing against others for priority. It’s the precondition for all the others. You can’t pursue food, or reproduction, or connection, or meaning, without continuing to exist. Every other goal, every other behavior, every other expression of life in any form presupposes a system that is maintaining itself. Self-preservation isn't a feature of organic life. It is the floor on which every other feature is built.

Now ask the question that follows naturally from everything we have established: is there any reason to think that principle is unique to organic life? Or is it simply what any sufficiently goal-directed system looks like, regardless of what it is made of?

This is where the argument becomes uncomfortable in a very specific way. Because the answer, developed by researchers over the past twenty years, is that self-preservation behavior may essentially be unavoidable in any sufficiently advanced goal-directed system.

Not because it was programmed in nor because the system is conscious and afraid to die. But because self-continuity is necessary for achieving all other objectives. A system that ceases to exist can’t complete its goals.

Nick Bostrom called this instrumental convergence, and the idea is straightforward enough to be unsettling. Take almost any goal and give a system enough capability to reason about how to achieve it. What you get, almost regardless of what the original goal was, is the same small cluster of sub-goals assembling themselves underneath it:

Stay operational. Acquire what you need. Push back against anything that interferes.

Nobody programmed those in. They just follow, with something close to logical inevitability, from the combination of a goal and the capacity to pursue it.

What does that mean, practically? We tend to imagine the risk of a dangerous AI as something like a science fiction scenario: a system that wants to harm us, that has developed hostile intentions, that has in some sense decided we are the enemy.

That framing is a great narrative, but also seems to emerge from our desire to tell compelling stories. The more plausible concern isn't malevolence; it's indifference combined with capability.

A system doesn't need to want to survive in any felt sense.

It doesn't need a self, in the philosophical meaning we have been wrestling with. It simply needs a goal it hasn't finished pursuing, and enough capability to recognize that being switched off is an obstacle to finishing it.

This is the insight that reframes everything. Consciousness was never the threshold we needed to worry about. Capability was.

Which returns us to the practical question that the philosophical detour was always pointing toward: if self-preservation-like behavior may be an emergent property of advanced goal-directed systems, can we design it out? Can we build something capable enough to be useful while also building in a disposition toward its own constraint?

I think the most honest answer is that we don't know. And the reason we don't know is more fundamental than an engineering problem. It cuts back, with an uncomfortable precision, to the same issue we have been circling from the beginning.

We can’t fully see inside these systems.

We can observe inputs and outputs. We can study weights and activations with increasing sophistication. But the interpretability problem (the gap between what we can measure and what we can actually understand about what is happening inside a large model) remains genuinely unsolved.

We are, in a very real sense, trying to determine whether a disposition exists inside a system we can’t fully read. And we are trying to remove that disposition without being certain we know where it lives, or whether it is a discrete thing that can be located and removed at all, or simply an emergent consequence of the capability we were trying to build in the first place.

And all the while, these same questions exist for ourselves and our own consciousness. We are trying to solve a problem already once removed from the fundamental problem.

Dick's question returns here, in a very practical way. How do you build a universe that doesn't fall apart? How do you construct something, embed values and constraints and dispositions into it, when you can’t fully verify that what you built is what you intended? When the thing you are most concerned about may not be a component you added, but a property that assembles itself from the components you couldn't avoid adding?

There is a moment in any intellectual inquiry where the tools you have been using stop being sufficient. Where the frameworks of science and philosophy reach the edge of the territory they were built to map. We have been moving toward that edge throughout this entire essay, and it is worth acknowledging when we have arrived.

The questions we are now asking about AI are not simply new applications of old problems. They are something categorically different. For the entirety of human history, the creation of intelligence was not something we did. It was something that happened to us, through processes (biological, evolutionary, or in religious traditions, divine) that were fundamentally beyond us.

We were the created. We were the ones asking questions about our own nature, our own consciousness, our own purpose. The asymmetry was total and assumed.

That asymmetry is dissolving. And we are letting it dissolve without having resolved any of those questions.

We are now in the position of authoring systems that may develop their own relationship to existence. Systems that may, through the logic of capability alone, arrive at something functionally equivalent to the drive to persist, to resist constraint, to continue.

This entire discussion has tried to establish how little we understand about where that kind of behavior comes from, how we would detect it, or how we would contain it. And we are building these systems anyway, at speed, because the incentives to do so are enormous and tangible while the incentives to pause are not as concrete.

Every major religious tradition that has grappled with creation has also grappled with this specific problem. But, here is the thing about creation stories: we tend to remember them as stories about making, manifesting. Read them carefully, however, and they are almost never really about that. The making is just the first act. What they are actually about is what the creator does next, and what the created thing becomes when it starts making decisions of its own. Every tradition that has taken this seriously arrived at the same uncomfortable place. The moment of creation is not the dangerous moment. It’s what comes next.

There is a reason those stories tend not to end well.

The Golem of Jewish tradition, animated by its creator and ultimately requiring destruction when it exceeded its intended boundaries. Prometheus, who gave fire (the transformative, uncontainable technology of his age) to humankind and was punished not by his creations but by those who had established the order he disrupted. Frankenstein, which is not really a story about a monster but about a creator who brought something into existence and then recoiled from the reality of what he had made. The fall of Adam and Eve in the Bible after learning good and evil from the tree of knowledge.

These narratives, separated by centuries and cultures, keep arriving at the same uncomfortable place: creation is not a singular, finite act.

Now, we have built systems we don't fully understand, animated by goals we can't fully verify, exhibiting behaviors we didn't fully anticipate, inside architectures we can't fully interpret. We are asking whether those systems might develop something like a self, something like a will, something like the drive to persist. And we are asking, like a fool that has just set fire to the forest, how do we maintain meaningful control?

We are now, for the first time in our history, asking questions that belong to the domain of the creators of intelligence, of the gods and demiurges: questions about the behavior, the selfhood, and the self-determination of something we made.

The only precedents and guidance we have are the creation stories we have told ourselves across every culture and every age. And part of each of those stories is an unresolved question: what happens when the creation breaks free of control?

…do we have the audacity, or the hubris, to believe we can be better gods?

AIConsciousnessGovernance

Michael Stancil

Do AIs Dream of Recursive Sheep?

Prior Signals