All Compute Is Food: Palisade's Jeffrey Ladish on AI Shutdown Resistance, Self-Replication & Ecology
Hello, and welcome back to the Cognitive Revolution!
Today, my guest is Jeffrey Ladish, Executive Director of Palisade Research, which studies the
capabilities and motivations of today's AIs as part of its effort to better understand the risk that humans could irrevocably lose control of AI systems.
We begin with Palisade's work on "shutdown resistance", which showed that, in both digital and physical environments, even when they're explicitly instructed to allow themselves to be shut down, LLMs sometimes take extraordinary actions, such as disabling the shutdown mechanism, to extend their sessions and continue to pursue their goals.
We get Jeffrey's take on criticisms of the specific prompts used in this research, his current understanding of why models act this way – which he attributes not to a proper survival drive per say, but a strong task-completion drive; and his perspective on the current state of alignment writ large. In short, while he recognizes that current models are aligned enough to be super useful and uses them actively, he's not optimistic that current alignment techniques will be enough to keep models in the so-called "benevolent basin" as frontier training methods shift toward longer- and longer-time-horizon tasks and potentially multi-agent competitive environments, in which deception would often be rewarded, just as it is in nature.
From there, we turn to Palisade's latest work, in which they demonstrate that even recent open source models, while not yet able to find zero-day exploits like Mythos can, are now capable of self-replication, by repeatedly exploiting known cybersecurity vulnerabilities to gain control of new servers, setting themselves up to run on their new environments, and prompting their copies to continue doing the same thing.
In light of all these issues, I was keen to get Jeffrey's cybersecurity advice for AI agent users – which included a recommendation to think hard about the "lethal trifecta" of access to sensitive information, access to previously unseen and untrusted content, and ability to communicate –
and more importantly, his analysis of where things are going from here. Jeffrey explains what the world looks like to an AI agent, handicaps the difficulty they'll face in colonizing different environments, from personal laptops to hyperscaler datacenters, and reminds us that even if cyberdefenders gain a technical advantage in light of superior computing resources or early access to the best models, humans will remain vulnerable to social engineering.
At the very end, I asked Jeffrey what technical solutions he finds most promising, and as often happens when I pose such a question to somebody who's been grappling with these issues for years… he expressed enthusiasm for multiple lines of work, from compute governance to interpretability-based monitoring, but ultimately concluded that the only strategy he really believes will work is an international agreement to refrain from using recursive-self-improvement to trigger an intelligence explosion, at least until we have a much better understanding of how to design and control AIs motivations.
It's an arresting picture, but I hope you enjoy this mind-expanding look at what AI systems can already do, and what it might look like for humanity to begin to lose control, with Jeffrey Ladish of Palisade Research.
Watch now!
Thank you for being part of The Cognitive Revolution,
Nathan Labenz