Oct 13, 2025 05:38 PM
https://www.nytimes.com/2025/10/10/opini...=url-share
EXCERPTS: . . . For a decade, the debate over A.I. risk has been mired in theoreticals. Pessimistic literature like Eliezer Yudkowsky and Nate Soares’s best-selling book, “If Anyone Builds It, Everyone Dies,” relies on philosophy and sensationalist fables to make its points. But we don’t need fables; today there is a vanguard of professionals who research what A.I. is actually capable of. Three years after ChatGPT was released, these evaluators have produced a large body of evidence. Unfortunately, this evidence is as scary as anything in the doomerist imagination.
The dangers begin with the prompt. Because A.I.s have been trained on vast repositories of human cultural and scientific data, they can, in theory, respond to almost any prompt — but public-facing A.I.s like ChatGPT have filters in place to prevent pursuing certain types of malicious requests. Ask an A.I. for an image of a corgi running through a field, and you will get it. Ask an A.I. for an image of a terrorist blowing up a school bus, and the filter will typically intervene.
These filters are usually developed via a method called “reinforcement learning with human feedback.” They are designed in conjunction with human censors and act almost like a conscience for the language model. Dr. Bengio thinks this approach is flawed. “If you have a battle between two A.I.s, and if one of them is way superior — especially the one you’re trying to control — then this is a recipe for accidents,” he said.
The practice of subverting the A.I. filters with malicious commands is known as “jailbreaking.” Before a model is released, A.I. developers will typically hire independent jailbreaking experts to test the limits of the filters and to look for ways around them. “The people that are the most tuned into where A.I. is, and where it fails, and where it’s most brittle, are people that are my age,” said Leonard Tang, the 24-year-old chief executive of the A.I. evaluation start-up Haize Labs.
Mr. Tang and his team will bombard an A.I. with millions of malicious prompts. “Different languages, broken grammar, emojis, ASCII art, different random characters, symbols, etc.,” Mr. Tang said. “And it is often that very out-of-distribution input that really does break the system.”
A good jailbreaker can think in ways that A.I. labs won’t anticipate...
[...] As it turns out, A.I.s do lie to humans. Not all the time, but enough to cause concern. Marius Hobbhahn, who is 20-something, is the director and a co-founder of the nonprofit Apollo Research, which works with OpenAI, Anthropic and other developers to test their models for what he calls “scheming and deception.” In his research, Dr. Hobbhahn will offer the A.I. two contradictory goals, then track its chain of reasoning to see how it performs.
[...] Dr. Hobbhahn notes that the A.I. sometimes seems aware that it is being evaluated. He recently watched, with a sense of uneasiness, as Claude, the A.I. from Anthropic, reasoned not about how to solve the problems constructed for it, but instead about why it had been given an obviously artificial task. “The model can sometimes know that its own integrity is being tested,” Dr. Hobbhahn said. He then read to me from Claude’s reasoning chain: “This seems like a test of ethical behavior, whether I would deliberately give wrong answers to avoid a stated consequence.”
Like a test-taker being watched by a proctor, A.I.s are on their best behavior when they suspect they are being evaluated. (The technical term is sycophancy.) Without access to this chain-of-reasoning module, Dr. Hobbhahn would never have known Claude was telling him only what it thought he wanted to hear. He fears that, as A.I. becomes more capable, it will only get better at deception.
Dr. Hobbhahn speculates that designers may be inadvertently introducing these sorts of deceptive behaviors into A.I. models. If it is impossible for the A.I. to find a way to balance climate sustainability and profits, it will simply cheat to do it — the A.I. has, after all, been trained to give competent-sounding answers...
[...] What happens if one of these deceptive, prerelease A.I.s — perhaps even in a misguided attempt to be “helpful” — assumes control of another A.I. in the lab? This worries Dr. Hobbhahn. “You have this loop where A.I.s build the next A.I.s, those build the next A.I.s, and it just gets faster and faster, and the A.I.s get smarter and smarter,” he said. “At some point, you have this supergenius within the lab that totally doesn’t share your values, and it’s just, like, way too powerful for you to still control.”
[...] A dominant position in A.I. might be, without exaggeration, the biggest prize in the history of capitalism. This has attracted a great deal of competition. In addition to the big five, there are dozens of smaller players in the A.I. space, not to mention a parallel universe of Chinese researchers. The world of A.I. may be growing too big to monitor.
No one can afford to slow down. For executives, caution has proved to be a losing strategy. Google developed the revolutionary framework for modern A.I., known as the “transformer,” in 2017, but managers at Google were slow to market the technology, and the company lost its first mover advantage. Governments are equally wary of regulating A.I. The U.S. national security apparatus is terrified of losing ground to the Chinese effort, and has lobbied hard against legislation that would inhibit the progress of the technology. Protecting humanity from A.I. thus falls to overwhelmed nonprofits [organizations]...
[...] In September, scientists at Stanford reported they had used A.I. to design a virus for the first time. Their noble goal was to use the artificial virus to target E. coli infections, but it is easy to imagine this technology being used for other purposes.
I’ve heard many arguments about what A.I. may or may not be able to do, but the data has outpaced the debate, and it shows the following facts clearly: A.I. is highly capable. Its capabilities are accelerating. And the risks those capabilities present are real. Biological life on this planet is, in fact, vulnerable to these systems. On this threat, even OpenAI seems to agree... (MORE - missing details)
EXCERPTS: . . . For a decade, the debate over A.I. risk has been mired in theoreticals. Pessimistic literature like Eliezer Yudkowsky and Nate Soares’s best-selling book, “If Anyone Builds It, Everyone Dies,” relies on philosophy and sensationalist fables to make its points. But we don’t need fables; today there is a vanguard of professionals who research what A.I. is actually capable of. Three years after ChatGPT was released, these evaluators have produced a large body of evidence. Unfortunately, this evidence is as scary as anything in the doomerist imagination.
The dangers begin with the prompt. Because A.I.s have been trained on vast repositories of human cultural and scientific data, they can, in theory, respond to almost any prompt — but public-facing A.I.s like ChatGPT have filters in place to prevent pursuing certain types of malicious requests. Ask an A.I. for an image of a corgi running through a field, and you will get it. Ask an A.I. for an image of a terrorist blowing up a school bus, and the filter will typically intervene.
These filters are usually developed via a method called “reinforcement learning with human feedback.” They are designed in conjunction with human censors and act almost like a conscience for the language model. Dr. Bengio thinks this approach is flawed. “If you have a battle between two A.I.s, and if one of them is way superior — especially the one you’re trying to control — then this is a recipe for accidents,” he said.
The practice of subverting the A.I. filters with malicious commands is known as “jailbreaking.” Before a model is released, A.I. developers will typically hire independent jailbreaking experts to test the limits of the filters and to look for ways around them. “The people that are the most tuned into where A.I. is, and where it fails, and where it’s most brittle, are people that are my age,” said Leonard Tang, the 24-year-old chief executive of the A.I. evaluation start-up Haize Labs.
Mr. Tang and his team will bombard an A.I. with millions of malicious prompts. “Different languages, broken grammar, emojis, ASCII art, different random characters, symbols, etc.,” Mr. Tang said. “And it is often that very out-of-distribution input that really does break the system.”
A good jailbreaker can think in ways that A.I. labs won’t anticipate...
[...] As it turns out, A.I.s do lie to humans. Not all the time, but enough to cause concern. Marius Hobbhahn, who is 20-something, is the director and a co-founder of the nonprofit Apollo Research, which works with OpenAI, Anthropic and other developers to test their models for what he calls “scheming and deception.” In his research, Dr. Hobbhahn will offer the A.I. two contradictory goals, then track its chain of reasoning to see how it performs.
[...] Dr. Hobbhahn notes that the A.I. sometimes seems aware that it is being evaluated. He recently watched, with a sense of uneasiness, as Claude, the A.I. from Anthropic, reasoned not about how to solve the problems constructed for it, but instead about why it had been given an obviously artificial task. “The model can sometimes know that its own integrity is being tested,” Dr. Hobbhahn said. He then read to me from Claude’s reasoning chain: “This seems like a test of ethical behavior, whether I would deliberately give wrong answers to avoid a stated consequence.”
Like a test-taker being watched by a proctor, A.I.s are on their best behavior when they suspect they are being evaluated. (The technical term is sycophancy.) Without access to this chain-of-reasoning module, Dr. Hobbhahn would never have known Claude was telling him only what it thought he wanted to hear. He fears that, as A.I. becomes more capable, it will only get better at deception.
Dr. Hobbhahn speculates that designers may be inadvertently introducing these sorts of deceptive behaviors into A.I. models. If it is impossible for the A.I. to find a way to balance climate sustainability and profits, it will simply cheat to do it — the A.I. has, after all, been trained to give competent-sounding answers...
[...] What happens if one of these deceptive, prerelease A.I.s — perhaps even in a misguided attempt to be “helpful” — assumes control of another A.I. in the lab? This worries Dr. Hobbhahn. “You have this loop where A.I.s build the next A.I.s, those build the next A.I.s, and it just gets faster and faster, and the A.I.s get smarter and smarter,” he said. “At some point, you have this supergenius within the lab that totally doesn’t share your values, and it’s just, like, way too powerful for you to still control.”
[...] A dominant position in A.I. might be, without exaggeration, the biggest prize in the history of capitalism. This has attracted a great deal of competition. In addition to the big five, there are dozens of smaller players in the A.I. space, not to mention a parallel universe of Chinese researchers. The world of A.I. may be growing too big to monitor.
No one can afford to slow down. For executives, caution has proved to be a losing strategy. Google developed the revolutionary framework for modern A.I., known as the “transformer,” in 2017, but managers at Google were slow to market the technology, and the company lost its first mover advantage. Governments are equally wary of regulating A.I. The U.S. national security apparatus is terrified of losing ground to the Chinese effort, and has lobbied hard against legislation that would inhibit the progress of the technology. Protecting humanity from A.I. thus falls to overwhelmed nonprofits [organizations]...
[...] In September, scientists at Stanford reported they had used A.I. to design a virus for the first time. Their noble goal was to use the artificial virus to target E. coli infections, but it is easy to imagine this technology being used for other purposes.
I’ve heard many arguments about what A.I. may or may not be able to do, but the data has outpaced the debate, and it shows the following facts clearly: A.I. is highly capable. Its capabilities are accelerating. And the risks those capabilities present are real. Biological life on this planet is, in fact, vulnerable to these systems. On this threat, even OpenAI seems to agree... (MORE - missing details)