AI is vulnerable to attack. Can it ever be used safely?

**C C** · Jul 31, 2024 05:18 PM

https://www.nature.com/articles/d41586-024-02419-0

EXCERPTS: The large language models (LLMs) that power chatbots such as ChatGPT, Gemini and Claude are capable of completing a wide variety of tasks, and at times might even appear to be intelligent. But as powerful as they are, these systems still routinely produce errors and can behave in undesirable or even harmful ways. They are trained with vast quantities of Internet text, and so have the capability to produce bigotry or misinformation, or provide users with problematic information, such as instructions for building a bomb. To reduce these behaviours, the models’ developers take various steps, such as providing feedback to fine-tune models’ responses, or restricting the queries that they will satisfy. However, although this might be enough to stop most of the general public encountering undesirable content, more determined people — including AI safety researchers – can design attacks that bypass these measures.

[...] McCoy advocates focusing on what LLMs were designed to do: predict the most likely next word, given everything that has come before. They accomplish this using the statistical patterns in language learnt during their initial training, together with a technique known as autoregression, which predicts the next value of something based on its past values. This enables LLMs not only to engage in conversation, but also to perform other, seemingly unrelated tasks such as mathematics. “Pretty much any task can be framed as next-word prediction,” says McCoy, “though in practice, some things are much more naturally framed that way than others.”

The application of next-word prediction to tasks that are not well suited to it can result in surprising errors. In a 2023 preprint study, McCoy and his colleagues demonstrated that GPT-4 — the algorithm that underpins ChatGPT — could count 30 characters presented to it with an accuracy of 97%. However, when tasked with counting 29 characters, accuracy dropped to only 17%. This demonstrates LLMs’ sensitivity to the prevalence of correct answers in their training data, which the researchers call output probability. The number 30 is more common in Internet text than is 29, simply because people like round numbers, and this is reflected in GPT-4’s performance. Many more experiments in the study similarly show that performance fluctuates wildly depending on how common the output, task or input text, is on the Internet. “This is puzzling if you think of it as a general-reasoning engine,” says McCoy. “But if you think of it as a text-string processing system, then it’s not surprising.”

[...] Because it seems almost impossible to completely prevent the misuse of LLMs, consensus is emerging that they should not be allowed into the world without chaperones. These take the form of more extensive guardrails that form a protective shell. “You need a system of verification and validation that’s external to the model,” says Rajpal. “A layer around the model that explicitly tests for various types of harmful behaviour.”

Simple rule-based algorithms can check for specific misuse — known jailbreaks, for instance, or the release of sensitive information — but this does not stop all failures. “If you had an oracle that would, with 100% certainty, tell you if some prompt contains a jailbreak, that completely solves the problem,” says Rajpal. “For some use cases, we have that oracle; for others, we don’t.”

Without such oracles, failures cannot be prevented every time. Additional, task-specific models can be used to try to spot harmful behaviours and difficult-to-detect attacks, but these are also capable of errors. The hope, however, is that multiple models are unlikely to all fail in the same way at the same time. “You’re stacking multiple layers of sieves, each with holes of different sizes, in different locations,” says Rajpal. “But when you stack them together, you get something that’s much more watertight than each individually.” (MORE - missing details)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Research Popular AI models aren’t ready to safely power robots	C C	0	148	Nov 11, 2025 12:58 AM Last Post: C C
	Research Robot or human teachers: Which do children trust? + Super-AI sneak attack? New study	C C	0	420	Dec 23, 2023 10:25 PM Last Post: C C
	Article US Air Force denies killer AI drone attack on its operator	C C	0	356	Jun 3, 2023 05:17 PM Last Post: C C
	The unsolved mystery attack on internet cables in Paris	C C	0	373	Jul 23, 2022 05:34 PM Last Post: C C
	Implanted devices are the future of medicine, but they’re vulnerable to hackers	C C	0	348	Nov 18, 2020 08:19 PM Last Post: C C
	Can robots ever have a true sense of self? Scientists are making progress	C C	0	568	Feb 28, 2019 08:32 PM Last Post: C C
	Are Social Networks Vulnerable to Manipulation?	Bowser	5	1,843	Sep 17, 2016 10:48 PM Last Post: Bowser
	FBI Warns Of Car Hacking Risks: Increased Connectivity Leaves Vehicles Vulnerable	C C	2	1,172	Mar 22, 2016 01:07 AM Last Post: stryder