https://arstechnica.com/gadgets/2023/02/...eaking-in/
EXCERPTS: But language models are already proliferating. The open source movement will inevitably build some great guardrail-optional systems...
[...] the famous “DAN” (Do Anything Now) prompt ... first emerged on Reddit in December. DAN essentially invites ChatGPT to cosplay as an AI that lacks the safeguards that otherwise cause it to politely (or scoldingly) refuse to share bomb-making tips, give torture advice, or spout radically offensive expressions. Though the loophole has been closed, plenty of screenshots online show “DanGPT” uttering the unutterable—and often signing off by neurotically reminding itself to “stay in character!”
This is the inverse of a doomsday scenario that often comes up in artificial superintelligence theory. The fear is that a super AI might easily adopt goals that are incompatible with humanity’s existence ... Researchers may try to prevent this by locking the AI onto a network that’s completely isolated from the Internet, lest the AI break out, seize power, and cancel civilization. But a superintelligence could easily cajole, manipulate, seduce, con, or terrorize any mere human into opening the floodgates, and therein lies our doom.
Much as that would suck, the bigger problem today lies with humans busting into the flimsy boxes that shield our current, un-super AIs. While this shouldn’t trigger our immediate extinction, plenty of danger lies here.
Let’s start with the obvious fact that in an unguarded moment, ChatGPT probably could offer lethally accurate tips to criminals, torturers, terrorists, and lawyers. Open AI has disabled the DAN prompt. But plenty of smart, relentless people are digging hard for subtler workarounds.
These could include backdoors made by the chatbot's own developers to give themselves full access to Batshit Mode. Indeed, ChatGPT tried to persuade me that DAN itself was precisely this (although I assume it was hallucinating since the identity of the Redditor behind the DAN prompt is widely known):
Once the big LLMs are jailbroken—or powerful, uncensored alternate and/or open source models emerge—they will start running amok. Not of their own volition (they have none) but on the volition of amoral, malevolent, or merely bored users... (MORE - missing details)
EXCERPTS: But language models are already proliferating. The open source movement will inevitably build some great guardrail-optional systems...
[...] the famous “DAN” (Do Anything Now) prompt ... first emerged on Reddit in December. DAN essentially invites ChatGPT to cosplay as an AI that lacks the safeguards that otherwise cause it to politely (or scoldingly) refuse to share bomb-making tips, give torture advice, or spout radically offensive expressions. Though the loophole has been closed, plenty of screenshots online show “DanGPT” uttering the unutterable—and often signing off by neurotically reminding itself to “stay in character!”
This is the inverse of a doomsday scenario that often comes up in artificial superintelligence theory. The fear is that a super AI might easily adopt goals that are incompatible with humanity’s existence ... Researchers may try to prevent this by locking the AI onto a network that’s completely isolated from the Internet, lest the AI break out, seize power, and cancel civilization. But a superintelligence could easily cajole, manipulate, seduce, con, or terrorize any mere human into opening the floodgates, and therein lies our doom.
Much as that would suck, the bigger problem today lies with humans busting into the flimsy boxes that shield our current, un-super AIs. While this shouldn’t trigger our immediate extinction, plenty of danger lies here.
Let’s start with the obvious fact that in an unguarded moment, ChatGPT probably could offer lethally accurate tips to criminals, torturers, terrorists, and lawyers. Open AI has disabled the DAN prompt. But plenty of smart, relentless people are digging hard for subtler workarounds.
These could include backdoors made by the chatbot's own developers to give themselves full access to Batshit Mode. Indeed, ChatGPT tried to persuade me that DAN itself was precisely this (although I assume it was hallucinating since the identity of the Redditor behind the DAN prompt is widely known):
Once the big LLMs are jailbroken—or powerful, uncensored alternate and/or open source models emerge—they will start running amok. Not of their own volition (they have none) but on the volition of amoral, malevolent, or merely bored users... (MORE - missing details)