Oct 2, 2024 03:44 PM
https://www.nature.com/articles/d41586-024-03169-9
EXCERPT: Strikingly, OpenAI o1 has become the first large language model to beat PhD-level scholars on the hardest series of questions — the ‘diamond’ set — in a test called the Graduate-Level Google-Proof Q&A Benchmark (GPQA)1. OpenAI says that its scholars scored just under 70% on GPQA Diamond, and o1 scored 78% overall, with a particularly high score of 93% in physics (see ‘Next level’). That’s “significantly higher than the next-best reported [chatbot] performance”, says David Rein, who was part of the team that developed the GPQA. Rein now works at the non-profit organization Model Evaluation and Threat Research, based in Berkeley, California, that works on assessing the risks of AI. “It seems plausible to me that this represents a significant and fundamental improvement in the model’s core reasoning capabilities,” he adds.
OpenAI also tested o1 on a qualifying exam for the International Mathematics Olympiad. Its previous best model, GPT-4o, correctly solved only 13% of the problems, whereas o1 scored 83%.
OpenAI o1 works by using chain-of-thought logic; it talks itself through a series of reasoning steps as it attempts to solve a problem, correcting itself as it goes.
OpenAI has decided to keep the details of any given chain of thought hidden — in part because the chain might contain errors or socially unacceptable ‘thoughts’, and in part to protect company secrets relating to how the model works. Instead, o1 provides a reconstructed summary of its logic for the user, alongside its answers. It’s unclear, White says, whether the full chain of thought, if revealed, would look similar to human reasoning.
The new capabilities come with trade-offs. For instance, OpenAI reports that it has received anecdotal feedback that o1 models hallucinate — make up incorrect answers — more often than their predecessors do (although the company’s internal testing showed slightly lower rates of hallucination for o1).
The red-team scientists noted plenty of ways in which o1 was helpful in coming up with protocols for science experiments, but OpenAI says the testers also “highlighted missing safety information pertaining to harmful steps, such as not highlighting explosive hazards or suggesting inappropriate chemical containment methods, pointing to unsuitability of the model to be relied on for high-risk physical safety tasks”.
“It’s still not perfect or reliable enough that you wouldn’t really want to closely check over it,” White says. He adds that o1 is more suited to guiding experts than novices. “For a novice, it’s just beyond their immediate inspection ability” to look at an o1-generated protocol and see that it’s “bunk”, he says... (MORE - missing details)
EXCERPT: Strikingly, OpenAI o1 has become the first large language model to beat PhD-level scholars on the hardest series of questions — the ‘diamond’ set — in a test called the Graduate-Level Google-Proof Q&A Benchmark (GPQA)1. OpenAI says that its scholars scored just under 70% on GPQA Diamond, and o1 scored 78% overall, with a particularly high score of 93% in physics (see ‘Next level’). That’s “significantly higher than the next-best reported [chatbot] performance”, says David Rein, who was part of the team that developed the GPQA. Rein now works at the non-profit organization Model Evaluation and Threat Research, based in Berkeley, California, that works on assessing the risks of AI. “It seems plausible to me that this represents a significant and fundamental improvement in the model’s core reasoning capabilities,” he adds.
OpenAI also tested o1 on a qualifying exam for the International Mathematics Olympiad. Its previous best model, GPT-4o, correctly solved only 13% of the problems, whereas o1 scored 83%.
OpenAI o1 works by using chain-of-thought logic; it talks itself through a series of reasoning steps as it attempts to solve a problem, correcting itself as it goes.
OpenAI has decided to keep the details of any given chain of thought hidden — in part because the chain might contain errors or socially unacceptable ‘thoughts’, and in part to protect company secrets relating to how the model works. Instead, o1 provides a reconstructed summary of its logic for the user, alongside its answers. It’s unclear, White says, whether the full chain of thought, if revealed, would look similar to human reasoning.
The new capabilities come with trade-offs. For instance, OpenAI reports that it has received anecdotal feedback that o1 models hallucinate — make up incorrect answers — more often than their predecessors do (although the company’s internal testing showed slightly lower rates of hallucination for o1).
The red-team scientists noted plenty of ways in which o1 was helpful in coming up with protocols for science experiments, but OpenAI says the testers also “highlighted missing safety information pertaining to harmful steps, such as not highlighting explosive hazards or suggesting inappropriate chemical containment methods, pointing to unsuitability of the model to be relied on for high-risk physical safety tasks”.
“It’s still not perfect or reliable enough that you wouldn’t really want to closely check over it,” White says. He adds that o1 is more suited to guiding experts than novices. “For a novice, it’s just beyond their immediate inspection ability” to look at an o1-generated protocol and see that it’s “bunk”, he says... (MORE - missing details)
