Nov 7, 2024 10:07 PM
(This post was last modified: Nov 7, 2024 10:18 PM by C C.)
RELATED (scivillage): Despite impressive output, generative AI doesn’t have coherent understanding of world
- - - - - - - - - - - - - -
Why “humanity’s last exam” will ultimately fail humanity
https://bigthink.com/starts-with-a-bang/...exam-fail/
EXCERPTS: . . . To assess the capabilities of LLMs, the Center for AI Safety, led by Dan Hendrycks, is currently crowdsourcing questions for review to construct what they’re calling Humanity’s Last Exam. Seeking exceptionally hard, PhD-level questions (potentially with obscure, niche answers) from experts around the world, they believe that such an exam would pose the ultimate challenge for AI.
However, the very premise may be fundamentally flawed.
[...] If you want to understand how AIs fail, it’s paramount that you first understand how AIs, and in particular, LLMs, work. In a traditional computer program, the user gives the computer an input (or a series of inputs), the computer then follows computations according to a pre-programmed algorithm, and then when it’s done with its computations, it returns an output (or a series of outputs) to the user. The big difference between a traditional computer program and the particular form of AI that’s leveraged in machine learning applications (which includes LLMs) is that instead of following a pre-programmed algorithm for turning inputs into outputs, it’s the machine learning program itself that’s responsible for figuring out the underlying algorithm.
How good is that algorithm going to be?
In principle, even the best algorithms that a machine is capable of coming up with are still going to be fundamentally limited by an inescapable factor: the quality, size, and comprehensiveness of the initial training data set that it uses to figure out the underlying algorithm. As Anil Ananthaswamy, author of Why Machines Learn: The Elegant Math Behind Modern AI, cautioned me about LLMs in a conversation back in July,
“While these algorithms can be extremely powerful and even outdo humans on the narrow tasks they are trained on, they don’t generalize to questions about data that falls outside the training data distribution. In that sense, they are not intelligent in the way humans are considered intelligent.”
In other words, if you want your LLM to perform better at tasks it currently doesn’t perform well at, the solution is to enhance your training data set so that it includes better, more relevant, and more comprehensive examples of the queries it’s going to be receiving.
Which brings us back to Humanity’s Last Exam, and why it’s such an absurd notion to begin with. I was contacted by the Director of the Center for AI Safety, Dan Hendrycks, with the following message:
“OpenAI’s recent o1 model performs similarly to physics PhD students on various benchmarks, yet it’s unclear whether AI systems are truly approaching expert-level performance, or if they’re merely mimicking without genuine understanding. To answer this question, The Center for AI Safety and Scale AI are developing a benchmark titled “Humanity’s Last Exam” consisting of difficult, post-doctoral level questions meant to be at humanity’s knowledge frontier.”
Sounds reasonable, right?
The problem is, they don’t want actual questions that probe for deep knowledge. They don’t want a question that requires a nuanced answer and a deep understanding of all the factors at play. They don’t want the types of questions you might ask of a seminar speaker, of a student defending their dissertation, or of a researcher who’s staked out a contrarian position on a scientific matter. What they want are multiple choice questions that, supposedly, would be answerable only by someone who is competent in the field.
[...] If we wanted to design a test that truly determined whether an AI actually demonstrated true generative intelligence, it wouldn’t be a test that could be defeated with an arbitrarily large, comprehensive set of training data. The notion that “infinite memorization” would equate to genuine AGI is absurd on its face, as the unique mark of intelligence is the ability to reason and infer in the face of incomplete information. If you can supply the subject of your test with complete information as respects the test itself, then you have no hope of measuring how it performs on measures of intelligence at all; you simply measured how well your subject performed on the test.
It’s for these essential reasons that the attempt to create Humanity’s Last Exam with a series of multiple choice questions that are, at their core, knowledge-based instead of reasoning-based is destined to fail... (MORE - missing details)
- - - - - - - - - - - - - -
Why “humanity’s last exam” will ultimately fail humanity
https://bigthink.com/starts-with-a-bang/...exam-fail/
EXCERPTS: . . . To assess the capabilities of LLMs, the Center for AI Safety, led by Dan Hendrycks, is currently crowdsourcing questions for review to construct what they’re calling Humanity’s Last Exam. Seeking exceptionally hard, PhD-level questions (potentially with obscure, niche answers) from experts around the world, they believe that such an exam would pose the ultimate challenge for AI.
However, the very premise may be fundamentally flawed.
[...] If you want to understand how AIs fail, it’s paramount that you first understand how AIs, and in particular, LLMs, work. In a traditional computer program, the user gives the computer an input (or a series of inputs), the computer then follows computations according to a pre-programmed algorithm, and then when it’s done with its computations, it returns an output (or a series of outputs) to the user. The big difference between a traditional computer program and the particular form of AI that’s leveraged in machine learning applications (which includes LLMs) is that instead of following a pre-programmed algorithm for turning inputs into outputs, it’s the machine learning program itself that’s responsible for figuring out the underlying algorithm.
How good is that algorithm going to be?
In principle, even the best algorithms that a machine is capable of coming up with are still going to be fundamentally limited by an inescapable factor: the quality, size, and comprehensiveness of the initial training data set that it uses to figure out the underlying algorithm. As Anil Ananthaswamy, author of Why Machines Learn: The Elegant Math Behind Modern AI, cautioned me about LLMs in a conversation back in July,
“While these algorithms can be extremely powerful and even outdo humans on the narrow tasks they are trained on, they don’t generalize to questions about data that falls outside the training data distribution. In that sense, they are not intelligent in the way humans are considered intelligent.”
In other words, if you want your LLM to perform better at tasks it currently doesn’t perform well at, the solution is to enhance your training data set so that it includes better, more relevant, and more comprehensive examples of the queries it’s going to be receiving.
Which brings us back to Humanity’s Last Exam, and why it’s such an absurd notion to begin with. I was contacted by the Director of the Center for AI Safety, Dan Hendrycks, with the following message:
“OpenAI’s recent o1 model performs similarly to physics PhD students on various benchmarks, yet it’s unclear whether AI systems are truly approaching expert-level performance, or if they’re merely mimicking without genuine understanding. To answer this question, The Center for AI Safety and Scale AI are developing a benchmark titled “Humanity’s Last Exam” consisting of difficult, post-doctoral level questions meant to be at humanity’s knowledge frontier.”
Sounds reasonable, right?
The problem is, they don’t want actual questions that probe for deep knowledge. They don’t want a question that requires a nuanced answer and a deep understanding of all the factors at play. They don’t want the types of questions you might ask of a seminar speaker, of a student defending their dissertation, or of a researcher who’s staked out a contrarian position on a scientific matter. What they want are multiple choice questions that, supposedly, would be answerable only by someone who is competent in the field.
[...] If we wanted to design a test that truly determined whether an AI actually demonstrated true generative intelligence, it wouldn’t be a test that could be defeated with an arbitrarily large, comprehensive set of training data. The notion that “infinite memorization” would equate to genuine AGI is absurd on its face, as the unique mark of intelligence is the ability to reason and infer in the face of incomplete information. If you can supply the subject of your test with complete information as respects the test itself, then you have no hope of measuring how it performs on measures of intelligence at all; you simply measured how well your subject performed on the test.
It’s for these essential reasons that the attempt to create Humanity’s Last Exam with a series of multiple choice questions that are, at their core, knowledge-based instead of reasoning-based is destined to fail... (MORE - missing details)
