Article Ultimate AI challenge? Why “humanity’s last exam” will ultimately fail humanity

**C C** · (This post was last modified: Nov 7, 2024 10:18 PM by C C.)

RELATED (scivillage): Despite impressive output, generative AI doesn’t have coherent understanding of world
- - - - - - - - - - - - - -

Why “humanity’s last exam” will ultimately fail humanity
https://bigthink.com/starts-with-a-bang/...exam-fail/

EXCERPTS: . . . To assess the capabilities of LLMs, the Center for AI Safety, led by Dan Hendrycks, is currently crowdsourcing questions for review to construct what they’re calling Humanity’s Last Exam. Seeking exceptionally hard, PhD-level questions (potentially with obscure, niche answers) from experts around the world, they believe that such an exam would pose the ultimate challenge for AI.

However, the very premise may be fundamentally flawed.

[...] If you want to understand how AIs fail, it’s paramount that you first understand how AIs, and in particular, LLMs, work. In a traditional computer program, the user gives the computer an input (or a series of inputs), the computer then follows computations according to a pre-programmed algorithm, and then when it’s done with its computations, it returns an output (or a series of outputs) to the user. The big difference between a traditional computer program and the particular form of AI that’s leveraged in machine learning applications (which includes LLMs) is that instead of following a pre-programmed algorithm for turning inputs into outputs, it’s the machine learning program itself that’s responsible for figuring out the underlying algorithm.

How good is that algorithm going to be?

In principle, even the best algorithms that a machine is capable of coming up with are still going to be fundamentally limited by an inescapable factor: the quality, size, and comprehensiveness of the initial training data set that it uses to figure out the underlying algorithm. As Anil Ananthaswamy, author of Why Machines Learn: The Elegant Math Behind Modern AI, cautioned me about LLMs in a conversation back in July,

“While these algorithms can be extremely powerful and even outdo humans on the narrow tasks they are trained on, they don’t generalize to questions about data that falls outside the training data distribution. In that sense, they are not intelligent in the way humans are considered intelligent.”

In other words, if you want your LLM to perform better at tasks it currently doesn’t perform well at, the solution is to enhance your training data set so that it includes better, more relevant, and more comprehensive examples of the queries it’s going to be receiving.

Which brings us back to Humanity’s Last Exam, and why it’s such an absurd notion to begin with. I was contacted by the Director of the Center for AI Safety, Dan Hendrycks, with the following message:

“OpenAI’s recent o1 model performs similarly to physics PhD students on various benchmarks, yet it’s unclear whether AI systems are truly approaching expert-level performance, or if they’re merely mimicking without genuine understanding. To answer this question, The Center for AI Safety and Scale AI are developing a benchmark titled “Humanity’s Last Exam” consisting of difficult, post-doctoral level questions meant to be at humanity’s knowledge frontier.”

Sounds reasonable, right?

The problem is, they don’t want actual questions that probe for deep knowledge. They don’t want a question that requires a nuanced answer and a deep understanding of all the factors at play. They don’t want the types of questions you might ask of a seminar speaker, of a student defending their dissertation, or of a researcher who’s staked out a contrarian position on a scientific matter. What they want are multiple choice questions that, supposedly, would be answerable only by someone who is competent in the field.

[...] If we wanted to design a test that truly determined whether an AI actually demonstrated true generative intelligence, it wouldn’t be a test that could be defeated with an arbitrarily large, comprehensive set of training data. The notion that “infinite memorization” would equate to genuine AGI is absurd on its face, as the unique mark of intelligence is the ability to reason and infer in the face of incomplete information. If you can supply the subject of your test with complete information as respects the test itself, then you have no hope of measuring how it performs on measures of intelligence at all; you simply measured how well your subject performed on the test.

It’s for these essential reasons that the attempt to create Humanity’s Last Exam with a series of multiple choice questions that are, at their core, knowledge-based instead of reasoning-based is destined to fail... (MORE - missing details)

**confused2** · Nov 12, 2024 02:18 AM

Back in the 70's in the UK girls were generally getting worse exam results than boys.They changed the way results were assessed to include (say) 50% coursework .. and the rest is history .. boys have never caught up.
I'd say a fair test for an AI would be to design something along the lines of an SR-71 Blackbird .. but better.
Just looking at what humans have going for them .. they (not me) can 'see' supersonic airflow. If you can see it, build it and fly it .. that's the one that wins.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Thanks to AI: The Beatles release their last song	C C	0	344	Nov 8, 2023 01:41 AM Last Post: C C
	Article AI models fail to reproduce human judgements about rule violations + AI empathy	C C	0	477	May 10, 2023 08:38 PM Last Post: C C
	Open AI GPT-4 Outperforms Most Humans on University Entrance Exams, Bar Exam etc.	Yazata	0	387	Mar 15, 2023 05:59 AM Last Post: Yazata
	Breakthrough for quantum computers? Engineers crack 58-year-old challenge	C C	0	695	Mar 15, 2020 07:27 PM Last Post: C C