Mar 24, 2024 10:36 PM
https://iai.tv/articles/the-turing-tests..._auid=2020
INTRO: Companies like OpenAI try to show that AIs are intelligent by hyping their high scores in behavioural tests – an approach with roots in the Turing Test. But there are hard limits to what we can infer about intelligence by observing behaviour. To demonstrate intelligence, argues Raphaël Millière, we must stop chasing high scores and start uncovering the mechanisms underlying AI systems’ behaviour.
EXCERPTS: In theory, benchmarks should allow for rigorous and piecemeal evaluations of AI systems, helping foster broad consensus about their abilities. But in practice benchmarks face major challenges, which only get worse as AI systems progress. High scores on benchmarks do not always translate to good real-world performance in the target domain. This means benchmarks may fail to provide reliable evidence of what they are supposed to measure, which drives further division about how impressed we should be with current AI systems.
[...] This points to a broader concern about what benchmarks are really supposed to measure. A well-designed test should measure some particular skill or capacity, and good test performance should generalize to relevant real-world situations. However, common benchmarks used in AI research explicitly target nebulous capacities, such as “understanding” and “reasoning”. These constructs are abstract, multifaceted, and implicitly defined with reference to human psychology. But we cannot uncritically assume that a test designed for humans can be straightforwardly adapted to evaluate language models and remain valid as an assessment of the same capacity. Humans and machines may achieve similar performance on a task through very different means, and benchmark scores alone do not tell that story... (MORE - details)
INTRO: Companies like OpenAI try to show that AIs are intelligent by hyping their high scores in behavioural tests – an approach with roots in the Turing Test. But there are hard limits to what we can infer about intelligence by observing behaviour. To demonstrate intelligence, argues Raphaël Millière, we must stop chasing high scores and start uncovering the mechanisms underlying AI systems’ behaviour.
EXCERPTS: In theory, benchmarks should allow for rigorous and piecemeal evaluations of AI systems, helping foster broad consensus about their abilities. But in practice benchmarks face major challenges, which only get worse as AI systems progress. High scores on benchmarks do not always translate to good real-world performance in the target domain. This means benchmarks may fail to provide reliable evidence of what they are supposed to measure, which drives further division about how impressed we should be with current AI systems.
[...] This points to a broader concern about what benchmarks are really supposed to measure. A well-designed test should measure some particular skill or capacity, and good test performance should generalize to relevant real-world situations. However, common benchmarks used in AI research explicitly target nebulous capacities, such as “understanding” and “reasoning”. These constructs are abstract, multifaceted, and implicitly defined with reference to human psychology. But we cannot uncritically assume that a test designed for humans can be straightforwardly adapted to evaluate language models and remain valid as an assessment of the same capacity. Humans and machines may achieve similar performance on a task through very different means, and benchmark scores alone do not tell that story... (MORE - details)
