
https://www.quantamagazine.org/chatbot-s...-20250131/
EXCERPT: . . . Einstein’s riddle requires composing a larger solution from solutions to subproblems, which researchers call a compositional task. Dziri’s team showed that LLMs that have only been trained to predict the next word in a sequence — which is most of them — are fundamentally limited in their ability to solve compositional reasoning tasks.
Other researchers have shown that transformers, the neural network architecture used by most LLMs, have hard mathematical bounds when it comes to solving such problems. Scientists have had some successes pushing transformers past these limits, but those increasingly look like short-term fixes. If so, it means there are fundamental computational caps on the abilities of these forms of artificial intelligence — which may mean it’s time to consider other approaches.
“The work is really motivated to help the community make this decision about whether transformers are really the architecture we want to embrace for universal learning,” said Andrew Wilson (opens a new tab), a machine learning expert at New York University who was not involved with this study.
Ironically, LLMs have only themselves to blame for this discovery of one of their limits. “The reason why we all got curious about whether they do real reasoning is because of their amazing capabilities,” Dziri said. They dazzled on tasks involving natural language, despite the seeming simplicity of their training. During the training phase, an LLM is shown a fragment of a sentence with the last word obscured (though technically it isn’t always a single word). The model predicts the missing information and then “learns” from its mistakes.
The largest LLMs — OpenAI’s o1 and GPT-4, Google’s Gemini, Anthropic’s Claude — train on almost all the available data on the internet. As a result, the LLMs end up learning the syntax of, and much of the semantic knowledge in, written language. Such “pre-trained” models can be further trained, or fine-tuned, to complete sophisticated tasks far beyond simple sentence completion, such as summarizing a complex document or generating code to play a computer game. The results were so powerful that the models seemed, at times, capable of reasoning.
Yet they also failed in ways both obvious and surprising. “On certain tasks, they perform amazingly well,” Dziri said. “On others, they’re shockingly stupid.” Nouha Dziri and her team helped show the difficulty current AI systems have with certain kinds of reasoning tasks.
Take basic multiplication. Standard LLMs, such as ChatGPT and GPT-4, fail badly at it. In early 2023 when Dziri’s team asked GPT-4 to multiply two three-digit numbers, it initially succeeded only 59% of the time. When it multiplied two four-digit numbers, accuracy fell to just 4%... (MORE - missing details)
RELATED (scivillage): The brain holds no exclusive rights on how to create intelligence
EXCERPT: . . . Einstein’s riddle requires composing a larger solution from solutions to subproblems, which researchers call a compositional task. Dziri’s team showed that LLMs that have only been trained to predict the next word in a sequence — which is most of them — are fundamentally limited in their ability to solve compositional reasoning tasks.
Other researchers have shown that transformers, the neural network architecture used by most LLMs, have hard mathematical bounds when it comes to solving such problems. Scientists have had some successes pushing transformers past these limits, but those increasingly look like short-term fixes. If so, it means there are fundamental computational caps on the abilities of these forms of artificial intelligence — which may mean it’s time to consider other approaches.
“The work is really motivated to help the community make this decision about whether transformers are really the architecture we want to embrace for universal learning,” said Andrew Wilson (opens a new tab), a machine learning expert at New York University who was not involved with this study.
Ironically, LLMs have only themselves to blame for this discovery of one of their limits. “The reason why we all got curious about whether they do real reasoning is because of their amazing capabilities,” Dziri said. They dazzled on tasks involving natural language, despite the seeming simplicity of their training. During the training phase, an LLM is shown a fragment of a sentence with the last word obscured (though technically it isn’t always a single word). The model predicts the missing information and then “learns” from its mistakes.
The largest LLMs — OpenAI’s o1 and GPT-4, Google’s Gemini, Anthropic’s Claude — train on almost all the available data on the internet. As a result, the LLMs end up learning the syntax of, and much of the semantic knowledge in, written language. Such “pre-trained” models can be further trained, or fine-tuned, to complete sophisticated tasks far beyond simple sentence completion, such as summarizing a complex document or generating code to play a computer game. The results were so powerful that the models seemed, at times, capable of reasoning.
Yet they also failed in ways both obvious and surprising. “On certain tasks, they perform amazingly well,” Dziri said. “On others, they’re shockingly stupid.” Nouha Dziri and her team helped show the difficulty current AI systems have with certain kinds of reasoning tasks.
Take basic multiplication. Standard LLMs, such as ChatGPT and GPT-4, fail badly at it. In early 2023 when Dziri’s team asked GPT-4 to multiply two three-digit numbers, it initially succeeded only 59% of the time. When it multiplied two four-digit numbers, accuracy fell to just 4%... (MORE - missing details)
RELATED (scivillage): The brain holds no exclusive rights on how to create intelligence