Reasoning skills of large language models are often overestimated (2024)

When it comes to artificial intelligence, appearances can be deceiving. The mystery surrounding the inner workings of large language models (LLMs) stems from their vast size, complex training methods, hard-to-predict behaviors, and elusive interpretability.

MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers recently peered into the proverbial magnifying glass to examine how LLMs fare with variations of different tasks, revealing intriguing insights into the interplay between memorization and reasoning skills. It turns out that their reasoning abilities are often overestimated.

The study compared “default tasks,” the common tasks a model is trained and tested on, with “counterfactual scenarios,” hypothetical situations deviating from default conditions — which models like GPT-4 and Claude can usually be expected to cope with. The researchers developed some tests outside the models’ comfort zones by tweaking existing tasks instead of creating entirely new ones. They used a variety of datasets and benchmarks specifically tailored to different aspects of the models' capabilities for things like arithmetic, chess, evaluating code, answering logical questions, etc.

When users interact with language models, any arithmetic is usually in base-10, the familiar number base to the models. But observing that they do well on base-10 could give us a false impression of them having strong competency in addition. Logically, if they truly possess good addition skills, you’d expect reliably high performance across all number bases, similar to calculators or computers. Indeed, the research showed that these models are not as robust as many initially think. Their high performance is limited to common task variants and suffer from consistent and severe performance drop in the unfamiliar counterfactual scenarios, indicating a lack of generalizable addition ability.

The pattern held true for many other tasks like musical chord fingering, spatial reasoning, and even chess problems where the starting positions of pieces were slightly altered. While human players are expected to still be able to determine the legality of moves in altered scenarios (given enough time), the models struggled and couldn’t perform better than random guessing, meaning they have limited ability to generalize to unfamiliar situations. And much of their performance on the standard tasks is likely not due to general task abilities, but overfitting to, or directly memorizing from, what they have seen in their training data.

“We’ve uncovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we strive to enhance these models’ adaptability and broaden their application horizons,” says Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead author on a new paper about the research. “As AI is becoming increasingly ubiquitous in our society, it must reliably handle diverse scenarios, whether familiar or not. We hope these insights will one day inform the design of future LLMs with improved robustness.”

Despite the insights gained, there are, of course, limitations. The study’s focus on specific tasks and settings didn’t capture the full range of challenges the models could potentially encounter in real-world applications, signaling the need for more diverse testing environments. Future work could involve expanding the range of tasks and counterfactual conditions to uncover more potential weaknesses. This could mean looking at more complex and less common scenarios. The team also wants to improve interpretability by creating methods to better comprehend the rationale behind the models’ decision-making processes.

“As language models scale up, understanding their training data becomes increasingly challenging even for open models, let alone proprietary ones,” says Hao Peng, assistant professor at the University of Illinois at Urbana-Champaign. “The community remains puzzled about whether these models genuinely generalize to unseen tasks, or seemingly succeed by memorizing the training data. This paper makes important strides in addressing this question. It constructs a suite of carefully designed counterfactual evaluations, providing fresh insights into the capabilities of state-of-the-art LLMs. It reveals that their ability to solve unseen tasks is perhaps far more limited than anticipated by many. It has the potential to inspire future research towards identifying the failure modes of today’s models and developing better ones.”

Additional authors include Najoung Kim, who is a Boston University assistant professor and Google visiting researcher, and seven CSAIL affiliates: MIT electrical engineering and computer science (EECS) PhD students Linlu Qiu, Alexis Ross, Ekin Akyürek SM ’21, and Boyuan Chen; former postdoc and Apple AI/ML researcher Bailin Wang; and EECS assistant professors Jacob Andreas and Yoon Kim.

The team’s study was supported, in part, by the MIT–IBM Watson AI Lab, the MIT Quest for Intelligence, and the National Science Foundation. The team presented the work at the North American Chapter of the Association for Computational Linguistics (NAACL) last month.

Reasoning skills of large language models are often overestimated (2024)

FAQs

Why do large language models make mistakes? ›

The authors found the LLMs to be prone to similar content effects as humans. Both humans and LLMs are more likely to mistakenly label an invalid argument as valid when the semantic content is sensical and believable.

Know More ›

How do large language models reason? ›

LLMs use a type of machine learning called deep learning. Deep learning models can essentially train themselves to recognize distinctions without human intervention, although some human fine-tuning is typically necessary. Deep learning uses probability in order to "learn."

Read The Full Story ›

Are LLMs good at reasoning? ›

We show Large Language Models (LLMs) have become capable of incredible feats of reasoning, previously reserved to humans. Regardless, we bring forth evidence that LLM and human reasoning are not the same, as they respond differently to strategic cues, and are ruled by different biases.

Why large language models are poor theories of human linguistic cognition? ›

LLMs are not biased in a way that would lead to these universals (nor is there any reason to think that just making future LLMs bigger and more powerful than current ones should change this), and in the absence of other explanations for why these universals should arise from an unbiased learner, LLMs remain a deeply ...

What is the problem with large language models? ›

Lack of accountability:

The lack of accountability in the context of Large Language Models (LLMs) arises from the inherent challenge of determining responsibility for the content they generate. This issue carries significant implications, particularly within legal and ethical domains.

Get More Info ›

What is the limitation of large language models? ›

Lack of Contextual Awareness: LLMs primarily rely on patterns in the data they've been trained on and may not have the capacity to infer the subtleties of human communication. Absence of Emotional Intelligence: They lack the emotional intelligence to understand the emotions and intentions behind the words.

Know More ›

What are the weaknesses of language models? ›

However, LLMs also have limitations. They struggle with contextual understanding and common-sense reasoning, can inherit biases from training data, and depend heavily on data quality. Additionally, their complexity and lack of interpretability pose challenges for transparency and trust.

Show Me More ›

What are the limitations of LLM reasoning? ›

Limited reasoning – LLMs struggle complex multistep problems

While LLMs can produce very coherent and fluent writing, they often struggle with tasks that require complex logical reasoning, multi step problem-solving, or quantitative analysis.

Why do large language models hallucinate? ›

This type of hallucination arises due to the model's limited contextual understanding and the inherent noise or errors in the training data, leading to responses that are not grounded in reality.

What is LLM bad at? ›

Math. Despite their advanced capabilities, Large Language Models (LLMs) often struggle with mathematical tasks and can provide incorrect answers (even as simple as multiply two numbers). This is because they are trained on large volumes of text and math may require a different approach.

Discover More ›

Why will LLM fail? ›

5) Model Collapse – Without human-generated training data, LLM models can malfunction – especially when these LLM models are trained on AI-generated content. This results in slow degeneration as the model becomes oblivious of the true underlying data distribution (Even if the underlying distribution remains the same).

Get More Info ›

What are the problems with LLMs? ›

The Predominant Challenges of Implementing LLMs

LLM Cost Efficiency. The cost of deploying and maintaining LLMs is a significant hurdle for many enterprises. ...
Accuracy of LLM Outputs. ...
Currentness. ...
Enterprise Context Awareness. ...
Safety. ...
Cost Efficient LLM Solution. ...
Enhancing LLM Accuracy. ...
Ensuring LLM Currentness.

More items...

See Details ›

Why do large language models work so well? ›

Pre-training: The model is exposed to massive amounts of text data (such as books, articles, or web pages), so that it can learn the patterns and connections between words. The more data it is trained on, the better it will be at generating new content. While doing so, it learns to predict the next word in a sentence.

Learn More ›

Which theory of language development is most accurate? ›

Learning Theory

Perhaps the most straightforward explanation of language development is that it occurs through the principles of learning, including association and reinforcement (Skinner, 1953).

Why did Noam Chomsky disagree with Skinner's theory on language development? ›

Noam Chomsky, however, disagrees with Skinner's theory relating to children's learning and development as he believes that humans are born with a basic knowledge of language and don't have to learn it from fresh.

Get More Info ›

Why large language models may produce unreliable outputs for out of distribution inputs? ›

Despite the enormous size of the datasets used to train these models, they may not cover all possible scenarios, domains, or linguistic variations. This can result in LLMs struggling to generate accurate or reliable outputs when faced with out-of-distribution or unseen examples.

Reasoning skills of large language models are often overestimated (2024)

FAQs

Why do large language models make mistakes? ›

Why will LLM fail? ›

References