Large Language Models (LLMs) have taken the world by storm, captivating our collective imagination with their eloquence and linguistic prowess. From composing poetry to writing software code, they’ve become our digital companions (or AI assistants as we call them at IBM now). But can they do more? Can they transcend mere language generation and venture into the realm of reasoning and planning?

Over the weekend, I happened to stumble upon a paper from Subbarao Kambhampati “Can Large Language Models Reason and Plan?” In this paper, Subbarao dives into the realm of LLMs like GPT-3 and investigates whether they possess the remarkable abilities to perform reasoning and planning tasks. I want to summarize what I learned through this blog for my fellow AI enthusiasts!

The N-gram Enigma

The paper begins by highlighting a significant distinction between LLMs and traditional AI systems. While LLMs excel at generating text by retrieving and stitching together memorized phrases, they lack the inherent logical problem-solving skills that we expect from AI systems. Early tests on standardized planning problems revealed that GPT-3 struggled to perform with accuracy. Even subsequent versions like GPT-4 showed only marginal improvements, but when faced with obfuscated action/object names, their performance drastically dropped, indicating reliance on memorization rather than true planning.

Fine-Tuning: A Mirage of Reasoning?

To enhance LLM performance, the author explores different approaches such as “fine-tuning” on planning examples or using iterative human prompting. However, it is important to note that fine-tuning merely strengthens the memory retrieval aspect of LLMs, without addressing the core reasoning deficiencies. On the other hand, prompting introduces the potential for human bias.

LLM-Modulo: A Beacon of Hope

The author proposes what he calls “LLM-Modulo” approach that could effectively leverage the strengths of LLMs for tasks involving reasoning, planning, and modeling complex environments. This approach involves using LLMs to generate potential solutions or ideas for a given problem, which are then tested and verified by external models or tools with stronger problem-solving capabilities. By combining the generative power of LLMs with the precise reasoning of specialized tools (or humans), this hybrid architecture aims to maximize the potential of LLMs while acknowledging their limitations.

The Commonsense Mirage

While several papers have claimed that LLMs possess planning capabilities, but the author argues that they often extract commonsense knowledge rather than engaging in true planning. Additionally, LLMs struggle with self-verification, undermining the theory that they can self-criticize and improve their performance.

The Verdict: Reasoning or Text Retrieval?

To sum it up, while LLMs excel at idea generation, principled reasoning is not their forte. So next time you are amazed by LLM’s eloquence, remember they’re knowledge fountains, not architects of logic.

What are your thoughts on the potential of LLMs? Share your insights in the comments below!


If you are interested in reading the paper or knowing more about the author, here is the link to Subbarao Kambhampati paper on arxiv.org Can Large Language Models Reason and Plan?