Notes on Agentic Reasoning from Andrew Ng at Sequoia AI Ascent 2024

AI pioneer and educator Andrew Ng gave a compelling talk at Sequoia’s AI Ascent 2024 a couple of weeks ago. I took some notes and have provided a few thoughts of my own. Presentation highlights are presented at the top, and a fuller synopsis with Ng’s citations is provided at the bottom.

Presentation Highlights

AI Agents are a significant trend everyone building in AI should pay attention to.
The term “AI agents” is becoming very common, often in vague or aspirational contexts. Ng makes “agent behavior” more concrete, identiying four main agentic behavior patterns: reflection, tool use, planning, and multi-agent collaboration, and walks through each. He finds in general:
- Reflection (multiple passes at a problem) and tool use (calling in outside resources to inform a task) are consistently effective and fairly well-defined.
- Planning (breaking a task down into subtasks, which are then executed individually and cumulatively) is incredibly powerful, but it works less consistently; the devil is in the details.
- Multi-agent collaboration (which combines reflection, tool use, and planning along with differently prompted agents sometimes running on different models) is also very powerful, but more emergent.
Agentic design patterns have the power to make simpler models perform better than more advanced models, and will be key to achieving more powerful AI applications and even AGI.
However, they involve complex execution patterns that may diverge from consumer expectations of instant response. Faster-executing models will be key to surfacing their power.

Commentary

Ng’s presentation offers a compelling analysis and argument for advancing research and development in AI agents. His categorization of agent behaviors provides clarity, countering vague claims about “AI agents.” In a field evolving as quickly as AI, these categories are invaluable for tracking progress and evaluating results.

Ng used AI code generation to showcase the power of agentic patterns, highlighting how planning, tool use (e.g., testing and running code), and reflection (e.g., AI-led code reviews) surpass zero-shot prompting. Code generation is especially suited for demonstrating these methodologies because of its objective benchmarks and studies showing how smaller models in agentic workflows outperform naive approaches with larger models.

Cited code-generation agents include:

But how representative is code generation of other tasks? Code is unique in being quasi-multimodal – text that functions beyond reading. Its execution and validation by interpreters make it an excellent test case, but these characteristics may not apply to many other tasks. However, programming’s unlimited complexity (“build me Google”) and objective evaluation make it an ideal area to showcase agentic methodologies.

For non-coding tasks, Ng cited eight papers, two for each agentic behavior, showing significant improvements over zero-shot approaches. Tasks involving structured, step-wise reasoning or large-scale information analysis intuitively benefit from – or even require – agentic workflows. For example, synthesizing conclusions from hundreds of documents exceeds current context window sizes, even as they expand, and benefits from guided agentic strategies.

Ng doesn’t delve into another important advantage of agentic behaviors: the paper trail they produce. When agents collaborate and use tools, they generate intermediate records that provide:

Transparency: Tracing decisions and actions.
Auditing: Supporting compliance and performance evaluations.
Improvement: Refining behavior through analysis.
Error Correction: Preventing harmful outcomes mid-run.

These intermediate outputs also create opportunities for future applications. Together, these features make agent collaboration an exciting frontier in AI development.

Presentation Summary

Today, most of us interact with large language models (LLMs) using a non-agentic workflow. This involves typing a prompt and generating an answer, akin to asking a person to write an essay without ever using the backspace key. Despite the inherent challenges, LLMs perform remarkably well in this setup.

In contrast, an agentic workflow is more iterative, and thus more similar to how humans work. For instance, an AI could first generate an essay outline, determine if it needs to conduct web research, write a draft, review its own draft, and revise accordingly. This iterative approach enables the model to think through its tasks and refine its output, often delivering far better results.

In his own work, Ng has been surprised by how effective these agentic workflows can be. His team analyzed the performance of these workflows using the “HumanEval Benchmark” for coding tasks, originally released by OpenAI. This includes challenges like: “Given a non-empty list of integers, return the sum of all elements at even positions.”

With zero-shot prompting, GPT-3.5 solved 48% of the tasks correctly, while GPT-4 performed better, solving 67%. However, when GPT-3.5 was used with an agentic workflow, its performance surpassed GPT-4 in a zero-shot context. Wrapping an agentic process around GPT-4 improved results even further. This highlights the potential for agentic workflows to enhance performance across various applications. Code generation agent benchmarks

Key Design Patterns for AI Agents

The term “AI agents” is widely used, often in vague or aspirational contexts. Ng wants to concretize this concept by categorizing the broad design patterns emerging in the space. While the field is chaotic, with abundant research and open-source contributions, Ng identifies four main patterns:

Reflection
Reflection is a robust pattern that works well in many scenarios. For instance, you can prompt an LLM to write a function and then use the same LLM to critique its own output. A reflection prompt might say: “Here’s the code you just generated. Check it carefully for correctness, soundness, efficiency, and construction.” The LLM might identify issues in the code, suggest fixes, and generate an improved version. Iterative cycles, such as running unit tests and analyzing failures, can lead to further refinement. Further reading:

Tool Use
Many LLM-based systems today leverage external tools to expand their capabilities. For example, a model might perform web searches, generate or execute code, or interact with APIs for personal productivity tasks. Early work on tool use originated in the computer vision community, where LLMs were used to manipulate images via function calls, as they lacked native image-processing abilities. Tool use significantly broadens what LLMs can achieve. Further reading:

Planning
Planning capabilities enable AI agents to autonomously navigate complex tasks, even recovering from failures. For example, if an agent encounters a problem while synthesizing an image or performing a search, it can adapt its approach. I’ve experienced moments where an AI agent autonomously rerouted around a failure, achieving unexpected results. While planning agents don’t always work reliably, when they succeed, they are impressive. Further reading:

Multi-Agent Collaboration
Multi-agent systems involve multiple agents working together, each fulfilling specialized roles. For instance, in the open-source project “ChatDev,” LLMs are prompted to act as a CEO, designer, product manager, or tester. These agents collaborate to build software, iterating through development, testing, and refinement. Although the outcomes are not always perfect, they can produce surprisingly complex programs. Further reading:

Applications and Implications

Agentic workflows are transforming how we use LLMs. For example, multi-agent collaboration can include scenarios where different models debate or critique one another, leading to better results. Tools like GPT-4 and Gemini are increasingly capable of supporting these workflows, though there is room for improvement in reliability.

One challenge with agentic workflows is user patience. Unlike traditional web searches, which provide near-instant results, agentic tasks may require minutes or even hours to complete. Learning to delegate tasks to AI and waiting for thoughtful results is a shift we’ll need to embrace. Fast token generation also becomes critical, as it allows iterative workflows to proceed efficiently.

Looking ahead, Ng expects agentic reasoning and design patterns to expand dramatically this year. While advanced models like GPT-5 or Gemini 2.0 promise enhanced zero-shot capabilities, integrating agentic workflows into current models can yield comparable performance in many applications. This shift represents a step forward on the long journey toward Artificial General Intelligence (AGI).