Agentic Reasoning: Andrew Ng at Sequoia AI Ascent 2024

The following is a transcript of Andrew Ng’s excellent presentation at AI Ascent on March 20, 2024 with his references for further reading presented inline. I will follow up with a briefer summary and notes/comments.

Ng: I’m looking forward to sharing with all of you what I’m seeing with AI agents, which I think is the exciting trend that everyone building in AI should pay attention to. I’m also excited about all the other presentations.

So, AI agents. Today, the way most of us use large language models is like this: with a non-agentic workflow where you type a prompt, and it generates an answer.

That’s a bit like if you ask a person to write an essay on a topic and say, “Please sit down at the keyboard and just type the essay from start to finish without ever using the backspace key.” Despite how hard this is, LLMs do it remarkably well.

In contrast, with an agentic workflow, this is what it might look like: Have an AI, have an LLM, say, write an essay outline. Do you need to do any web research? If so, let’s do that. Then write the first draft, read your own first draft, and think about what parts need revision. Then revise your draft. You go on and on. This workflow is much more iterative, where you might have the LLM do some thinking, revise the article, do some more thinking, and iterate through this process a number of times.

What not many people appreciate is that this delivers remarkably better results. I’ve actually been really surprised myself, working with these agent workflows, by how well they work. Let me share one case study. My team analyzed some data using a coding benchmark called the HumanEval Benchmark, released by OpenAI a few years ago. This includes coding problems like: “Given a non-empty list of integers, return the sum of all elements at even positions.” It turns out the answer is a code snippet like that.

Today, a lot of us use zero-shot prompting, meaning we tell the AI, “Write the code,” and then run it on the first try. But who codes like that? No human codes like that—just typing out the code and running it. Maybe you do, but I can’t do that. It turns out that if you use GPT-3.5 with zero-shot prompting, it gets it 48% right. GPT-4 does way better, getting 67% right. But if you take an agentic workflow and wrap it around GPT-3.5, it actually does better than even GPT-4. And if you wrap this type of workflow around GPT-4, it also does very well. You’ll notice that GPT-3.5 with an agentic workflow actually outperforms GPT-4 in a zero-shot setup. I think this has significant consequences for how we all approach building applications.

The term “agents” is thrown around a lot. There are many consultant reports talking about agents as the future of AI, and so on. I want to be a bit concrete and share with you the broad design patterns I’m seeing in agents. It’s a very messy and chaotic space, with tons of research and open-source projects. There’s a lot going on, but I’ve tried to categorize it more concretely.

Four Design Patterns in AI Agents

[Let’s look at four design patterns: Reflection, Tool Use, Planning, and Multi-Agent systems. In my experience, these fall into two categories:]

Robust: Reflection is a tool that I think many of us should just use because it works. It’s widely appreciated and performs quite well. I think of these as robust technologies. When I use them, I can aLLMost always get them to work well. This is also true of Tool Use.

Emerging: Planning and Multiagent Collaboration: this is more emerging. Sometimes, when I use it, my mind is blown by how well it works, but at this moment, I don’t feel like I can always get it to work reliably.

Let me walk through these four design patterns in more detail.

Reflection

For reflection, here’s an example: Let’s say you ask a system, “Please write code for me for a given task.” Then you have a coder agent, just an LLM that you prompt to write code, such as defining a function. An example of self-reflection would be to prompt the LLM with something like: “Here’s the code intended for a task.” Then you give it back the exact same code it just generated and say, “Check the code carefully for correctness, soundness, efficiency, and good construction.” Just writing a prompt like that can often work.

It turns out the same LLM you prompted to write the code may spot problems, such as a bug on line five, and suggest fixes. If you now take its feedback and re-prompt it, the LLM may generate a version two of the code that works better than the first version. It’s not guaranteed, but it works often enough to be worth trying for a lot of applications.

To foreshadow, you can also let it run unit tests. If it fails the unit test, you can ask it why and have a conversation to figure out the issue. It can then refine the code and generate a version three. For those of you interested in learning more about these technologies, I’m very excited about them. For each of the four sections, I’ve included a recommended reading section at the bottom to provide more references.

So far, I’ve described a single coder agent that you prompt to have a conversation with itself. A natural evolution of this idea is to have two agents. One could be a coder agent, and the other could be a critic agent. These might use the same base LLM but are prompted in different ways. For example, one might be prompted as an expert coder to write code, and the other as an expert code reviewer to review the code. This type of workflow is straightforward to implement and a general-purpose technology. For many workflows, it can provide a significant boost in performance.

Further reading:

Tool Use

Many LLMs today use tools. For example, on the left is a screenshot from Copilot, and on the right is something I kind of extracted from GPT-4. An LLM today, if you ask it what’s the best coffee maker, it might do a web search; for some problems LLMs will generate code and run code. There are a lot of different tools people are using for analysis, to gathering inforation, for taking action, for personal productivity. Early work in tool use often originated in the computer vision community, where LLMs were used to manipulate images because they couldn’t natively process them. These techniques have since expanded what LLMs can do.

Further reading:

Planning

Planning capabilities are transformative. For those of you who haven’t yet explored planning algorithms, it can be an AI Agent “wow” moment. I’ve run live demos where an AI agent rerouted around failures, leaving me amazed. For instance, an AI agent can analyze an image of a girl reading a book, determine the pose, use models to synthesize a new image with different characteristics, and generate text or speech descriptions—all autonomously. While finicky, these workflows are pretty amazing when they succeed. I found myself already using research agents in my work; I want some research, but I don’t want to do it myself; I’ll send it to a reasearch agent, come back in a few minutes and see what it’s come up with. It sometimes works and sometimes doesn’t, but that’s already part of my personal workflow.

Further reading:

Multiagent Collaboration

This works much better than you might think. For example, the ChatDev project, which is open-source, allows you to prompt agents to act as a CEO, designer, or tester. These agents collaborate and iterate to develop software, producing surprisingly complex programs. While the technology isn’t perfect, it’s rapidly improving.

Further reading:

Final Thoughts

To summarize, agentic reasoning and workflows are an exciting trend that can significantly boost productivity. We’re entering a phase where the capabilities of AI systems will expand dramatically due to these workflows. However, it may require us to adjust expectations. For instance, rather than expecting instant responses, we might need to patiently delegate tasks to AI agents and wait for their thoughtful results.

Another important trend is the speed of token generation. Faster token output can enable more iterations within agentic loops, potentially delivering better results. Even a slightly lower-quality LLM that generates tokens faster might outperform a higher-quality LLM in these setups.

I’m looking forward to the next-generation models like GPT-5, Gemini 2.0, and others. While these advances promise better zero-shot capabilities, agentic reasoning with current models can already achieve remarkable results. The path to AGI feels like a journey rather than a destination, but these workflows represent a meaningful step forward.