The Scaling Debate in AI: Beyond Bigger Models

Recent discussions in the AI community have raised concerns about the effectiveness of scaling large language models (LLMs). Reports from Platformer, TechCrunch, Time, and Bloomberg suggest that increasing model size and data volume is yielding diminishing returns. This has led to questions about the viability of scaling as a strategy for achieving advanced reasoning capabilities.

Disappointing? Maybe. Surprising? Maybe not.

While some may be disappointed that advanced reasoning and deduction are challenging to achieve with standalone LLMs, this outcome is not entirely unexpected. Neural networks excel at pattern recognition through linear approximation, a fundamentally statistical process. The remarkable achievements LLMs have made — creating original, cogent responses to complex questions by predicting the next word in a sequence — demonstrate the power of pattern recognition. However, expecting this methodology to emulate complex reasoning or deduction may be unrealistic.

It is interesting to hear Anthropic’s Dario Amodei explain his faith in scaling. His reasoning draws heavily on the concept of 1/r distributions (aka Pink Noise), a statistical pattern common in natural systems, to suggest that scaling LLMs follows a similar principle of emergent efficiency, regularity, and scalability. He thus concludes that the continued growth of LLMs in size and complexity will yield consistent, meaningful gains in their capabilities, much as natural systems exhibit convergence to efficient states. This analogy, and much of his reasoning besides, is essentially inductive, assuming that past scaling trends will continue to yield predictable efficiency gains, like a fractal seen at different scales.

There is something to this observation, and it’s certainly a nice hook for someone who wants to believe that models can improve indefinitely, but it says little about the possible cost-benefit tradeoffs. That is, perhaps there is no limit to how good a model can get, but there remains the question of whether any particular gain is worth its inputs, especially when model-authors are throwing around projected training costs in the trillions. And the incremental effort (both human hours and Joules) to wring additional benefits from this methodology has led many to question whether this architecture is the best approach to modeling a biological mechanism that can be trained on 12 years of food and oxygen and operates at a power of ~20 Watts.

Foundation model builders like OpenAI and Anthropic and hardware providers like Nvidia, having built much of their success around scaling the power of LLMs, might find this suggestion challenging. Nonetheless, advanced reasoning doesn’t need to depend solely on scaling. It’s acceptable if not everything fits the paradigm of brute computational force.

A Shift Toward Agentic Systems and New Architectures

Thought leaders like Andrej Karpathy, Andrew Ng, and Harrison Chase have emphasized the need for more sophisticated systems that integrate multiple models and new architectures, each presenting at a conference in March. Karpathy, while noting the potential of agentic systems to be the “app layer” of AI, with foundation models as the “OS”, also suggested various signs that the focus on monolithic neural network-based models may be due for a rethink, noting 1) the massive compute now required to improve and even run these models, 2) the unbridged gap between transformer and diffusion models, and 3) the potential of new architectures and possibly new hardware beyond transformers and GPUs. He also lamented the underuse of reinforcement learning in LLMs, as a bit of human-pleasing icing on the cake of self-supervised pattern mimicry, whereas human beings learn by thinking and doing, and AlphaGo mastered Go by playing the game against itself.

Andrew Ng, at the same conference, pitched the step-increase of AI’s performance that can be achieved with agentic workflows. By orchestrating agents in various patterns, AI is given the opportunity to think about a task rather than performing it in one shot. This is also more similar to how humans operate, and allows the system to take advantage of planning, reflection, and other helpful patterns that an LLM cannot model as well in a single-pass (aka “zero-shot”) computation. Harrison Chase, founder of LangChain, explored a variety of opportunities in agentic orchestration, emphasizing the importance of memory to empower AI. These insights suggest a shift toward architectures that incorporate multiple models and tools, orchestrated with intermediate outputs saved to memory, akin to human cognitive processes. Indeed, one gap Karpathy identified was the inefficiency of the classic von Neumann architecture of the CPU, separating memory from compute, and the possibility of new hardware technology that unites them, as in biological brains.

It is worth noting that projects incorporating memory into deep learning foundation models is in progress. Researchers like Albert Gu are exploring how models like his Mamba, can incorporate advanced memory mechanisms. Mamba compresses prior data into usable summaries, improving efficiency and applicability. Architectures like Mamba represent a focus on smart design rather than size, again, taking inspiration from biology.

Progress on the App Layer Continues Unabated

The “laws” of scale may falter, but progress continues. It is notable that while some big-tech tycoons are fretting over scaling, the startup and venture class are hypothesizing the enormous potential of agentic orchestration of existing models. Foundation models have moved ahead so quickly, with LLM and diffusion models power so readily available, that those focused on the app layer will need more time to catch up—notwithstanding advances in AI-enabled coding that have themselves greatly facilitated application development. Agentic workflows and UX are still developing, a fact emphasized by Chase in his talk cited above. On the foundation model side, some low-hanging fruit for improvement includes faster inference, which is critical for real-world applications, especially in agent-based architectures that require rapid interaction among components. Smaller, more efficient models will enable new edge-computing use cases, and may progress independent of scaling issues. Base models are likely to become more specialized, focusing on particular use cases and efficient input and output in structured formats for tool use and orchestration.

Conclusion: Beyond the Hammer

Scaling laws have brought us to where we are today, but they are not the only path forward. Instead of lamenting the possible limitations of single-pass transformers, we should celebrate the ongoing innovation in multi-agent systems, memory-based reasoning, and lighter and more specialized models. AI doesn’t have to be one giant model trying to do everything. As we evolve our approach, the future of AI may lie in the thoughtful collaboration of diverse tools, much like our human brains, and, indeed, society itself.