Making AI accessible with Andrej Karpathy and Stephanie Zhan

At Sequoia Capital’s AI Ascent on March 26, 2024, Andrej Karpathy spoke about creating a more accessible AI ecosystem. He highlighted the significance of open collaboration in AI development and shared insights from his experiences working with Elon Musk at Tesla. The full video is available here.

Some key takeaways:

Possible directional adjustments needed in LLM arms race:
- Base models as the OS that future applications will be built on, akin to an app ecosystem, as opposed to an all-in-one device.
- Enormous power and data required to train the latest models may mean we’re approaching the next step in functionality wrong. There is still a lack of fit between hardware and workloads.
- There is room, and probably a need, for new architectures. The divergence between LLM and diffusion architecture suggests this as well.
The importance of an open AI ecosystem for innovation, and the desire for more models that are fully open source, not just open-weights.
Experiences working with Elon Musk and his influence on AI projects.
Making AI building tools more accessible to a wider audience.

Transcript

[0:28] Stephanie: I’m thrilled to introduce our next and final speaker, Andrej Karpathy. I think Andrej probably needs no introduction; most of us have probably watched his YouTube videos at length. He is renowned for his research in deep learning, designed the first deep learning class at Stanford, was part of the founding team at OpenAI, led the computer vision team at Tesla, and is now a mysterious man again after just leaving OpenAI. So we’re very lucky to have you here, Andrej. You’ve been such a dream speaker, and we’re excited to have you and Stephanie close out the day. Thank you.

[Applause]

[1:01] Stephanie: Andre’s first reaction as we walked up here was, “Oh my God” to his picture. It’s quite intimidating! I don’t know what year that was taken, but he looks very impressive. Okay, amazing! Andre, thank you so much for joining us today, and welcome back.

[1:20] Andrej: Yeah, thank you!

[1:34] Stephanie: Fun fact: Most people don’t actually know where OpenAI’s original office was. How many folks here know?

[1:48] Audience: [Responses]

[2:12] Stephanie: That’s amazing! Nick, I’m going to guess right here. Right here on the opposite side of our San Francisco office, where many of you guys were just in huddles. So this is fun for us because it brings us back to our roots, back when I first started at Sequoia and when Andre first started co-founding OpenAI. Andre, in addition to living out the Willy Wonka dream of working atop a chocolate factory, what were some of your favorite moments working from here?

[2:20] Andrej: Yes, OpenAI was right there. This was the first office after, I guess, Greg’s apartment, which maybe doesn’t count. We spent maybe two years here, and the chocolate factory was just downstairs, so it always smelled really nice. The team was about 10 to 20 plus, and we had a few very fun episodes here. One of them was alluded to by Jensen at GTC that happened just yesterday or two days ago. Jensen was describing how he brought the first DGX and delivered it to OpenAI, and that happened right there. That’s where we all signed it; it’s in the room over there.

[2:58] Stephanie: Andre needs no introduction, but I wanted to give a little bit of backstory on some of his journey to date. As Sonia introduced, he was trained by Jeff Hinton and Fay. His first claim to fame was his deep learning course at Stanford. He co-founded OpenAI back in 2015, and in 2017, he was poached by Elon. I remember this very clearly. For folks who don’t remember the context, Elon had just transitioned through six different autopilot leaders, each of whom lasted six months. I remember when Andre took this job, I thought, “Congratulations and good luck.” Not too long after that, he went back to OpenAI and has been there for the last year. Unlike all the rest of us today, he is basking in the ultimate glory of freedom, all the time, and responsibility. We’re really excited to see what you have to share today.

[3:38] Andrej: A few things that I appreciate the most about Andre are that he is an incredibly fascinating futurist thinker, a relentless optimist, and a very practical builder. I think he’ll share some of his insights around that today.

[3:55] Stephanie: To kick things off, AGI seven years ago seemed like an incredibly impossible task to achieve, even within our lifetimes. Now, it seems within sight. What is your view of the future over the next few years?

[4:12] Andrej: Yes, I think you’re right. A few years ago, I sort of felt like AGI was— it wasn’t clear how it was going to happen. It felt very academic, and you would think about different approaches. Now, I think it’s very clear, and there’s a lot of space. Everyone is trying to fill it, and there’s a lot of optimization.

[4:21] Andrej: Roughly speaking, the way things are happening is that everyone is trying to build what I refer to as kind of like this LLM operating system. I like to think of it as an operating system where you have to get a bunch of peripherals that you plug into this new CPU. The peripherals are, of course, text, images, audio, and all the modalities, and then you have a CPU, which is the LLM Transformer itself. It’s also connected to all the software 1.0 infrastructure that we’ve already built up for ourselves.

[4:42] Andrej: I think everyone is kind of trying to build something like that and then make it available as something customizable for all the different nooks and crannies of the economy. So I think that’s roughly what everyone is trying to build out, and it’s what we sort of also heard about earlier today. I think that’s roughly where it’s headed. We can bring up and down these relatively self-contained agents that we can give high-level tasks to, and they can specialize in various ways. So yeah, I think it’s going to be very interesting.

[5:02] Andrej It’s exciting, and it’s not just one agent; it’s many agents. What does that look like? If that view of the future is true, how should we all be living our lives differently?

[5:18] I don’t know. I guess we have to try to build, influence it, and make sure it’s good. Just try to ensure it turns out well.

[5:31] Stephanie: So now that you’re a free, independent agent, I want to address the elephant in the room, which is that OpenAI is dominating the ecosystem. Most of our audience here today are founders who are trying to carve out a little niche, praying that OpenAI doesn’t take them out overnight. Where do you think opportunities exist for other players to build new independent companies, and what areas do you think OpenAI will continue to dominate as its ambitions grow?

[5:56] Andrej: Yes, so my high-level impression is basically that OpenAI is trying to build out these large models. As we heard earlier today, they are trying to develop a platform on top of which different companies and different verticals can position themselves.

[6:10] Now, I think the operating system analogy is also really interesting because when you look at something like Windows, those are also operating systems. They come with a few default apps, like a browser comes with Windows, right? You can use the Edge browser.

[6:24] So, I think in the same way, OpenAI or any of the other companies might come up with a few default apps, quote-unquote. But it doesn’t mean that you can’t have different browsers running on it, just like you can have different chat agents sort of running on that infrastructure.

[6:43] There will be a few default apps, but there will also be a potentially vibrant ecosystem of all kinds of apps that are fine-tuned to the various niches and needs of the economy. I really like the analogy of the early iPhone apps and what they looked like. They were all kind of like jokes initially, and it took time for that to develop.

[7:02] Absolutely, I agree that we’re going through the same thing right now. People are trying to figure out what this thing is good at, what it is not good at, how to work with it, how to program with it, how to debug it, and how to actually get it to perform real tasks.

[7:20] What kind of oversight do we need? Because it’s quite autonomous but not fully autonomous. So, what does the oversight look like? What does the evaluation look like? There are many things to think through just to understand the psychology of it.

[7:37] I think that’s what’s going to take some time to figure out exactly how to work with this infrastructure, so I think we’ll see that over the next few years.

[7:45] The race is on right now with large language models like OpenAI, Anthropic, Llama, and Gemini. The whole ecosystem of open-source models now encompasses a long tail of smaller models. How do you foresee the future of the ecosystem playing out?

[8:04] Andrej: So again, I think the operating systems analogy is interesting. We have, say, an oligopoly of a few proprietary systems like Windows, macOS, etc. Then we also have Linux, which has an infinity of distributions. I think it might look something like that.

[8:21] However, we have to be careful with the naming because a lot of the ones you listed, like Llama and others, I wouldn’t actually categorize as open-source. It’s kind of like handing over a binary for an operating system. You can work with it; it’s useful, but it’s not fully useful.

[8:39] There are a number of what I would classify as fully open-source large language models, such as the NOPIA models, LLM 360, and Almo, which fully release the entire infrastructure needed to compile the operating system. This includes training the model from the data and gathering the data, etc.

[8:58] When you’re just given a binary, it’s much better because you can fine-tune the model, which is very useful. However, it’s subtle: you can’t fully fine-tune the model. The more you fine-tune the model, the more it’s going to start regressing on everything else.

[9:16] So, what you really want to do, for example, if you want to add capability without regressing the existing ones, is to train on some kind of mixture of the previous data set distribution and the new data set distribution. You don’t want to regress the old distribution; you just want to add knowledge.

[9:38] If you’re only given the model weights, you can’t do that. You need the training loop, the data set, etc. You are constrained in how you can work with these models.

[9:50] Again, I think it’s definitely helpful, but we need slightly better language for it. There are open weights models, open-source models, and then proprietary models. That might be the future ecosystem.

[10:03] Hopefully, it will look very similar to what we have today, and I believe it will continue to help build some of that out.

[10:10] Stephanie: I’d love to address the other elephant in the room, which is scale. Simplistically, it seems like scale is all that matters—the scale of data, the scale of compute. Therefore, the large research labs and large tech giants have an immense advantage today. What is your view on that? Is that all that matters, and if not, what else does?

[10:34] Andrej: So, I would say scale is definitely number one. I do think there are details there to get right. A lot also goes into the data set preparation, making it very good and clean, etc. That matters a lot. These are all important factors.

The compute efficiency gains that you can get are significant. There’s the data, the algorithms, and then, of course, the training of the model to make it really large. I think scale will be the primary determining factor—like the first principal component of things, for sure. However, there are many other elements that you need to get right. It’s almost as if the scale sets some kind of a speed limit. You do need some of the other components, but if you don’t have the scale, then you fundamentally just can’t train some of these massive models.

[10:42] If you’re going to be training models, or if you’re just going to be doing fine-tuning and so on, then, I think maybe less scale is necessary. But we haven’t really seen that fully play out just yet. Can you share more about some of the ingredients that you think also matter, perhaps lower in priority behind scale?

[11:01] The first thing I think is that you can’t just train these models if you’re given the money and the scale. It’s actually still really hard to build these models, and part of it is that the infrastructure is still so new and still being developed; it’s not quite there. Training these models at scale is extremely complex and presents a very complicated distributed optimization problem.

[11:40] The talent for this is fairly scarce right now, and it just turns into this insane scenario where you’re running on tens of thousands of GPUs, all of which are failing at random at different points in time. Instrumenting that and getting it to work is actually an extremely difficult challenge. GPUs were not intended for workloads involving 10,000 GPUs until very recently, and I think a lot of the infrastructure is creaking under that pressure.

[12:19] Right now, if you give someone a ton of money or a ton of scale or GPUs, it’s not obvious to me that they can just produce one of these models. This is why it’s not just about scale; you actually need a ton of expertise—both in infrastructure, algorithms, and then data management, being careful with that. Those are the major components.

[12:49] The ecosystem is moving so quickly. Even some of the challenges we thought existed a year ago are being solved more today—hallucinations, context windows, multimodal capabilities, and inference are all getting better, faster, and cheaper. What are the LLM (Large Language Model) research challenges today that keep you up at night? What do you think are many enough problems that are also solvable problems we can continue to pursue?

[13:24] On the algorithm side, one thing I’m reflecting on quite a bit is this distinct split between diffusion models and auto-regressive models. They’re both ways of presenting problems, but different modalities are apparently a good fit for one of the two. There’s probably some space to unify them or connect them in some way and get the best of both worlds—figure out how we can create a hybrid architecture and so on.

[13:58] It’s odd to me that we have these two separate endpoints in the space of models, both extremely effective, and it feels wrong that there’s nothing in between. I think we’ll see that carved out, and there are interesting problems there. Another point I want to highlight is that there’s still a massive gap in the energetic efficiency of running all this.

[14:24] My brain operates at roughly 20 watts; Jensen was just discussing at GTC the massive supercomputers that they’re going to be building, and those numbers are in megawatts. Maybe you don’t need all that to run like a brain. I don’t know how much you need exactly, but I think it’s safe to say we’re probably off by a factor of a thousand to a million in terms of the efficiency of running these models.

[14:59] Part of this is because the computers we’ve designed are just not a good fit for this workload. NVIDIA GPUs are a good step in that direction. You need extremely high parallelism; we don’t actually care about sequential computation that is somewhat data-dependent. We just need to blast the same algorithm across many different array elements.

[15:29] So I would say number one is adapting computer architecture to the new data workflows. Number two is pushing on a few things that we’re currently seeing improvements on.

[15:43] First, we have seen precision come down from what was originally 64-bit floating point to 4.5 or even 1.58, depending on which papers you read. Precision is a significant lever in getting a handle on this, along with the concept of sparsity. It’s another big factor; like your brain, which is not always fully activated.

[16:12] Sparsity is a significant lever, and the last lever I feel is regarding the Von Neumann architecture of computers. This setup, where data is shuttled in and out and significant data movement occurs between memory and the cores carrying out compute tasks, is fundamentally broken. It’s not how your brain works, which is why it operates so efficiently.

[16:39] I think we are at a very exciting time in computer architecture. I’m not a computer architect, but we seem to be off by a factor of thousands, perhaps to millions, and there should be some exciting innovations on the horizon. There are at least a few builders in the audience working on this problem.

[15:06] Stephanie: Okay, switching gears a little bit. You’ve worked alongside many of the greats of our generation, such as Sam Altman from OpenAI and the rest of the OpenAI team, as well as Elon Musk. Who here knows the joke about the American rowing team versus the Japanese team?

[15:22] Okay, great! So this will be a good one. Elon shared this at Allen’s base camp, and I think it reflects a lot of his philosophy around how he builds cultures and teams. You have two teams: the Japanese team has four rowers and one steerer, while the American team has four steerers and one rower. Can anyone guess what happens when the American team loses?

[15:47] Exactly, they fire the rower. Elon shared this example as a reflection of how he thinks about hiring the right people and building the right teams at the right ratio. From working closely with these incredible leaders, what have you learned?

[16:06] Andrej: Yeah, so I would say definitely that Elon runs the company in an extremely unique style. I don’t actually think people appreciate how unique it is. You sort of read about it, but you don’t really understand it. It’s even hard to describe.

[16:23] I like to say that he runs the biggest startups, and I think it’s just—I don’t even know how to describe it. It almost feels like a longer sort of thing that I have to think through. But, number one, he likes very small, strong, highly technical teams. So that’s the first point.

[16:42] At big companies, teams tend to grow large. Elon has always been a force against that growth. I would have to expend effort just to hire people. I would basically have to plead to hire folks.

[16:55] The other thing at big companies is that it’s usually really hard to get rid of low performers. Elon is very friendly to the idea of getting rid of low performers by default. I actually had to fight to keep people on the team because he would want to remove them.

[17:12] So, to keep a small, strong, highly technical team, there is no middle management that is non-technical for sure. That’s number one. The second thing involves the vibes of how everything runs and feels when he walks into the office.

[17:30] He wants it to be a vibrant place. People are walking around, pacing, and working on exciting stuff, coding, and so on. He doesn’t like stagnation; he doesn’t want it to look that way. He discourages large meetings and always encourages people to leave meetings if they aren’t being useful.

[17:51] You often see in large meetings that if you’re not contributing and learning, you should just walk out. This is fully encouraged, and I think it’s something that you don’t normally see. The culture he instills is that the vibe matters a lot.

[18:06] Another unique aspect is how connected he is to the team. Normally, a CEO of a company is a remote person, five layers up, who talks to their vice presidents, who talk to their directors, and eventually you talk to your manager.

[18:24] That’s not how it is in Elon’s companies. He will come to the office and talk directly to the engineers. Many of the meetings we had included 50 people in the room, including Elon, who wanted to engage directly with the engineers rather than just the vice presidents and directors.

[18:42] Normally, in a corporate setting, people might spend 99% of their time talking to the vice presidents, while he spends maybe 50% of his time directly communicating with the engineers.

[18:56] He believes if the team is small and strong, then the engineers and the code are the source of truth—not some manager. He wants to talk to them to understand the actual state of things and what should be done to improve it.

[19:12] I would say that the degree to which he’s connected with the team is unique. His leadership style also involves a willingness to exercise his authority within the organization.

[19:25] If he talks to the engineers and they mention a blockage, like not having enough GPUs to run their tasks, he is quick to act. If he hears this concern more than once, he views it as a problem.

[19:40] He will ask about the timeline and, when not satisfied with the answers, he wants to talk to the person in charge of the GPU cluster. Someone will dial the phone, and then he’ll simply say, “Okay, double the cluster right now. Let’s have a meeting tomorrow, and from now on, send me daily updates until the cluster is twice the size.”

[19:57] They kind of push back, saying, “Okay, well, we have this procurement setup, we have this timeline, and Nvidia says that we don’t…”. I think the extent to which [Elon] is extremely involved, removes bottlenecks, and applies his hammer, is also not appreciated. There are a lot of these kinds of aspects that are unique. I would say it is very interesting and, honestly, going to a normal company outside of that, you definitely miss aspects of that.

So, yeah, maybe that’s a long rant, but that’s just my take. I don’t think I hit all the points, but it is a very unique thing. It’s engaging and, yeah, I guess that’s my brand—hopefully tactics that most people here can employ.

Stephanie: [24:05] Taking a step back, you’ve helped build some of the most generational companies. You’ve also been such a key enabler for many people—many of whom are in the audience today—of getting into the field of AI. Knowing you, what you care most about is democratizing access to AI education tools, helping create more equality in the whole ecosystem at large so there are many more winners. As you think about the next chapter in your life, what gives you the most meaning?

Andrej: [24:33] Yeah, I think you’ve described it in the right way. My brain goes by default to, you know, I’ve worked for a few companies, but ultimately I care not about any one specific company. I care a lot more about the ecosystem. I want the ecosystem to be healthy; I want it to be thriving. I want it to feel like a coral reef full of cool, exciting startups in all the nooks and crannies of the economy. I want the whole thing to be like this boiling soup of cool stuff.

Genuinely, I dream about coral reefs! I want it to be a cool place, and I think that’s why I love startups and companies. I want there to be a vibrant ecosystem of them. I would say I’m a bit more hesitant about the idea of five megacorps taking over, especially with AGI being such a magnifier of power. That worries me, so I have to think that through more. But yeah, I love the ecosystem, and I want it to be healthy and vibrant.

Stephanie: [28:05] Amazing. We’d love to take some questions from the audience. Yes, Brian.

Brian: [28:09] Hi, I’m Brian Hallan. Would you recommend founders follow Elon’s management methods, or is it kind of unique to him, and you shouldn’t try to copy him?

Andrej: [28:18] Um, yeah, I think that’s a good question. I think it’s up to the DNA of the founder. You have to have that same kind of DNA, some kind of vibe. When you’re hiring your team, it’s really important to be clear upfront about this being the kind of company you have. When people sign up for it, they tend to be very happy to go along with it.

However, if you change it later, people might not be as happy, and that’s very messy. As long as you do it from the start and are consistent, I think you can run a company like that. But you know, it has its own pros and cons as well. So, it’s up to the individual. I think it’s a consistent model of company building and running.

Stephanie: [29:16] Yes, Alex.

Alex: [29:20] Hi, I’m curious if there are any types of model composability that you’re really excited about—maybe other than mixture of experts? I’m not sure what you think about model merges, Franken-merges, or anything else to make model development more composable.

Andrej: [29:34] Yeah, that’s a good question. I see papers in this area, but I don’t know that anything has really stuck. Maybe the composability… I don’t exactly know what you mean, but there’s a ton of work on primary efficient training and things like that. I don’t know if you would categorize that as composability in the sense I’m thinking about.

It’s only the case that traditional code is very composable; I would say neural networks are a lot more fully connected and less composable by default. However, they do compose and can fine-tune as part of a whole. For example, if you’re doing a system that involves different inputs, it’s common to pre-train components and then plug them in and fine-tune the whole system.

So, there’s composability in those aspects where you can pre-train smaller pieces—and then compose them later. This is done through initialization and fine-tuning. So, those are my scattered thoughts on it, but I don’t know if I have anything very coherent.

Stephanie: [30:41] Otherwise? Yes, Nick.

Nick: [30:46] So, you know we’ve got these next-word prediction models. Do you think there’s a path toward building a physicist or a von Neumann-type model that has a self-consistent mental model of physics and can generate new ideas, like how do you do fusion? How do you get…?

[24:58] Andrej: I think there is a fundamentally different vector in terms of these AI model developments.

[25:00] I think it’s fundamentally different in one aspect. What you’re talking about may be just the capability question because the current models are simply not good enough. I think there are big rocks to be turned here, and I believe people still haven’t really seen what’s possible in this space at all.

[25:12] Roughly speaking, I think we’ve done step one of AlphaGo; this is the imitation learning part. There’s step two of AlphaGo, which is the reinforcement learning (RL), and people haven’t done that yet. I think it’s going to be fundamentally crucial. This is the part that actually made it work and created something superhuman.

[25:20] There are still big rocks in capability that need to be turned over. The details of that are potentially tricky, but we just haven’t done step two of AlphaGo—long story short. We have only completed imitation learning, and I don’t think people appreciate, for example, how terrible the data collection is for things like GPT.

[25:30] Imagine you have a problem—some prompt that presents a mathematical problem. A human comes in and gives the ideal solution to that problem. The issue is that human psychology is different from model psychology. What’s easy or hard for the human is different from what’s easy or hard for the model.

[25:44] Humans kind of fill out some sort of a trace that leads to the solution, but parts of that may be trivial for the model while other parts represent a massive leap that the model simply doesn’t understand. Consequently, you’re kind of losing critical data, and everything else later becomes polluted by that.

[26:00] Fundamentally, what you need is for the model to practice itself in solving these problems. It needs to figure out what works for it and what does not work. Maybe it’s not very good at four-digit addition, so it will fall back and use a calculator, but it needs to learn that for itself based on its capability and knowledge.

[26:15] So, that’s number one: that’s totally broken. I think it’s a good initializer, though, for something agent-like.

[26:20] The other aspect is that we’re doing reinforcement learning from human feedback, but that’s like a super weak form of reinforcement learning. It doesn’t even count as real reinforcement learning, I think. What is the equivalent in AlphaGo for RL from human feedback?

[26:38] It’s a vibe check, if you will. Imagine if you wanted to train an AlphaGo model using human feedback; you would give two people two boards and ask, “Which one do you prefer?” You would take those labels and train a model against that.

[26:45] But what are the issues with that? Firstly, it’s just based on vibes—the aesthetics of the board—that’s what you’re training against. Secondly, if it’s a reward model using a neural net, it’s very easy to overfit to that reward model for the model you’re optimizing over. It’s going to find all these unintended hacks.

[27:00] AlphaGo gets around these problems because it has a very clear objective function you can train against. So, RL from human feedback is nowhere near that. I would say RL is silly, and the other thing is imitation learning. Imitation learning is also a bit silly. RL from human feedback is a nice improvement, but it’s still lacking.

[27:20] I think people need to look for better ways to train these models so that they can be in the loop with themselves and their own psychology. I believe there will probably be unlocks in that direction. It’s sort of like a graduate school for AI models; it needs to sit in a room with a book and quietly question itself for a decade.

[27:38] Yes, I think that would be part of it. When you are learning stuff, and you’re going through textbooks, there are exercises in the textbook. What are those? They are prompts that encourage you to exercise the material.

[27:45] When you’re learning something, you’re not just reading left to right. You’re exercising, and maybe you’re taking notes, rephrasing, or reframing concepts. You’re doing a lot of manipulation with this knowledge in a way that you learn it. However, we haven’t seen an equivalence of that at all in large language models (LLMs), so it’s like super early days.

[28:10] It’s cool to be optimal and practical at the same time, so I would ask how you align the priority between cost reduction and revenue generation or finding better quality models with enhanced reasoning capabilities. How do you align that?

[28:20] I think what I see a lot of people do is they start out with the most capable model, regardless of the cost. They use GPT-4, they create super prompts, and utilize retrieval-augmented generation (RAG), etc. They are just trying to get their product to work, so they go after accuracy first, and then they make concessions later.

[28:30] For example, you might check if you can fall back to GPT-3.5 for certain types of queries, and you make it cheaper later. So, I would suggest going after performance first, and then finding ways to reduce costs later.

[28:40] This has been the paradigm that I’ve seen work for a few people I’ve spoken with about it. It’s not just about a single product; think about the various ways in which you can optimize your approach.

[29:57] Andrej: I would say that it can potentially work at all. If you can just get it to work, like say you make 10 prompts or 20 prompts and you pick the best one, you can have some debate—or I don’t know what kind of a crazy flow you can come up with, right? Just get your thing to work really well. Because if you have a thing that works really well, then one other thing you can do is you can distill that.

[30:15] So, you can get a large distribution of possible problem types, run your super expensive thing on it to get your labels, and then you get a smaller, cheaper thing that you can find you on it. So I would say I would always go after getting it to work as well as possible, no matter what, first, and then make it cheaper. That’s the thing I would suggest.

[30:34] Stephanie: Hi, Sam. I have one question. This past year, we saw a lot of impressive results from the open-source ecosystem. I’m curious about your opinion on how that will continue to keep pace or not keep pace with closed-source development as the models continue to improve in scale.

[30:56] Andrej: Yeah, I think that’s a very good question. I don’t really know. Fundamentally, these models are so capital-intensive. One thing that is really interesting is, for example, you have Facebook and Meta, among others, who can afford to train these models at scale. However, it’s also not part of what they do, and it’s not like their money printer is unrelated to that.

[31:19] They actually have an incentive to potentially release some of these models so that they empower the ecosystem as a whole. They can actually borrow all the best ideas. So to me, that makes sense. But so far, I would say they’ve only just done the open weights model. I think they should actually go further, and that’s what I would hope to see. I think it would be better for everyone.

[31:43] I think they might be potentially squeamish about some aspects of it eventually, with respect to data and so on. I don’t know how to overcome that. Maybe they should just try to find data sources that they think are very easy to use or something like that and try to constrain themselves to those. So I would say that those are kind of our champions potentially.

[32:09] I would like to see more transparency also coming from them. I think Meta and Facebook are doing pretty well; they released papers and published a logbook, and so on. So they’re doing well. However, they could do much better in terms of fostering the ecosystem, and I think maybe that’s coming. We’ll see.

[32:34] Peter: Yeah, maybe this is like an obvious answer given the previous question, but what do you think would make the AI ecosystem cooler and more vibrant? What’s holding it back? Is it openness, or do you think there are other things that are also a big factor that you’d want to work on?

[32:58] Andrej: Yeah, I certainly think one big aspect of it is just the stuff that’s available. I had a tweet recently about like, number one, build the thing; number two, build the ramp. I would say there are a lot of people building things, but there’s a lot less happening regarding building ramps so that people can actually understand all this stuff.

[33:20] I think we’re all new to this; we’re all trying to understand how it works. We all need to ramp up and collaborate to some extent to even figure out how to use this effectively. So, I would love for people to be a lot more open with respect to what they’ve learned, how they’ve trained all this, what works, and what doesn’t work for them, etc.

[33:46] Yes, it’s critical for us to learn a lot more from each other—that’s number one. Then, I also think there is quite a bit of momentum in the open ecosystems as well, so I think that’s good to see. Maybe there are some opportunities for improvement that I have already talked about.

[34:06] Michael: To get to the next big performance leap from models, do you think it’s sufficient to modify the Transformer architecture with, say, thought tokens or activation beacons, or do we need to throw that out entirely and come up with a new fundamental building block to take us to the next big step forward, or AGI?

[34:28] Andrej: Yeah, I think that’s a good question. The first thing I would say is that the Transformer is amazing. It’s just so incredible. I don’t think I would have seen that coming for sure. For a while before the Transformer arrived, I thought there would be an insane diversification of neural networks, and that was not the case. It’s the complete opposite, actually.

[34:51] It’s like a complete, unified model. It’s incredible to me that we have that. I don’t know if it’s the final neural network. I think there will definitely be—I would say it’s really hard to say that, given the history of the field. I’ve been in it for a while, and it’s tough to say that this is the end of it.

[35:10] Absolutely, it’s not. I feel very optimistic that someone will be able to find a significant change in how we do things today. On the front of the auto-regressive or diffusion, which is kind of like the modeling and the law setup, there’s definitely some fruit there.

[35:27] But also on the Transformer, as I mentioned, these levers of precision and sparsity, as we drive that together with the co-design of the hardware and how that might evolve, and just making the network…

Andrej: [34:55] Architectures are a lot more sort of well-tuned to those constraints and how all that works. To some extent, I would say like the Transformer is kind of designed for the GPU, by the way. That was the big leap, I would say, in the Transformer paper, and that’s where they were coming from. We wanted an architecture that is fundamentally extremely paralyzable. The recurrent neural network has sequential dependencies, which is terrible for GPU. The Transformer basically broke that through attention, and this was like the major insight there.

There are some predecessor insights, like the neural GPU and other papers at Google that were sort of thinking about this. But that is a way of targeting the algorithm to the hardware that you have available. I would say that’s kind of like in that same spirit. Long story short, I think it’s very likely we’ll see changes to it still, but it’s been proven remarkably resilient, I have to say. It came out many years ago now—something like six or seven years ago.

So, you know, the original Transformer and what we’re using today are not super different.

Stephanie: [35:03] As a parting message to all the founders and builders in the audience, what advice would you give them as they dedicate the rest of their lives to helping shape the future of AI?

Andrej: [35:15] I don’t usually have super generic advice. I think maybe the thing that’s top of my mind is that founders, of course, care a lot about their startup. I also want to think about how we can build a vibrant ecosystem of startups. How do startups continue to win, especially with respect to big tech? How can the ecosystem become healthier? What can you do?

Stephanie: Sounds like you should become an investor. Amazing! Thank you so much for joining us, Andre, for this and also for the whole day.

[Applause] today.