# Stanford MS&E435 Economics of the AI Supercycle | Spring 2026 | Infrastructure, Capstone Case

https://www.youtube.com/watch?v=4k53z3Ysjg0

[00:09] Hello everybody.
[00:12] It's good to see you again.
[00:16] Another round of the economics of the AI super cycle.
[00:19] This time with uh Professor Kati.
[00:21] Yes.
[00:21] Welcome back.
[00:22] Welcome back to Stanford.
[00:22] Thank you.
[00:25] Thank you.
[00:27] Um, you know, I as as I thought about introducing Professor Kati, I could not think of a better person who has seen the entire soup to nuts of electrons, the entire substrate all the way to agents.
[00:40] You obviously started a networking startup.
[00:43] You were the CTO and head of AI at Intel.
[00:46] You now run compute, industrial compute at OpenAI.
[00:49] Thank you.
[00:51] Thank you for joining us.
[00:51] Thank you.
[00:53] It's coming back home for me.
[00:55] So, welcome.
[00:55] Welcome.
[00:57] You know, I thought we'd started a fun segment.
[00:59] Intel spent about a decade trying to convince everybody that they're an AI company.
[01:02] It finally happened.
[01:04] They finally got there in the last like two weeks.
[01:07] What happened?
[01:08] I told you.
[01:08] So,
[01:13] uh that was my job.
[01:15] So I was as April was mentioning I was Intel CTO and also running its AI business until I left for OpenAI in November.
[01:20] Uh so yeah a little bit of a lag but you know that's the that's the bane of people who have to forecast but no it's I think it's Intel story is turning.
[01:32] Uh I'd say there are two factors that have big tailwinds for Intel.
[01:38] One is the world is heavily supply constrained.
[01:42] Mhm.
[01:42] And so any company that has serious manufacturing jobs in the space and and can build, not just design is going to have tailwinds, right?
[01:53] And Intel obviously is pretty much the only leading edge American company left that still can manufacture.
[02:01] Uh the other of course is CPUs are making a comeback uh uh with how we are beginning to use AI with agents and we can get into that a little bit later.
[02:10] So both of those are very good things
[02:13] obviously for Intel.
[02:15] Uh a lot of execution still to be done.
[02:17] I mean the market always is ahead of the story but uh fingers crossed.
[02:24] Liu is is is a great CEO.
[02:26] Loved working with them.
[02:29] So I think good things to look forward to.
[02:30] Amazing.
[02:32] I'm sure your departure had nothing to do with the stock chart.
[02:33] They're not correlated.
[02:34] I still kept my stock.
[02:37] So don't worry.
[02:37] Well done.
[02:37] Well done.
[02:38] Well, we'll get more into the role of CPUs, all the different parts of the uh compute supply chain.
[02:44] You know, the second thing I thought we'd we'd spend some time on is this chart that OpenAI put out at the start of the year for everybody.
[02:50] Sarah Frier, the Open CFO, wrote this article about OpenAI's compute ambitions.
[02:55] And on the left you'll see is OpenAI's compute capacity over the last three years.
[03:00] This is OpenAI's target by the end of the decade.
[03:03] And it magically seems such in that the compute capacity seems to be hyperorrelated with our revenue.
[03:13] No prizes for guessing what this might
[03:14] be if this gets there.
[03:17] Yeah, talk about this chart for a second.
[03:19] What's what's going on here and how should we how should we process this?
[03:23] Yeah, so at OpenAI I lead industrial compute.
[03:26] Just sort of some context before I answer the question.
[03:27] Uh so my job and my team's job is uh delivering the compute that OpenAI needs across everything across training and inference.
[03:35] So this chart is what I live and breathe every day and making the numbers go up.
[03:43] and yes my job is to make it go up and into the right.
[03:45] So that's that's the job description.
[03:49] Uh but kidding aside, I think u it has as you pointed out uh revenue is basically a lagging indicator for frontier lab companies.
[04:00] and what I mean by that is it basically is very simple calculation of how much compute we have and how well utilized is the compute.
[04:10] Right. And so the last three years have bone that out.
[04:10] Every year we have
[04:14] tripled compute year-over-year.
[04:16] Right.
[04:17] And revenue has tripled.
[04:18] Mhm. uh we don't see any end in sight to the correlation yet.
[04:22] Uh I think uh just 5.5 coming out and uh the uptake uh I mean CEX has probably seen meaningful doubledigit growth just in two weeks since uh 5.5 came out.
[04:34] Uh I think people are using it for not just coding anymore.
[04:40] Codeex is being used for general purpose knowledge work.
[04:42] So token usage uh and just more more and more complex tasks uh being consumed.
[04:47] So I we essentially are tracking how much compute we have available and the number of users the number of tokens and therefore the revenue basically is tracking tracking that.
[05:02] uh I'd say that as we think about the future uh we open is still a research lab and the reason I say that is it's very much not just a here's how much revenue
[05:17] we can maximize it's much more rather.
[05:20] how do we make the maximum amount of compute we can make possible for research right so that researchers are unconstrained in exploring new ideas new models and new ways of pushing the frontier on intelligence.
[05:34] and so the 30 gawatt number here that which is an aspirational goal is a split is split across research and products.
[05:42] but we definitely don't see a world where we don't utilize it uh if given the current trends that you're seeing.
[05:50] and maybe just a quick followup before we move on.
[05:51] What's the rough split between training and inference?
[05:53] Um, and how's that trended over time and how do you expect it to uh for go over over time?
[06:01] I think the scaling laws right so obviously scaling laws initially everyone assumed applied for pre-training only.
[06:07] What has shifted is scaling laws have evolved to cover the entire life cycle of compute and what I mean by that is pre-training.
[06:17] post-training with RL
[06:19] which is primarily an inference workload.
[06:21] Mhm.
[06:23] Synthetic data because we have run out of real world data to train models on.
[06:27] So we generating data to train models on that is primarily an inference workload.
[06:32] Mhm.
[06:32] And then of course the actual products themselves, everyone using chat GBD and Codex and that is an inference workload.
[06:39] So more and more it is shifting to inference.
[06:42] I mean inference is already the majority just to be clear.
[06:46] But even inference should not be taken to mean just products.
[06:48] A big chunk of research a big chunk of training the next level of intelligence is also inference.
[06:57] Mhm.
[06:58] And so that's why our prediction is that a super majority plus like 80% plus will essentially be inference compute right in the future.
[07:07] Is it also true just building on your response Sachin is it also true that if the relative ratio of how much gets used for inference goes up
[07:18] over time the dollar density meaning dollars per gawatt might also go up.
[07:24] because inference is basically what you can monetize.
[07:28] well we are hoping it goes down so dollars per gawatt because uh this stuff is expensive every gigawatt is roughly.
[07:36] I meant monetization sorry monetization.
[07:38] Yes. Yes. Uh I think yes for sure right as more tokens get consumed uh that should lead to a corresponding increase uh in revenue.
[07:49] At the same time I think our mission is to make tokens cheaper right and and and it's it's two two different dimensions.
[07:54] One is make every token cheaper make every token more intelligent and make every task require less number of tokens to perform.
[08:03] Right? So we push on three dimensions. keep improving hardware and software to generate tokens more cheaply.
[08:13] We push keep pushing on the capabilities of models to make sure every token is more intelligent.
[08:18] Mhm.
[08:18] Right. And we keep pushing the harness
[08:21] like codeex to make it such that we need less number of tokens to perform any given task.
[08:26] Right.
[08:28] And that's I that is a very fundamental principle in which in way the in how the company operates.
[08:34] And the the reason of course is how do we make sure that all of this intelligence is as widely accessible as possible.
[08:43] Now your job as you said is to get the numbers to go up top and to the right the uh forecast honestly I don't envy anybody who's forecasting tripling year-over-year at that scale seems like a hard job to not only forecasting and is an easier job than actually making it happen.
[09:01] What?
[09:02] You think so?
[09:04] What? Tell us about your job a little bit.
[09:06] What is the hardest part of it?
[09:08] Is it is it sourcing the the the compute?
[09:11] Is it securing it?
[09:13] Is it uh and how are you securing uh compute right now?
[09:15] It seems like a fist fight.
[09:17] And where's the bottleneck?
[09:19] Is it power?
[09:19] Is it is it energy?
[09:19] Is it is it chips?
[09:19] Is it land?
[09:22] Is it all of the above?
[09:23] U it is it is uh I I think if you think about the life cycle of compute, right?
[09:29] So one is obviously sourcing compute and compute is a very broad term.
[09:37] Uh when you think about compute for AI you really have to think chips, memory, networking, power cooling, data center buildings, power generation, power distribution and of course land.
[09:56] All of that is equal to compute, right?
[09:59] All of that needs to come together to build compute at a gawatt scale, right?
[10:04] And so when we think about sourcing, we are not sourcing comput.
[10:08] We are literally sourcing that entire supply chain.
[10:11] And making sure at this scale that we have visibility into where that each component of that supply chain will come from.
[10:20] So that is one big piece. The second
[10:22] piece is how do we orchestrate that supply chain to all land and align at the same time to make this compute operational.
[10:29] Right?
[10:31] So a gigawatt is roughly a million G half a million GPUs.
[10:36] Right?
[10:39] And so that's uh and and when we're talking about whatever number it is, six or 10 gawatt, you're talking about quite a large number of chips being worked together, being powered, being cooled, being kept up and alive, made sure that everything else that needs to come together is there.
[10:57] So a big chunk of the work really starts after you sign the contracts.
[10:59] like how do I make sure that your suppliers are actually going to deliver what they said they will?
[11:09] How do we make sure that we engineer these systems so that it all works together at this scale?
[11:13] And how do we make sure that it is operationally usable like it is up and running and runs at the highest performance uh we can run these chips
[11:23] And these chips are very brittle today.
[11:26] Uh very sensitive to cooling and power fluctuations.
[11:29] And they can quickly throttle back in in how much compute, how many flops you have.
[11:35] So that's really the job.
[11:37] It's uh the fun part is the contract signing.
[11:40] The hard part is everything after.
[11:42] Yeah.
[11:42] Yeah.
[11:42] Yeah.
[11:43] Getting the autographs.
[11:45] Yes.
[11:45] you know, you must be um it's it's a very consequential time right now and I imagine a lot of the decisions you're making will impact us and the rest of uh compute users, which is, you know, billions of people for years, if not decades to come.
[12:01] What are some of the biggest trade-offs you're making?
[12:03] What are some of the biggest decisions you're making that will make, you know, case studies at some point down the future um that that that you're that you're uh that you can talk about?
[12:12] I mean I I think it's there's a lot of uh societal level implications of these decisions right to pick an example if you put a gigawatt data center in a
[12:24] place like Georgia or Michigan for example.
[12:28] it's a pretty big consumer of the grid right in that amount of power and when you run a big training job these things are synchronized jobs right they go up and down in sync in intensity.
[12:43] Mhm.
[12:43] So you can see energy fluctuations on the grid that can be hundreds of megawatt very quickly.
[12:49] Mhm.
[12:50] And our infrastructure was never designed for it.
[12:53] Mhm.
[12:53] A grid could basically fall apart and an entire state could have a blackout depending on how these data centers behave.
[13:00] Right. So a lot of time we spend thinking about how do we make sure we can design these systems to not have all this collateral damage on the rest of the country's infrastructure.
[13:14] Mhm.
[13:15] Uh so that's an example of the kinds of things that are being redesigned.
[13:19] Right.
[13:20] Uh we obviously are spending a lot of time thinking about how to derisk supply
[13:26] chains, right?
[13:27] So how do we move fabs?
[13:29] How do we move memory factories to other parts of the world?
[13:35] Uh how do we decouple from grid energy and use natural gas and increasingly nuclear in the future?
[13:42] Mhm.
[13:43] So I think this is going to lead to infrastructure investments and innovations that the rest of society will benefit beyond AI because these are things that otherwise did not have an impetus to happen.
[13:53] Mhm.
[13:56] Then I'd say obviously all the implications of AI itself and compute at this scale.
[14:01] Uh I mean 30 GW is a lot, right?
[14:05] Uh but I'd say our vision is and Sam has been talking about this for a while.
[14:13] Like we we've all taken it for granted that every one of us should have a mo mobile phone and we upgrade one every year or every two years.
[14:22] It's not that crazy to think every one of us should have a GPU.
[14:25] Mhm.
[14:26] Right. And a GPU is what a kilowatt to 2 kilowatt now.
[14:30] 7 billion humans out there.
[14:33] That's uh 700 7 terowatts of compute, right?
[14:37] And so that is two orders of magnitude more than what we are talking about here.
[14:43] So if you really believe in that world, then we still have a long ways to go.
[14:48] Right. And maybe just put this in perspective Sachin, how much energy does America consume compared to 30 gawatt?
[14:56] I think I don't have the number off the top of my head, but uh I think the US is if you add up all the hyperscalers is planning to build around 100 gawatt of compute.
[15:06] Got it.
[15:08] Beyond OpenAI, so 30 gawatt of us, whatever else everyone else builds.
[15:12] You've seen Google's numbers, Amazon's numbers.
[15:15] 100 gawatt is probably already a fifth to higher of the grid.
[15:20] Uh so this will be consuming double digit percentage of US capacity.
[15:25] Wow.
[15:27] Making the market.
[15:29] It will change the market, right?
[15:31] I think the the way we think about energy as just purely for human consumption is no longer true.
[15:37] Yeah.
[15:37] You know, one of the one of the rumors that's been going around um is that OpenAI has a significant compute advantage compared to the other labs um the class here loves uh both OpenAI and Enthropic equally sort of.
[15:53] Are we pulling?
[15:55] We uh we we did that and we'll we'll we'll we'll save you the answer.
[15:59] I I'll fill you in after.
[16:01] But um but uh talk about that computer advantage to an extent that you can you know share with us what does that afford us um what does that allow us to do you know assuming forecasting was perfect um what what does that afford us to do and and and deliver to to to consumers of of openi.
[16:18] I mean you're seeing it right so uh 5.5 is a big model it's expensive to serve.
[16:24] But there are no limits.
[16:26] Right so everyone's able to go and use.
[16:29] it uh without token limits uh we are much more generous on how many tokens you get for your subscription.
[16:38] Uh we often every almost every day or every week reset the limits so that people can play with it a lot more.
[16:46] And that's the compute advantage showing up in day-to-day usage.
[16:49] Right. And so that comes back to that earlier point which is making sure that we have enough compute to distribute this intelligence at scale.
[16:58] Not just build the intelligence.
[17:01] It's no good if you build the intelligence but you can't really deliver it at scale.
[17:05] So really we spend a lot of time in making sure that it's not just about training.
[17:11] It's actually usable compute that we can uh deliver uh to everyone at scale without putting artificial limits.
[17:19] 100%. One of the impacts that the class has already felt is we asked um two labs for a codeex and um unnamed product subscription.
[17:25] The codeex team gave us
[17:30] that pretty quickly.
[17:33] So I I I now understand why that was the case.
[17:36] Um we'll switch it up a little bit.
[17:38] Son, you know, uh codecs and and codecs like instruments have a lot of different things that need to come together.
[17:44] uh the the the the GPU and and all sorts of AS6 uh the CPU, the memory, the networking, all the things that you outlined us.
[17:51] Maybe uh maybe start with the workload uh in question.
[17:53] Um what does the modern agentic workload look like?
[17:58] How has that evolved over time?
[18:01] Uh I think the way maybe just to frame the answer, right?
[18:05] So chat GPD was obviously a big inflection moment.
[18:08] But if you think about chat GPD when it started, it really is oneot inference, right?
[18:16] You ask a question, it gives you an instant answer and you're done and you go to go to the next thing.
[18:23] I think the big innovation and the breakthrough in 2024 was reasoning, right?
[18:28] And so not just for inference but also
[18:31] for training, right?
[18:33] So being able to for the model model to introspect and think and therefore generate better answers
[18:39] and that again increased intensity of inference right so there's more and more inference uh happening
[18:45] uh I think but they are still passive things right you're asking a question they're giving you an answer they don't take any action for you right
[18:53] so the word agent kind of encodes what we mean it has agency it has agency to do things right
[18:59] and so what I mean by that is when we think about agents it's really about closing the loop not just thinking and suggesting but also trying it and looking at the output iterating and then trying a refined answer to do a task right
[19:18] so whether it's coding or any other form of knowledge work
[19:21] so it's really closing the loop right
[19:23] and it's actually delivering the full value of what we expect AI to deliver to you right not just be an assistant
[19:31] But I'll actually be an agent that can close the loop and do work for you.
[19:35] Right?
[19:36] And so implicit in that statement is obviously inference and thinking, but as I said trying, right?
[19:42] So it's going to go look for a relevant data.
[19:45] It's going to go search.
[19:48] It's going to go spin up a VM to run a test if it has generated some code.
[19:53] it's uh going to spin up Excel or PowerPoint to try out some slides and see how it looks,
[20:01] right?
[20:01] And it's going to look at the output and reason about it and iterate on this, right?
[20:05] And so to the graph, the compute graph is a lot more complex now,
[20:10] right?
[20:10] If I putting my computer science hat back on, if I thought about the chatbot world, it's a very simple compute graph.
[20:17] It's there's a user, there's one node, which is the inference call and there's an answer, right?
[20:22] reasoning was multiple nodes of inference calls
[20:26] and now we have a much more directed acyclic graph if you will to use a use the more precise technical term you have
[20:33] an inference call you might have a tool
[20:34] call you might have a database or a
[20:36] search query you might have a RL VM
[20:39] environment spun up then back to an
[20:41] inference call and so on and so on
[20:43] >> so the compute graph is now a lot more
[20:46] complex
[20:47] >> that you're executing right
[20:49] >> and so that naturally leads to a much
[20:52] more sophisticated compute
[20:54] infrastructure that's needed to execute
[20:56] that compute graph. A lot more
[20:58] intelligence needed in how you
[21:00] distribute that compute graph and where
[21:02] you run what part of the graph on.
[21:05] >> And so both the compute but more
[21:08] importantly the workload evolving in
[21:10] this direction is going to change the
[21:13] shape of how we think about compute
[21:16] infrastructure.
[21:17] >> Fascinating. you you outlined a bunch of
[21:19] different steps along the way. Um
[21:23] I can imagine some parts of that being
[21:25] more relevant for uh different machine
[21:27] like a GPU, other parts for CPUs and A6.
[21:31] Is there uh emerging maybe clusters of
[21:34] workloads that are particularly suited
[21:36] for a certain workload? You might say,
[21:37] hey, the Nvidia GPU is best for that.
[21:40] You might say the Cerebrus chips are
[21:42] best for this kind of a workload because
[21:43] you know agents come in all different
[21:44] shapes and sizes. You've got customer
[21:46] service chat bots that latency is a is a
[21:48] prime requirement as opposed to a deep
[21:52] deep research query where not latency
[21:54] but accuracy and and broad search.
[21:57] >> Yeah.
[21:57] >> Um are there clusters forming in in your
[22:00] view?
[22:01] >> Definitely. And maybe to use a slide.
[22:04] Yeah. This is what I was talking about
[22:06] earlier, right? So this is kind of a way
[22:08] to visualize what's happening right
[22:10] >> in a typical agent call.
[22:12] >> Mhm. Uh I guess this was a
[22:13] tongue-in-cheek slide that I had made.
[22:15] Uh today if you look at agents right you
[22:19] give it a task it goes off tries to do
[22:22] it uh thinks for a while tries a bunch
[22:25] of tools and then you have context
[22:28] switched.
[22:28] >> Mhm.
[22:29] >> You're going off doing something else
[22:30] but because it's taking minutes to maybe
[22:32] even hours to do it. Right. you spaced
[22:34] out all
[22:35] >> you spaced out and then when it comes
[22:37] back and asks you for a steer or a
[22:38] decision you have to page back all that
[22:40] context in and then you do whatever you
[22:42] do right
[22:43] >> and so our vision is we want to get to a
[22:46] world where the human is the bottleneck
[22:49] >> right today the AI is the bottleneck
[22:52] given how long it takes to execute all
[22:54] this right really we have succeeded from
[22:56] a compute perspective
[22:58] >> when we have built the systems and the
[23:00] infrastructure such that the human
[23:01] becomes the bottleneck when the AI is
[23:03] finishing these things so quickly.
[23:05] >> Mhm.
[23:06] >> That you are constantly being asked for
[23:08] what's the next step.
[23:10] >> Mhm.
[23:10] >> And that is a tongue-in-cheek point, but
[23:12] the better way to say that is how do we
[23:13] make sure human is in flow
[23:16] >> when they're doing this work with AI,
[23:18] right? And there's this feeling of flow
[23:20] when like everything's so quick and
[23:21] interactive and it's like it knows
[23:24] exactly what you need and it does it
[23:26] quickly.
[23:27] >> That's that's a user experience we'd
[23:29] love to deliver, right? And so as we
[23:32] think about this future, we do need
[23:34] heterogenous comput. You can't actually
[23:38] >> deliver this kind of experience
[23:40] economically on pure GPU based compute.
[23:44] >> Okay.
[23:44] >> So you need a much more heterogenous
[23:46] infrastructure that's not just GPUs and
[23:48] CPUs but also different kinds of
[23:50] accelerators.
[23:51] >> So Cerebrus is an example that is for
[23:53] very fast inference. Mhm.
[23:55] >> Uh you might have other accelerators
[23:57] that are built for very long context
[23:59] like they hold a lot of state in memory.
[24:02] >> So they can remember your entire task
[24:04] and don't have to page it back in and
[24:06] out.
[24:07] >> So
[24:08] >> for example, that could be useful
[24:09] >> for coding for sure, right? They have to
[24:10] hold your entire GitHub project in
[24:12] context and be able to pull that very
[24:15] quickly. So you are going to see a lot
[24:17] more flex hetrogenity in the underlying
[24:21] infrastructure because the user
[24:23] experience is going to push us
[24:25] >> towards optimizing every part of this
[24:28] agentic graph.
[24:30] >> And what we as people who have to build
[24:32] compute have to do is make sure we can
[24:34] match the right part of the workload
[24:36] >> to the right kind of compute
[24:38] >> to optimize on both efficiency as well
[24:40] as performance.
[24:42] >> Right. Fascinating. So this is uh going
[24:44] off script for a second off-roading you
[24:46] know yesterday big day four earnings
[24:48] calls. Um a lot of hyperscalers talking
[24:51] about their accelerator programs. Uh
[24:56] Amazon notably at roughly $50 billion of
[24:58] run rate revenue on their tranium chips.
[25:02] Something I forget the alphabet number
[25:03] but that's a bigger number. Yes.
[25:05] >> And then obviously you've got the big
[25:06] guy Nvidia. Um, if you were to draw like
[25:09] a
[25:10] market share chart, it looks heavily in
[25:14] the favor of Nvidia right now. Uh, I'm
[25:16] sure there's all sorts of other AS6 that
[25:18] have not even seen the day of flight
[25:20] yet. Should we expect obviously, you
[25:22] know, the guidance from Nvidia is we're
[25:23] going to do everything. The guidance
[25:25] from the others is is similar. How do
[25:27] you expect this to trend? Is there one
[25:29] or two that you're a particular fan of?
[25:31] Um, outside of obviously the the the
[25:34] main workhorse,
[25:35] >> you know, I'm not going to answer that,
[25:36] right? So, but uh kidding aside, uh no,
[25:41] I think the the world needs a much more
[25:46] resilient compute supply chain.
[25:48] >> Mhm.
[25:49] >> Uh I think it is dangerous for the world
[25:52] to be singlethreaded on any one
[25:54] component.
[25:55] >> Right. Um and so I think that is what
[25:59] the market is reflecting. Right. So we
[26:01] are going to see quite a bit of uh
[26:04] choices.
[26:05] >> Mhm. And the workload is also going to
[26:06] push it there because the workload is
[26:08] getting a lot more complex than a pure
[26:11] inference or training job on a GPU,
[26:13] right? And so that is going to lead to
[26:15] flexibility.
[26:16] >> I'd say the other underappreciated part
[26:18] that I don't know whether everyone
[26:22] in will will will appreciate
[26:25] >> Mhm.
[26:26] >> the way TSMC allocates wafers.
[26:28] >> Mhm.
[26:29] >> Will mean that there have to be multiple
[26:33] GPUs and accelerators. say more about
[26:35] that.
[26:35] >> Uh I think TSMC has done been extremely
[26:39] successful because they try to make sure
[26:41] that
[26:43] multiple customers are successful. Okay.
[26:45] And it is in their business interest to
[26:47] be so right because they don't want to
[26:49] be single threaded on any one big
[26:51] customer. Right.
[26:51] >> Right. And so they I think and that is a
[26:55] single choke point in the supply chain
[26:57] >> and so the way those wafers get
[26:59] allocated there will be multiple people
[27:02] multiple companies which will get wafers
[27:03] there and by definition therefore
[27:05] there'll be multiple varieties of chips.
[27:07] >> Mhm. And so for the scale we are talking
[27:10] about for the scale any one of us are
[27:11] talking about Google Amazon us whoever
[27:15] by definition we have to learn how to
[27:17] use all of these chips because we don't
[27:19] have a choice
[27:20] >> right
[27:20] >> right and so that's why I think the
[27:22] world will look a lot more richer in the
[27:24] future
[27:25] >> fascinating fascinating the um you know
[27:28] one of the one of the maybe the other
[27:29] dimension such is training the shape of
[27:33] the training workload as you said is
[27:34] fairly synchronous it it's typically
[27:37] coordinated you need coherent cluster,
[27:38] it goes up right all at the same time.
[27:41] Inference on the other hand does not
[27:42] seem that way. It's likely much more
[27:44] spiky, a lot harder to forecast maybe.
[27:47] And as that as that changes, uh you
[27:51] might even want more compute closer to
[27:53] the edge to minimize latency for for
[27:54] inference. Talk about that for a second.
[27:56] How do you manage um the shape of your
[28:00] your your compute capacity knowing that
[28:01] you're moving towards an inferenceheavy?
[28:03] Uh, does that mean more distributed
[28:05] almost cloudflare- like mini clusters
[28:08] closer to the edge or a giant one in
[28:11] Texas or Virginia is good enough? It
[28:13] will get there, but it's not yet and for
[28:16] two reasons. One is uh there are still
[28:19] significant benefits to scale uh on
[28:23] building this this compute. Uh so
[28:27] building 50 megawatts of compute is far
[28:30] more expensive per megawatt than
[28:32] building a gigawatt of compute at one
[28:34] location.
[28:34] >> Fascinating.
[28:35] >> Uh and
[28:36] >> on a per unit basis
[28:37] >> on a per megawatt basis. Got it.
[28:39] >> Right. And that's for many reasons.
[28:40] Right. So labor is a big bottleneck
[28:42] around the world today in especially in
[28:44] the US. We just don't have enough people
[28:46] to build these things. So getting the
[28:49] kind of critical human mass you need to
[28:51] build
[28:52] >> you would much rather do it for a bigger
[28:54] scale than for little bits of 50
[28:57] megawatt spread around the country.
[28:59] >> Mhm.
[29:00] >> So that I think is going to drive the
[29:01] economics. The other technical reason is
[29:04] the way these models work and especially
[29:06] for agentic workloads. Uh the time to
[29:11] first token is still on the order of 4
[29:13] to 500 milliseconds because they have to
[29:15] page all of this context in before they
[29:18] generate the first token.
[29:20] >> And so 4 to 500 milliseconds is far
[29:23] larger than any latency benefits you get
[29:26] by putting compute closer to the user.
[29:28] >> Got it?
[29:28] >> Right. And so to me that will also mean
[29:32] that this will push us towards more
[29:34] concentrated clusters of compute for
[29:36] inference
[29:37] >> still for some time. Uh this will change
[29:40] as we figure out how to distill in very
[29:43] intelligent models to be small
[29:45] >> and potentially run closer to you.
[29:47] >> Got it.
[29:48] >> Uh but at this point the economics don't
[29:51] favor it.
[29:52] >> Got it. Uh follow up on that. Could you
[29:54] break down the 500 milliseconds into
[29:57] what are the different components of
[29:59] that call from the time that you know we
[30:01] pressed a enter button on the chat on
[30:04] chat GPT if you were to allocate that
[30:06] 500 millconds who's using that up how
[30:08] much budget is each part of the stack
[30:10] allocated
[30:11] >> I'd say at the 500 millconds didn't even
[30:13] include some of that other components
[30:16] that you were talking about but even for
[30:18] example
[30:19] >> uh you ask a query on codeex it's
[30:22] running off a project Right? Uh it is
[30:25] going to take that prompt combine it
[30:27] with your codebase. Right? That's the
[30:29] entire context for that uh for that
[30:32] model.
[30:33] >> Mhm.
[30:34] >> I mean there's to get technical for a
[30:36] minute the prefill phase of running the
[30:39] inference. It basically has to run that
[30:42] entire context which could be
[30:44] >> hundreds of megabytes.
[30:46] >> Mhm.
[30:46] >> Uh like our codeex models now are 400k
[30:49] context, right? So there's 400k tokens.
[30:51] >> Mhm. 400k tokens have to be computed
[30:55] through the attention mechanism before
[30:57] the first output token is generated.
[30:59] >> Right?
[31:00] >> And so that is the basically the model
[31:03] paging in all the context relevant to
[31:06] that task before it spits out even the
[31:08] first output token.
[31:09] >> Right?
[31:10] >> And so that's that several hundred
[31:12] millconds of latency.
[31:13] >> After that you can add other stuff,
[31:15] right?
[31:15] >> Prefill the first part is
[31:16] >> prefill. This is prefill. And so after
[31:19] that you can add the other sources of
[31:20] latency that could be like it usually
[31:22] could just be your app
[31:24] >> turning your prompt into a token that is
[31:27] sent to the cloud
[31:28] >> and load balanced into the appropriate
[31:31] GPU to run to the model. All of that is
[31:33] going to add maybe tens of milliseconds
[31:35] of latency.
[31:36] >> So that's where I was saying that that
[31:38] first token generation latency is higher
[31:42] than all the other sources of latency.
[31:43] But an interesting side effect um when
[31:47] we brought Cerebras in and we rolled out
[31:49] Cerebras earlier this year uh it started
[31:52] generating tokens so much faster
[31:55] >> that all of these other latencies that
[31:57] we had in the system in the app in the
[31:59] way our API works started to become
[32:01] prominent.
[32:02] >> Mhm.
[32:02] >> And so when we improved one layer of the
[32:06] stack it forced us it actually showed up
[32:08] all the inefficiencies that we had in
[32:10] the rest of the stack. And so we had to
[32:12] do a lot of engineering to fix those
[32:14] latencies. We literally published a blog
[32:16] post on this yesterday.
[32:18] >> So we had to change OpenAI's API
[32:20] infrastructure
[32:21] >> to actually keep pace with Cerebras. And
[32:24] so there's if someone's interested lot
[32:26] of very neat software engineering that
[32:29] has gone into how do we shave off
[32:32] latency and every layer of the stack.
[32:33] >> What's the name of this blog?
[32:35] >> The OpenAI blog.
[32:36] >> Open AI blog. Great. Great. Great.
[32:37] Great. So it's like a whack-a-ole
[32:39] problem, you know, similar to how folks
[32:41] were optimizing page load times. Yes. On
[32:43] the internet.
[32:44] >> Yes. I mean, I think latency is going to
[32:46] be a very important dimension we will
[32:48] focus on,
[32:49] >> right?
[32:49] >> Uh I think the trope is true that every
[32:52] 30 or 50 millconds of latency you can
[32:54] shave,
[32:55] >> leads to higher engagement, leads to
[32:57] higher revenue, leads to higher
[32:58] retention. For sure that is true, right?
[33:00] And I think that is going to be a
[33:02] dimension on which all of us are going
[33:03] to compete.
[33:04] >> Fascinating. That makes a lot of sense.
[33:06] And particularly given the attention uh
[33:09] of the human brain is only going one
[33:10] way.
[33:11] >> Yes.
[33:12] >> Not not expanding.
[33:13] >> Yes.
[33:13] >> Here's a fun question for you. You know,
[33:15] every guest we've had so far has has
[33:17] mentioned that compute is the biggest
[33:19] bottleneck as an ingredient for their
[33:22] business. Probably true. Um what is the
[33:25] consensus that the AI community has has
[33:27] maybe wrong or not not right enough that
[33:30] you you have reason to believe um is
[33:32] misunderstood? What about AI
[33:34] infrastructure is most misunderstood?
[33:36] Right now
[33:39] >> I guess the
[33:42] the biggest shift that is happening that
[33:45] is underappreciated
[33:47] is we have very simplistic systems
[33:51] today. Right. And what I mean by that is
[33:54] >> we have these big compute units attached
[33:57] to one layer of memory which is high
[33:59] bandwidth memory.
[34:00] >> Mhm.
[34:00] >> Right. And I think we went through this
[34:04] in general purpose computing. CPU
[34:06] started similarly, right? And then they
[34:08] added multiple layers of caching.
[34:11] >> They added flash, they added hard drive
[34:13] storage, all kinds of stuff, right?
[34:15] >> And so I think we are very early days in
[34:19] how systems infrastructure is going to
[34:21] evolve
[34:22] >> for AI compute.
[34:23] >> Uh we've gone from very simplistic ways
[34:26] of programming these things to more
[34:28] sophisticated ways. I'd say the other
[34:30] big shift that's happening underneath is
[34:34] AI is generating the next generation AI
[34:38] infrastructure.
[34:39] >> And so what I mean by that is we are
[34:41] increasingly using our latest models to
[34:44] design the next chip.
[34:46] >> Mhm.
[34:47] >> And the next set of low-level software
[34:49] needed to run the next model.
[34:51] >> Mh.
[34:52] >> So recussion if you will, right? So how
[34:54] can AI basically figure out what is the
[34:57] right kind of chip system and software
[35:00] it needs to run most efficiently
[35:03] >> rather than this decoupled world today
[35:05] where we train a model someone else is
[35:07] designing a chip independently and
[35:09] delivering to us and we figure out how
[35:10] to make it work. So how do we uh quicken
[35:14] that pace where basically the next model
[35:18] while it is being trained is also
[35:20] figuring out what should be the chip and
[35:23] system design it wants to run most
[35:25] efficiently. We are not that far from
[35:27] that world.
[35:28] >> Fascinating. That's a recursion is a
[35:32] recursive algorithms are one of the most
[35:33] powerful algorithms. So this seems like
[35:35] a brave future. It is I think but it is
[35:38] also probably the only feasible way to
[35:42] bend the curve on the compute time right
[35:45] so cycle time compute cycle like how
[35:48] quickly can we get the right kind of
[35:50] compute designed and operational for the
[35:53] next generation
[35:54] >> and and so because otherwise we won't be
[35:56] able to keep pace as as h if humans are
[35:59] going to try and interpret and then
[36:01] design and then do it
[36:02] >> a typical chip design cycle is 3 years
[36:05] >> like from inception of idea ideating on
[36:08] what a chip should be to actually
[36:10] getting it in production is 3 years and
[36:12] that's too long given how quickly things
[36:14] are changing.
[36:15] >> Yeah. 3 years is right around when Chad
[36:17] GPT was launched.
[36:18] >> Yes.
[36:19] >> So yes,
[36:20] >> that feels like forever ago.
[36:21] >> It's an eternity.
[36:22] >> You know, one of the questions we ask a
[36:24] lot of um our speakers uh is this chart
[36:27] here. Um you know, we talk about the
[36:30] five layer cake of AI as Jensen
[36:32] describes it. energy chips, infra
[36:36] models, apps, you play across all five
[36:38] of them. We're waiting for chips to show
[36:41] up soon from Broadcom and others. Um, if
[36:44] you were to uh guide us based on
[36:48] everything you know, which part of the
[36:49] stack is most likely to acrue value in
[36:51] the long term, what would you point to?
[36:54] Obviously, all of the money right now is
[36:55] in the bottom half of this layer cake.
[36:58] >> It changes, right? So, I mean, I think
[37:00] uh history rhymes. So if you look at the
[37:03] mobile revolution, initially a lot of
[37:07] the money was made by the telos and the
[37:11] people building the infrastructure.
[37:12] >> Yeah.
[37:13] >> Uh then it moved up uh into the
[37:16] application layer, the people building
[37:18] the apps.
[37:19] >> And then it moved up into the cloud
[37:22] services cloud services layer.
[37:24] >> Uh I don't see any reason why this cycle
[37:27] will be different. We are right now in
[37:28] the world where the infra layer is where
[37:30] the profits are. Mhm.
[37:31] >> But over time it'll move to the
[37:33] platforms and the apps.
[37:35] >> Mhm.
[37:35] >> Uh and so that is uh that is I guess the
[37:38] inevitable cycle,
[37:39] >> right?
[37:40] >> Oh, we hope so.
[37:41] >> It seems that every app is getting
[37:42] engulfed by uh openthropic.
[37:45] >> Uh certainly right this second. So, so,
[37:47] so we were eagerly waiting for that.
[37:50] >> Rapid fire question for you Sachin
[37:51] before we uh open it up. long short.
[37:54] Pick a business. Pick a startup that
[37:55] you're very excited about that you'd go
[37:57] long. And the same on the other side, a
[37:59] counterfactual, a business idea startup
[38:01] that you're uh bearish about.
[38:05] >> I'm long open AI. I'm voting with my
[38:08] feet, but keeping a side. Uh no, I think
[38:11] uh I'd say that
[38:14] the thing maybe for this audience that
[38:17] is underappreciated uh the I would go
[38:20] long on the lowest layer of the stack.
[38:24] Uh because
[38:27] at least in the US we have forgotten
[38:31] how to build very foundational
[38:33] infrastructure.
[38:34] >> Mhm.
[38:34] >> Right. And that's from everything like
[38:37] how do we build transformers at scale?
[38:39] How do we build batteries at scale?
[38:41] >> How do we build generation and
[38:43] distribution? How do we build cooling?
[38:46] How do we build components that go into
[38:49] all of these systems?
[38:50] >> Uh that is
[38:53] a an underserved layer of the
[38:56] infrastructure.
[38:56] >> Fascinating.
[38:57] >> Uh that is also one where
[39:00] differentiation is sustainable because
[39:03] it's both technical as well as scale. if
[39:06] you build it, it's very hard for other
[39:08] people to replicate it.
[39:10] >> Uh so I'd say kind of a corollary bet on
[39:14] if AI is going to have that
[39:16] transformation that we think it will.
[39:18] >> Really the cor this this all this layer
[39:21] has to change from how it's done. Mhm.
[39:24] >> And so for people in this audience
[39:26] especially early in their careers,
[39:28] >> I think I I was I was a faculty here as
[39:32] some of you know for 15 years both in W
[39:34] and CS and I saw dwindling enrollments
[39:36] in E.
[39:38] >> Uh especially on the lower layers of the
[39:40] stack, right? Especially around how do
[39:41] you do transistors, how do you do
[39:43] materials, how do you do that kind of
[39:44] stuff.
[39:45] >> That stuff is what will move the needle
[39:47] here.
[39:48] >> And so I'd strongly encourage going long
[39:51] that layer of the stack. Great. Great.
[39:54] We have some E students in the in the
[39:56] class. Uh what are you short Sachin?
[39:59] What are you skeptical about? What are
[40:00] you cautious about? We'll lower the
[40:02] stakes.
[40:03] >> I I in general obviously
[40:05] we I'm short anything that is a model
[40:09] wrapper.
[40:10] >> Mhm.
[40:11] >> Uh and that's a bit of an easy answer,
[40:14] but it is also true because the pace at
[40:16] which this thing is changing.
[40:18] >> Mhm. uh and how quickly these models are
[40:22] able to introspect and figure out how to
[40:26] deliver an outcome.
[40:28] >> Uh I'd say that it's very very very hard
[40:32] >> to to just be a wrapper on top.
[40:35] >> Um so that is not a statement that open
[40:38] AI or even anthropic for that matter I
[40:40] would say the same
[40:41] >> that we just want to build all the apps.
[40:43] I I think this whole notion of apps
[40:46] >> probably to me is the one that I'd be
[40:48] short of
[40:49] >> like is that going to be the user
[40:51] interface of the future unclear right
[40:53] like is it really going to be apps if we
[40:55] are going to
[40:56] >> interact with computing in the form of
[40:58] outcomes this is the outcome I want go
[41:00] figure it out
[41:01] >> today apps are a crutch
[41:03] >> right
[41:03] >> to get to an outcome
[41:05] >> right right
[41:05] >> and so that would be the notion I'd be
[41:07] short of
[41:08] >> fascinating that makes a ton of sense um
[41:11] first company to 10 trillion in market
[41:13] cap if you were to pick one.
[41:17] >> The easy answer is Nvidia, right?
[41:18] >> Yeah.
[41:19] >> Yeah. Yeah. Okay. I thought for a second
[41:21] you were going to say open. Yeah.
[41:22] >> The first one you said. I hope I will
[41:24] get there for sure. But I think if we
[41:27] are getting there for sure, Nvidia is
[41:28] getting there.
[41:29] >> Good, good, good, good, good, good. Um,
[41:31] biggest unsolved problem in
[41:32] infrastructure right now.
[41:35] >> Oh, you you name it. Right. So I think
[41:37] uh so many uh
[41:39] >> I'd say the single structural issue is
[41:43] enough fab capacity across logic and
[41:47] memory.
[41:48] >> Uh
[41:49] >> this is TSM at the TSMC layer.
[41:50] >> TSMC, Samsung, Intel and then Micron,
[41:53] SKX, Samsung. Uh this it's a very very
[41:57] concentrated market.
[41:58] >> Mhm. uh this whole thing is kind of
[42:02] single threaded on a very small number
[42:04] of companies
[42:05] >> and probably if you dig down even deeper
[42:06] it's ASML
[42:08] >> right
[42:08] >> right and that's at this for all of
[42:10] these right you need ASML machines
[42:12] >> so to me that that is the single choke
[42:15] point
[42:16] >> right
[42:16] >> of the whole supply chain
[42:17] >> right makes a ton of sense you already
[42:20] answered my last question which was
[42:21] advice for students if you have anything
[42:24] to add we'll take it otherwise we'll
[42:25] open it up for questions for a couple
[42:27] minutes go ahead
[42:28] Thank you for being here. Um my question
[42:30] is as we go through and you mentioned
[42:33] computer is a very broad term where do
[42:35] you think the next is going to come
[42:37] from? Is it going to be the hardware
[42:38] like memory networking
[42:40] software?
[42:45] >> Short to medium term it's probably in
[42:48] the orchestration software the harness
[42:51] and the models getting more token
[42:53] efficient.
[42:55] uh medium to long-term I'd say new
[42:58] memory architectures
[43:00] uh because
[43:02] I think the compute unit in it unless
[43:04] the transformer gets reinvented like
[43:06] something replaces the transformer
[43:09] you know what the compute unit shape is
[43:10] right it's really kind of what is the
[43:13] memory architecture around the compute
[43:15] unit that's changing all the time so
[43:17] that would be my medium to long-term
[43:19] answer
[43:23] >> go ahead that you walked away from
[43:25] Stargate EK. goes through some of those
[43:28] positions through the process for
[43:32] >> uh I think for us Stargate is basically
[43:37] Stargate is my job right so it's how do
[43:40] we deliver all of this compute u and the
[43:43] way we look at it is given the size we
[43:46] talking about uh like a gigawatt is $70
[43:51] billion in spend so these are massive
[43:54] numbers and it's also So operationally a
[43:58] big challenge right it's as I said a
[44:00] gigawatt is half a million GPUs to
[44:03] manage build up staff and all that so I
[44:06] think fundamentally the way I look at
[44:08] this is how do I make sure that it's not
[44:11] just the absolute number it also lands
[44:14] on time as quickly as possible and so a
[44:18] big part in our kind of approach is now
[44:21] time to compute rather than amount of
[44:24] compute and so that's dictating kind of
[44:27] where we double down invest and that's
[44:29] why the earlier question too we we
[44:32] prefer bigger chunks of concentrated
[44:34] compute for that reason because
[44:36] otherwise operationally it's very hard
[44:38] for us to get that compute online if
[44:40] it's lots of little chunks spread
[44:42] everywhere
[44:44] >> go ahead
[44:45] >> I want to ask about open weight models
[44:48] models are proving first part of the
[44:50] question is how you think that that's
[44:53] second part is key Obviously open way
[44:56] models usually have fewer parameters
[44:58] require less comput.
[45:03] >> Yeah, I mean I think uh obviously open
[45:06] source models have a role to to play uh
[45:09] in the in the in the ecosystem. U
[45:13] we frontier model intelligence is going
[45:17] to require orders of magnitude more
[45:20] compute right uh I think that we don't
[45:23] see that changing so the scaling loss
[45:25] continuing to hold
[45:26] >> so we will continue to invest on that
[45:29] frontier model intelligence obviously
[45:31] open weight models will play catchup and
[45:35] try and distill that intelligence to
[45:37] deliver it in more uh compact track form
[45:41] factors and we don't see that as an
[45:43] issue but a six-month lead in
[45:47] intelligence is an enormous lead
[45:49] >> right
[45:50] >> uh and so we we we don't see any reason
[45:54] to back off on continuing to invest on
[45:57] Frontier Intelligence.
[46:00] >> Awesome folks. We'll wrap it here.
