# How GPT, Claude, and Gemini are actually trained and served – Reiner Pope

https://www.youtube.com/watch?v=xmkSf5IS-zw
Translation: zh-CN

[00:00] Today I'm interviewing Riner Pope who is CEO of Maddox which is a new chip startup.
  今天我采访的是 Riner Pope，他是 Maddox 的首席执行官，Maddox 是一家新的芯片初创公司。

[00:07] Previously he was doing TPU architecture and many other things at Google.
  此前，他在谷歌从事 TPU 架构以及许多其他工作。

[00:10] This is a very different format from my usual interviews.
  这与我通常的采访形式截然不同。

[00:11] This is going to be a Blackboard lecture we're going to get up in a second.
  这将是一场黑板讲座，我们马上就开始。

[00:14] We in fact built this whole new studio with specifically this format in mind.
  事实上，我们建造了这个全新的工作室，专门考虑到了这种形式。

[00:16] Um and so it's a pleasure to get to inaugurate it with you.
  嗯，所以很高兴能与您一起启用它。

[00:20] We're going to be talking about model architecture, ML infra, many other things.
  我们将讨论模型架构、ML 基础设施以及许多其他事情。

[00:24] And um the reason I think it's an important topic is because once you actually understand how training and inference actually work in a cluster as we'll see a lot of things about why AI is the way it is why AI architectures are the way they are why um API prices are the way they are fundamentally also how why AI progress is the way it is start making sense and you need to understand the details to get there and you need a blackboard to understand the details.
  嗯，我认为这是一个重要话题的原因是，一旦你真正理解了训练和推理在集群中是如何工作的，正如我们将看到的，关于人工智能为何是现在这样，人工智能架构为何是现在这样，嗯，API 定价为何是现在这样，以及根本上人工智能的进步为何是现在这样，你就会开始理解，并且你需要了解细节才能做到这一点，而你需要一块黑板来理解细节。

[00:48] So Riner thank you so much for doing this.
  所以 Riner，非常感谢您做这件事。

[00:50] Yeah very happy to be here. Okay.
  是的，很高兴来到这里。好的。

[00:52] Uh, full disclosure, I am a angel investor in Maddx, but that's unrelated to this podcast.
  呃，坦白说，我是 Maddx 的天使投资人，但这与本播客无关。

[00:56] Um, Reiner, maybe to kick us off, I'll ask this question.
  嗯，Reiner，也许为了开始，我会问这个问题。

[00:59] So, we have
  所以，我们有

[01:02] a couple of companies like Claude and Codex and Cursor are offering something like uh, fast mode where for 6x the price, they'll give you streamy tokens at 2.5x the speed.
  Claude、Codex 和 Cursor 等几家公司提供类似“快速模式”的服务，以 6 倍的价格，提供速度快 2.5 倍的流式令牌。

[01:10] Mechanically, I'm curious what's going on here.
  从机制上讲，我很想知道这里发生了什么。

[01:12] Like, why is it the case that you can pay more to get faster latency?
  比如，为什么你支付更多费用就能获得更快的延迟？

[01:16] And two, could you keep going?
  第二，你能继续下去吗？

[01:19] Could you pay 100x more and somehow get even faster speeds or much much faster speeds?
  你能支付 100 倍的费用，并以某种方式获得更快的速度，或者快得多的速度吗？

[01:24] Um, and three, could you go the other way?
  嗯，第三，你能反过来吗？

[01:26] Could you have something like uh claw code slow mode where if you are willing to wait for minutes on end, you could get um even cheaper prices.
  你能拥有类似“慢速模式”的东西吗？如果你愿意等待很长时间，你可以获得更便宜的价格。

[01:35] So maybe this will help motivate the kind of analysis that you'll be doing through the liar.
  所以，这也许有助于激发你将在“说谎者”中进行的分析。

[01:40] Great.
  太好了。

[01:40] I mean to jump to a little bit to jump to the conclusion the big effect is batch size but what we're going to do now is quantify exactly what that looks like and what its implications are on latency and cost.
  我的意思是，跳到结论，最大的影响是批处理大小，但我们要做的就是量化它的确切样子以及它对延迟和成本的影响。

[01:48] Uh there's going to be another effect which is um you can call it speculative decoding or multi-token prediction.
  嗯，还将有另一个影响，你可以称之为推测性解码或多令牌预测。

[01:54] We can maybe come back to that later but I think the first thing that we'll talk through is batch size.
  我们也许可以稍后再谈，但我认为我们要讲的第一件事是批处理大小。

[01:57] So what I'd like to introduce is um sort
  所以，我想介绍的是，嗯，排序

[02:02] of the two principles of analysis.
  两种分析原理。

[02:04] Firstly we're going to look at a roof line analysis of how I run a transformer model on on a cluster of chips.
  首先，我们将分析一个屋顶线，即我如何在芯片集群上运行一个 Transformer 模型。

[02:10] um we'll take a sort of a let's say a Blackwell NVL72 uh cluster so a rack of 72 GPUs um and so the roof line analysis means we look at uh memory bandwidth and and compute performance and then the other side of that is that we're going to look at just two simple factors of the model which are the time to operate on the weights and then the time to operate on the context the KB cache.
  嗯，我们将采用一个 Blackwell NVL72 集群，也就是一个包含 72 个 GPU 的机架。屋顶线分析意味着我们关注内存带宽和计算性能，然后另一方面，我们将关注模型的两个简单因素，即操作权重的时间以及操作上下文 KB 缓存的时间。

[02:35] So let's jump in.
  那么，让我们开始吧。

[02:37] What we're going to try and do is we're going to try and estimate the time that it takes uh to to run an inference of a certain shape.
  我们将尝试估计运行特定形状的推理所需的时间。

[02:44] Now, we're not perfect here.
  现在，我们并不完美。

[02:47] We can't uh exactly predict the time.
  我们无法准确预测时间。

[02:49] And so, instead, we're going to approximate.
  所以，取而代之的是，我们将进行近似。

[02:50] And so, we're going to say that the time must be greater than or equal to a certain quantity.
  因此，我们将说时间必须大于或等于某个数量。

[02:54] And so, we're going to consider two different um aspects.
  因此，我们将考虑两个不同的方面。

[02:56] We're going to look at the time for uh it takes to uh do the memory fetches uh and then the
  我们将查看内存获取所需的时间，然后是

[03:03] time it takes to do the compute.
  完成计算所需的时间。

[03:07] And it'll turn out that this actually gives us a very strong predictive power even with a simple model.
  事实证明，即使使用简单的模型，这实际上也为我们提供了非常强大的预测能力。

[03:10] So one by one, what is the time that it takes to do the compute?
  所以一个接一个，完成计算需要多长时间？

[03:19] So there are really two things I need to do in the compute.
  所以计算中我需要做的实际上是两件事。

[03:20] I need to um multiply by all of the active parameters um and then I need to do some work on the attention.
  我需要乘以所有活动的参数，然后我需要对注意力做一些工作。

[03:26] So multiplying by all the active parameters.
  所以乘以所有活动的参数。

[03:29] I have a certain batch size that I'm running and then I've got a number of uh active parameters in my model
  我有一个正在运行的特定批次大小，然后我的模型中有一定数量的活动参数

[03:37] and then um and then I'm just going to divide this by the compute throughput which is uh the flops of the chip.
  然后我将把它除以计算吞吐量，也就是芯片的浮点运算次数。

[03:45] So this is hardware constant.
  所以这是硬件常量。

[03:48] So this this actually accounts for all of the compute time for all of the weight matrix multiplies.
  所以这实际上占了所有权重矩阵乘法的计算时间。

[03:55] Um there's a little caveat here.
  这里有一个小小的注意事项。

[03:55] we we've sort of ignored the time to do any of the attention computation but that in general can be will be quite small in comparison to this.
  我们已经忽略了进行任何注意力计算的时间，但总的来说，与这个相比，它会很小。

[04:01] So so we'll ignore this.
  所以我们会忽略它。

[04:03] Maybe I'll just inter from time to time to ask some very naive questions or to clarify some uh basic points but just for the audience you're not serving one user at a time.
  也许我会不时地插话，问一些非常天真的问题，或者澄清一些基本的观点，但只是为了观众，你不是一次服务一个用户。

[04:11] The batch refers to the fact that you're serving many different users at the same time.
  批处理指的是你同时服务许多不同用户的事实。

[04:15] Yeah.
  是的。

[04:15] Um and that's a whole batch.
  嗯，那就是一整批。

[04:17] Yeah.
  是的。

[04:17] So I can motivate the batch at least a little bit.
  所以至少可以稍微激励一下批处理。

[04:18] So um I mean we will see exactly why batch is such a favorable optimization but what will turn out to be the case is that uh if you do not batch together many users um the cost and the economics you get is can be like a thousand times worse than than if you do batch many two users together um and and we'll be able to see that quite explicitly
  所以，嗯，我的意思是，我们将确切地看到为什么批处理是一个如此有利的优化，但事实证明，如果你不将许多用户一起批处理，你获得的成本和经济效益可能会比你一起批处理许多两个用户差一千倍，我们将能够非常清楚地看到这一点。

[04:38] and then uh number of active parameters this is saying like if I look at for example a deepseek model uh the deepseek v3 model has about 30 37 billion active parameters and 700 billion total parameters.
  然后，嗯，活跃参数的数量，这就像我说，如果我以深度搜索模型为例，深度搜索v3模型大约有3037亿活跃参数和7000亿总参数。

[04:51] So this is we're focusing on just the ones that are active for a single token.
  所以这是我们只关注单个标记活跃的那些。

[04:53] Okay, so we're modeled compute performance.
  好的，所以我们模拟了计算性能。

[04:56] I'm going to keep writing equals, but in all of these cases, you can think of this time as being at least this much and and maybe there will be some terms we
  我将继续写等于号，但在所有这些情况下，你可以认为这段时间至少是这么长，而且可能还会有一些我们

[05:04] ignored.
  忽略。

[05:08] Um on the memory side, um what do we need to do uh with memory?
  嗯，在内存方面，嗯，我们需要对内存做什么？

[05:10] We we need to fetch um we need to fetch all of the weights.
  我们需要获取，我们需要获取所有的权重。

[05:12] And so there is some time to fetch all of the the total number of parameters not just the active parameters.
  所以需要一些时间来获取参数的总数，而不仅仅是活动的参数。

[05:21] Um so there's wait fetch time and then in addition uh there's a KV cache fetch time.
  嗯，所以有等待获取时间，然后此外，还有KV缓存获取时间。

[05:26] So there is um this actually depends on batch size.
  所以，嗯，这实际上取决于批次大小。

[05:28] Uh so for every element of the batch we have to fetch uh an entire context length worth of tokens and then there's a size per token.
  呃，所以对于批次中的每个元素，我们必须获取整个上下文长度的令牌，然后每个令牌都有一个大小。

[05:39] So uh um like bytes bytes per for for one token.
  所以，呃，嗯，比如一个令牌的字节数。

[05:46] Um and so this is a model parameter.
  嗯，所以这是一个模型参数。

[05:47] and maybe just back in let's just explain what the KB cache is real quick.
  也许我们再回顾一下，快速解释一下KB缓存是什么。

[05:52] Yeah.
  是的。

[05:52] So when I do a forward pass uh let me draw actually a um how the autoregressive inference works.
  所以当我做一个前向传播时，呃，让我实际画一下，嗯，自回归推理是如何工作的。

[05:57] So this is during decode.
  所以这是在解码期间。

[05:58] Um so if I think I have a bunch of tokens uh of text I'm growing a
  嗯，所以如果我认为我有一堆文本令牌，我正在生成一个

[06:05] tensor because uh ultimately the tokens are represented as some like tensor of uh in some embedding dimension and then in this direction I have the sequence length.
  张量，因为最终这些 token 被表示为某种嵌入维度中的张量，然后在这个方向上，我有序列长度。

[06:18] Um the work of running a decode is I I have to run each token through a um uh through a whole bunch of matrix multiplies over a bunch of different layers.
  嗯，运行解码的工作是，我必须将每个 token 通过大量的矩阵乘法，跨越许多不同的层。

[06:27] Um, and I have I have in general I'm going to have to do that work over uh all of these uh tokens.
  嗯，而且我通常必须对所有这些 token 完成这项工作。

[06:35] But then one step of decode is actually to produce just this one additional token path here.
  但解码的一个步骤实际上是只生成这一个额外的 token 路径。

[06:40] Yep.
  是的。

[06:43] And so what I'm going to do there is I'm going to run a full forwards pass of uh multiplying by all of the weight matrices in the entire model.
  所以，我将要做的是，我将执行一次完整的正向传播，乘以整个模型中的所有权重矩阵。

[06:48] Um but then I've got this attention mechanism where this token sort of it's it's like looking at all of the past tokens.
  嗯，但然后我有了这个注意力机制，这个 token 就像在查看所有过去的 token。

[06:55] um in this way and what is it looking at specifically?
  以这种方式，它具体在看什么？

[06:59] It is looking at some internal representation that the model has produced of the tokens and we call that the KB cache.
  它正在查看模型为这些 token 生成的内部表示，我们称之为 KB 缓存。

[07:04] So
  所以

[07:08] this process of attending this this single token attending to all of the history of tokens um that's attention.
  这个过程，即这个单一 token 注意所有 token 的历史，嗯，那就是注意力。

[07:13] It is mostly dominated by memory fetches rather than um than matrix multiplies.
  它主要由内存获取主导，而不是嗯，而不是矩阵乘法。

[07:18] Mhm.
  嗯哼。

[07:18] So we've got the amount of memory that we're fetching shown over here and then this is of course just then divided by the uh memory bandwidth.
  所以我们在这里展示了我们正在获取的内存量，然后这当然只是除以呃内存带宽。

[07:25] Um so uh so the memory bytes per second.
  嗯，所以，所以每秒内存字节数。

[07:35] So in fact this these equations here are actually enough for us to now some draw some fit lines.
  所以事实上，这些方程在这里实际上足以让我们现在画一些拟合线。

[07:40] And so the things that we'd like to look at are sensitivity to batch and then also um which we'll draw separately to context links.
  所以我们想看的是对批次的敏感度，然后还有嗯，我们将单独绘制上下文链接。

[07:51] So we said that the big big effects you can get is like some some trade-off in latency versus uh versus cost um in in batch size.
  所以我们说，你可以获得的巨大影响就像在批次大小的延迟与呃，与成本之间进行一些权衡。

[07:58] So so let's draw them out.
  所以，让我们把它们画出来。

[08:00] I think there's just really two graphs we want to draw.
  我认为我们实际上只想画两个图。

[08:01] Um we'll first just draw um batch size versus uh time here.
  嗯，我们首先在这里画出嗯，批次大小与呃时间的关系。

[08:11] So when we look at the shape of this, we've got a maximum of a sum and then and then another term.
  所以当我们看这个形状时，我们有一个总和的最大值，然后，然后是另一个项。

[08:19] Um so let's look at these terms one by one and how they scale uh uh the time for compute and and memory uh and how they show up.
  嗯，所以让我们逐一看看这些项，以及它们如何缩放计算和内存的时间，以及它们如何显示出来。

[08:28] So let's first look at this compute time.
  所以让我们先看看这个计算时间。

[08:31] Uh this is just purely linear linear in batch size with no um no offset.
  呃，这只是纯粹的线性，在批次大小上是线性的，没有呃，没有偏移。

[08:37] So it is some curve like this.
  所以它是一些像这样的曲线。

[08:37] This is this is t compute.
  这是计算时间。

[08:43] Um and then on the memory side we've got some portion here that uh that is just this constant that um that is you know constant in some base offset here which is the uh weight fetch weight fetch.
  嗯，然后在内存方面，我们这里有一些部分，呃，它只是这个常数，嗯，你知道，在这个基础偏移量上是常数，这是呃，权重获取，权重获取。

[09:00] And then finally we have um this term here which is the KB fetch um which we're going to draw as as this is the KB fetch
  最后，我们有嗯，这里的这个项，它是KB获取，嗯，我们将把它画成，这是KB获取。

[09:13] which is which is linear and bash and so it looks like that.
  这是线性的，也是bash，看起来是这样的。

[09:19] So the sum of this plus this maxed with this.
  所以这个的和加上这个与这个的最大值。

[09:21] So let's at least first to draw the sum.
  所以我们至少先画出这个和。

[09:28] Um so the two memory times in in conjunction end up looking on this curved slope like this.
  嗯，所以这两个内存时间结合起来，最终看起来是这样的弯曲斜率。

[09:33] Mhm.
  嗯哼。

[09:34] And then we get a um the overall maximum is I'll draw a little figure here.
  然后我们得到一个嗯，总的最大值是，我在这里画一个小图。

[09:38] Is it the maximum of these two curves?
  它是这两个曲线的最大值吗？

[09:41] Make sense?
  有意义吗？

[09:43] Okay.
  好的。

[09:43] So so so what does what does this mean actually?
  那么，这到底意味着什么呢？

[09:46] So this is a latency um plot.
  所以这是一个延迟嗯图。

[09:53] Um so if I grow my batch size I um I get initially some not very strong dependence on batch size and so there's some lower bound on latency here.
  嗯，所以如果我增加我的批次大小，我嗯，最初会得到一些不是很强的批次大小依赖性，所以这里有一个延迟的下限。

[10:02] Um latency lower bound lower bound.
  嗯，延迟下限，下限。

[10:10] Um so this already partially answers the question for a given hardware
  嗯，所以这已经部分回答了对于给定的硬件的问题

[10:14] configuration and then we can talk about varying hardware configuration but for a given hardware configuration there is a lower bound on latency which is simply the I need to read all of my total parameters um from uh from memory into the into the chips and that takes a certain amount of time.
  配置，然后我们可以讨论可变的硬件配置，但对于给定的硬件配置，延迟有一个下限，那就是我需要将我的所有总参数从内存读取到芯片中，这需要一定的时间。

[10:32] Uh if if I use all of my memory bandwidth I can't do any better than that.
  呃，如果我使用了所有的内存带宽，我就不能做得更好了。

[10:35] uh it seems like the way you've drawn the slopes for compute time and how the KB grows and what implication the KB has on memory uh time that as what if this were above or below or yeah or is that necessarily the case because if this is always true then as batch size grows compute always dominates.
  呃，看起来你计算时间斜率的画法以及 KB 如何增长以及 KB 对内存时间有什么影响，就像如果这个在上面或下面，或者是的，或者这是必然的吗？因为如果这是永远正确的，那么随着批次大小的增长，计算总是占主导地位。

[10:57] uh KVN which which suggests that if you have big enough batch size maybe memory is never an issue.
  呃 KVN，这表明如果你有足够大的批次大小，内存可能永远不是问题。

[11:01] Yeah, this is really sensitive to the context length.
  是的，这真的对上下文长度很敏感。

[11:05] Um, so I think we should come back and explore this.
  嗯，所以我想我们应该回来探讨一下。

[11:06] The there will be as you vary the context length, the KB fetch time will go up and up and so that'll cause a transition from uh compute limited to
  随着你改变上下文长度，KB 获取时间会越来越长，所以这会导致从计算受限到

[11:14] memory limit.
  内存限制。

[11:15] And is there something especially significant about the slope being exactly the slope of the um the comput time?
  斜率恰好是计算时间的斜率，这有什么特别重要的吗？

[11:25] Yeah, whenever we have balance points, it kind of says that you're getting it exactly right.
  是的，每当我们有平衡点时，这有点说明你做得恰到好处。

[11:27] Um and so for the particular context length where the slopes match um that says I am equally memory bound and computebound.
  嗯，所以对于斜率匹配的特定上下文长度，这说明我内存和计算能力相当。

[11:36] which is a really desirable place to but suppose it's like this is a very simple algebra algebra problem but suppose it's you know the optimal is 100k context length.
  这是一个非常理想的位置，但假设这是一个非常简单的代数问题，但假设你知道最佳上下文长度是 100k。

[11:44] and you go to 200k context length does your MFU go down to like 50% like does it have a humongous impact on MFU to be like slightly outside of context length optimal range goldilock zone.
  然后你增加到 200k 上下文长度，你的 MFU 会下降到 50% 吗？它会对 MFU 产生巨大的影响，使其略微超出上下文长度的最佳范围，即金发女郎区域吗？

[12:01] That's right.
  没错。

[12:01] So that is true as modeled here.
  所以正如这里所建模的那样，这是真的。

[12:03] Um there's a key point here that I'm modeling this context length as uh or I'm modeling the memory fetch as linear in context length.
  嗯，这里有一个关键点，我将这个上下文长度建模为，或者说我将内存获取建模为与上下文长度成线性关系。

[12:11] that actually depends on model architecture.
  这实际上取决于模型架构。

[12:13] It is true for many of the
  对于许多模型来说，这是真的。

[12:16] or all of the model architectures with dense attention.
  或者所有具有密集注意力的模型架构。

[12:17] Yeah.
  是的。

[12:20] Um there's a sparse attention actually scales much better than that.
  嗯，稀疏注意力实际上比那扩展得更好。

[12:22] Got it.
  明白了。

[12:24] And is sparse attention what everybody uses in practice?
  那么稀疏注意力是大家在实践中都在使用的吗？

[12:25] I'm pretty excited about sparse attention.
  我对稀疏注意力相当兴奋。

[12:27] uh it's hard to know what the labs are using.
  呃，很难知道实验室在用什么。

[12:28] Deepseek has published a sparse attention mechanism.
  Deepseek 发布了一种稀疏注意力机制。

[12:31] I'll just like put a plug in that sparse attention.
  我就像给稀疏注意力打个广告一样。

[12:33] Some of the deepseeek papers that have published sparse attention end up putting a square root in this term.
  Deepseek 发表的一些关于稀疏注意力的论文最终在这个项中加入了一个平方根。

[12:39] Okay.
  好的。

[12:39] So, so far we've done we've looked at the latency.
  所以，到目前为止，我们已经研究了延迟。

[12:41] Um it's kind of hard to read off cost from this.
  嗯，很难从中读出成本。

[12:43] Uh so if I think what does cost mean um I'm going to like to run this inference I'm going to use the GPU for a certain number of seconds like 1 millisecond or 20 milliseconds or something like that.
  呃，所以如果我思考成本意味着什么，我将运行这个推理，我将使用 GPU 一定的秒数，比如 1 毫秒或 20 毫秒之类的。

[12:56] Um, and I have to pay the rental time for for that for that time.
  嗯，我必须为那个时间支付租用时间。

[12:58] So like it's $2 an hour per GPU or something like that.
  所以就像每小时每 GPU 2 美元或者类似的东西。

[13:02] Um, so so that's the cost of this inference, but how much value have how many tokens have I processed during that inference?
  嗯，所以这就是这个推理的成本，但在那个推理过程中我处理了多少价值，多少个 token？

[13:09] That is the batch size.
  那就是批次大小。

[13:12] And so what we actually want to plot is going to be the um the cost versus batch
  所以我们实际想要绘制的是成本与批次

[13:16] Size. Um, which is like T over B uh versus batch size.
  大小。嗯，这就像 T 除以 B uh 与批次大小相比。

[13:23] Uh, this is the cost per token.
  呃，这是每个 token 的成本。

[13:27] Um so like we have to imagine dividing each of these three curves by by b.
  嗯，所以我们必须想象将这三条曲线中的每一条除以 b。

[13:34] So multiplying by this um reciprocal um and so what we end up with there is the the compute curve is going to um it was linear.
  所以乘以这个嗯倒数嗯，我们最终得到的是计算曲线将是嗯，它是线性的。

[13:44] We divide by b that makes that uh a constant here and this is t compute.
  我们除以 b，这使得这里的 uh 成为一个常数，这就是 t 计算。

[13:52] The um the kv fetch was linear now it becomes a constant as well um uh uh kv fetch.
  嗯，kv 获取是线性的，现在它也变成了一个常数，嗯，uh kv 获取。

[14:07] And then the um the the weight fetch uh was constant and now we've divided by b and so it becomes this um hyper parabola.
  然后嗯，weight 获取 uh 是常数，现在我们除以 b，所以它变成了这个嗯，超抛物线。

[14:21] And so again, we're going to compute the the max of the sum.
  因此，我们再次计算总和的最大值。

[14:28] Um so the sum of these two terms shifts the the uh the parabola up sum of the KB fetch and the weight fetch um gives us a sort of a a higher parabola that's like this.
  嗯，所以这两个项的和将抛物线向上移动，KB获取和权重获取的总和给我们一个类似这样的更高的抛物线。

[14:41] Mhm.
  嗯哼。

[14:41] And then we're going to take the max with the compute uh here.
  然后我们将与这里的计算进行比较取最大值。

[14:45] So we end up with this this being the overall shape that that we care about.
  所以我们最终得到这个我们关心的整体形状。

[14:52] So again, so like we see some limiting behavior.
  所以再次，我们看到一些限制行为。

[14:54] The cost initially starts very high at batch size of one actually like it almost goes to infinity like uh it's um because we've got so many weight fetches which are not advertised over a large batch size.
  成本最初在批次大小为一时非常高，实际上几乎趋于无穷大，因为我们有太多的权重获取，而这些权重获取并未在大批次大小上进行广告宣传。

[15:05] Um but then as we increase the batch size the weight fetches become amortized over so many different batch elements that they their cost go grows very small and eventually the compute time uh ends up driving the cost.
  但是，随着我们增加批次大小，权重获取被分摊到如此多的不同批次元素上，以至于它们的成本增长非常小，最终计算时间成为成本的主要驱动因素。

[15:17] Mhm.
  嗯哼。

[15:18] So there is a limiting um like lower
  所以有一个限制性的，嗯，像较低的

[15:22] bound lower bound on cost.
  成本的下界。

[15:29] um which is this one here.
  嗯，就是这个。

[15:30] Yeah.
  是的。

[15:31] Um, so clawed code slow or codec slow or whatever would just live on this line and it wouldn't help much because you're you're not able to amortize the KV values over a much bigger badge.
  嗯，所以编码器慢或者编解码器慢，或者无论什么，都会留在这条线上，而且帮助不大，因为你无法在更大的批次上摊销 KV 值。

[15:43] Yeah. Yeah. They're unique per batch.
  是的。是的。它们每个批次都是唯一的。

[15:45] The compute is also unique per batch.
  计算量每个批次也是唯一的。

[15:46] And so what is the minimum work you can do per batch after amatizing everything else away?
  那么，在摊销掉所有其他东西之后，你每个批次能做的最小工作量是多少？

[15:50] Um so at this point where you are no longer um memory bandwidth bound,
  嗯，那么在这一点上，当你不再是内存带宽瓶颈时，

[15:59] what practically how big a batch do you need to like how yeah how big are the batches practically for frontier models?
  实际上需要多大的批次才能像，是的，对于前沿模型来说，批次实际有多大？

[16:07] Um you can you can just solve for that actually.
  嗯，你实际上可以为它求解。

[16:09] Um and it's not even particularly sensitive to model architecture.
  嗯，而且它甚至对模型架构也不是特别敏感。

[16:12] So um let let's go ahead and do that.
  那么，嗯，让我们继续做吧。

[16:15] So what we're talking about is we're going to say when the memory time is equal to the compute time.
  所以我们说的是，我们将说内存时间等于计算时间的时候。

[16:18] That's that that's what that question is.
  这就是那个问题。

[16:20] Um
  嗯

[16:23] for now I'm going to discard the um because we're focused on what what the batch size is and really there's a question of what uh when the weights are amotized over the um the the multiplies.
  现在我将忽略它，因为我们关注的是批次大小是什么，以及当权重分摊到乘法上时，确实存在一个问题。

[16:34] I'm going to focus on comparing the weight fetch time to the weight multiply time.
  我将专注于比较权重获取时间和权重乘法时间。

[16:37] I'm going to disregard the KB fetch term um just just to simplify the analysis so we can get a kind of a clean answer out.
  我将忽略KB获取项，只是为了简化分析，以便我们能够得到一个清晰的答案。

[16:45] Um so we're going to equate uh this portion with this with these two terms.
  因此，我们将这个部分与这两个术语相等。

[16:50] Yeah.
  是的。

[16:57] So writing that out um we get n number of total parameters over memory uh me memory bandwidth uh is equal to um batch size times number of active parameters divided by the compute performance.
  所以写出来，我们得到总参数数量除以内存带宽等于批次大小乘以活动参数数量除以计算性能。

[17:22] So looking over here, everything on the
  所以看看这里，上面的一切

[17:24] Top, these are model parameters.
  顶部，这些是模型参数。

[17:26] Everything on the bottom, these are hardware parameters.
  底部所有这些都是硬件参数。

[17:27] Um it it turns out to be nice to rearrange them such that we have the hardware parameters on one side.
  嗯，事实证明，将它们重新排列成硬件参数在一侧会很好。

[17:32] So So let's this is equivalent to
  所以，让我们，这相当于

[17:37] um memory bandwidth being equal to um batch size times number of active parameters.
  嗯，内存带宽等于嗯批次大小乘以活动参数的数量。

[17:52] divided by the number of total parameters.
  除以总参数的数量。

[17:55] So, so this is a hardware parameter.
  所以，所以这是一个硬件参数。

[17:57] Um, actually the this actually ends up being a dimensionless constant.
  嗯，实际上，这实际上最终是一个无量纲常数。

[17:59] Uh, if you look in terms of flops, what are the dimensions of this?
  呃，如果你从浮点运算次数来看，它的维度是什么？

[18:04] This is um multiplies per second.
  这是嗯每秒乘法次数。

[18:05] This is bytes per second.
  这是每秒字节数。

[18:07] So, that's not quite dimensionless.
  所以，那不是完全无量纲的。

[18:08] But what you do is you say like multiplies per second times let's say I'm doing FP4.
  但你所做的是你说每秒乘法次数乘以，比如说我正在做 FP4。

[18:13] Um, so I I do like how many FP4 multiplies per second times the fact that uh each one each FP4 is half a bite.
  嗯，所以我我做多少个 FP4 每秒乘法次数乘以这样一个事实，呃每一个 FP4 是半个字节。

[18:22] Um, and so I can
  嗯，所以我可以

[18:24] actually make this end ending up being dimensionless.
  实际上使它最终变得无量纲。

[18:29] Um, and and this ends up being on most GPUs um around 300 somewhere around 300.
  嗯，这在大多数 GPU 上大约是 300，大约是 300。

[18:36] and sorry has that ratio changed over time as we've gone from model generation to model generation where the blobs keeps increasing?
  抱歉，随着我们从模型生成到模型生成，这个比例是否随着 blob 的不断增加而改变？

[18:41] So there's a hardware parameter um to what extent has the hardware changed?
  所以有一个硬件参数，硬件在多大程度上发生了变化？

[18:45] So um from like A100 to A100 to B100 um the the flops has increased substantially.
  所以从 A100 到 A100 到 B100，浮点运算量已大大增加。

[18:51] The memory band has also increased substantially and it has remained reasonably stable.
  内存带宽也大幅增加，并且保持相对稳定。

[18:56] Yeah.
  是的。

[18:56] And we can we can express this one as well.
  我们也可以表达这一点。

[18:57] This is a sparity parameter.
  这是一个稀疏度参数。

[18:59] Um and I I might even phrase it slightly different.
  嗯，我甚至可能表达得略有不同。

[19:01] Let's solve for batch size in total.
  让我们总共计算批次大小。

[19:03] Um we end up with and so we're just moving this back over to the other side.
  我们最终得到，所以我们只是把它移到另一边。

[19:07] we end up with batch size needs to be bigger than approximately um 300 times sparity.
  我们最终得到批次大小需要大约是稀疏度的 300 倍。

[19:15] So for example, if I have 100 like I activate in deepseek I activate 32 out of 256 experts.
  所以例如，如果我有 100 个，就像我在 deepseek 中激活 256 个专家中的 32 个一样。

[19:21] So this would be like eight for deepseek.
  所以这对于 deepseek 来说就像是八。

[19:24] Got it.
  明白了。

[19:24] Okay.
  好的。

[19:24] So so
  所以，所以

[19:27] this actually gives you a ballpark which is like remarkably accurate to practice.
  这实际上给了你一个大概的范围，这在实践中非常准确。

[19:31] Generally people will go a little bit larger than this.
  一般来说，人们会比这稍微大一点。

[19:33] they don't really want to be exactly at the balance point because um real world efficiencies aren't as good as a roof line analysis would say.
  他们并不想正好处于平衡点，因为现实世界的效率不像线分析所说的那样好。

[19:40] Um but like take this and maybe double it or triple it.
  嗯，但就像拿这个然后可能翻倍或三倍。

[19:44] Okay.
  好的。

[19:44] So basically it's like 2 to 3,000 tokens per batch.
  所以基本上是每批 2000 到 3000 个 token。

[19:47] But then if you included the KB cache, yes, the implication would be that the optimal batch size should grow larger.
  但如果你包含了 KB 缓存，是的，这意味着最佳批次大小应该会增长。

[19:58] So this is get like we we solve for the equivalence between when um compute time is equal to memory time.
  所以这是这样的，我们解决了计算时间等于内存时间时的等价问题。

[20:06] If I add in more memory bandwidth like something that consumes more memory bandwidth then I have less available for the the weight loads and so I need to grow the uh the memory bandwidth more and therefore the batch size more.
  如果我增加更多的内存带宽，比如消耗更多内存带宽的东西，那么我用于权重加载的内存就更少了，所以我需要增加内存带宽，因此也增加批次大小。

[20:17] This seems incredibly small like a batch this would be like less than one sequence, right?
  这似乎非常小，像一个批次，这将不到一个序列，对吧？

[20:21] Yeah.
  是的。

[20:21] Okay.
  好的。

[20:21] So, so I guess this is um keep in mind that I'm talking about the number of tokens that I'm generating one
  所以，所以我想这是请记住，我正在谈论我正在生成的一个 token 的数量

[20:28] more token for. So, so it's like it's

[20:31] actually 2,000 unique sequences in

[20:33] >> Got it. Okay. We're just talking about

[20:35] the a single forward pass on these

[20:37] sequences. This is like the Do you think

[20:39] of like the bash is the number sequences

[20:40] rather than like

[20:41] >> That's right. Okay. Cool.

[20:42] >> Yeah.

[20:43] >> When I'm prepping for interviews, I

[20:44] often talk to experts in the field. So,

[20:45] for Reiner, I chatted with two of James

[20:48] engineers, Clark and Axel. Clark, who

[20:51] works on low latency trading systems,

[20:52] walked me through why Gene Street uses

[20:54] FPGAAS to make sure that they have

[20:55] predictable nancond latencies. You can

[20:57] just build these like giant grids of

[20:59] compute very easily that do exactly what

[21:01] you need to touch 100 megabytes of SRAM

[21:04] and then get your response back in tens

[21:05] of nanconds very easily and that's

[21:08] basically impossible on he then went on

[21:10] to explain why CPUs just wouldn't work

[21:12] for this kind of thing. And so if you

[21:13] have a clock that's going every 3 nonds,

[21:16] you actually have several bytes of

[21:18] information at a time to make your

[21:20] decision. That's as opposed to a CPU

[21:22] where you'll just collect up a whole

[21:23] packet, you know, let's say a 1500 byt

[21:25] packet and you say, "Okay, this packet

[21:26] is ready. Here you go, CPU. You can

[21:28] start thinking about it now." FPGAs

[21:29] allow you to react to the earliest part

[21:31] of the packet as it arrives rather than

[21:33] having to wait for the full thing. We

[21:34] also talked about liquid cooling,

[21:36] network design, and many other things.

[21:37] If you're interested in this stuff, Jane

[21:39] Street is hiring. You can check out

[21:41] their open roles at

[21:42] janestreet.com/bcash.

[21:46] And if you want to watch the full prep

[21:47] conversation, we posted it there, too.

[21:49] If you've got a Frontier model and you

[21:52] are actually doing inference,

[21:55] surely they must have more than 2,000

[21:57] concurrent users.

[21:58] >> Yeah.

[21:58] >> Is there any added latency from the fact

[22:00] that you need to have the whole batch

[22:02] fill up? or is it if you have a

[22:04] reasonable amount of users, it's so

[22:06] unlikely that you wouldn't it it would

[22:08] not take you 100 milliseconds to fill up

[22:10] the next 2,000 slots.

[22:12] >> Yeah, the the way to think about this, I

[22:14] guess we think of it as like when does

[22:16] the train depart as a model. So let's

[22:18] say I've picked a batch size that I'm

[22:20] going to run at. Maybe I pick, you know,

[22:21] this batch size.

[22:22] >> Um and so like well and by the way, this

[22:26] intersection point is is the same

[22:27] intersection point here. Um

[22:30] >> so I pick this batch size. is I know

[22:32] that it's going to take for example

[22:33] maybe it's something like 20

[22:34] milliseconds is a common place this ends

[22:36] up landing

[22:37] >> what I'm going to produce is uh like so

[22:41] this is a timeline of what is running on

[22:42] the GPU it's going to start a new batch

[22:44] every 20 milliseconds uh regardless and

[22:47] so uh so so each of this is 20 this is

[22:50] 40

[22:56] you can think of this as a schedule for

[22:57] the train a new train departs every 20

[22:59] millonds any passengers who are board

[23:01] the train. Um if the train is full, then

[23:04] they wait to the next train. Um if the

[23:05] train is not f full, the train's going

[23:07] to go anyway. Um and so in terms of what

[23:09] that means for queuing latency, it means

[23:12] that the worst case is that you like a

[23:15] request arrives just after the train

[23:17] departed. It has to wait for the next

[23:20] train. So that's up to 20 millonds and

[23:21] then it has to wait for that train to to

[23:24] complete. Uh and so the worst case

[23:26] latency is 40.

[23:27] >> So how is it 20 millconds derived? Um I

[23:30] mean rule of thumb but where it comes

[23:31] from is not fully explained yet but um

[23:36] so far we've focused on memory bandwidth

[23:38] and compute uh time. Uh when we look at

[23:41] memory the other consideration is that

[23:42] we want to use all of the memory

[23:43] capacity we have. Um and so generally

[23:47] we're going to use all of that memory

[23:49] capacity to store the weights or the

[23:51] KBs. And so we just want to read like in

[23:55] the time of doing a forward pass maybe

[23:57] we want to read all of the memory

[23:58] capacity into into the chip. Um and so

[24:01] that is capacity divided by bandwidth

[24:03] that tends to be 20 milliseconds on on

[24:05] many different generations of HPM.

[24:07] >> The units make sense. You would have a

[24:11] uh a bite divided bytes per second.

[24:12] >> Yeah. So for example, I mean on on I

[24:15] think the Reuben generation it is

[24:17] something like 288 GB um divided by 20

[24:21] terabytes per second. Um uh and

[24:27] this looks like it comes out to about 15

[24:29] millconds.

[24:32] Yeah. Let me make sure I understand what

[24:34] it's saying. I mean I understand how why

[24:35] the units can't the sort of unit

[24:37] analysis but what is it saying is

[24:43] we can evacuate and replace the HBM in

[24:48] this amount of time. And so we don't

[24:51] want to be in a situation where

[24:53] the HBM is not big enough that we're

[24:56] not, you know, actually able to

[25:00] keep write everything we want to it or

[25:01] take everything out of it or we don't

[25:03] want to be in a situation where our

[25:04] ability to write back and forth is so

[25:06] big or so small compared. Yeah, there's

[25:08] sort of two scenarios. Why don't we pick

[25:10] a latency that is bigger than 15

[25:11] milliseconds? And um if I think what

[25:14] that means, it means I actually have

[25:16] time to read the HBM like twice. Y

[25:18] >> um by the way, most of HPM accesses is

[25:21] reads, not writes. It's like almost all

[25:22] reads because the weight matrices are

[25:24] read only and then almost all of the KB

[25:26] cache accesses are reads. So um in like

[25:29] let's say I run 30 milliseconds, I can

[25:31] read all of HPM twice, but what's the

[25:34] point of that? Like I I don't want to

[25:35] read the white matrices twice. Um I

[25:37] don't want to read the KVs twice.

[25:38] >> Yeah, it makes sense. Makes a ton of

[25:39] sense. Okay, so a couple of actually

[25:41] quick questions. One, if it is the case

[25:44] that the optimal batch size is something

[25:45] like 2,00 and that actually true, it's

[25:49] totally dependent on the sparity. It's

[25:50] not dependent on the model size or

[25:52] anything.

[25:52] >> I mean sparity shows up in model size,

[25:54] but beyond that, it only depends on

[25:56] sparity, not on scale. But that's a very

[25:58] interesting result and that seems to

[25:59] imply that you can

[26:02] one question is how much of a push

[26:05] towards centralization is it that you

[26:07] would have these economies of scale from

[26:08] inference from batching.

[26:10] >> Yeah.

[26:10] >> But it seems like it's not that big a

[26:12] deal like I don't know is 2,000 users at

[26:13] the same time a lot. It doesn't seem

[26:14] like a lot.

[26:15] >> We can do a bit of analysis on this

[26:17] which would be actually it's like you

[26:18] can think of it in terms of number of

[26:19] users but maybe a more productive way to

[26:21] think of it is in terms of number of

[26:23] tokens per second.

[26:24] >> Mhm. So what does this batch size uh

[26:26] mean in terms of tokens per second of

[26:28] this of the system? So um tokens per

[26:31] second um tokens per second is going to

[26:33] be equal to the batch size. We run a

[26:35] batch many tokens and then we do that

[26:38] every um t

[26:40] so every time intervals which is let's

[26:42] say which is uh which is this thing is

[26:44] equal to the 15 milliseconds 20

[26:46] milliseconds number. So um this ends up

[26:49] being batch size itself times

[26:53] uh about 60. So um like 64 * b um and so

[26:58] this ends up being around

[27:01] uh 2,00 * 64. So like 128 um 128k uh

[27:07] tokens per

[27:09] sort of in more digestible units like uh

[27:11] it's hard to reason about concurrent

[27:13] users but what is the global traffic for

[27:15] for a system? Um

[27:19] uh when you look at some of the

[27:21] announcements uh sometimes the API

[27:24] providers will will brag about how much

[27:25] traffic they have. Um the the the

[27:28] numbers that I've remembered from some

[27:29] announcements of Gemini last year were

[27:31] in the hundreds of millions of tokens

[27:33] per second worldwide. So so uh about a

[27:37] thou like this is 1,000th of that.

[27:39] >> Yeah. But I mean the Gemini is big,

[27:41] right? That's actually 1,000 of Gemini

[27:43] is a lot to to actually be like uh

[27:46] >> to be competitive at scale, you need to

[27:48] be able to serve at least 1,000 of

[27:50] Gemini. Yeah,

[27:50] >> that's interesting.

[27:51] >> Um, cool. Um, okay. So,

[27:57] the more sparsity you have, the less

[28:00] compute you need.

[28:04] And it does seem that as batch sizes get

[28:06] bigger, compute ends up being the

[28:08] bottleneck. Mhm.

[28:10] >> According to this analysis. So then the

[28:11] question is how far can you take

[28:13] sparity? That is to say as the sparity

[28:16] ratio increases as you have fewer and

[28:18] fewer active parameters relative to

[28:19] total parameters how much is performance

[28:22] of the model degrading and is it

[28:24] degrading faster than you're saving

[28:28] compute by increasing the sparity

[28:30] factor.

[28:31] >> Yeah. So performance equality of the of

[28:33] the model rather than speed of the

[28:34] model. Yeah. Yeah.

[28:35] >> So unfortunately we're not able to

[28:37] answer that analytically. That's um

[28:40] >> that is an empirical question of model

[28:41] quality.

[28:43] >> Best I can do is pull up a paper and

[28:45] answer that empirically.

[28:46] >> Yeah.

[28:46] >> Okay. Uh should we follow the paper now

[28:48] or is it make sense?

[28:49] >> Yeah. So so this paper this is unified

[28:51] laws for routed language models. It's a

[28:53] somewhat old paper by this stage but one

[28:55] of the things that they did is looked at

[28:57] if I keep increasing sparity what is the

[28:59] model quality impact? This answer is

[29:01] very sensitive to the actual choice of

[29:04] mixture of experts. Mixture of experts

[29:05] has been around for a really long time.

[29:07] I think it was maybe even back in 2017.

[29:10] Um

[29:11] but the tech techniques have changed a

[29:13] lot. Deepseek mixture of experts was was

[29:15] a big change in how it worked. Um there

[29:17] have been older papers which are Gshar

[29:19] uh switch transformer. So the actual

[29:22] empirical results are going to depend on

[29:23] all of that. Um but on one of the older

[29:25] techniques that is shown here you can

[29:27] see if I hold constant the number of

[29:30] active parameters at a certain size and

[29:32] then I increase the sparity which they

[29:33] call expert count here the quality keeps

[29:36] increasing and then if you imagine like

[29:37] drawing a horizontal line from 1.3B

[29:40] dense

[29:42] >> uh across you end up seeing that for

[29:44] example in this case the 64 expert 370

[29:47] million activated parameters model is as

[29:50] good as a dense 1.3 billion model. So in

[29:52] some sense it's actually not amazing

[29:54] returns where you need to increase total

[29:57] parameters 100fold to get the equivalent

[30:00] of 10x as many active parameters.

[30:03] >> Yeah I mean actually even more so yeah

[30:05] it's a huge increase in parameter count

[30:07] for a modest increase in in

[30:09] >> yeah so in this case actually it's what

[30:10] what is it 4x

[30:11] >> 64x for 4x. Yeah. So

[30:15] while it is while it is true I guess

[30:18] that the you get this benefit of

[30:22] being able to economize on your compute

[30:25] time if you increase sparsity.

[30:28] Um naively it would seem like oh that's

[30:30] a trade-off worth making. But if this is

[30:33] this you're decreasing this by 2x and

[30:37] then having this go up by 8x every time

[30:40] you double

[30:41] >> sparity. So is that good or bad?

[30:43] Actually um even from a memory point of

[30:45] view keep in mind um you are doubling

[30:48] this portion of the memory fetches which

[30:51] is amotized by batch and so just just

[30:53] keep running a larger batch size. Um

[30:56] from the point of view of the analysis

[30:58] we've done here this is pure win. Keep

[31:00] doing it. Um uh keep doing it until you

[31:03] run out of available users basically.

[31:05] >> Mhm. Um, so there's actually this

[31:08] equivalence between uh if I want to go

[31:12] sparse or if I have a lot of users, I

[31:14] can go to a much sparer model. So from

[31:16] that point of view, it's it's a

[31:17] reasonable trade-off. Um, the other

[31:19] trade-off that shows up here is that um

[31:21] it also consumes memory capacity, which

[31:23] we we've only reasoned about memory

[31:24] bandwidth here, but it also consumes

[31:26] memory capacity.

[31:26] >> So let me just make sure I understood.

[31:29] You're saying

[31:31] we want bigger we we want um to spend

[31:35] less time computing therefore we do more

[31:39] sparity to make that work we need bigger

[31:41] batch sizes

[31:42] >> which means we need more memory capacity

[31:45] um

[31:46] >> yeah so

[31:47] >> to have more sparity.

[31:48] >> Yeah. So I mean maybe this would be a

[31:49] good point to actually um talk about how

[31:53] a mixture of experts layer is typically

[31:55] laid out on on a like on a rack of GPUs

[31:57] or something like that.

[31:58] >> Yeah. Yeah, makes sense.

[32:00] >> Yeah. Where were we?

[32:01] >> Uh sparse mixture of experts. Um maybe

[32:04] how we lay that on out on a GPU.

[32:06] >> Yeah.

[32:07] >> So, um let's zoom in on the mixture of

[32:09] experts layer first and and and sort of

[32:11] draw what that looks like.

[32:13] >> So, we typically um will have a some

[32:17] kind of a router layer

[32:19] >> um which is making the decision of where

[32:21] we route uh the experts uh the tokens

[32:23] to. So, we get tokens coming in here.

[32:26] They go through a router layer and then

[32:27] we have a bunch of different um experts.

[32:32] Uh I'll draw draw a few more um to line

[32:36] some up. Um and then the router will

[32:38] make a decision and which experts am I

[32:41] going to route to? And it'll be a small

[32:42] fraction of them. Maybe one in 32. So

[32:45] maybe it'll make a decision to route to

[32:47] this one. Um uh maybe this one and maybe

[32:52] this one.

[32:52] >> Mhm.

[32:54] uh these experts. So these each expert

[32:57] itself is a normal MLP. It has a up

[33:00] projection and then a down projection

[33:02] with a nonlinearity in between. Um and

[33:04] then finally we sort of do the inverse

[33:06] operation. So where we were broadcasting

[33:08] things out here um we're going to bring

[33:10] them back in and sum them up. So

[33:15] bringing them in like this.

[33:18] Uh and then finally we have our residual

[33:21] connection. So that the token is also

[33:23] passed through here and it gets added

[33:26] to the result of thee layer. So so this

[33:28] is a normal layer. Um what I want to

[33:32] talk through is how this is mapped to a

[33:34] like a GPU rack um and what this means

[33:37] for communication uh because I think

[33:40] this will will start to show some of the

[33:41] the limits of how fast we can go.

[33:43] >> Yeah.

[33:44] >> So um the standard practice here and it

[33:47] it is the best solution is to use um

[33:50] expert parallelism. So that means

[33:51] different experts go on different GPUs.

[33:54] So if we take something like a Deepseek

[33:55] model, um they have 256 experts. Um

[34:00] let's say we want to run that on a

[34:02] Blackwell rack. Um so there are 72 GPUs.

[34:06] Um we have a divisibility problem. This

[34:09] is not a power of two. Um so we'll just

[34:11] like simplify and say we're only going

[34:13] to use 64 of them. Um just ignore the

[34:16] other eight. It's not a big deal. Um and

[34:18] so we we have four experts per GP. uh

[34:21] very simple um uh for the sake of the

[34:24] diagram I'll actually just say let's

[34:26] let's say we have two experts per GPU so

[34:28] we um we end up just putting uh these

[34:33] are the GPU boundaries every pair of

[34:34] experts is on its own GPU um

[34:38] and then we can look at the

[34:39] communication cost we had some experts

[34:41] stored some tokens stored centrally here

[34:44] they get routed to all of these experts

[34:46] um and so uh there's some communication

[34:49] cost paid Here there's the same

[34:51] communication cost paid on the output.

[34:53] Um and then the hope is that uh this

[34:56] does not become communication committed.

[34:59] Um now what is the traffic pattern here?

[35:02] Um the traffic pattern here is that any

[35:05] GPU in fact will be talking to any other

[35:07] GPU depending on um the the decisions

[35:09] made by the model. So this is an

[35:12] allto-all traffic pattern.

[35:14] >> So when you say any GPU in the pretense,

[35:17] >> yeah,

[35:18] >> the router is more than one GPU. Yeah,

[35:19] the router. So I I drew this as one

[35:21] router. Uh in reality, you would

[35:23] actually have many copies of the router

[35:25] and so you would have um as as many

[35:27] routers as as GPUs in fact

[35:29] >> as as as the incoming incoming traffic.

[35:32] >> Yeah. So these are these are the these

[35:35] are 64 GPUs. These are 64 GPUs. It's

[35:37] actually the same GPUs. We just like

[35:39] draw them as a separate because they're

[35:40] serving different purposes.

[35:42] >> So at this point any GPU can be sending

[35:44] to any any other GPU. So this all

[35:47] to-wall pattern um of communication that

[35:49] shows up uh how the blackwell racks are

[35:52] configured um is a is a perfect fit for

[35:56] the um the communication pattern that

[35:58] thee actually wants to do. Um however if

[36:02] you think maybe I want to do like maybe

[36:04] one rack is too slow and I want to do

[36:06] two racks. Um then I have this challenge

[36:09] that like maybe I've got some sort of

[36:11] rack boundary drawn outside here like

[36:14] this. Um,

[36:17] and I no longer in fact have all toall

[36:19] communication between all the GPUs in

[36:22] two racks. Um, and so the rackto-rack

[36:25] communication ends up being a

[36:26] substantial bottleneck. So, uh, this

[36:29] sort of like the fundamental thing here

[36:30] is that one rack is actually the bounds

[36:33] the size of an expert layer you can do.

[36:35] And so, uh, this has been part of what's

[36:37] been driving towards um, larger and

[36:40] larger interconnect domains.

[36:41] >> Yeah. Um before we it may be worth you

[36:44] explaining what exactly a rack is

[36:47] >> the differences in bandwidth between

[36:48] Iraq

[36:50] >> and within Iraq

[36:52] >> and the all versus not all nature of

[36:54] communication within versus outside.

[36:56] >> Yeah. And and this is a place where it

[36:57] starts to be very different in fact

[36:58] between uh Nvidia for example and Google

[37:00] and then others including us. Um the so

[37:04] generally uh a rack is a um

[37:09] it is a physical structure. Um it it's a

[37:12] few meters tall um meter or two wide

[37:14] depends on configuration. Um and it

[37:17] stores uh some number of GPUs or XPUs

[37:21] which is typically about 64. Um the the

[37:24] con what constrains it being a certain

[37:26] size is power delivery weight um and

[37:29] cooling ability. uh it it ends up being

[37:32] about this size in in many cases because

[37:34] of these physical constraints. Um so

[37:37] then when I deploy a data center like

[37:39] I've got a data center may have

[37:41] thousands of these racks. So I've got

[37:42] one of these tall racks. It's got a

[37:43] bunch of GPUs in it um uh and so on. Um

[37:46] and then I put another rack um next. Um

[37:50] >> you make it sound so easy.

[37:51] >> Yeah. Right. I just like drop them in.

[37:54] Um in Nvidia's case um the the

[37:58] communication uh topology um is uh

[38:02] actually it it they put the GPUs on on

[38:05] the outside of the rack and then they

[38:06] put these switches on the inside of the

[38:08] rack. So what this ends up being is that

[38:11] there's a set of switches in here. Um

[38:13] these are the NV switches.

[38:15] >> Mhm.

[38:17] >> And then they run a bunch of cables. Um,

[38:19] every single GPU uh has cables um going

[38:23] going to the switches in the middle. Um,

[38:29] so uh every GPU goes to the switches in

[38:32] the middle and then uh the switches have

[38:33] connections to all the GPUs. So all of

[38:35] the GPUs can talk to all the other GPUs

[38:37] uh in in just like two hops going to the

[38:39] switch going to the other GPU. Now when

[38:42] I want to leave the rack, I end up going

[38:44] via a different path. Um the GPUs have

[38:48] also a much slower um uh connectivity

[38:51] which is typically about eight times

[38:53] slower um which is uh so so the green

[38:56] that I drew here in GPU cases is the NVL

[38:59] link. More generally it's called the

[39:00] scale up network. Um uh this is the

[39:03] scale up network. Um [snorts]

[39:06] you will typically um also have a scale

[39:09] out network which allows you to connect

[39:11] to like some data center switch. Um so

[39:14] data center switch

[39:18] And then all of the GPUs will have some

[39:20] connectivity up to some data center

[39:22] switch somewhere.

[39:23] >> Um but this is this is about times uh

[39:25] like this is the scale out um

[39:31] and it tends to be about about eight

[39:33] times slump

[39:35] uh in bad words.

[39:38] So the the challenge if you want to for

[39:40] example lay out a mixture of expert

[39:42] layer across two racks is that

[39:46] half of the GPUs here are going to be

[39:48] wanting to talk to to talk to the GPUs

[39:50] GPUs here. And so um like half of the

[39:54] like just on average like when I look at

[39:56] where the tokens on on these GPUs want

[39:58] to go half of the tokens want to go

[39:59] inside the rack that's great they can

[40:01] use the the fast scale up network but

[40:03] half the tokens are going to want to

[40:05] leave the rack and go to the other rack

[40:06] and that's not as good. they're going to

[40:08] need to use a much slower network. And

[40:09] so that becomes the bottleneck on uh on

[40:12] on the all to-all pattern. Um a

[40:15] different choice would be well why don't

[40:16] I like have a big switch here and sort

[40:18] of like um and connect uh everything to

[40:21] some big switching uh like much bigger

[40:25] switch that actually combines the two

[40:26] racks together. There are many ideas in

[40:28] this direction but in general it becomes

[40:30] uh the reason you have this sort of

[40:32] hierarchy of switches rather than one

[40:33] big switch is to manage the cabling

[40:36] congestion. uh you just need to run a

[40:38] large number of cables.

[40:39] >> Sorry. Is this is that question you just

[40:41] asked basically why isn't it a bigger

[40:43] scale up?

[40:43] >> Yeah, exactly.

[40:44] >> Why not why not just like have like a

[40:46] million chips in scale up?

[40:48] >> What has changed that is allowed Nvidia

[40:49] to go from Hopper was eight then

[40:53] Blackwell is 72 and now Reuben will be

[40:58] is it 500 or something?

[40:59] >> 500 and something. Yeah.

[41:00] >> Um what what has allowed that to happen?

[41:02] uh from Hopper to to Blackwell is is

[41:05] mostly just a uh the decision to switch

[41:07] from uh uh trays as the form factor or

[41:11] one of these as a tray to to switching

[41:13] to racks as the form factor. That's a

[41:15] product decision. It's um there wasn't a

[41:17] substantial technical barrier there. Um

[41:21] >> uh switching from uh from the like uh 64

[41:26] to to 500 or so. Um there's a bit of

[41:29] Jensen math there, but uh uh there is at

[41:32] least a genuine 4x increase um which is

[41:35] um coming from a much more complicated

[41:37] and difficult rack design. And so that

[41:39] that is actually like new new physical

[41:40] design to run more cables.

[41:42] >> And the cable complication is just the

[41:45] the the cost of figuring out which cable

[41:48] hops to which or like which signal goes

[41:50] from.

[41:51] >> Let's sort of zoom in on this and look

[41:53] at the the wire density. Um,

[41:57] I'll draw this diagram just once more.

[41:58] So, we have a bit of a cleaner version

[42:00] to work with and a larger version. Um,

[42:03] let's say I have some switches in the

[42:04] middle. Yep.

[42:05] >> Um, and let's say I'm going to have

[42:07] initially I'm going to start with just

[42:08] two GPUs on each side or two trays of

[42:10] GPUs on each side. Um, and let's say

[42:13] maybe each tray wants to have uh two

[42:15] cables coming out of it. Um,

[42:18] so I get some kind of I I physically run

[42:22] vertical cables that look like this

[42:23] running to the switches. Um, now if I

[42:26] want to double the number of GPUs in a

[42:28] rack, um,

[42:31] uh, I need to run like literally twice

[42:33] the density of cables. So, um, I need to

[42:35] run Yeah. Uh,

[42:38] these as well. Um,

[42:42] >> extremely question, but if you look at a

[42:44] physical data center Mhm.

[42:46] Seems like there's a lot of space within

[42:48] a rack. I don't know. Just like the

[42:50] cables are like really big and

[42:51] >> Yeah. So there is space outside the

[42:54] rack. Inside the rack like these racks

[42:56] are like I mean as they become more

[42:58] optimized these racks are very tight. So

[43:00] um there's uh connector density going

[43:03] from um from from from the tray into the

[43:07] rack and the rack's back plane. Um and

[43:09] then the back plane itself has a has has

[43:11] a really high density. Um there are

[43:13] other physical constraints including

[43:14] like bend radius of cables like you

[43:17] don't want to snap them and so on.

[43:18] >> Yeah.

[43:19] >> It's literally the physical space to put

[43:20] a cable that's constraining it.

[43:22] >> Yeah.

[43:22] >> I had no idea. Interesting.

[43:24] >> Uh that seems surprising that like

[43:26] >> of a hu the rack is so big and it just

[43:29] like we can't just stuff more cables in

[43:30] there.

[43:31] >> Yeah. So I mean rack design is not my

[43:32] expertise but like when I talk to to

[43:33] folks and what are the constraints

[43:34] they're up against it's it's a

[43:36] combination of um uh so what are the big

[43:39] physical things you're optimizing for?

[43:41] um space uh weight of the rack like it's

[43:45] actually really heavy and so like you

[43:47] need enough metal to not sag and fall

[43:50] but then you add more metal and it's

[43:51] heavier. Um and then power and cooling

[43:53] and so all of those are competing for

[43:55] like modern racks are pushing all of

[43:58] those to very extreme physical limits.

[44:00] >> Deep work is by its nature quite

[44:01] aversive. So even things which seem like

[44:04] work like Slack and email can be easy

[44:06] ways to distract yourself. So, I often

[44:08] wish that I could just turn the internet

[44:09] off, but if I'm prepping for an

[44:12] interview, even if I have the papers and

[44:13] books on hand, it's still super useful

[44:15] to be able to do a back and forth in the

[44:17] LLM so I can break down concepts and

[44:19] research follow-ups. Google's new Gemma

[44:21] 4 is the first open model that allows me

[44:23] to have this kind of fully disconnected

[44:26] focus machine. It's small enough to run

[44:27] on my laptop, but good enough to

[44:29] actually be useful. So, to prep for this

[44:30] episode, I downloaded Reer Scaling book

[44:32] and shut off the internet. I was able to

[44:33] have Gemma help me understand the

[44:35] material and answer my questions. If you

[44:36] want an LLM that you can run locally on

[44:38] your laptop or even your phone, you

[44:40] should check out Gemma 4.

[44:45] >> When was GP4 released again? It was 2022

[44:47] or 2023.

[44:48] >> Three. Okay. And it was rumored to be

[44:50] over one trillion parameters

[44:53] >> and it seems like only now and within

[44:55] the last 6 months have models been

[44:57] getting released that are significantly

[44:59] more parameters than a model released 3

[45:00] years ago.

[45:01] >> Yeah. when supposedly there should have

[45:02] been this um uh scaling in the meantime.

[45:07] Is the reason that we were just waiting

[45:08] for racks with enough memory to hold the

[45:13] five trillion parameter model along with

[45:15] its KV Kash for enough you know users

[45:18] for a full um for a lot of sequences or

[45:22] RL if you're doing RL kind of a similar

[45:24] consideration of actually holding the KB

[45:25] cache for all the the uh

[45:28] >> the the the batch of problems you're

[45:29] trying to solve. Um so if you look at

[45:32] like hopper you had eight hoppers and I

[45:34] think the

[45:35] >> that's 640 gigabytes uh as of 2022.

[45:38] Yeah.

[45:38] >> Um

[45:39] >> with black well finally which was

[45:41] deployed what 2020

[45:42] >> very recently maybe last year.

[45:43] >> Last year

[45:44] >> you finally have a scale up with on the

[45:46] order of like 10 20 terabytes.

[45:48] >> Mhm.

[45:48] >> Which is enough for like a 5T model plus

[45:51] KB cache.

[45:52] >> Yeah. Um deploying in in larger scale up

[45:55] domains is a huge unlock. Um yeah, I

[45:57] mean

[45:58] >> I've drawn here the sort of Nvidia

[46:00] Blackwell deployment. Um the Google

[46:02] deployment uh has actually had very

[46:04] large scale domains for

[46:05] >> and that also explains why Gemini was

[46:07] seem to be ahead like was Gemini 2.5 was

[46:10] a successful or it just seems like

[46:11] Gemini has that successful pre-train for

[46:13] longer than some of the other labs. I

[46:15] not having been there at the time, I'm

[46:16] not sure how much is coming from like

[46:18] successfully deploying higher sparity

[46:20] ratios which which could be um it could

[46:22] also be I mean there's a whole bunch of

[46:24] actual modeling things of like uh

[46:27] specifically how do you do the mixture

[46:29] of experts uh we've seen the um deepseek

[46:33] uh like the deepseek mixture of expert

[46:35] has said actually activate more experts

[46:37] but finer grained experts was a big

[46:38] innovation

[46:39] >> um I'm sure that there are many other

[46:41] innovations on the model architecture as

[46:43] well as on the training data it's kind

[46:44] hard to disentangle all of them. But uh

[46:47] what shows up in terms of the limits of

[46:49] what you can do um

[46:52] the the active parameters uh as we saw

[46:55] is limited by the compute cost um and

[46:58] then the total parameters is limited by

[47:00] the scaleup size.

[47:01] >> Yep. When you're operating within a

[47:04] single scale of domain, is that a

[47:07] consideration specifically for either

[47:09] forward or backward

[47:12] or specifically for prefill versus

[47:15] decode

[47:17] or is it is it preferred to always be

[47:20] within a scale up

[47:23] >> whatever kind of workload you have

[47:25] whether you're doing a pre-training run

[47:27] or whether you're doing RL generation or

[47:30] whether you're doing inference for

[47:31] users. Yeah, really interesting. Um so

[47:36] okay so uh to answer that question we're

[47:38] going to need to talk about the

[47:39] communication patterns um that so we've

[47:41] talked about the mixture of expert

[47:43] communication pattern that is this all

[47:44] to all um uh all to all

[47:51] um all to all very strongly

[47:54] to all um all to all very strongly

[47:54] favors um uh full connectivity which is

[47:57] what we've kind of just shown here and

[47:59] it favors being within one rack Um there

[48:03] are other kinds of parallelism besides

[48:05] expert parallelism which which which we

[48:07] just showed here in the literature is

[48:09] tensor parallelism. This is um with a

[48:13] trend towards smaller experts this has

[48:15] become much less relevant. So we can

[48:16] ignore that. Um but the other two things

[48:18] that we have available are data

[48:20] parallelism and pipeline parallelism. Um

[48:22] and they are actually much they can be a

[48:25] much better fit for uh using multiple

[48:27] racks. So let's focus on pipeline

[48:29] parallelism specifically. Um, this is

[48:32] one layer of um, I'm going to have like

[48:35] a 100 more layers up above. Um, I could

[48:39] decide at this point, for example, to

[48:43] move to a different rack, change rack.

[48:50] Now, is that going to become a

[48:51] communication bottleneck? So, we can

[48:55] actually just solve for when this

[48:56] becomes a communication bottleneck. Um

[48:58] but before we do that algebraically like

[48:59] let's just sort of visualize it out and

[49:01] sketch the path. So we're going to have

[49:02] a bunch this is another layer and we're

[49:04] going to have another layer here and so

[49:06] on. Um uh so let's say I change rack

[49:10] here and then some number of layers

[49:12] later I change rack here as well. Um

[49:19] so our our our methodology that we're

[49:22] going to use to determine whether we

[49:23] have a communication bottleneck in this

[49:25] like in this point where we change rack

[49:27] >> [snorts] >> um is we're going to compare the this

[49:30] this is the scale out um scale out um

[49:35] bandwidth requirements to the scale up

[49:38] bandwidth requirements.

[49:39] >> Mhm.

[49:42] So let's try this. And and I mean the

[49:44] hint is going to be that um there's a

[49:47] lot more sends here like we're sending

[49:49] many things here whereas we're only

[49:51] sending one thing here and then we're

[49:52] also maybe doing it many times. That's

[49:54] so that's going to be the uh what what

[49:57] makes the difference.

[49:58] >> Uh c can I try to guess just out of

[49:59] curiosity to see if I'm actually

[50:01] understanding. Um it seems like you're

[50:03] sending like

[50:04] >> batch size into the rack

[50:07] >> in here.

[50:08] >> Yes. Uh but the communication within

[50:11] Iraq is sort of batch size

[50:14] times number of GPUs.

[50:17] >> Yeah. So number of activated GPUs,

[50:20] right? So like I I don't send to this

[50:22] GPU at all, right? So there's an

[50:23] explosion from one to like it's three

[50:26] times larger here in in this diagram.

[50:28] >> Yeah.

[50:29] >> Um the key thing is that I I didn't even

[50:31] need to send to this GPU at all. And so

[50:32] that's a big saving.

[50:33] >> I see. Yeah.

[50:35] >> Okay. So we're going to talk through um

[50:37] uh sort of how much more uh what is the

[50:41] slowdown of to what extent is scale up

[50:44] uh a bottleneck over scale uh over scale

[50:47] out. So uh we will directly jump to the

[50:50] ratio of the time spent on uh scale up

[50:57] time on scale up

[51:00] over the time spent on scale out. So

[51:04] this this is the quantity we're talking

[51:05] about. Um

[51:08] and the first consideration is that the

[51:11] scale up is like um uh scale up is is

[51:15] eight times faster than scale out

[51:16] generally. And so uh at a baseline if

[51:19] the bandwidths were the same we would

[51:20] have this one one over eight which is

[51:23] coming from bandwidth

[51:25] bandwidth.

[51:28] But then we have some amount of

[51:31] expansion in in in how much data we're

[51:33] sending. So if one token comes in here,

[51:36] >> then this one token gets routed to in

[51:40] the deep sea case, it'll get routed to

[51:42] maybe 32 experts or or 16 experts gets

[51:45] routed to some number of experts. So

[51:48] this is the number of activated experts.

[51:50] Number of activated

[51:54] experts

[51:58] Um, and then it also

[52:03] this the same thing applies on multiple

[52:05] different layers. So maybe I'm going to

[52:06] run two layers. So um there's also a

[52:09] multiple

[52:11] times um number of layers uh

[52:16] per stage.

[52:19] >> And you need to multiply the whole thing

[52:20] by two for the um for the Yes. Yes. And

[52:23] there's a factor. Thank you. Um

[52:28] so what we would like is the for the

[52:31] scale up time to be greater than the

[52:33] scale out time. Um because like the

[52:35] scale up time is the more important and

[52:36] precious resource. And so we just we

[52:38] want this one we would like this number

[52:40] to be greater than or equal to one. Um

[52:42] and this really doesn't seem hard like

[52:44] we there's just a factor of eight that

[52:45] we need to overcome. So we need the

[52:47] product of these three things to be

[52:48] bigger than eight. Um typically we have

[52:51] a fairly large number of activated

[52:52] experts. could be eight um by itself. Um

[52:55] and then we can increase the number of

[52:56] layers per stage a lot until until we we

[52:59] satisfy this.

[52:59] >> I see.

[53:00] >> Um so what this ends up looking like is

[53:02] that I can in fact have an entire

[53:04] pipeline of racks where one rack does

[53:06] one layer and then I move on to the next

[53:07] rack and I do another layer and then I

[53:09] move on to the next rack. I can do

[53:10] another layer. It's interesting to me

[53:12] that the best parallelism

[53:15] uh strategy in practice ends up being

[53:18] one which physically resembles the

[53:22] actual architecture. It's not some

[53:24] galaxy brain thing, you know, it's like,

[53:25] oh, we have experts, we're going to put

[53:26] them on different GPUs. Oh, we have

[53:28] different layers. We're just going to

[53:28] put them on different racks. Isn't that

[53:30] I feel that's interesting that the

[53:32] physical and

[53:33] >> the the the model architecture matches

[53:35] like the the cutting matches the model

[53:36] architecture.

[53:37] >> Yeah, exactly.

[53:37] >> Yeah. I mean it could have been

[53:38] something wackier with tensor

[53:39] parallelism and whatever.

[53:41] >> Yeah. So I mean I think a way to think

[53:43] of it is I mean okay the galaxy brain

[53:46] way to think of it is um like what are

[53:49] all the different dimensions in which a

[53:51] model is scaled up. Um and so there is

[53:54] uh it is scaled up by layers. It is

[53:55] scaled up by the like the model uh

[53:57] dimension. It is scaled up by the DFF

[53:59] dimension. It is scaled up by the number

[54:00] of experts. Um every single one of those

[54:03] numbers you can choose to cut along. Um

[54:06] and if those numbers are big enough, it

[54:07] eventually becomes profitable to gong

[54:09] there.

[54:10] >> Um and we have selected two of them. The

[54:12] other two in the way typical models are

[54:15] typically sized are not profitable.

[54:18] >> So there's um talk by Ilia where he says

[54:21] today we know not to do pipeline

[54:22] parallelism

[54:24] and Horses gave my friends and me

[54:28] I hate that it sounds like a Dr.

[54:30] quote [laughter]

[54:33] but he gave us a lecture on these

[54:35] different kinds of parallelisms and he

[54:36] said the problem with pipeline

[54:37] parallelism is that it other than the

[54:39] bubbles it constrain it creates these

[54:41] architectural constraints yes

[54:42] >> on um like Kimmy for example has these

[54:47] uh residuals where attention attends to

[54:49] the

[54:50] >> a few back or something

[54:51] >> yeah layers a few back and so that

[54:53] becomes hard to implement in this way

[54:55] >> yeah um so and I guess we didn't really

[54:57] fully articulate even what is the

[54:59] benefit that we're getting from

[55:00] pipelining.

[55:01] >> Yeah.

[55:01] >> Um

[55:03] >> uh and so these complexities are real.

[55:06] It's pipelining is a massive hassle.

[55:08] It's uh but it does give you some

[55:10] benefits. Um

[55:13] >> the uh and then you can then decide

[55:16] whether those benefits are worth it

[55:17] worth the costs. Um the uh the biggest

[55:20] benefit that shows up so it can has some

[55:23] benefits in inference maybe bigger

[55:24] benefits in training. Um in inference

[55:27] what are we saving on? Are we saving on

[55:29] um memory time or compute time? Not

[55:32] really. We're just moving the memory

[55:33] time from one chip to another chip um or

[55:36] one rack to a different rack. There's no

[55:38] actual benefit in runtime. Um

[55:41] however, what we are saving on is that

[55:43] the memory capacity is uh the amount of

[55:47] memory used per rack. If we think that

[55:49] the memory in a rack is a bottleneck,

[55:51] then there's a constraint on how fast we

[55:53] can go. Um it pipelining allows us to

[55:56] massively reduce that bottleneck.

[55:58] >> I I I guess but the

[56:00] opposite connotation to this which

[56:03] actually before this I was chatting

[56:05] before this interview I was chatting

[56:06] with um Axel who's a GPU performance

[56:09] engineer at uh Jane Street and he was

[56:11] explaining well to do pipelining you had

[56:13] to do micro batches rather than full

[56:14] batches.

[56:15] >> Mh. And if you do micro batches, then

[56:18] you're by definition not able to

[56:21] amortize

[56:22] the weight loading the weights. That's

[56:24] right. Across

[56:25] >> all the users or all the sequences. And

[56:28] so the positive connotation of that is

[56:31] you don't have to use as much memory.

[56:32] The negative connotation is that of that

[56:33] is that we can't amortize loading the

[56:36] weights across all those users. Maybe

[56:37] it's worth explaining why you had to do

[56:39] microbatches because you can't.

[56:40] >> So we draw the mic the pipeline bubble.

[56:42] Um yeah.

[56:43] >> Okay. So, so why do we do um uh what

[56:46] what is this microbatching that shows up

[56:48] in shows up in pipeline parallelism? So,

[56:51] um the uh I'll focus on inference first.

[56:54] It's it's a slightly simpler problem. Um

[56:58] so, and I'm going to draw uh so this is

[57:00] time um and then this is which rack uh

[57:04] rack um we're on. And so, the idea is

[57:08] that maybe I'll have like four racks. So

[57:10] I've got um uh an inference that is

[57:13] going to like step through these four

[57:14] racks in some time like this.

[57:18] So this is inference number zero. Um

[57:22] uh it runs at a certain batch size uh

[57:25] and it steps through all all the

[57:26] pipeline stages like this. Now if we

[57:28] were to say well we're going to run

[57:29] inference number one here like this is

[57:32] clearly like a massive waste, right?

[57:34] Like um like threequarters of the time

[57:37] each of the racks is doing nothing. So

[57:40] um so so we don't actually run inference

[57:41] one here. We we we run it as soon as we

[57:43] can which is immediately after um

[57:46] inference zero finishes like this. Um so

[57:50] uh and then we keep um so if we hadn't

[57:53] filled this in we would call this the

[57:54] pipeline bubble. Um when I've drawn it

[57:56] in this inference context where we're

[57:57] only going in a forwards pass it's like

[57:59] obvious like why would you do the stupid

[58:01] thing?

[58:02] >> But in a training context uh it's maybe

[58:04] less obvious. But in the inference

[58:06] context, it's it's sort of really

[58:07] natural to to make this change.

[58:09] >> Oh, interesting. So, this sort of

[58:12] obvious, but um

[58:14] the difference between microbash and

[58:16] bash doesn't matter at all in inference

[58:18] because you can just call whatever you

[58:20] want, whatever.

[58:21] >> Yeah,

[58:21] >> it it only matters in training because

[58:25] there is an optimal batch size.

[58:27] >> Yes. And before you do the backward

[58:29] step, you want to have accumulated

[58:33] before you do a full backward step, you

[58:34] want to have accumulated all the

[58:36] sequences in that bash. And if you want

[58:38] to do pipeline and training in order to

[58:41] avoid that bubble, you need to

[58:43] >> should we draw the the training diagram?

[58:45] Yeah, let's do that. Let's do that. Um,

[58:47] >> so so this is the inference diagram and

[58:49] I'll call this four just so we don't

[58:51] have the wrong thing showing up there.

[58:52] Um, so let's do the same thing for

[58:54] training. Now we've got a forwards pass,

[58:56] but at some stage we're going to have to

[58:57] transition to a backwards pass. So we'll

[59:01] we'll do some number of uh batches in

[59:03] the forwards pass

[59:11] and then we're going to transition to

[59:12] the backwards pass for everyone all in

[59:13] one go.

[59:24] So the the inference part is the same uh

[59:26] here but then we do a hard stop at this

[59:28] point and then transition everyone to

[59:29] backwards pass um similar numbering like

[59:32] this. It may be worth clarifying the

[59:34] reason there is that hard stop is

[59:36] because you want to do a whole batch at

[59:38] once for the backward step

[59:40] >> and then there is an optimal size for

[59:42] how big that batch should be.

[59:44] >> Yeah. I mean smaller is always better

[59:46] actually is is is is a way to put it but

[59:48] uh it's a like from a ML convergence

[59:51] rate perspective smaller is always

[59:52] better because basically you're getting

[59:54] the freshest information from from from

[59:56] the gradient descent

[59:57] >> but total trading time perspective

[59:58] >> total training time perspective it's wor

[01:00:00] like smaller is worse from a systems

[01:00:01] perspective and so the optimum is the

[01:00:03] trade-off between those two

[01:00:05] >> so you pick a batch size um and you uh

[01:00:09] and then like for that batch size you

[01:00:11] you do some amount forwards and then a

[01:00:12] some amount backwards Y

[01:00:14] >> you asked why why is there even a hard

[01:00:16] stop there pipeline parallelism because

[01:00:18] of this the like the fact that you've

[01:00:21] got this idle time here which is the the

[01:00:23] bubble um there are so many techniques

[01:00:26] in the literature for how to um lay this

[01:00:29] out differently and and avoid that there

[01:00:31] are more complicated schemes called like

[01:00:33] zero bubble or one forward one backward

[01:00:35] which sort of interle the forwards and

[01:00:37] the backwards in complicated ways but uh

[01:00:40] >> you can mine bitcoin in that

[01:00:41] >> right right more usefully you can do the

[01:00:44] weight gradient uh step but uh but you

[01:00:46] can also maybe yeah so in inference

[01:00:50] actually the the effect of of pipelining

[01:00:54] on anything you care about like batch

[01:00:56] size or latency actually is neutral it

[01:00:58] it doesn't improve it doesn't make it

[01:00:59] worse so if you look at the latency of

[01:01:01] this inference running it if it were

[01:01:02] pipelined versus if it were all on one

[01:01:04] rack if it were all on one rack we would

[01:01:06] just like slide all of the the boxes

[01:01:08] down uh and still put them in a row and

[01:01:10] the latency would be the same So um

[01:01:13] pipelining is neither better nor worse

[01:01:15] for latency. Um but it it does mean that

[01:01:18] you just use less memory per per rack

[01:01:22] like memory capacity because now instead

[01:01:24] of needing the whole model you only need

[01:01:25] a quarter of the model.

[01:01:26] >> Makes a ton of sense. So basically no

[01:01:29] brainer to use pipelining during

[01:01:31] inference but there's this hardware

[01:01:33] trade-off during training.

[01:01:35] >> So so even in inference in fact it is

[01:01:37] not used a ton. Um it say it reduces

[01:01:40] your memory capacity requirements. Um

[01:01:42] there's actually a huge surplus like um

[01:01:45] I think you're saying that a a a

[01:01:48] rack of blackwell has many many

[01:01:50] terabytes maybe tens of terabytes of uh

[01:01:52] that's much bigger than um like a

[01:01:55] trillion parameter model a trillion

[01:01:56] parameter model is only needs one

[01:01:57] terabyte. Um, and so it already fits in

[01:02:00] fact. And so there's not a huge benefit

[01:02:01] from um from pipelining because you

[01:02:04] you're reducing a number that's already

[01:02:05] pretty small.

[01:02:07] >> But it does say that theoretically maybe

[01:02:08] you had too much memory and uh maybe you

[01:02:11] could have done a different uh like

[01:02:14] build a different hardware that has less

[01:02:15] memory. In fact,

[01:02:16] >> if [snorts] you were designing your

[01:02:17] hardware like and you said I actually

[01:02:19] didn't need that much memory because um

[01:02:21] I don't need the weights to fit in one

[01:02:22] rack. I can fit the weights in eight

[01:02:23] racks. um then uh I could have maybe

[01:02:27] built a hardware that didn't have so

[01:02:28] much HPM per GPU.

[01:02:30] >> Last week, Horses was kind enough to

[01:02:32] give me and my friends a great lecture

[01:02:34] on large scale pre-training systems. And

[01:02:36] there were some concepts that I wanted

[01:02:38] to animate for a write up on my blog

[01:02:40] like how weight shard and gradients flow

[01:02:42] depending on the parallelism that you're

[01:02:44] using. So I gave cursor my lecture notes

[01:02:47] and a sketch that I made during the

[01:02:48] lecture and I asked it to visualize a

[01:02:52] specific hierarchal collective that

[01:02:53] Horus had explained. The first version

[01:02:55] was already pretty good and then I was

[01:02:56] able to use design mode to select and

[01:02:59] tweak any specific components from

[01:03:00] there. I was able to do all of this

[01:03:02] without a clear end state in mind.

[01:03:03] Cursor's composer too fast model was

[01:03:05] quick enough that I was able to iterate

[01:03:07] almost instantaneously. I could try an

[01:03:09] idea, test the results in the built-in

[01:03:10] browser and immediately make any

[01:03:12] changes. I went through 10 different

[01:03:14] versions in under 20 minutes. If you

[01:03:15] want to check out this animation, I

[01:03:17] published it along with the lecture

[01:03:18] notes in a blog post. The link is in the

[01:03:20] description. And if you want to try out

[01:03:22] this kind of iterative design flow for

[01:03:23] yourself, go to cursor.com/larch

[01:03:26] to get started. So macro question,

[01:03:31] everybody's talking about the memory

[01:03:32] wall right now. Memory is getting super

[01:03:34] expensive. There's not enough memory.

[01:03:36] Smartphone volume will go down 30%

[01:03:38] because there's not enough memory.

[01:03:39] Hyperscalers are spending this is

[01:03:41] shocking if I'm Dylan said they're

[01:03:43] spending 50% of their capex this year

[01:03:46] >> on memory.

[01:03:47] >> On memory that's believable. Yeah.

[01:03:48] >> So like what is hyperscaler capex? It's

[01:03:51] like high hundreds of billions maybe a

[01:03:53] trillion and they're spending half of

[01:03:55] that on memory. Okay. So that that is a

[01:03:57] huge constraint. That's why we're not

[01:03:59] going to get new laptops and phones this

[01:04:00] year.

[01:04:00] >> Um

[01:04:02] >> but at the same time we're we have too

[01:04:04] much memory. Like people are willing to

[01:04:05] put too much memory into these systems,

[01:04:06] >> right? So um this is

[01:04:08] >> like why why why is Jedet shoving all

[01:04:10] this memory into these racks if

[01:04:12] >> Yeah. If you don't need it.

[01:04:13] >> Yeah. So we've like in in the um

[01:04:15] equations we had here before we raised

[01:04:16] them we were doing memory time. So

[01:04:18] memory bandwidth and and compute

[01:04:19] bandwidth. Let's now start looking at uh

[01:04:21] memory capacity.

[01:04:22] >> Yeah.

[01:04:23] >> So we'll start off with just like memory

[01:04:25] capacity without even thinking about

[01:04:27] parallelism scheme. Um and so the um uh

[01:04:32] like the capacity of memory um or the or

[01:04:35] the the demand on memory is um the

[01:04:38] number of total parameters

[01:04:41] um plus so so this is what we need to

[01:04:44] fit the weights in some system that we

[01:04:46] are using

[01:04:47] >> um and then we need to fit the KVs as

[01:04:50] well. So, KVs go as batch size times the

[01:04:53] length of the context um times uh times

[01:04:58] the bytes bytes per bytes per um

[01:05:04] okay so um

[01:05:07] what I was arguing about in this context

[01:05:09] and the case I was making uh for

[01:05:11] pipelining is that um we will actually

[01:05:13] there are some techniques that allow us

[01:05:15] to solve this other techniques that

[01:05:17] allow us to solve this so let's let's

[01:05:19] consider

[01:05:20] So we're going to run this on some

[01:05:22] number of GPUs and and we're going to

[01:05:23] say um we're going to have one extended

[01:05:25] which is um uh E is going to be the

[01:05:29] expert parallelism.

[01:05:31] So how many when we had this charting of

[01:05:34] uh uh expert layer across many GPUs how

[01:05:37] much of that uh to what extent do we do

[01:05:39] that? How many GPUs? Um

[01:05:42] so we're going to say that this is fact

[01:05:44] for example 64

[01:05:46] and then P is going to be the extent of

[01:05:48] pipeline

[01:05:49] pipelining.

[01:05:52] Um and so this is the number of racks

[01:05:54] which who knows maybe maybe we'll pick

[01:05:56] four or something like that

[01:05:59] what we want to calculate. So this is

[01:06:00] the this is like the total um total

[01:06:03] memory requirement across the system. Um

[01:06:07] but now I'm going to calculate a um a

[01:06:11] memory requirement per GPU. So per per

[01:06:15] GPU memory requirement

[01:06:19] uh we're going to have I guess I'll use

[01:06:21] a lower case C me. Um

[01:06:25] and well obviously we just take all

[01:06:27] these numbers and divide it by en really

[01:06:29] easy. So um uh it's this n total um plus

[01:06:35] the batch time length of context

[01:06:40] time bytes

[01:06:42] per toke. Um all of this is divided by e

[01:06:46] * p.

[01:06:48] Okay. So this is like why is this

[01:06:50] correct to divide it this way? Um well

[01:06:52] we're we're we're saying we knew that

[01:06:55] the parameters were perfectly divided

[01:06:56] amongst all the the GPUs in a rack.

[01:06:59] They're al the layers are perfectly

[01:07:01] divided amongst the the the different

[01:07:03] racks. So that works here and somehow

[01:07:06] we're going to arrange I'll handwave

[01:07:08] exactly how somehow we can arrange the

[01:07:10] same perfect sharding of of the contexts

[01:07:12] across GPUs in a rack and and and then

[01:07:15] based on layer across uh racks

[01:07:18] >> and sorry for the number of racks

[01:07:20] >> uh yeah for example

[01:07:21] >> yeah um

[01:07:23] so um this is the place where we

[01:07:27] actually need to go back and analyze

[01:07:28] this batch size B and you were making

[01:07:30] this comment that there's micro batching

[01:07:32] versus global batching

[01:07:33] So um let's come back to this pipelining

[01:07:37] diagram here. Um we've got one batch

[01:07:39] going forward here and then as I drew

[01:07:42] it, it kind of just like disappeared.

[01:07:44] That's not really correct. If you think

[01:07:45] about um how decode is working, I have a

[01:07:49] bunch of tokens that I have generated

[01:07:51] already. I do one forwards pass where I

[01:07:54] generate a new token and then and then I

[01:07:58] push like then I write that to my KB

[01:07:59] cache and then I do another forwards

[01:08:01] pass that generates the next token.

[01:08:03] >> So I'm actually going to be running this

[01:08:05] batch zero in a loop. So

[01:08:08] >> in fact I go forwards once I finish I

[01:08:11] can start the next iteration of the loop

[01:08:12] up here. Yeah.

[01:08:17] So we'll just fill this in. We'll have a

[01:08:24] Oh. Uh, nice. Yes. [laughter]

[01:08:28] Um, yeah. So, we've got the two or three

[01:08:31] little two and three. Uh, two three. Uh,

[01:08:36] so let's split this batch. This batch

[01:08:38] will be the global batch size. So B is

[01:08:42] going to be the um number of number of

[01:08:45] micro batches

[01:08:48] times the batch of like the batch size

[01:08:52] per micro batch. So how many micro

[01:08:55] batches do we need? So the number of

[01:08:56] microbatches in this diagram is four 0 1

[01:08:58] 2 3 um and then the batch size per um

[01:09:03] like the microbatch size this is still

[01:09:06] this like 2,000ish number. Um this is

[01:09:08] the one that is like um

[01:09:11] >> Mhm.

[01:09:11] >> This is the like 2,00 um times sparity

[01:09:15] uh sorry uh no this is the 300 time

[01:09:18] sparity uh 300 times sparity.

[01:09:21] >> This is this is the how big the train

[01:09:22] that takes up every 20 milliseconds,

[01:09:24] >> right? Yes. This is going to be the the

[01:09:26] 20 milliseconds uh train. Um so the

[01:09:30] global batch size is the number of

[01:09:31] microbatches times the the local batch

[01:09:33] size. Local batch size is set by this

[01:09:34] hardware parameter. the number of

[01:09:36] microbatches

[01:09:37] um well the number of microbatches is as

[01:09:39] small as possible such that we can like

[01:09:42] wrap around uh and not leave any idle

[01:09:45] time when we wrap around. So if we like

[01:09:47] if we had fewer we would have have this

[01:09:49] idle time when we wrap around and so you

[01:09:51] can sort of just visually see that it is

[01:09:52] equal to the number of pipeline stages I

[01:09:54] mean sort of proof by visual here like

[01:09:57] it is four and it's four this way as

[01:09:58] well but I can you can sort of look and

[01:10:00] see that it goes along here and then it

[01:10:02] wraps around um number of fun stages

[01:10:04] >> yeah it's a very basic question this is

[01:10:07] what is actually done

[01:10:09] >> okay like as in a frontier model today

[01:10:11] will actually have during inference have

[01:10:13] pipeline

[01:10:14] >> uh for sure during

[01:10:16] massive scale training this is done um

[01:10:19] it can be done for inference I'm

[01:10:21] actually going to make the case for why

[01:10:23] it is less attractive it is useful for

[01:10:26] weights but not so useful for Ks yeah

[01:10:28] yeah

[01:10:29] >> um the big challenge is so let's let's

[01:10:33] fill this in the microbatch size here

[01:10:35] ends up being equal to the number of

[01:10:37] pipeline stages y

[01:10:40] >> when we go back and substitute this all

[01:10:42] of that into Here

[01:10:48] we get a um

[01:10:52] number of pipeline stages times um this

[01:10:55] little b

[01:10:57] showing up in here. And then when we

[01:11:00] factor this out, I'm going to split this

[01:11:01] into like this plus into two two terms.

[01:11:04] Um

[01:11:08] we get the full division by e * p over

[01:11:10] here.

[01:11:12] We still have division by E * P over

[01:11:14] here, but the P's cancel, this P and

[01:11:16] this P. Um,

[01:11:20] they cancel

[01:11:22] and so what we find if you increase the

[01:11:24] number of uh pipeline stages, the memory

[01:11:27] footprint for the number of weights

[01:11:28] keeps going down and down and down, but

[01:11:30] the memory footprint for the number of

[01:11:32] activations stays constant. So, so it it

[01:11:35] it doesn't actually work like most of

[01:11:37] your memory um ends up like once you do

[01:11:41] enough pipelining and it's really not

[01:11:42] much like even two is often enough. Um

[01:11:46] this term becomes very small. This

[01:11:48] becomes the dominant term. The KB cache

[01:11:50] becomes the dominant term.

[01:11:51] >> Yeah, I I know this is wrong. I'm trying

[01:11:53] to think out why logic here is wrong. If

[01:11:56] you have many different um you're

[01:11:59] pipelining through many different

[01:12:00] stages, the KV values are not shared

[01:12:02] between layers. So why would it not help

[01:12:04] to be pipelining across multiple layers

[01:12:06] because then you don't have to store

[01:12:08] >> Yeah, you only need to store like one

[01:12:09] layer rather than two layers of KVs,

[01:12:11] right? So So it helps from that

[01:12:13] perspective. You're right. Um

[01:12:16] what's competing with that though is

[01:12:17] that you need to be keeping all of the

[01:12:19] racks usefully busy at a time. And so

[01:12:22] the number of sequences that are in

[01:12:24] flight simultaneously has gone up.

[01:12:25] >> Ah yeah yeah yeah makes sense makes

[01:12:27] sense makes sense.

[01:12:27] >> So those exactly cancel and and you end

[01:12:29] up not getting a saving per GPU.

[01:12:31] >> Right. This is going back fundamentally

[01:12:33] to the point of you're you're not able

[01:12:34] to amvertise across KV caches.

[01:12:37] >> Yeah. Well so first we did you can't

[01:12:39] amortize KV caches across batch size and

[01:12:41] now we're saying you also can't um shard

[01:12:43] it across pipeline stages. Um uh it it

[01:12:48] sucks from both of those points of view.

[01:12:49] >> Yeah. Yeah. Interesting.

[01:12:50] >> Okay. Because then what is done during

[01:12:51] inference?

[01:12:52] >> Um so I mean a like the deepseek paper

[01:12:55] reports what they do which is like um

[01:12:57] they just do a lot of expert

[01:12:58] parallelism. You should in effect you

[01:13:01] should increase your expert parallelism

[01:13:03] up to your scale up domain size.

[01:13:05] >> Um and then do very little pipelining.

[01:13:08] Maybe none at all maybe two um just

[01:13:11] enough to make the weight storage not

[01:13:13] not too big of an issue. Um those are

[01:13:16] the only two parallelisms that really

[01:13:17] make sense. In the past um there was

[01:13:19] tensor parallelism which was make

[01:13:21] cutting up within an expert but uh the

[01:13:25] experts are so small now that that that

[01:13:26] is not a profitable optimization.

[01:13:29] >> So this goes back to the question does

[01:13:30] that mean that frontier labs when

[01:13:32] they're doing inference are just

[01:13:33] basically within a single scale up?

[01:13:35] >> Uh yes. Yeah. I mean you can look at how

[01:13:37] it depends on model size. um like you

[01:13:41] could have a very large model like um

[01:13:46] like one that exceeds the memory of a

[01:13:48] rack um and and and there you should be

[01:13:50] doing a bit of pipelining um maybe maybe

[01:13:52] it's extremely sparse for example and

[01:13:54] that would be a reason to do it

[01:13:55] >> um so I guess this goes back to the

[01:13:57] question about uh or this goes back to

[01:13:58] the promise at the beginning of the

[01:13:59] lecture which was this will actually

[01:14:01] tell you about AI progress as well um to

[01:14:03] the extent it is the case that model

[01:14:05] size scaling has been slow until

[01:14:07] recently

[01:14:08] Because

[01:14:10] let me make sure I understand the claim.

[01:14:12] The claim would not be you could have

[01:14:14] trained across more more racks. It was

[01:14:17] just that it would not have made sense

[01:14:18] before like we didn't have the ability

[01:14:20] to do inference for a bigger model

[01:14:23] easily.

[01:14:24] >> Actually I make the so pipelining

[01:14:27] doesn't help with context length. It

[01:14:29] totally helps with model size. And so um

[01:14:31] because of the ability to do pipelining

[01:14:33] uh

[01:14:35] at least a rack should not be a

[01:14:37] constraint on your ability to fit the

[01:14:38] model parameters. I guess the other

[01:14:40] consideration you're asking like why

[01:14:41] hasn't it scaled up more and why did

[01:14:43] bigger scale up domains help.

[01:14:45] >> Um so we we talked through one aspect of

[01:14:47] that which is um we kind of said it it's

[01:14:49] not because of memory capacity. We we

[01:14:52] have a solution to the memory capacity

[01:14:53] at least with respect to model size.

[01:14:55] Yeah. Not not with respect to um uh KV

[01:14:58] cache size but at least with respect to

[01:14:59] model size we have a solution to memory

[01:15:01] capacity. Um the other issue that shows

[01:15:04] up is uh latency.

[01:15:06] >> I was just about to ask so what is the

[01:15:08] going from rack to rack? What is the

[01:15:11] latency cost per per hop?

[01:15:13] >> This is very much dependent on the

[01:15:15] hardware. Um it's

[01:15:18] I would uh I can't say with a lot of

[01:15:21] authority. I think it's probably on the

[01:15:22] order of a few milliseconds, but it

[01:15:24] could be off by an order. There

[01:15:25] >> is four a realistic number of how many

[01:15:27] pipelining stages you might have?

[01:15:28] >> Yeah. Yeah.

[01:15:29] >> Okay. So that's that's not

[01:15:30] >> it's not on a small number of pipelining

[01:15:32] stages. This is not a huge um uh latency

[01:15:35] impact.

[01:15:35] >> Wait, I guess it's 10 milliseconds per

[01:15:38] token.

[01:15:39] >> That's right.

[01:15:39] >> Two * 4ish

[01:15:42] or I don't know how many you said, but

[01:15:44] >> yeah. Yeah. 10 millions per token is

[01:15:45] actually a lot.

[01:15:46] >> Yeah. If it if it goes from 20 to 30,

[01:15:48] right? Or something like that. Yeah. Um

[01:15:49] this is so like just to to chart the

[01:15:52] path that it goes through. Um here

[01:15:54] you're going from your from your GPU or

[01:15:57] TPU or whatever um to a network card um

[01:16:03] uh which then goes to like a top rack

[01:16:06] switch um

[01:16:08] and then hops over to the other rack and

[01:16:10] does the same uh same thing in reverse.

[01:16:12] So you sort of have to sum up the

[01:16:13] latencies of these different things. Um

[01:16:15] >> so this is the same thing as the DC

[01:16:18] the it may in fact go up to a des switch

[01:16:20] and back. Um depends on deployment

[01:16:22] configuration.

[01:16:22] >> Got it. Yeah. And because it's um decode

[01:16:26] in sequential it's also not the like

[01:16:30] they stack up across the stages. You

[01:16:32] can't do them at the same time.

[01:16:34] >> That's right. Yeah.

[01:16:35] >> Okay. So I I guess this brings us back

[01:16:36] to the question then. Is the size the

[01:16:39] scale up at all relevant to why AI model

[01:16:42] sizes or whatever have been what they

[01:16:44] have been over the last few years

[01:16:45] whether whether whether through training

[01:16:47] or through infrance.

[01:16:47] >> Yeah. So I mean we talked about latency

[01:16:49] of the hop um of the of this hop. Um

[01:16:53] there is also just the the same tm

[01:16:56] latency the memory time latency is

[01:16:59] actually substantially like massively

[01:17:01] improved by larger scale of domains. So

[01:17:04] um I'll I'll recall TMM down here. um tm

[01:17:09] for the weights. Uh

[01:17:12] t mm of weights. Um

[01:17:18] this was equal to the number of total

[01:17:20] parameters

[01:17:24] divided by the memory bandwidth.

[01:17:28] Which memory bandwidth are we talking

[01:17:30] about here? Is it just one GPU or it's

[01:17:32] it's it's in fact it it is the number of

[01:17:35] GPUs that I can use in parallel to to

[01:17:37] load these weights. So, um I can't use

[01:17:40] different pipeline stages in parallel

[01:17:42] because they they're not running at the

[01:17:43] same time, but I can use all the GPUs in

[01:17:46] my scaleup domain in parallel to load

[01:17:47] the weights.

[01:17:48] >> And so, um this is actually extremely

[01:17:51] effective. Um so, uh basically I end up

[01:17:55] with a term here. This this memory

[01:17:57] bandwidth term itself is equal to um

[01:18:00] like scale up size

[01:18:03] >> times memory bandwidth per GPU.

[01:18:04] >> Yeah. Yeah. Times GPU bandwidth. Um uh

[01:18:08] and so this term doesn't increase a lot.

[01:18:11] It maybe increases 1.5 or 2x per

[01:18:13] generation. But this one increased by

[01:18:14] like a factor of eight um from from

[01:18:16] >> so so the reason the bigger scale up

[01:18:17] matter is not the memory capacity of the

[01:18:19] whole scale scale up but really the

[01:18:21] memory bandwidth.

[01:18:21] >> Yeah. Yeah. Pipelining totally solves

[01:18:23] the capacity problem, but um but uh uh

[01:18:27] scale up size helps solve the bandwidth

[01:18:29] problem

[01:18:30] >> and the bandwidth problem helps you do

[01:18:32] longer context lengths which is more and

[01:18:35] more relevant as these models get more

[01:18:36] agentic.

[01:18:37] >> Yeah, it lets you just run the model at

[01:18:38] lower latency um uh as a first thing

[01:18:41] like if I just do a very fast model and

[01:18:43] it's on like a little like H100 box.

[01:18:45] Yeah.

[01:18:46] >> Um uh the latency will be really high.

[01:18:49] >> Yeah. Okay. a super tangential question.

[01:18:53] There's chinchilla scaling which tells

[01:18:55] you how how big should a model be

[01:18:57] relative to the amount of data you're

[01:18:58] going to train it on. Um

[01:19:02] but now obviously you're not just trying

[01:19:03] to optimize for the highest quality

[01:19:07] model you can get with training compute.

[01:19:09] You want the best results a user can get

[01:19:11] with a mixture of training and inference

[01:19:12] compute. Mhm.

[01:19:14] >> So then there's a question of how much

[01:19:16] should you overtrain a model

[01:19:18] >> such that that compute amortized over

[01:19:20] training and inference is minimized to

[01:19:23] get a certain performance. But now with

[01:19:25] RL inference there's or RL there's

[01:19:27] another consideration which is you're

[01:19:30] going to do some amount of pre-training

[01:19:32] that pre-training will be used both for

[01:19:34] RL generation

[01:19:36] >> and then for inference for the final

[01:19:38] user. And by overtraining here I mean

[01:19:41] while it would have been more efficient

[01:19:42] just from a training computer

[01:19:42] perspective to have a bigger model

[01:19:45] >> that you train for less time because it

[01:19:47] can learn faster maybe you you get a

[01:19:49] smaller model you spend more computing

[01:19:50] it than you otherwise would have but now

[01:19:52] it's cheaper to give it to users like

[01:19:54] basically okay maybe but let me question

[01:19:56] more concrete how much more than

[01:19:58] chinchilla optimal are models

[01:19:59] overtrained

[01:20:01] >> and has that changed as a result of our

[01:20:02] generation

[01:20:03] >> this is a place where we have to do a

[01:20:04] bit of guess work because like the um

[01:20:06] the updated scaling laws and and the use

[01:20:08] and model traffics are not reported and

[01:20:10] so we have to guess there. Um but uh one

[01:20:14] way to look at it um

[01:20:19] let me first just make a sort of a

[01:20:22] general huristic claim if I am if I have

[01:20:24] some like cost and I've got a total cost

[01:20:28] which is a sum of like cost A and cost B

[01:20:32] like maybe this is the training cost and

[01:20:33] this is the inference cost. Yeah. Um and

[01:20:36] so I want to minimize this sum

[01:20:39] for many uh for many curves that tend up

[01:20:43] being the case. The minimum tends to be

[01:20:45] where these are where the costs are

[01:20:46] equalized. Um that's something of a

[01:20:48] heristic claim, but uh you can you can

[01:20:51] it tends like there are many examples

[01:20:52] where it's true like uh where one is one

[01:20:54] overx and the other one is is x for

[01:20:56] example. Um they tend to be minimized at

[01:20:59] at the point where uh they equal each

[01:21:02] other. Um it's also true for like um e

[01:21:05] to the x and like e to the minus x and

[01:21:07] all kinds of other things. Um uh like so

[01:21:10] basically I've got some I've got some

[01:21:13] curve that's going down, some other

[01:21:14] curve that's going up and they tend to

[01:21:15] be minimized at this equal point. Um

[01:21:20] huristically I will conjecture that that

[01:21:22] is true um for the setup you described

[01:21:25] as well. um uh like actually showing

[01:21:28] that that would be true would require

[01:21:30] looking at the scaling laws and um and

[01:21:33] like fitting these like weird exponents.

[01:21:36] Um but but things that do follow power

[01:21:37] laws tend to tend to have this property.

[01:21:39] So I'll just make that claim and move

[01:21:40] on. Um so we're going to say that the uh

[01:21:45] cost of training

[01:21:47] um plus the cost of inference um we want

[01:21:49] to equalize these um

[01:21:53] uh we'll do pre-training only first

[01:21:55] because it's a little well actually we

[01:21:56] can do all of it in general. So so

[01:21:58] actually we'll we'll cost it as um cost

[01:22:00] of pre-training. So number of uh so

[01:22:04] number of number of active params

[01:22:07] um times the data on pre-training.

[01:22:11] So that's the cost of pre-training.

[01:22:13] There's a factor of six out here which

[01:22:14] is the number of flops. Um there's the

[01:22:17] famous 6 ND formula. Um and then in in

[01:22:21] RL we have approximately the same thing.

[01:22:24] We've got like same number of active

[01:22:25] parameters. Um but now it's uh the

[01:22:28] amount of data is the RL data. Um

[01:22:31] there's this extra like efficiency

[01:22:33] multiplier which is um or inefficiency

[01:22:35] like the um the inefficiency um

[01:22:41] uh

[01:22:42] >> which is the fact that you're not

[01:22:44] training on all your rollouts.

[01:22:45] >> Well, yeah, there there's that. Um and

[01:22:48] then the other perhaps even bigger

[01:22:49] inefficiency is that

[01:22:52] um this involves a substantial amount of

[01:22:54] decode and often decode runs at uh less

[01:22:56] MFU than than than training.

[01:22:58] >> Okay. So if you're doing a backward pass

[01:23:01] on every single generation in RL it

[01:23:04] would be six ND.

[01:23:05] >> Yeah. So this could be a smaller number,

[01:23:07] right? Like this could be somewhere. So

[01:23:09] um

[01:23:09] >> it would at least be two somewhere in

[01:23:11] the range of two to six. So we'll just

[01:23:13] like we'll say somewhere in the range of

[01:23:14] two to six and leave it at that.

[01:23:16] >> Yeah. uh um and then and then we can add

[01:23:19] in the inference cost. Um the inference

[01:23:20] cost is two um number of active uh times

[01:23:25] the data in inference.

[01:23:28] It's right I think the way I said it was

[01:23:29] super gable. So for just for the

[01:23:31] audience maybe

[01:23:33] forward plus backwards per parameter is

[01:23:36] six.

[01:23:37] >> Forward alone is two. That's why RL

[01:23:41] where you might you're definitely going

[01:23:42] to generate all the trajectories but you

[01:23:44] might or might not train on all the

[01:23:45] trajectories is two to six. Yes. Yeah.

[01:23:47] Thank you. Um and then inference is is

[01:23:49] is just two.

[01:23:49] >> Yeah.

[01:23:50] >> So we're going to solve for essentially

[01:23:52] maybe equality of all three of these

[01:23:54] terms that is ballpark where people are

[01:23:56] going to be like

[01:23:58] >> uh labs have more information on on what

[01:24:00] is productive in doing more RL for

[01:24:03] example than versus doing more

[01:24:04] pre-training. I don't have that

[01:24:05] information but I think a good ballpark

[01:24:07] is 30 30 like uh 33% split between each

[01:24:10] of them.

[01:24:10] >> Actually I'm not sure I understand the

[01:24:11] intuition for that. Um,

[01:24:15] another naive model could have been that

[01:24:16] RL plus pre-training would be 50%. Any

[01:24:19] inference would be 50%.

[01:24:20] >> Yeah, that that's also a valid uh answer

[01:24:22] as well. The because this is heristic, I

[01:24:25] can't really argue for one versus the

[01:24:27] other. They don't differ by that much

[01:24:28] like 33 versus 25 is is only a small

[01:24:30] fac.

[01:24:32] >> Um,

[01:24:34] [snorts]

[01:24:35] uh, so let's pick one of them. Uh, all

[01:24:38] equal seems uh simple enough. Um

[01:24:42] um and so we're just going to solve for

[01:24:43] equality of them. It's pretty

[01:24:44] straightforward. We can immediately see

[01:24:45] that the number of activated parameters

[01:24:47] totally disappears. And so let's factor

[01:24:48] that out. And we're going to just say

[01:24:50] that uh data in pre-training

[01:24:54] I decided to do it your way. It's a

[01:24:56] little bit nicer actually. So data in

[01:24:58] pre-training plus um this uh oh I didn't

[01:25:03] have the inefficiency over here either.

[01:25:05] um inefficiency um data in pre-training

[01:25:09] plus um some multiple of like alpha

[01:25:13] times the data in RL

[01:25:16] is just going to be and end up equal to

[01:25:19] the um some sum of beta times the uh

[01:25:23] data in inference. Um

[01:25:27] so uh and then let's just like roughly

[01:25:30] size the alpha. This this this alpha

[01:25:32] it's going to be um

[01:25:35] uh this is like the it's maybe somewhere

[01:25:38] in the range of 2 to 6 uh 2 to 6 over 6

[01:25:41] um from this term compared to this term.

[01:25:44] Um and then we've got an inefficiency

[01:25:46] term which uh I would say is maybe in

[01:25:48] the range of like 30% or something like

[01:25:49] that. Um so uh so so this alpha is going

[01:25:54] to be something like um 1 and 10 1 / 10

[01:25:58] say um and this beta here is is actually

[01:26:02] the same. It's it's a third it's 1/3* 33

[01:26:05] 30%. So it's also um equals 1 in 10

[01:26:09] something like that.

[01:26:11] If if both of them are one and 10 that

[01:26:12] kind of implies that there's never a

[01:26:13] backward pass on RL.

[01:26:15] >> Yeah. Okay. We can make this like two

[01:26:16] and 10. Make it a bit bigger. Yeah. So

[01:26:20] yeah, like just write it out once more

[01:26:21] like this is two 2 over 10, this is 1

[01:26:25] over 10. Um, so the number of inference

[01:26:28] tokens you have and this is just a

[01:26:30] function of like I've got hundreds of

[01:26:32] millions of tokens per second um times

[01:26:34] my model is deployed for I don't know

[01:26:37] two months before I shift shift to the

[01:26:39] next version. um that should determine

[01:26:43] the um the number of uh tokens in in RL

[01:26:47] and pre pre-training and then I guess we

[01:26:49] didn't do the equivalence between

[01:26:50] pre-training and and RL so we'll do that

[01:26:52] here data pre-training should be equal

[01:26:54] to like 2 over 10 * data in RL for them

[01:26:57] to be cost equivalent um so

[01:27:03] sorry this one over I got it backwards

[01:27:05] uh like we pay more cost when it's

[01:27:08] inefficient so it's this needs to be one

[01:27:10] over. Um

[01:27:12] uh um so this tracing this back uh back

[01:27:16] forward

[01:27:17] >> um this this thing ends up actually

[01:27:19] being as written here it's like yeah so

[01:27:23] this is like 1.5 and this is one um

[01:27:27] >> um

[01:27:28] >> billions of dollars worth of compute

[01:27:29] just flow the other direction.

[01:27:30] >> Yeah. Right. Right. [laughter]

[01:27:33] >> I think like if you do it with a

[01:27:34] spreadsheet and like actually out you

[01:27:36] might notice when the money is going

[01:27:37] down the drain. Yeah. Yeah. Um so uh

[01:27:40] yeah so I think this yeah all of these

[01:27:42] end up being close in as modeled here.

[01:27:45] This 30% may have been a little bit too

[01:27:46] generous. Um so let's say something like

[01:27:48] 1.5 here and and leave this as a one

[01:27:51] here. So I think it like at this point

[01:27:55] you can almost read it off like the

[01:27:56] number of inference tokens should be

[01:27:58] about the same as the number of

[01:27:58] pre-training tokens should be about the

[01:28:00] same as the number of RL tokens um

[01:28:02] within like factors that we're not able

[01:28:04] to reason about. But then so it looks

[01:28:07] sorry I'm making a basic altruistic it

[01:28:09] sounds seems like there should be less

[01:28:10] RL tokens than pre-training tokens and

[01:28:12] >> yes that's in general right because RL

[01:28:14] is less efficient um in terms of machine

[01:28:17] time and so uh you

[01:28:21] um if you're trying to equalize the RL

[01:28:23] and and pre-training time then then you

[01:28:24] should have fewer tokens in order to

[01:28:26] have the same wall time that this is all

[01:28:29] quite interesting that um I never

[01:28:31] thought about it in terms of how much

[01:28:34] equalizing in terms of data I I I I mean

[01:28:36] I think starting with equalizing and

[01:28:37] cost is right but uh depending on how

[01:28:40] you model the cost this comes close to

[01:28:41] equalizing in data

[01:28:42] >> that if every single user who uses

[01:28:45] basically if you for GBT to be trained

[01:28:48] optimally every single user who uses

[01:28:50] GPD5 the total amount of tokens that

[01:28:52] they stream should equal the amount

[01:28:54] total amount that have gone into

[01:28:54] pre-training.

[01:28:55] >> Yeah.

[01:28:56] >> And the total amount of tokens that got

[01:28:57] into pre-training is the sum of all

[01:28:59] human knowledge. So like each model

[01:29:02] should generate the sum of human

[01:29:04] knowledge on the output that it gets on

[01:29:05] the input.

[01:29:06] >> Yeah. So I mean which way are people

[01:29:07] going to error? Like uh if you think

[01:29:10] that people's power of prediction is not

[01:29:12] perfect and and also um you run the risk

[01:29:15] that your um that you make a model that

[01:29:18] is not a frontier model and then you

[01:29:19] just throw it away. um then then like

[01:29:22] that kind of changes the cost trade-off

[01:29:24] because there's some like probability

[01:29:26] that applies to the inference and you

[01:29:28] should derate the inference tokens by

[01:29:29] some amount

[01:29:30] >> right

[01:29:31] >> and then can we back out how much more

[01:29:35] yeah compute than chinchilla optimal for

[01:29:37] a given sized

[01:29:39] >> model

[01:29:40] >> so I think we just have to make some

[01:29:41] real world assumptions here in order to

[01:29:43] do that so um so the inference tokens we

[01:29:47] should totally be able to catch right

[01:29:48] like so um let's say a 200 million I

[01:29:51] don't know maybe it's like uh 500

[01:29:52] million tokens a second now I don't

[01:29:54] really know um 500 million tokens a

[01:29:57] second times a model is deployed for 2

[01:30:00] months before it becomes obsolete I

[01:30:02] don't really know um

[01:30:05] uh I can't do this in my head can you

[01:30:08] computer um

[01:30:11] uh 2.6 * 10 15th

[01:30:15] >> Okay 2.6 6 uh * 10 15. Okay. Um

[01:30:20] um this number is probably too large.

[01:30:22] This um because this is going to be

[01:30:24] multiple models in a family. We So let's

[01:30:27] let's make it like

[01:30:29] five times smaller or 10 times smaller

[01:30:31] or something like that. Um uh okay. So

[01:30:35] we're estimating maybe 50 million tokens

[01:30:38] per second per per specific model. The

[01:30:41] model is live for two months. Um and so

[01:30:45] uh this comes out to around 200 uh

[01:30:48] trillion tokens. Um and then we want to

[01:30:51] compare that to active parameters on a

[01:30:54] um frontier model. I don't actually know

[01:30:56] the latest rumors but um some

[01:31:00] do do you know?

[01:31:01] >> Somebody told me 150 trillion

[01:31:03] >> active cramps.

[01:31:03] >> So sorry I meant tokens

[01:31:06] >> trained on 150 trillion tokens.

[01:31:07] Interesting.

[01:31:08] >> Which which is similar.

[01:31:09] >> Yeah. Yeah. That's actually similar. So

[01:31:10] um so data on pre-training

[01:31:13] >> this is not but well cited but

[01:31:14] >> you want me to not remove this um

[01:31:17] >> and I think often active prams uh number

[01:31:19] of active prams could be in the range of

[01:31:22] like uh 100 billion something like that

[01:31:26] maybe maybe a bit larger um uh so I'm

[01:31:29] assuming active pram of about 100

[01:31:30] billion and so multiply by 20 to get the

[01:31:32] chinchilla uh token count so chinchilla

[01:31:35] d chinchilla would be around uh two

[01:31:40] trillion

[01:31:42] and yeah and we see like we're at 100

[01:31:44] times larger than uh than that

[01:31:47] >> actually what does the chinchilla

[01:31:48] actually mean [snorts]

[01:31:49] >> like the token count for pre-training

[01:31:51] for um that the chinchilla scaling law

[01:31:55] would recommend I guess um

[01:31:56] >> oh I see so how much is it overtrained

[01:31:59] >> got it

[01:31:59] >> so yeah like the ratio of this 200

[01:32:02] trillion or 100 trillion parameters over

[01:32:04] uh over the like the tential optimal of

[01:32:08] of two trillion And that's the amount

[01:32:10] it's overtrained which is like effective

[01:32:11] 100 overt trained perhaps that's what

[01:32:13] okay so if you consider this right here

[01:32:16] to the extent this is in the right

[01:32:17] ballpark just by thinking about okay you

[01:32:19] kind of want everything to be equal in

[01:32:21] terms of compute um here's if if that

[01:32:25] openi also realizes that and they're

[01:32:26] serving a certain amount of tokens per

[01:32:28] second that tells you how much data went

[01:32:31] into the pre-training of GBD5

[01:32:33] >> it even if it's like 50% off or

[01:32:36] something that is that is sort of wild

[01:32:38] that you can sort of first principles,

[01:32:40] >> these kinds of numbers.

[01:32:41] >> This is also I mean this is why you

[01:32:42] should just like approximate everywhere

[01:32:43] because like there's so big error bars

[01:32:45] on this but yeah know it's kind of like

[01:32:47] empowering to just like set a equal to b

[01:32:49] and figure it out.

[01:32:49] >> Yeah. Yeah. That's super cool.

[01:32:51] >> Okay. So um it is weird of trying to

[01:32:53] deduce things. We can publicly look up

[01:32:56] the prices of the APIs of these models

[01:33:00] and um maybe you can learn something

[01:33:02] from that. Uh first with a longer

[01:33:05] context um [clears throat]

[01:33:07] Gemini 3.1 is

[01:33:11] um 50% more expensive if you go over

[01:33:14] 200k tokens than if you're below 200k

[01:33:17] tokens.

[01:33:19] I mean

[01:33:21] at a high level I understand why that

[01:33:22] might that be but why specifically 50%.

[01:33:25] >> Yeah. Um so I mean why specifically 50%.

[01:33:28] Let's let's sort of um so so the high

[01:33:31] level uh even in the first place is um

[01:33:35] >> there is some amount of uh increasing

[01:33:37] cost with with context length.

[01:33:38] >> Yeah.

[01:33:39] >> Um and

[01:33:41] >> and uh we can bring that back up. That

[01:33:43] was the um the the memory time versus

[01:33:47] the compute time. So um okay so we we've

[01:33:50] put up these same equations from before

[01:33:52] of the the time for memory fetches which

[01:33:54] is the weights and and the KB cache. um

[01:33:57] and then the the time for the compute

[01:33:59] which is just the uh matrix

[01:34:00] multiplications for the weights.

[01:34:03] I I will also draw the um the cost

[01:34:05] curve.

[01:34:12] Um but this time I'll do it as a

[01:34:13] function of context length instead of as

[01:34:15] a function of patch size. Um so this is

[01:34:18] time over uh yeah just just time. Uh so

[01:34:22] this is the cost curve as a function of

[01:34:23] context length. Um

[01:34:26] we'll draw the compute. Um the the the

[01:34:28] cost of the compute is actually constant

[01:34:30] as a function of context length. There's

[01:34:31] no dependence here on context length. In

[01:34:33] reality, there is some dependence, but

[01:34:35] it is very mild dependence, so we'll

[01:34:36] ignore it. Um so this is the um time for

[01:34:41] the compute

[01:34:47] this one. Uh and then we'll also draw

[01:34:49] the dependence uh of the memory fetch on

[01:34:51] on context length. And this starts at a

[01:34:54] large number for the weights and then

[01:34:56] grows gradually with um with the context

[01:34:58] length. So uh maybe here um and then

[01:35:03] grow gradually with context length.

[01:35:09] And so you take the maximum and you see

[01:35:11] there is this inflection point here. So

[01:35:13] now so this is the costs that uh that

[01:35:15] that for example Gemini might be paying.

[01:35:17] Um and then you think how how how might

[01:35:20] you put a pricing structure on top of

[01:35:21] that? um you would like to ensure that

[01:35:24] no matter what the context length is,

[01:35:25] you are you are still profitable. So

[01:35:28] >> interesting.

[01:35:30] >> And so we've got a two-tier pricing

[01:35:31] structure, maybe we've got something

[01:35:32] that looks like this up to some context.

[01:35:34] >> That's fascinating.

[01:35:35] >> So I think it says something about um

[01:35:39] given that the bump is at 200k, it

[01:35:41] probably means that this is somewhat

[01:35:42] aligned with this crossover point. Maybe

[01:35:44] not exactly aligned with

[01:35:46] >> fascinating. Um so we can actually

[01:35:48] probably even complete that calculation

[01:35:50] just to see where it lands out. Um we

[01:35:54] can solve for the number of bytes per

[01:35:55] token. Um if if if we sort of make some

[01:35:58] assumptions about the number of active

[01:36:00] parameters. So solving for the number of

[01:36:02] bytes per token. Um we're going to

[01:36:04] assume like the the point where we

[01:36:06] equalize um the time of memory and the

[01:36:08] time of compute is at let's say 200k

[01:36:10] tokens. Um so we equalize these two. Um

[01:36:14] we're also going to just uh assume that

[01:36:16] the batch size is large enough that the

[01:36:18] um the memory time spent on weights is

[01:36:21] is negligible. So we'll forget about

[01:36:23] this and we'll focus on the actual

[01:36:25] memory time spent on KB cache. So

[01:36:29] that ends up saying copying this term

[01:36:31] over batch times len context um times uh

[01:36:36] bytes

[01:36:38] token

[01:36:40] um over me bandwidth

[01:36:44] is going to be equal to uh number of

[01:36:47] activated prams

[01:36:49] over flops.

[01:36:53] And then we're going to solve for bytes

[01:36:55] per token. Um

[01:37:18] size was missing here.

[01:37:20] Shows up here and then it cancels out by

[01:37:22] the time we get to here.

[01:37:28] and uh and I I dropped the len context

[01:37:35] >> so we can plug in numbers. This number

[01:37:37] this is this is this well is the

[01:37:38] reciprocal of the number that we saw

[01:37:39] before. It's yeah this is like one over

[01:37:41] 300 um which is reasonably stable across

[01:37:44] many um different hardware platforms.

[01:37:47] We conjecturally said that maybe number

[01:37:49] of activated tokens is like 100 billion

[01:37:54] and length of the context we said was

[01:37:56] 200k. Um

[01:37:59] something is wrong here. The length of

[01:38:01] the context should be on the denominator

[01:38:02] not the numerator. Um

[01:38:19] Um 1667 like about one one kil almost 2

[01:38:23] kilobyte. That's that is plausible

[01:38:25] actually. Um so we said around 2

[01:38:28] kilobytes. Um

[01:38:32] so um so let's just do a a sanity check

[01:38:37] for this um for what this could be. Um

[01:38:38] there are two mechanisms that people do

[01:38:40] uh attention with a small number of

[01:38:42] bytes per token. Um one is uh dense

[01:38:46] attention with a lot of reuse across

[01:38:48] layers. Um so character AI has a blog

[01:38:51] post talking about that alternating long

[01:38:52] and short context. Um and like in the

[01:38:56] character AI kind of model which also

[01:38:58] showed up in the Gemma models the global

[01:39:01] context which is really what we're

[01:39:02] talking about here global context um was

[01:39:04] shared across all the layers. And so to

[01:39:06] get this 2 kilobytes, you could get that

[01:39:08] for example as um a d head of 128 um is

[01:39:13] is typical. Um and then like the number

[01:39:17] of bytes is typically

[01:39:20] um number of attention layers um uh

[01:39:26] times

[01:39:29] 2 * d head uh times

[01:39:34] uh number of uh q heads.

[01:39:39] So um this is the number of unique

[01:39:42] contexts per layer. Do you do you share

[01:39:44] the the context across many layers or do

[01:39:46] you use it only once? Um uh so in

[01:39:49] character AI like models uh this number

[01:39:52] is one. Um we said this is 100 128. Um

[01:39:58] and uh this is a choice which typically

[01:40:02] ranges from one uh sorry this is KV

[01:40:04] heads I meant. Um

[01:40:06] >> so there is written a head and a KV

[01:40:08] head. is that

[01:40:08] >> the KV heads are the heads that are

[01:40:11] stored in memory like store the contents

[01:40:13] of the previous tokens. The Q heads are

[01:40:15] the um the retrieval heads. They're

[01:40:18] they're only used temporarily and

[01:40:19] they're they're used by the attending

[01:40:21] token. So um in this auto reggressive

[01:40:24] context, I've got KV heads associated

[01:40:26] with all of the context and then Q heads

[01:40:28] associated with this new token here. But

[01:40:30] but but this head the 128.

[01:40:32] >> Oh uh this is um this this number is

[01:40:36] actually the same for oh so this d head

[01:40:38] is the dimension of the vector.

[01:40:39] >> Ah yeah uh and number of kv heads is

[01:40:42] typically in the range of 1 to 8.

[01:40:44] >> So um like it is totally plausible to

[01:40:48] get this by for example having eight KV

[01:40:50] heads and and a d head of 128. That

[01:40:53] gives you exactly this number

[01:40:54] >> or or you could have like fewer KB heads

[01:40:57] but more layers. Interesting.

[01:40:58] >> Yeah. Um, so this is one way to get

[01:41:00] there via dense attention. There's also

[01:41:02] a way to get there via sparse attention

[01:41:03] where you um increase all of these

[01:41:05] numbers, but then you have like a run

[01:41:07] over sparity term.

[01:41:11] >> So yeah, I mean I I think this number is

[01:41:13] plausible if if maybe a little bit

[01:41:14] small.

[01:41:15] >> It's funny that they would leak so much

[01:41:16] information through their API pricing.

[01:41:18] >> I mean you are incentivized to price

[01:41:21] close to your costs because otherwise

[01:41:22] someone could script you.

[01:41:24] >> Maybe we can learn something about the

[01:41:25] difference in input versus output

[01:41:27] prices. Yeah.

[01:41:28] >> And what that tells us about decode

[01:41:30] versus pre-filled in these models. Um,

[01:41:34] and I think last I checked it's like 50%

[01:41:36] more expensive or something like that or

[01:41:38] >> I I don't remember. What I've seen in

[01:41:40] the past is like three or five times

[01:41:41] more sense. Let's say it's five times

[01:41:43] more expensive. Okay. This is the

[01:41:45] compute to process the next token in

[01:41:49] decode. Suppose you're doing prefill.

[01:41:52] You're not just processing the most

[01:41:54] recent token. You're processing all the

[01:41:55] tokens in parallel.

[01:41:57] So I want to say

[01:42:00] that it would be this times len

[01:42:05] um len prefill length of a pass in

[01:42:08] general.

[01:42:08] >> Yeah. If we say like if we can think of

[01:42:11] decod as being a pass with one and then

[01:42:13] prefill being a pass with many.

[01:42:14] >> Okay. Yeah. Yeah. Um so maybe like

[01:42:16] prefix. Sure. Whatever.

[01:42:19] >> Um okay. Memory. So you're not storing

[01:42:23] the KV cache if you're for the tokens

[01:42:25] that are the prefill tokens.

[01:42:26] I think maybe maybe sort of let's draw

[01:42:29] actually how prefill shows up here. Um

[01:42:31] uh if I may clarify uh so we do a bit of

[01:42:35] decode like this. Um

[01:42:37] >> we may actually come back and do more

[01:42:39] prefill like like if you think this is a

[01:42:41] chat session the user says something the

[01:42:44] AI generates the response and then the

[01:42:45] user says something else and we prefill

[01:42:47] this. So like maybe this is the more

[01:42:49] common like this is the general case

[01:42:50] rather than this. In fact, this is like

[01:42:52] you read a file or something.

[01:42:54] >> Read a file or just like the AI is

[01:42:56] responding to a user input or tool call

[01:42:58] or anything that's not generated.

[01:43:00] >> Yep. Okay. Okay. So, suppose we're here.

[01:43:04] So,

[01:43:06] you will need to load

[01:43:08] basically

[01:43:10] the you will have calculated all of this

[01:43:12] previously.

[01:43:13] >> Mhm.

[01:43:14] >> So, just the KV of everything that came

[01:43:15] before.

[01:43:19] But what is the memory cost of this?

[01:43:22] Well,

[01:43:26] memory bandwidth cost of this if you're

[01:43:28] doing flash attention, it would Yeah,

[01:43:31] it's it's basically temporary. It it it

[01:43:33] doesn't even go to main memory. Just

[01:43:35] ignore it.

[01:43:35] >> Okay. So, then it would just be

[01:43:36] everything that came before. So, is it

[01:43:40] not just that then?

[01:43:41] >> Yeah, there's actually no adjustment at

[01:43:42] all to the memory time.

[01:43:43] >> Great. Oh, so it's a very trivial

[01:43:45] change. Yeah. To accommodate. So

[01:43:49] this term is making it 5x more

[01:43:51] expensive. Now why would that be? Or

[01:43:53] what does that tell us about

[01:43:56] what what are we trying to learn here?

[01:43:57] What does that actually tell us? What

[01:43:58] what variable does this help us clamp?

[01:44:00] Um

[01:44:03] well the compute has presumably gotten

[01:44:04] five like the only thing that could have

[01:44:05] changed the comput 5x more expensive as

[01:44:07] a result. So so yeah this is the time

[01:44:10] for one pass but actually the amount of

[01:44:12] tokens is that that much larger. So I

[01:44:14] guess we want the cost per token in fact

[01:44:17] or the time per token.

[01:44:19] >> So I'm not sure I understood the this is

[01:44:24] for processing the next token in

[01:44:26] this is for processing the next token in

[01:44:26] prefix.

[01:44:27] >> Uh well actually for processing the

[01:44:28] entire batch um so in this like at this

[01:44:31] cost we have processed this many tokens

[01:44:34] like let it prefilled.

[01:44:35] >> Yeah. Um or I guess pre Yeah. Like the

[01:44:38] of the pass like yeah not not this

[01:44:41] prefix but it's this cost.

[01:44:42] >> Okay. So let's just change as a pass

[01:44:50] we can. So this is 5x more expensive.

[01:44:52] >> Um input is 5 5x more expensive.

[01:44:54] >> No output is more expensive.

[01:44:55] >> Output is 5x more expensive.

[01:44:58] >> So the the result we want to work

[01:45:00] towards is that prefill is compute

[01:45:03] limited and decode is um memory

[01:45:06] bandwidth limited.

[01:45:08] >> Why don't we do this? Why don't we have

[01:45:09] Why don't we just chart it with like len

[01:45:11] pass on the x- axis?

[01:45:13] >> Yep.

[01:45:14] >> T on T on the Y ais.

[01:45:16] >> T we want the cost per token. So it'll

[01:45:19] be T over some stuff. T over length of

[01:45:22] the pass.

[01:45:24] >> Mhm.

[01:45:27] Yeah, that'll be right.

[01:45:32] Okay. So

[01:45:46] okay

[01:45:49] confused about this len pass is the

[01:45:53] it seems like this should be higher when

[01:45:54] you're doing prefill.

[01:45:56] >> Prefill has a bigger length pass. Yeah.

[01:45:58] Right.

[01:45:59] >> But then why is it cheaper?

[01:46:01] >> Why is it cost higher? Yeah. Yeah. Um,

[01:46:03] so I mean we're gonna it's this division

[01:46:06] by length pass that that actually makes

[01:46:08] it all uh so okay this is going to

[01:46:12] divide out. This is going to divide out

[01:46:13] but then we're going to get a divi all

[01:46:15] of this is going to divide by length of

[01:46:16] pass and it's going to make the memory

[01:46:18] cost cheaper.

[01:46:19] >> Okay. Yeah, let me let me think about

[01:46:21] this then. Okay. So let's do one line

[01:46:23] for

[01:46:24] basically we'll have four different

[01:46:25] lines. Um let's do the

[01:46:30] let's do prefill first. And so

[01:46:34] actually let's let's do decode first.

[01:46:37] >> Oh. Oh. So actually I will length length

[01:46:40] of the pass when it's one that is

[01:46:42] decode. When it is bigger that is

[01:46:43] prefix.

[01:46:44] >> Okay. I see. I see. I see. That makes

[01:46:46] sense. Okay. Getting back to it. So tmp

[01:46:48] compute if you have um basically just

[01:46:51] this divided by length pass is just this

[01:46:53] amount. So this actually does not vary

[01:46:56] based on

[01:46:57] >> t. So it'll just be some flat value

[01:47:00] like this.

[01:47:02] Um and this is

[01:47:05] t compute

[01:47:09] and then th this is like

[01:47:12] uh this is

[01:47:12] >> that's decode decode, right? Um now tm

[01:47:16] we have this whole thing divided by len

[01:47:18] pass. Well, it doesn't really matter

[01:47:20] what's up there. It'll just be something

[01:47:21] that looks like this.

[01:47:25] Right. Yeah. Say this is T

[01:47:28] me.

[01:47:31] This is decode again.

[01:47:33] So

[01:47:35] as

[01:47:37] the length of the prefix goes up or pass

[01:47:42] your memory bandwidth time declines.

[01:47:48] And that means that to the extent that

[01:47:50] you were me bottlenecked on memory

[01:47:52] bandwidth before you can avoid being

[01:47:54] bottlenecked on memory bandwidth. The

[01:47:57] fact that they are charging 5x less for

[01:48:02] prefill than decode does suggest that

[01:48:04] they are bottlenecked on memory

[01:48:05] bandwidth to quite a degree such that

[01:48:08] for them at least because t is

[01:48:10] equivalent to cost right it's the cost

[01:48:11] of renting a computer. This is actually

[01:48:15] like this this would be at one and this

[01:48:17] would be at five. That's right. That's

[01:48:18] right. Yeah. So it it is in fact

[01:48:21] tremendously memory bandwidth bottom.

[01:48:23] The real graph looks something like the

[01:48:26] real graph looks something like

[01:48:29] like that.

[01:48:30] >> Yeah. I mean it still crosses but

[01:48:32] >> yeah exactly. So yeah let me do it this

[01:48:33] way.

[01:48:35] >> Yeah that's right.

[01:48:37] Um

[01:48:41] and then the this is the gap on decode

[01:48:46] between the memory and the compute time.

[01:48:50] >> Yeah. Yeah.

[01:48:50] >> Okay. Interesting. Another interesting

[01:48:52] one would be why cachets are so much

[01:48:55] cheaper.

[01:48:56] >> Yeah. Okay.

[01:48:57] >> So I think if I remember correctly,

[01:48:59] cachets are like 10x. It's more

[01:49:01] expensive to write to cache according to

[01:49:04] the pricing on all these models, but if

[01:49:06] you do hit a cache, it's 10x cheaper.

[01:49:10] So, what is going on with

[01:49:15] presumably this is the cost of keeping

[01:49:18] something in HBM rather than just

[01:49:21] evacuating it. But if you do keep it in

[01:49:23] HBM, then it's cheaper to load again,

[01:49:25] >> right? So there's two ways you can

[01:49:27] produce um tokens uh or the the KV cache

[01:49:30] for a token. Um you can just produce it

[01:49:32] from scratch by computing it from the

[01:49:34] underlying like token ids which are

[01:49:36] tiny. Um

[01:49:39] or you can um previously have produced

[01:49:42] it and stored it in memory somewhere.

[01:49:44] >> Uh so the cost ratio is really talking

[01:49:46] about the ratio between those two

[01:49:48] mechanisms of producing it. A cache miss

[01:49:50] means you've deleted it from all your

[01:49:52] memories and you have to recmp compute

[01:49:53] it from the tokens directly.

[01:49:55] >> In fact, you can maybe even take that a

[01:49:57] step further and think about which

[01:49:59] memory tier do you store it in. So you

[01:50:01] could store it in HPM. Um there are

[01:50:04] other slower and cheaper memories than

[01:50:05] HPM like DDR on your host or flash um as

[01:50:10] well. And so one of the things you can

[01:50:12] do is a is a calculation of um where it

[01:50:16] makes sense to be in each memory tier.

[01:50:18] Um and this is related to um how long

[01:50:22] you're going to store for. So so we want

[01:50:24] to look at the cost of storage in in a

[01:50:26] few different memory tiers and also the

[01:50:28] cost of rematerialization. So um uh

[01:50:31] remat means the cost to rematerial like

[01:50:35] rebuild all of the KB cache from scratch

[01:50:38] having it after you deleted it. So we

[01:50:39] rematerialize it. Um and so basically

[01:50:43] this is going to cost the uh length of

[01:50:45] the context. Um

[01:50:48] uh actually we'll look at uh cost per

[01:50:50] token so that we don't need to carry

[01:50:52] around this length of context

[01:50:53] everywhere. So to rematerialize one

[01:50:56] token of KV cache um I just need to run

[01:51:01] I need to run a forward pass on the

[01:51:03] whole model and um and then so this is

[01:51:07] going to be the compute time. I have to

[01:51:08] rerun the compute um

[01:51:11] at whatever speed my GPU does it and

[01:51:13] then I multiply it by my like GPU

[01:51:17] dollars per second. Um

[01:51:19] >> sorry, extremely question. Why is there

[01:51:21] not a quadratic term?

[01:51:23] >> Yeah. So, uh there is a quadratic term

[01:51:26] um in it shows up in the compute um

[01:51:31] uh

[01:51:35] as an approximation. I chose to remove

[01:51:37] it. um the what that I I'll just show

[01:51:40] you sort of quickly what that looks

[01:51:41] like. It's because so you have the um

[01:51:46] if you look at the cost per token um or

[01:51:48] the number of flops per token. There is

[01:51:51] the flops that are coming from doing the

[01:51:54] weight matrix multiplies as a function

[01:51:56] of context lengths. Um and then there is

[01:51:59] the number of multiplies that comes from

[01:52:01] doing the KV cache. Um which is which

[01:52:03] goes up linearly with the the amount of

[01:52:05] stuff you attend to. um the slope on

[01:52:07] this is so low that like when you when

[01:52:10] you draw it like this it's like it's

[01:52:11] very well approximated by a flat line.

[01:52:13] >> So like it starts to like you start to

[01:52:15] notice the effect of the quadratic or

[01:52:17] the linear term up in the in the

[01:52:19] millions of tokens or so. So just not

[01:52:21] super relevant.

[01:52:22] >> So what is the reason that there's no

[01:52:25] company which has over a million token

[01:52:27] context length

[01:52:28] >> um

[01:52:28] >> if this is true.

[01:52:30] >> Yeah. So there are two costs of long

[01:52:31] context. One is the memory bandwidth

[01:52:33] cost which we've spent a lot of time

[01:52:34] analyzing. That's this thing. Um and

[01:52:37] then the other one is the compute cost.

[01:52:38] The compute cost is almost always um and

[01:52:41] sort of actually forced by um

[01:52:45] fundamental principles uh to be a much

[01:52:48] smaller slope than than the memory

[01:52:50] bandwidth cost. And so the primary thing

[01:52:53] that limits you to have really large

[01:52:55] contexts are memory band memory capacity

[01:52:58] which is exactly this effect like

[01:53:00] >> um and so there's this idea that Daario

[01:53:02] said on the podcast and others have said

[01:53:04] which is we don't need continual

[01:53:06] learning for AGI in context learning is

[01:53:08] enough and if you believe that then you

[01:53:10] have to think that we had to get to 100

[01:53:12] million token 100 million billion

[01:53:14] context length to have an employee that

[01:53:16] is the equivalent to working with you

[01:53:18] for a month. Now maybe that's no longer

[01:53:20] true as far as attention or something.

[01:53:22] Yeah.

[01:53:22] >> But um

[01:53:25] >> yeah, if you think that then as a some

[01:53:28] ML infra thing would have to change to

[01:53:29] allow for 100 million like the memory

[01:53:32] bandwidth to allow for 100 million

[01:53:34] >> token context lengths.

[01:53:36] >> I mean sparse attention gives you get

[01:53:37] out for sure because you get this um

[01:53:40] square root like you know gives you a

[01:53:42] big improvement. Um

[01:53:46] but I think it's like if you look at the

[01:53:48] history of um context lengths of models

[01:53:51] um

[01:53:54] from like earlier models like GPT3 maybe

[01:53:57] to GPD4 I don't remember when the

[01:53:59] transition happened exactly like they

[01:54:00] shot up from like about 8K to 100K 200K

[01:54:04] um and then for the last year or two

[01:54:06] they've all been hovering around there.

[01:54:08] Um I think that actually indicates that

[01:54:11] that that's sort of the reasonably

[01:54:12] balanced cost point and going massively

[01:54:15] beyond that would be cost prohibitive.

[01:54:17] >> Not because of the compute cost because

[01:54:20] >> the memory bandwidth cost. Yeah.

[01:54:22] >> Um so I actually don't see a very good

[01:54:27] path to solving that. Like the memory

[01:54:31] the HPM is where is is at where it is.

[01:54:33] Uh it's not getting hugely better. And

[01:54:35] and why doesn't sparse attention solve

[01:54:37] that?

[01:54:38] >> The sparse attention is a big

[01:54:39] improvement. Um uh maybe that is priced

[01:54:42] in already perhaps. Um uh it's not an

[01:54:44] infinite improvement because if you go

[01:54:46] too sparse, you lose too much quality.

[01:54:48] >> Yeah.

[01:54:48] >> But yeah, I mean the empirical result is

[01:54:50] that uh the context lengths haven't been

[01:54:52] increasing that much. Um uh and I think

[01:54:54] it's because there is no solution to the

[01:54:57] memory wall. Yeah. Interesting.

[01:54:59] >> Like so going too sparse just means like

[01:55:01] you're attending to a very small subset

[01:55:03] of the tokens and the quality will get

[01:55:04] worse. Yeah. So what is the cost of of

[01:55:07] of these different ways of producing um

[01:55:10] uh reynthesizing the KV cache? Computing

[01:55:13] it from scratch is based on my GPU time.

[01:55:15] I have to do a certain amount of

[01:55:17] multiplies in order to um uh of GPU time

[01:55:22] that I spend in order to produce it. Um

[01:55:25] storing HPM.

[01:55:30] Um,

[01:55:33] this really goes as my um I think I had

[01:55:35] a number here which was the bytes per

[01:55:37] token.

[01:55:39] Um, so I need to I need to have some

[01:55:42] number of bytes per token

[01:55:44] and then I need to store this in the uh

[01:55:46] HBM. So it's going to use up some of my

[01:55:48] HBM capacity.

[01:55:50] Uh so a way to think of this is that

[01:55:53] like if I have too many of these things

[01:55:55] sitting in HBM like if I fill up my HBM

[01:55:58] with just KV caches that I'm not using I

[01:56:00] can't use that GPU and so how do I price

[01:56:03] that? Maybe I say that the cost of it is

[01:56:05] proportional to the fraction of the HPM

[01:56:06] I'm using. So so there's also times GPU

[01:56:09] dollars. Um

[01:56:11] uh

[01:56:13] um and then let's just do one more

[01:56:15] memory tier and say something like uh

[01:56:16] DDR um store in DDR instead. Um

[01:56:22] uh

[01:56:23] the same kind of thing goes up for flash

[01:56:25] and and for DDR. Um I put these in the

[01:56:28] wrong columns actually. Um I meant to

[01:56:30] make two columns.

[01:56:32] The the distinction I want to make is

[01:56:34] that there is the time to cost to

[01:56:36] retrieve

[01:56:41] And then there's a uh cost cost to store

[01:56:45] um cost to hold hold on.

[01:56:47] >> Um um and so this is like there's a cost

[01:56:50] per second whereas this is like a

[01:56:52] instantaneous cost. Um so

[01:56:54] rematerialization has a cost to retrieve

[01:56:57] and has zero cost to store it because

[01:56:59] we've deleted it. Um

[01:57:02] this is the one that I put in the wrong

[01:57:03] location. This is this is actually the

[01:57:05] cost to to hold on. So I will rewrite

[01:57:07] it.

[01:57:24] Okay. Um so we have this is the uh like

[01:57:27] if we're just storing it in HPM, it has

[01:57:29] this sort of cost profile. Um

[01:57:32] uh and then if we store in DDR um it's

[01:57:36] actually going to take some time. So

[01:57:37] it's like we get the same thing here.

[01:57:38] Bytes

[01:57:42] per token over DDR capacity

[01:57:47] times DDR

[01:57:49] cost um

[01:57:53] a second. Um but but now this has a um a

[01:57:56] cost to retrieve that is is higher than

[01:57:58] the HPM because we need to copy it into

[01:57:59] the HPM. And so this is um bytes per

[01:58:03] token

[01:58:06] uh over DDR bandwidth um uh bandwidth

[01:58:11] uh and then this consumes some amount of

[01:58:13] the DDR as well

[01:58:14] >> and every scale up has DDR and flash.

[01:58:18] >> This is really a deployment question and

[01:58:19] so you can choose that. Um Nvidia does

[01:58:22] deploy in this form. Uh it has it has

[01:58:23] both.

[01:58:24] >> Why isn't the [snorts] cost to retrieve

[01:58:25] HBM the memory bandwidth or the bytes

[01:58:28] divided by memory bandwidth? Yeah, I

[01:58:30] mean it depends what what you define a

[01:58:32] retrieve to be. Here I'm defining

[01:58:33] retrieve to be um uh move it into HPM so

[01:58:36] that you can start actually doing

[01:58:37] inference on it and so like sort of by

[01:58:39] definition

[01:58:40] >> and because if it's already in HPM you

[01:58:41] can be doing compute while you're

[01:58:43] getting it from HPM desk for example.

[01:58:45] >> Yeah. Um so so these are three things

[01:58:47] and I I guess I ordered them wrong. Um,

[01:58:50] in general, if you if you're balancing

[01:58:51] two costs and you've got different

[01:58:52] memory uh different tiers in the memory

[01:58:54] hierarchy, you should expect as as this

[01:58:57] cost goes up, this cost should go down.

[01:59:00] Um, so you can kind of see where the

[01:59:03] zeros are and um like I should have

[01:59:06] ordered them. This one first, this one

[01:59:09] second, and this one third. So if you're

[01:59:12] going to hold on to it for for a very

[01:59:14] short amount of time,

[01:59:16] >> then the um all of this is like

[01:59:18] multiplied by the um hold time.

[01:59:21] >> Yeah.

[01:59:24] >> This one is and so is this one. Um

[01:59:29] and interestingly they have different

[01:59:31] prices to write for and is you specify

[01:59:33] this in the API for 5 minutes versus an

[01:59:36] hour.

[01:59:37] >> Yeah. which which suggests that the 5

[01:59:39] minutes is HBM and the hour is DDR.

[01:59:41] >> I think that's like I think that's a

[01:59:43] pretty good assumption. It could if you

[01:59:45] look at the numbers it might also turn

[01:59:46] out that it's one tier down and it's DDR

[01:59:48] versus flash is

[01:59:49] >> Yeah. Okay. Interesting. And the price

[01:59:51] difference I think was I'll look it up.

[01:59:54] Okay. So the um base

[01:59:59] uh base input tokens is five per million

[02:00:03] tokens basic which means rebat.

[02:00:05] >> Yeah, that's five. Um,

[02:00:06] >> is this five $5

[02:00:08] >> to like

[02:00:10] retrieve quote unquote and then the um

[02:00:14] to write to um

[02:00:19] uh

[02:00:21] presumably HBM write for 5 minutes is

[02:00:24] 6.25.

[02:00:26] >> So actually we might actually be able to

[02:00:27] determine the um which memory t it is by

[02:00:31] um by the durations. Actually, the

[02:00:33] duration probably tells it to actually

[02:00:35] 5 minutes versus 1 hour.

[02:00:37] >> Yeah, exactly. I think this will

[02:00:39] probably end up being um it's going to

[02:00:42] be the drain time of the memory uh tier

[02:00:44] that you're in. And so what that means

[02:00:46] is like uh like given that I'm I know

[02:00:50] I'm going to be holding something for 5

[02:00:51] minutes. I would like to

[02:00:54] have [snorts] pick a memory that I can

[02:00:56] read every 5 minutes like I can read the

[02:00:58] whole memory once per 5 minutes

[02:01:00] ballpark. Um so that is the drain time

[02:01:02] of the memory. So if I take the the like

[02:01:04] call or the storage storage capacity

[02:01:07] over storage bandwidth

[02:01:09] bandwidth um I would like this to be

[02:01:12] like equal to 5 minutes or something

[02:01:13] like that.

[02:01:14] >> Um and so actually we did this

[02:01:16] calculation for HPM. For HPM we know

[02:01:18] that this number is 20 milliseconds. Um

[02:01:21] so HPM is much too short like much too

[02:01:24] small. Um DDR

[02:01:27] could be about an order of magnitude or

[02:01:29] or two off from this. And so this is

[02:01:31] probably in the order of like actually I

[02:01:33] think it might even be in the in the

[02:01:34] seconds like 1 to 10 seconds. Um and

[02:01:37] then

[02:01:39] this is really I don't have these

[02:01:41] numbers memorized but generally as you

[02:01:42] go to slower tiers uh flash is plausibly

[02:01:45] in the order of 1 minute. Um and then

[02:01:47] like spinning disc uh which is massively

[02:01:50] different I think is on the order of 1

[02:01:51] hour. So this might actually identify

[02:01:54] that the tiers are probably flash and

[02:01:56] spinning disc. Sorry, why why is this

[02:01:58] the calculations the storage cap divided

[02:02:00] by the bandwidth?

[02:02:01] >> So, um you you've got a bunch of

[02:02:02] different memory tiers like we've listed

[02:02:04] four of them. Um

[02:02:06] >> uh the your choice like your choice of

[02:02:09] which memory tier is a like you want to

[02:02:11] minimize the cost.

[02:02:12] >> Um

[02:02:13] >> and so you are like what fraction of the

[02:02:16] device are you using? You're using some

[02:02:17] fraction of the device for the holding

[02:02:20] onto it and then you're using some

[02:02:21] fraction of the device to retrieve it.

[02:02:24] Um, and so let's say I'm using like 10%

[02:02:30] of the device. Um, and and I want to

[02:02:31] equalize those two fractions. Uh, that

[02:02:33] that's a sign that I've hit the right um

[02:02:35] the right thing. So let's say I've got

[02:02:37] some runtime here. Like I I'm going to

[02:02:39] hold on for all of this time. Um, uh,

[02:02:42] and then so this is the time hold uh,

[02:02:47] and then there's going to be some amount

[02:02:49] of time here which is time to retrieve.

[02:02:50] >> Mhm. Uh

[02:02:53] and I want I mean basically to equalize

[02:02:56] the costs these two costs. Um I want the

[02:02:59] retrieval time to be equal to the hold

[02:03:02] time

[02:03:04] uh

[02:03:06] times the like fraction of capacity.

[02:03:10] >> Mhm.

[02:03:13] >> Um because like this is the the

[02:03:15] retrieval time. Uh yeah I mean this is

[02:03:18] >> this is how many other things I can hold

[02:03:19] simultaneously. Basically just like,

[02:03:20] hey, you want to you want to store

[02:03:23] things in there for so long such that

[02:03:27] the amount of time it's in there is kind

[02:03:29] of the time to get all your things in

[02:03:31] there and out.

[02:03:32] >> Yeah, basically it makes sense.

[02:03:33] >> I I think that probably indicates that

[02:03:35] this is the two tiers of flash and and

[02:03:37] spinning disc. I'm kind of shocked to

[02:03:39] see spinning disc being used at all

[02:03:40] because it's such an old technology.

[02:03:43] Yeah.

[02:03:43] >> I mean, it's also crazy that it's so

[02:03:45] slow that it takes an hour to load its

[02:03:47] full capacity into it and then

[02:03:48] >> like it's a really unattractive

[02:03:49] technology, but it's useful in some

[02:03:51] places.

[02:03:51] >> Yeah. So, we're sitting down because I

[02:03:53] want to ask you some questions that uh I

[02:03:54] guess don't need a blackboard. Um, you

[02:03:57] have this extremely interesting blog

[02:03:58] post where you talk about how at a high

[02:04:01] level the architecture of different

[02:04:03] cryptocraphic protocols looks a lot like

[02:04:06] neural networks. And there's this

[02:04:08] conversion evolution where they both

[02:04:10] need to jumble information across all

[02:04:13] their inputs. For cryptographic

[02:04:14] protocols, it's to make sure that

[02:04:15] there's like each new input into a hash

[02:04:18] function will totally scramble what

[02:04:19] happens. For neural networks, of course,

[02:04:22] they need to consider information how

[02:04:25] this piece of information changes what

[02:04:26] you should make of this other piece of

[02:04:28] information. That has a extremely

[02:04:30] interesting point. I guess at a high

[02:04:32] level that the difference in what

[02:04:33] they're trying to do in some sense,

[02:04:34] they're trying to do the inverse thing,

[02:04:36] right? which is um cryptographic

[02:04:39] protocols are trying to take information

[02:04:41] which has structure and make it look

[02:04:44] indistinguishable from randomness.

[02:04:45] >> Yeah.

[02:04:46] >> And uh neural networks are trying to

[02:04:47] take things which are look like random

[02:04:50] protein sequences DNA garble text and

[02:04:55] extract higher level structure from it.

[02:04:57] So

[02:04:59] they have similar highle mechanisms but

[02:05:01] they're actually kind of trying to do

[02:05:02] the opposite things. Um yeah I wonder

[02:05:04] what you make of that.

[02:05:05] >> Yeah. Um, so I mean the like the mixing

[02:05:08] like I tried to look for other examples

[02:05:11] where mixing like scrambling mixing

[02:05:13] shows up as well. There's actually

[02:05:14] almost even like a physical example

[02:05:16] where like you're stirring something,

[02:05:18] you're making a cake and you want to

[02:05:19] stir the batter and like literally the

[02:05:21] idea like first stir it this way and

[02:05:23] then stir it this way is like actually

[02:05:24] not too bad of an approach. Um, but

[02:05:26] beyond that like in back to the digital

[02:05:28] world um

[02:05:31] there are some differences and the one

[02:05:32] you talk uh call out is is a pretty

[02:05:34] strong difference. um the way it shows

[02:05:37] up um like what makes neural nets uh

[02:05:43] like if you just randomly initialize a

[02:05:44] neural network actually maybe it's a

[02:05:46] reasonable cryptograph like cipher as

[02:05:49] well because like the random

[02:05:50] initialization is it going to jumble

[02:05:51] stuff in a complicated way it may even

[02:05:53] like do what you want who knows um uh

[02:05:56] the thing that makes it interpretable is

[02:05:58] the gradient descent so you can

[02:05:59] differentiate a neural network um and

[02:06:01] get a meaningful derivative um and we do

[02:06:05] a lot of work to

[02:06:07] like not over complicate the derivative.

[02:06:09] So the residual connection keeps it like

[02:06:11] contained and simple. Um and the uh and

[02:06:14] so does like the layer norm uh stuff

[02:06:16] that we do. Um

[02:06:18] one of the biggest attacks against uh

[02:06:21] cryptographic ciphers is also to

[02:06:23] differentiate the cipher. Um ciphers run

[02:06:26] in a different number field. They run in

[02:06:28] um uh the field of two elements. So just

[02:06:32] binary. Um whereas neural nets run like

[02:06:34] in theory in the field of real numbers.

[02:06:36] Um uh and so you have to differentiate

[02:06:38] with respect to like binary numbers. Um

[02:06:42] but you can absolutely differentiate a

[02:06:44] cipher and this is called differential

[02:06:47] crypt analysis. And uh like basically

[02:06:50] what it says is that if you take a small

[02:06:51] difference of the input how like uh it's

[02:06:54] quite difficult to make uh the

[02:06:55] difference of the output be small like

[02:06:57] oh like uh the whole job of a of a

[02:07:00] well-designed cipher is to make the

[02:07:01] difference out very large.

[02:07:02] >> Um so I I guess the distinction is that

[02:07:06] the the optimization goals at that point

[02:07:08] are about complexifying. They they don't

[02:07:10] have the same residual connections or um

[02:07:12] or like layer norms that that would

[02:07:14] >> Yeah. I mean, I I guess a place where

[02:07:16] the the two merge is back doors.

[02:07:21] >> Um, okay. So, with a back door LLM,

[02:07:24] you're trying to hide um

[02:07:27] what do you consider an input? It's not

[02:07:28] an input into the forward pass, but it's

[02:07:30] an input into the backward pass, but

[02:07:31] you're trying to hide an input into the

[02:07:32] backward pass.

[02:07:33] >> Like you're like this is like an

[02:07:35] adversarial uh

[02:07:37] Yeah. So, yeah. I mean in fact this is

[02:07:39] like this is actually a place where you

[02:07:41] get exactly the um sort of avalanche

[02:07:44] property that ciphers have as well. Um

[02:07:48] like adversarial attacks on typically

[02:07:51] like image classification models right

[02:07:53] are can I find a perturbation of the

[02:07:55] image that a very very small

[02:07:57] pertabbation of the image that totally

[02:07:58] changes the classification totally

[02:07:59] changes the output

[02:08:01] >> that is the common case in ciphers

[02:08:02] whereas it that's the like undesired

[02:08:05] case in in neural nets for sure. Yeah.

[02:08:07] >> Okay. So I was asking you uh has have

[02:08:10] neural networks actually been used for

[02:08:12] cryptography and um we realized it might

[02:08:15] be better to just do this on the

[02:08:16] blackboard.

[02:08:16] >> Yeah.

[02:08:17] >> Um so I'm curious are they actually

[02:08:19] being used for cryptography?

[02:08:20] >> Yeah. So using neural nets for

[02:08:22] cryptography well in general

[02:08:24] cryptography like creating a new cipher

[02:08:26] is a very very dangerous proposition.

[02:08:27] Like uh almost all of them are broken

[02:08:29] like 99% of them are broken. So uh

[02:08:33] probably a bad place to start but the

[02:08:34] other direction has been very like in in

[02:08:37] at least one very clear case quite

[02:08:39] productive. Um so there's this

[02:08:41] construction in so a construction that

[02:08:44] exists in in ciphers and then was

[02:08:46] imported into neural nets um called a

[02:08:48] fistl cipher fal network. Um so the idea

[02:08:52] is that um you you may have some some

[02:08:54] some function f uh which is not

[02:08:56] invertible. Um

[02:09:00] but you like the function because it

[02:09:01] like does interesting things like it it

[02:09:03] it um it does an MLP for example or or

[02:09:06] it mixes it in an interesting way. Um

[02:09:08] you'd like to build something out of

[02:09:09] this that is invertible. So the

[02:09:11] construction we're going to make is

[02:09:12] going to actually be a twoinput function

[02:09:13] rather than a one input function. um

[02:09:18] and we're going to apply uh

[02:09:22] f ofx

[02:09:25] we need to actually remember what x was.

[02:09:27] So we're going to stick x over here so

[02:09:29] that we can uh work backwards and then

[02:09:32] we also can't drop y. So we're going to

[02:09:33] remember y and we're going to add them

[02:09:35] together. And so we form this tpple.

[02:09:40] So, um, the the way to invert this, like

[02:09:42] if you think I have this output and I

[02:09:44] want to recover X and Y, well, I can

[02:09:46] easily recover X. It's right there. I

[02:09:48] just read it off. And then to recover Y,

[02:09:50] I like if this thing was called Z, um, I

[02:09:53] can I can recover Y by Z minus F of X

[02:09:58] because I've already recovered X. So, so

[02:10:01] that means that this construction is

[02:10:02] invertible.

[02:10:03] Um,

[02:10:05] this was used in ciphers like a ton. Um,

[02:10:08] still is used. It's one of the main uh

[02:10:09] mechanisms of constructing ciphers.

[02:10:11] Often you want ciphers to be invertible,

[02:10:12] especially the layers of ciphers you

[02:10:14] want to be invertible um because that

[02:10:16] has better cryptographic properties.

[02:10:19] This has actually been ported over into

[02:10:22] um

[02:10:24] into neural nets. Um there's a 2017 18

[02:10:28] paper called Rev nets, reversible

[02:10:30] networks. Um and what it does is it

[02:10:33] actually makes the entire like you can

[02:10:35] apply it to any network like a

[02:10:36] transformer network. you can make I do a

[02:10:38] forwards pass but then I can actually

[02:10:40] run the entire pass backwards as well.

[02:10:42] Um so the whole neural network is

[02:10:43] invertible

[02:10:45] um with exactly this construction and so

[02:10:48] this paper reversible networks um like

[02:10:50] applied to some layer like a transformer

[02:10:52] layer for example we've got this

[02:10:54] function f which is our transformer

[02:10:56] layer um now normally we would have um

[02:10:59] just an input and then a residual

[02:11:01] connection coming out um and it gets

[02:11:04] added like this um over here. Mhm.

[02:11:07] >> Um but now, uh the variation of this is

[02:11:11] going to be we've got two inputs, X and

[02:11:12] Y. Um so we've got X

[02:11:17] and Y inputs.

[02:11:19] Um X goes through the function gets

[02:11:22] added to Y

[02:11:28] and then this becomes the new X, the

[02:11:31] output X.

[02:11:34] And then this x

[02:11:37] becomes the output y.

[02:11:40] So um really what this is doing this is

[02:11:42] like this is actually sort of doing if

[02:11:44] you think of two layers uh back this is

[02:11:46] actually the thing you mentioned before

[02:11:48] it's actually doing the residual

[02:11:49] connection from two layers back. Um like

[02:11:52] this y came from the previous layer and

[02:11:53] was the residual connection there.

[02:11:55] >> Um but because of this construction it

[02:11:57] the whole thing is invertible.

[02:11:59] >> Why do I care? What does invertible

[02:12:01] matter for? Um the big thing that it can

[02:12:03] be interesting for is for training. Um

[02:12:05] if I think of a forward passive training

[02:12:08] um so I will let's say I have four

[02:12:10] layers I run them in 0123 order um I

[02:12:13] have to write all of the um activations

[02:12:16] to HBM

[02:12:18] >> um and so I get an HPM footprint um here

[02:12:21] that is kind of like linear linear in

[02:12:26] uh number of layers.

[02:12:27] >> Yep.

[02:12:29] Um, so this this actually can be uh the

[02:12:32] largest memory footprint during

[02:12:34] training. Um, and so this is normal

[02:12:36] training and then and then I run the

[02:12:37] backwards pass and I read it kind of in

[02:12:39] reverse like I I run them sort of

[02:12:41] forward pass goes forward, backward pass

[02:12:42] goes backwards and I have to read them

[02:12:44] back out. Um, the idea of this RevNet's

[02:12:48] paper is that because it's in invertible

[02:12:51] um I don't need to store this at all. I

[02:12:53] can completely rematerialize it when I'm

[02:12:55] running my backwards pass. So I I run my

[02:12:56] forwards pass and then when I'm running

[02:12:58] my backwards pass, I'm simultaneously in

[02:13:01] lock step undoing all of the forwards

[02:13:03] pass steps that I did in order to um uh

[02:13:06] to have the activations that I need

[02:13:08] here. So this ends up being a memory

[02:13:09] saving, which is a nice idea.

[02:13:11] >> Interesting. And in in some sense,

[02:13:13] you're spending more compute to save

[02:13:15] memory.

[02:13:15] >> That's right. Yeah.

[02:13:16] >> Interesting.

[02:13:17] >> Huh. Actually, it's kind of the opposite

[02:13:19] of what you're doing with the KV cache.

[02:13:20] In the KV cache,

[02:13:22] >> you're spending more memory to save

[02:13:23] compute.

[02:13:24] >> Yeah. Uh, spending more memory to save

[02:13:26] computers is generally profitable given

[02:13:27] where yeah, hardware are today.

[02:13:29] >> Yeah. Interesting. Cool. Uh, that was

[02:13:31] super fun, right? Thank you so much for

[02:13:33] doing it. I I feel like it really

[02:13:35] vindicated the vision behind the studio

[02:13:36] and and the blackboard.

[02:13:38] >> Cool. Thanks so much for doing it.

[02:13:39] >> Thanks.