# Chip design from the bottom up – Reiner Pope

https://www.youtube.com/watch?v=oIk3R-sMX5o
Translation: zh-CN

[00:00] I'm back with Rainer Pope, who is the CEO of Maddock, which is a new AI chip company.
  我和 Rainer Pope 一起回来了，他是 Maddock 的首席执行官，Maddock 是一家新的人工智能芯片公司。

[00:04] Last time we were talking about what happens inside a data center.
  上次我们谈论了数据中心内部发生的事情。

[00:07] Now I want to understand what happens inside an AI chip.
  现在我想了解人工智能芯片内部发生的事情。

[00:09] How does a chip actually work?
  芯片到底是如何工作的？

[00:13] Full disclosure, by the way, I am an angel investor in Maddock.
  顺便说一句，坦白说，我是 Maddock 的天使投资人。

[00:16] Um, so hopefully you have designed a good chip.
  嗯，所以希望你设计了一款不错的芯片。

[00:19] >>> [laughter]
  >>> [笑声]

[00:20] So I'll start with uh, sort of the very smallest fundamental unit of of of chip design and we'll sort of build up into what an overall um, like actual production chip, uh, what are the components of that?
  所以，我将从芯片设计的最小基本单元开始，然后逐步构建一个整体的、实际的生产芯片，它的组成部分是什么？

[00:31] Yeah.
  是的。

[00:31] At the very bottom level of a chip, the primitives that we work with are uh, logic gates, um, which are very simple things like and or not.
  在芯片的最底层，我们使用的基本单元是逻辑门，它们是非常简单的东西，比如与、或、非。

[00:39] Um, and then these are connected together by by wires that have to be laid out uh, physically as metal traces on a chip.
  然后，这些通过导线连接在一起，导线必须在芯片上物理地布线，作为金属痕迹。

[00:45] The main function that that AI chips want to uh, compute is uh, multiplication of matrices and really inside that is the fundamental primitive is multiply accumulator just like of of pairs of numbers.
  人工智能芯片想要计算的主要功能是矩阵乘法，而其内部的基本单元是乘加器，就像对数字对进行运算一样。

[00:59] So we're going to uh, sort of
  所以，我们将要，嗯，某种程度上

[01:00] demonstrate what that calculation looks like by hand.
  演示一下那个计算手工看起来是什么样的。

[01:04] And then sort of infer what what a uh, circuit would look like for that.
  然后推断一下，呃，电路会是什么样子。

[01:08] It'll turn out to be sort of easiest if if I do multiplication uh, accumulator something like a um, a four-bit number um, with another four-bit number.
  事实证明，如果我做一个乘法累加，比如一个四位数的，和一个另一个四位数的，会比较容易。

[01:18] and then we're going to the the actual clearest primitive is actually um, multiply accumulate.
  然后我们要做的最清晰的原语实际上是乘法累加。

[01:24] So there's a multiply these two terms and then we're going to add in.
  所以有一个将这两个项相乘，然后我们要加上。

[01:30] So, product of these two terms and then we're going to add in an eight-bit uh, number.
  所以，这两个项的乘积，然后我们要加上一个八位的数字。

[01:39] And then you can ask a clarifying question.
  然后你可以问一个澄清性的问题。

[01:41] Why is this the um, natural primitive for um, you know, what what whatever computation happens inside a computer?
  为什么这是，呃，自然的基元，用于，你知道的，计算机内部发生的任何计算？

[01:51] Yeah. So um, the there's a few reasons for this.
  是的。所以，呃，这有几个原因。

[01:53] Uh, it it's a little bit more efficient, um, but the the reason it's natural for AI chips is that if you look
  呃，它效率稍微高一点，但是它对AI芯片来说是自然的，是因为如果你看

[02:01] What's happening during a matrix multiply?
  矩阵乘法过程中发生了什么？

[02:02] What is matrix multiply?
  什么是矩阵乘法？

[02:05] In very short, it is um uh there's a for loop over I and over J and over K.
  简而言之，它有一个对 I、J 和 K 的 for 循环。

[02:18] Of output I K plus equals to um input I J times other input J K.
  输出 I K 加上等于输入 I J 乘以其他输入 J K。

[02:33] And so multiply accumulate happens at every single step of of a matrix multiply.
  因此，乘法累加发生在矩阵乘法的每一步。

[02:37] Makes sense.
  说得通。

[02:38] And then the other observation is that um the precision will almost always be higher in the accumulation step than in the multiplication step.
  然后另一个观察是，精度在累加步骤中几乎总是高于乘法步骤。

[02:45] Um this is maybe specific to AI chips, uh but you're you're multiplying low precision numbers, but then when you accumulate, errors accumulate quickly and so you need more precision here.
  这也许是 AI 芯片特有的，但你正在将低精度数字相乘，然后当你累加时，误差会迅速累积，所以你需要更高的精度。

[02:54] So, this is why we've chosen to do a 4-bit multiplication and an 8-bit addition.
  所以，这就是为什么我们选择进行 4 位乘法和 8 位加法。

[02:56] Let me make sure I understood that.
  让我确保我理解了。

[03:00] There's two ways to understand that.
  有两种理解方式。

[03:00] One
  一种

[03:01] is that the value will be larger than the inputs.
  该值将大于输入值。

[03:07] And the other is that if it was a floating-point number, it would be maybe that that part is like less intuitive to me, but it's maybe the same principle.
  另一个是，如果它是一个浮点数，它可能是我不太直观的那一部分，但它可能是相同的原理。

[03:14] It It is really the same principle.
  这确实是相同的原理。

[03:16] Um I guess the sort of I mean I guess the the separate principle is that as you are summing up this number, you are summing up a whole bunch of numbers and so you've got a lot of rounding errors accumulating.
  嗯，我想，我的意思是，我想，分开的原理是，当你对这个数字进行求和时，你是在对很多数字进行求和，所以你积累了很多舍入误差。

[03:26] Whereas in this case, there's like there's there's only one multiplication in that chain and so there's not a lot of rounding errors accumulating in the multiplication.
  而在这种情况下，链中只有一个乘法，因此乘法中没有多少舍入误差在积累。

[03:32] Why are you summing up a whole bunch of numbers?
  你为什么要对很多数字求和？

[03:33] There's just two numbers, right?
  只有两个数字，对吧？

[03:34] Just I mean, the summation happens it's repeated J J many times.
  我的意思是，求和发生了，它被重复了很多次。

[03:39] Any errors accumulate, I see.
  任何错误都会累积，我明白了。

[03:41] Yes, makes sense.
  是的，有道理。

[03:41] So, how would we perform this calculation by hand?
  那么，我们如何手动进行此计算？

[03:43] Um I mean, as a human we would probably separate it into two, but we can sort of do it all all in one um using long multiplication.
  嗯，我的意思是，作为一个人，我们可能会把它分成两部分，但我们可以用长乘法把它全部完成。

[03:52] So, the multiplication term first, we're going to multiply this number that this 4-bit number here by every single bit position in the other 4-bit number.
  所以，首先是乘法项，我们将把这个数字，这个4位数字，乘以另一个4位数字中的每一个位。

[04:03] So, we write that out.
  所以，我们把它写出来。

[04:04] Um first 1001 multiplied by this bit position.
  嗯，首先是 1001 乘以这个位。

[04:10] Um that is this number itself.
  嗯，那就是这个数字本身。

[04:12] Then, shift it up across by one.
  然后，向上移动一位。

[04:14] We're multiplying by zero.
  我们乘以零。

[04:16] That gives us an all zeros number.
  这给了我们一个全零的数字。

[04:20] Shifted uh across even one more to multiply by this one.
  再向右移动一位以乘以这个一。

[04:24] We get 1001.
  我们得到 1001。

[04:28] And then finally for this last last bit position, we get an all zeros uh number again.
  最后，对于这个最后的最后一位，我们再次得到一个全零的数字。

[04:37] So, this this sort of gives us a bunch of terms that we're going to have to add for the um for the multiplication.
  所以，这给了我们一堆项，我们将不得不为乘法相加。

[04:43] And then while we're doing that uh summation of this, we might as well add in the the the actual accumulator term as well.
  然后，在我们进行这个求和时，我们不妨也加上实际的累加器项。

[04:49] So, we just copy that directly across.
  所以，我们直接复制过去。

[04:58] So, so this is the sum.
  所以，这就是总和。

[05:00] It's a it's a five-way sum that we're going to want to compute.
  这是一个我们要计算的五路和。

[05:06] So, firstly, what what logic gates did it take us to even get to this intermediate step?
  所以，首先，我们用了哪些逻辑门才得到这个中间步骤？

[05:12] We needed to uh produce all 16 of these um partial products.
  我们需要呃生成这16个 um 部分积。

[05:17] How do I produce one of these partial products?
  我如何生成这些部分积中的一个？

[05:19] So, let's take this this number one for example here.
  所以，我们以这里的数字一为例。

[05:23] It is one.
  它是1。

[05:23] So, what was it How do we produce this number by multiplying this number by this one over here.
  那么，它是怎么通过将这个数字乘以这边的这个数字来生成的呢？

[05:31] We can actually produce that by an AND gate.
  我们实际上可以通过一个与门来生成它。

[05:32] This this number is one if both this bit is one and this bit is one.
  这个数字为1，当且仅当这个位是1并且这个位是1。

[05:37] If either of them is zero, then the multiplication of one times one of of zero times anything is zero.
  如果其中任何一个为零，那么1乘以1或0乘以任何数都等于零。

[05:41] So, to produce all of this stuff, we ended up consuming 16 and gates.
  所以，为了生成所有这些东西，我们最终消耗了16个与门。

[05:49] Or in the general case, if I were doing a like a P bit multiply times a Q bit multiply then this will be like P times Q many hands.
  或者在一般情况下，如果我进行一个 P 位乘 Q 位乘，那么这将是 P 乘以 Q 个与门。

[06:06] Exactly.
  确实。

[06:10] Finally, I sum them.
  最后，我将它们相加。

[06:12] Actually, most of the work is going to happen in the summing.
  实际上，大部分工作将在求和中完成。

[06:17] Um and so let me describe sort of the other logic gate that we use here.
  嗯，所以让我来描述一下我们在这里使用的另一种逻辑门。

[06:20] And is almost the simplest logic gate that exists on a chip.
  它是芯片上几乎最简单的逻辑门。

[06:22] It's it's it's it's almost the smallest.
  它，它，它，它几乎是最小的。

[06:27] At the other extreme, typically the very largest logic gate that you'll use is something called a full adder.
  在另一个极端，通常你将使用的最大的逻辑门是所谓的全加器。

[06:36] And what this does is sort of it does like it coming from software, you might think that a full adder it like adds 32-bit numbers together.
  它的作用是，就像从软件角度来看，你可能会认为全加器是将32位数字相加。

[06:41] Uh in this case, it just adds three single-bit numbers together.
  呃，在这种情况下，它只是将三个单比特数字相加。

[06:44] And so you can think of it as like adding zero, one, and one together.
  所以你可以把它想象成将零、一和一相加。

[06:50] Now, when I add this together, the result can be zero, one, two, or three.
  现在，当我将它们相加时，结果可以是零、一、二或三。

[06:53] So, I can express that in binary using just two bits.
  所以，我可以用两个比特在二进制中表示它。

[06:55] So, it it it as input it has three bits and as output it has it has two bits, which in this case are one and like uh the number two in binary is one zero.
  所以，它，它，它作为输入有三个比特，作为输出有两个比特，在这种情况下是“一”和“零”，呃，数字二的二进制是“一零”。

[07:06] So, so this is also known as a
  所以，所以这也被称为一个

[07:09] three to two compressor.
  三转二压缩器。

[07:15] Because it takes three bits of input and produces two bits of output.
  因为它接收三个输入位并产生两个输出位。

[07:26] sorry, the the the the three inputs are um all all bits that are in the same sort of bit position, like three bits that are in a column here.
  抱歉，三个输入位都是处于相同位位置的位，就像这里一列中的三个位一样。

[07:34] Yeah, yep. And then the two outputs um I I have sort of drawn them vertically here and horizontally here to kind of match this vertical versus horizontal layout here, which is expressing that um things that are in the same column are in the same like bit position.
  是的，嗯。然后两个输出我在这里垂直绘制，在这里水平绘制，以匹配这里的垂直与水平布局，这表明同一列中的事物处于相同的位位置。

[07:48] Yeah. Whereas things that are in adjacent columns, like this is a carry out where this was the sum.
  是的。而相邻列中的事物，比如这是进位输出，而这是和。

[07:51] So, if the inputs in the full adder are let's say like 101, then the output will still be 10.
  所以，如果全加器的输入是，比如说 101，那么输出仍然是 10。

[07:57] If it was 111, it'd be 11.
  如果是 111，那就是 11。

[07:59] If it was 000, it'd be 00.
  如果是 000，那就是 00。

[08:02] If it was like 010, it'd still be 01.
  如果是 010，那仍然是 01。

[08:05] So, yeah, it's just counting essentially the number of the number of
  所以，是的，它基本上只是在计算数量，数量的

[08:09] things in each expressing that in binary.
  每个东西都以二进制形式表达。

[08:12] So, this circuit actually can sort of capture what we as humans naturally do when when we're when we're doing summing along the column.
  所以，这个电路实际上可以捕捉到我们人类在沿着列进行加法时自然而然会做的事情。

[08:21] So, I can show that sort of I'll show you sort of one iteration of using the full adder to sum.
  所以，我可以展示一下使用全加器进行一次加法的过程。

[08:26] The way I sum here is going to be a little bit unnatural for humans.
  我在这里进行加法的方式对人类来说会有点不自然。

[08:29] Humans, we would sort of sum along the column and then remember the carry.
  人类会沿着列进行加法，然后记住进位。

[08:33] But instead of remembering the carry, we'll actually just explicitly write it out.
  但我们不会记住进位，而是会明确地把它写出来。

[08:38] So, in this we proceed from the rightmost column towards the left.
  所以，在这个过程中，我们从最右边的列向左进行。

[08:42] On the rightmost column, we sum the one and the one.
  在最右边的列，我们将一和一相加。

[08:45] And that produces like a a zero here and a carry of one.
  这会产生一个零和一个进位一。

[08:50] So, we've we've sort of used a this full adder circuit on this pair of bits and produced a pair of bits as output.
  所以，我们在这个比特对上使用了这个全加器电路，并产生了一个比特对作为输出。

[08:59] Now, we can do the same thing with this column.
  现在，我们可以对这一列做同样的事情。

[09:00] We've got a column of 1 2 3 4 numbers.
  我们有一列1、2、3、4个数字。

[09:04] And so, maybe we'll like take the first three of them.
  所以，也许我们会取其中的前三个。

[09:07] Um run a full adder on them and that gives us a zero and a zero as output.
  嗯，在它们上面运行一个全加器，这会给我们一个零和一个零作为输出。

[09:08] Um so,
  嗯，所以，

[09:11] it's like the sum of these is is 00.
  这些的总和是00。

[09:13] And so, that's the full compressor full adder applied to all of these bits.
  所以，这就是应用于所有这些位的完整压缩器全加器。

[09:17] As I've used up bits, I'm going to sort of just cross them out to indicate that I've I've I've handled them.
  当我用完位时，我会将其划掉，以表明我已经处理了它们。

[09:25] Let's just keep going a little bit more.
  让我们继续一点。

[09:27] So, we we'll we'll go here.
  所以，我们将在这里进行。

[09:29] I take these three numbers.
  我取这三个数字。

[09:29] I add them.
  我将它们相加。

[09:31] That gives me a a one and a zero.
  这给了我一个一和一个零。

[09:35] I've dealt with these three numbers.
  我已经处理了这三个数字。

[09:36] And now I take one, two, and I can even take these three numbers for example right now and add them and that gives me a a one and a zero and I've dealt with these numbers.
  现在我取一、二，我甚至可以举例来说，现在就取这三个数字并将它们相加，这给了我一个一和一个零，我已经处理了这些数字。

[09:47] So, I can sort of like the way I should view this is that I have this whole grid of numbers that need to be added.
  所以，我应该这样看待这个问题，我有一个需要相加的数字网格。

[09:52] I'm going to just keep applying full adders to all the bits that are here, constantly removing three numbers from a column and then writing out two numbers as output.
  我将继续将全加器应用于这里的所有位，不断地从一列中移除三个数字，然后写出两个数字作为输出。

[10:03] Keep going with this over and over and over again until I eventually get like some just one single number coming out here.
  一遍又一遍地继续下去，直到最终得到一个单独的数字在这里输出。

[10:09] Um, something like that.
  嗯，差不多是这样。

[10:12] This is probably the wrong sum.
  这可能是不正确的总和。

[10:16] So, so this approach that I've described here, this is called a um, data multiplier.
  所以，我在这里描述的这种方法，它被称为一个嗯，数据乘法器。

[10:26] Um, and this is sort of like the standard for how how you do area efficient multipliers using full adders.
  嗯，这有点像标准，说明如何使用全加器来实现面积效率高的乘法器。

[10:36] Let's try and quantify the circuit size of this just so we have got a a sense of like how big things are that so we can compare to them later.
  让我们尝试量化这个电路的大小，这样我们就能大致了解事物的大小，以便以后进行比较。

[10:44] How many full adders do I did I use?
  我用了多少个全加器？

[10:47] I started with how many numbers?
  我从多少个数字开始？

[10:47] I have the 16 partial products, which is the product of all of these terms with all of these terms.
  我有 16 个部分积，这是所有这些项与所有这些项的乘积。

[10:57] Plus the eight um, terms that I'm adding here.
  加上我在这里添加的八个嗯，项。

[10:59] So, I I started off with 24 bits.
  所以，我开始时有 24 位。

[11:06] And then on I produced eight bits on the output eventually.
  然后最终我在输出上产生了八位。

[11:10] And in every step, I was sort of
  在每一步中，我都在某种程度上

[11:13] crossing off three numbers and writing two numbers out as a result.
  划掉三个数字，结果写出两个数字。

[11:17] And so, every single use of a full adder eliminates one of the bits here.
  所以，每次使用一个全加器都会消除这里的一个比特。

[11:21] Mhm.
  嗯。

[11:21] And so, how many full adders it must be the 24 minus the eight.
  所以，有多少个全加器，必须是24减去8。

[11:25] So, that there were 16 full adders in in this circuit.
  所以，这个电路中有16个全加器。

[11:28] In general, this is true in the general case as well.
  总的来说，在一般情况下也是如此。

[11:30] There will be P * Q many full adders Uh, in this circuit.
  这个电路中将有 P * Q 个全加器。

[11:35] I'm just trying to understand the logic of that.
  我只是想理解其中的逻辑。

[11:39] So, the input bits 24 is P * Q + P + Q.
  所以，输入比特24是 P * Q + P + Q。

[11:43] That's right.
  没错。

[11:43] And the output bits is just P + Q.
  输出比特就是 P + Q。

[11:52] And so, P * Q + P + Q P + Q = P * Q.
  所以，P * Q + P + Q P + Q = P * Q。

[11:59] So, I think this explains sort of or at least hints that the second reason why we chose to do a multiply accumulate.
  所以，我认为这解释了或者至少暗示了我们选择进行乘法累加的第二个原因。

[12:06] First reason being that's actually what shows up in matrix matrix multiplication, but second reason being it gave us this very slick
  第一个原因是这实际上就是矩阵乘法中出现的情况，第二个原因是它给了我们这个非常简洁的

[12:14] P * Q very simple algebra.
  P * Q 非常简单的代数。

[12:17] So, we've sort of described like this.
  所以，我们已经大致描述了像这样。

[12:19] this whole procedure every single atomic step that I took here becomes a logic gate and then sort of the wires connected together.
  这个整个过程，我在这里采取的每一个原子步骤都变成了一个逻辑门，然后是连接在一起的导线。

[12:25] Like when I had these three inputs that that I summoned to produce these two outputs, like if I think of mapping this to a physical device, there would be a wire that runs sort of connecting all three of these things together into a logic gate that produced this output.
  就像当我拥有这三个输入来产生这两个输出时，如果我设想将其映射到一个物理设备，就会有一根导线将这三者连接到一个产生该输出的逻辑门。

[12:42] Okay.
  好的。

[12:42] So, this is the the main primitive at different bit widths that that is that is inside an AI chip.
  所以，这是不同位宽内的主要基本单元，它存在于 AI 芯片中。

[12:52] We're going to build up from here to how would you use that to run all of the other operations you want?
  我们将从这里开始构建，如何使用它来运行您想要的所有其他操作？

[12:58] This may be the wrong time to ask this question, but whenever Nvidia reports that this chip can do X many FP4 or half as many FP8, it seems to imply that those circuits are fungible.
  现在问这个问题可能不合适，但每当英伟达报告说该芯片可以执行 X 个 FP4 或一半的 FP8 时，似乎就意味着这些电路是可互换的。

[13:11] That there's not as dedicated like FP4 versus FP8,
  也就是说，没有像 FP4 与 FP8 那样专门的区分，

[13:15] but the way you're you're mapping it out here it seems like you would need if it has to be mapped out in the logic,
  但你在这里的映射方式似乎需要，如果它必须在逻辑中映射出来，

[13:22] you would need a dedicated FP4 multiply accumulate and then a dedicated FP8 accumulate?
  你需要一个专用的 FP4 乘累加器，然后是一个专用的 FP8 累加器？

[13:30] Basically, can you can you fund them?
  基本上，你能，你能资助它们吗？

[13:33] As drawn, they're they're actually not particularly fungible.
  如所绘制的，它们实际上并非特别可替代。

[13:36] Um This is actually one of the main choices you have to make when designing a chip,
  嗯，这实际上是设计芯片时必须做出的主要选择之一，

[13:38] which is how much of FP4, how much of FP8 do I have?
  即我拥有多少 FP4，多少 FP8？

[13:41] And then that sometimes I'll make that consideration from the point of view of like uh what do I think that is the customer requirement?
  然后有时我会从“嗯，我认为客户要求是什么？”的角度来考虑这一点。

[13:47] Another way to take an angle on that is to say um what is the what is the power budget for uh equalize the power budget between FP4 and FP8.
  另一种看待这个问题的方式是说，嗯，FP4 和 FP8 之间的功耗预算是多少，平均功耗预算是多少。

[13:53] But so then when they report those numbers and they just happen to be the case that like it does 2x as many FP4 as FP8,
  但当他们报告这些数字时，碰巧是 FP4 的数量是 FP8 的两倍，

[13:59] they just happen to choose like give um equivalent die areas to all the floating points and as a result that ended up being
  他们碰巧选择给予所有浮点数等效的芯片面积，结果就是

[14:10] Like why is the ratio exactly 2x?
  为什么比例恰好是 2 倍？

[14:12] yeah, exactly.
  是的，没错。

[14:12] Yeah.
  是的。

[14:12] So um part of it is I mean the the surely that wouldn't be exactly like
  所以，嗯，其中一部分是，我的意思是，肯定不会正好是

[14:16] exactly equivalent to die area.
  正好等于芯片面积。

[14:19] There's uh there's a data movement reason actually and we'll maybe come back to this when we sort of look through how it goes into and out of memories.
  实际上有一个数据移动的原因，当我们回顾它如何进出内存时，我们可能会回到这一点。

[14:27] Um there's something really nice just from a software level of the fact that I can pack two four-bit numbers into into the same storage as an eight-bit number.
  嗯，从软件层面来看，我能够将两个四位数字打包到与一个八位数字相同的存储空间中，这一点非常棒。

[14:32] And so when I store that to a memory or something like that uh it's the the sizing of the um the buses that I wire within the chip um actually makes that work out really really nicely.
  所以当我将它存储到内存或类似的东西时，嗯，我芯片内部连接的总线的大小，嗯，实际上使得这一点运行得非常好。

[14:43] Actually come to think of it, it's not just 2x, it would be uh the the amount of area it takes it sounds like is um quadratic.
  说实话，它不仅仅是2倍，它所占用的面积听起来是二次方的。

[14:54] It's quadratic in fact.
  事实上，它是二次方的。

[14:55] Yeah.
  是的。

[14:55] with the with the bit length.
  与位长度有关。

[14:57] So uh that's why a smaller precision is like even more favorable than you would have expected.
  所以，嗯，这就是为什么较低的精度比你预期的更有利。

[15:01] a really big reason.
  一个非常重要的原因。

[15:01] So um in fact, Nvidia made a change um historically uh up until B100 or B200, every time you have the bit precision, you you double the um the flop count.
  所以，嗯，事实上，英伟达在历史上做了一个改变，嗯，直到B100或B200，每次你改变位精度，你就会将浮点运算次数翻倍。

[15:14] Um that ratio is
  嗯，这个比例是

[15:17] exactly like for the reason you said, because of this quadratic scaling, that ratio is actually slightly wrong.
  正如你所说，由于这种二次方缩放，该比例实际上略有错误。

[15:22] Um it should be like an even bigger you should get an even bigger speed up than than you than you might otherwise think.
  嗯，它应该是一个更大的，你应该获得比你可能想象的更大的加速。

[15:28] Um, Nvidia's like product specs have sort of started acknowledging that in B3 100 and beyond where the FP4 is three times faster than the FP8.
  嗯，英伟达的产品规格似乎已经开始在 B3 100 及更高版本中承认这一点，其中 FP4 的速度是 FP8 的三倍。

[15:37] Though it should be 4x.
  尽管它应该是 4 倍。

[15:38] Yeah, what I've shown here is like the simplest case of integer multiply.
  是的，我在这里展示的是整数乘法的最简单情况。

[15:42] When you're when you're dealing with floating point as you do in FP4 and FP8, um, there's this sort of other term which is the the the exponent that just complicates this this calculation.
  当你处理 FP4 和 FP8 中的浮点数时，嗯，还有这个指数项，它使这个计算变得复杂。

[15:53] So so what can we see already from this?
  那么，我们已经从这里看到了什么？

[15:54] It's like I think the the big observation you've made is that there's this quadratic scaling with bit width, um, which which is like very effective and and is the single reason why um, low precision arithmetic has worked so well for neural nets.
  我认为你提出的一个重要观察是，存在这种具有位宽的二次方缩放，嗯，这非常有效，并且是低精度算术在神经网络中如此成功的原因。

[16:07] Um, but the other thing we're going to do now is we're going to compare sort of the area spent on the multiplication itself with all of the circuitry that is around it.
  嗯，但我们要做的另一件事是比较用于乘法本身的面积与围绕它的所有电路的面积。

[16:18] So, we'll we'll walk back in time a little bit and and see how did GPUs prior to Tensor Cores work?
  所以，我们稍微回顾一下历史，看看在张量核心出现之前，GPU是如何工作的？

[16:26] Um, as which is the same way as the way that CPUs worked in fact.
  嗯，这和事实上的 CPU 的工作方式是一样的。

[16:30] Mhm. So which is like where do we stick this multiply accumulate unit?
  嗯哼。那么，我们把这个乘加单元放在哪里呢？

[16:37] So, generically I'll just describe like a CUDA core or a CPU.
  所以，泛泛地说，我将描述一个 CUDA 核心或一个 CPU。

[16:43] You'll have some register file which stores some number of entries.
  你将拥有一些寄存器文件，其中存储着一定数量的条目。

[16:48] Maybe maybe it's like eight um, eight entries.
  也许，也许是八个，嗯，八个条目。

[16:53] Um, uh, of of like in this case I guess four-bit numbers but typically like 32-bit numbers or something like that of uh, which are not numbers.
  嗯，呃，就像在这种情况下，我猜是四位数的数字，但通常是三十二位数的数字或类似的东西，呃，它们不是数字。

[17:00] Um, so this is the like inside the CUDA core I'll have some register file of some depth.
  嗯，所以这就是 CUDA 核心内部的样子，我将拥有一些具有一定深度的寄存器文件。

[17:06] And then I will have my multiply accumulate circuit multiply uh, and accumulate circuit.
  然后我将拥有我的乘法累加电路，乘法呃，和累加电路。

[17:14] And what it's going to do it's going to like it's going to take three arbitrary registers from this register file
  它将要做的是，它将从这个寄存器文件中取出三个任意寄存器

[17:19] perform the multiply accumulate and then write back to the register file.
  执行乘法累加，然后写回寄存器文件。

[17:24] So, it's going to maybe write to this one, but it it was able to read from this one.
  所以，它可能会写入这个，但它能够从这个读取。

[17:29] This one and and another random one.
  这个和一个随机的另一个。

[17:31] So, it'll take three inputs like this.
  所以，它将接受三个这样的输入。

[17:36] So, this is the core data path of of many processors.
  所以，这是许多处理器的核心数据路径。

[17:38] Most processors look like this.
  大多数处理器看起来都像这样。

[17:40] You've got some set of registers and then you've got some set of logic units or ALUs.
  你有一组寄存器，然后你有一组逻辑单元或ALU。

[17:48] We want to analyze the cost of the data movement from the register file to to the ALU and back.
  我们想分析从寄存器文件到ALU再返回的数据移动成本。

[17:56] So, ultimately there's going to be some circuit that says, "Well, I I don't always have to select this guy.
  所以，最终会有一个电路说，“嗯，我不必总是选择这个家伙。

[17:59] I might select any of the registers in it in any point in time."
  我可能在任何时候选择其中的任何一个寄存器。”

[18:03] And so, sort of a first question is how can I build a circuit?
  所以，第一个问题是如何构建一个电路？

[18:07] The circuit that I'm going to look for is is a mux.
  我要找的电路是一个多路复用器。

[18:10] So, in this case it's going to have eight inputs, one from each entry of the register file,
  所以，在这种情况下，它将有八个输入，每个输入来自寄存器文件的一个条目，

[18:17] and it's going to have one output, which is actually
  它将有一个输出，实际上是

[18:21] producing this output.
  生成此输出。

[18:25] And then like what is the cost of this thing?
  然后，这个东西的成本是多少？

[18:29] It's it's like all we have to to build it out of is and and or.
  我们能用来构建它的只有与和或。

[18:32] Uh and so, how do we build it?
  呃，那么，我们如何构建它？

[18:36] We we do the dumbest thing possible.
  我们做最愚蠢的事情。

[18:38] We like form a mask saying we okay, when we want to read like the third entry,
  我们创建一个掩码说，好的，当我们想读取第三个条目时，

[18:43] um we're going to and every single entry with either one or zero based on whether that's what we want to read and then we're going to or all of them together.
  呃，我们将与每个条目进行与运算，根据我们是否想读取它来决定是 1 还是 0，然后我们将对它们进行或运算。

[18:50] Okay, just to make sure I understand the basics.
  好的，只是为了确保我理解了基本原理。

[18:51] What the mux is doing is it's just like selecting.
  多路选择器所做的是它只是在选择。

[18:53] Just selecting.
  只是在选择。

[18:55] Just selecting an input.
  只是在选择一个输入。

[18:58] Yeah. So, like invisible to software is like you say I want input number three, that means there's a mux here.
  是的。所以，对软件来说就像你说我想要输入三号，这意味着这里有一个多路选择器。

[19:00] And so, like what is the cost of this mux?
  所以，这个多路选择器的成本是多少？

[19:02] So, an n input mux uh operating on P bits.
  所以，一个 n 输入的多路选择器呃，在 P 位上操作。

[19:15] Well, I'm going to So, I have N N rows, that's this eight
  嗯，我将要，所以我 N N 行，这是这个八

[19:20] rows, and I've got uh, like each row is P bits wide.
  行，而且我每行有 P 位宽。

[19:22] Well, I have to AND every single bit.
  好吧，我必须对每一位进行与运算。

[19:24] So, I get N * P many AND gates.
  所以，我得到 N * P 个与门。

[19:31] Every single input I have to say, am I going to like mask it out or not?
  每一个输入我都要说，是要屏蔽掉还是不屏蔽？

[19:35] Um, and then and then I'm going to OR them all together.
  嗯，然后我将把它们全部或起来。

[19:38] And so, uh, there's going to be like N - 1 * P many OR gates.
  所以，嗯，将会有 N - 1 * P 个或门。

[19:44] Which is saying, uh, I've got all of these different things.
  也就是说，嗯，我有了所有这些不同的东西。

[19:46] Almost all of them are zeros, but I need to sort of collapse them down into into like from my eight options down into one option.
  几乎所有都是零，但我需要将它们从我的八个选项压缩成一个选项。

[19:53] And so, every step I need to OR like one row into into an existing row.
  所以，每一步我都需要将一行或到一个现有的行中。

[19:54] Got it.
  明白了。

[19:56] Yeah.
  是的。

[19:59] It's actually kind of funny that you would sort of um, you don't think at the level of hardware.
  实际上很有趣的是，你不会在硬件层面思考。

[20:01] You sort of just think like, oh, I'll just select element three, and something as simple as that is uh, a sort of like in and of itself a quite complicated circuit.
  你只是觉得，哦，我将选择第三个元素，而像这样简单的事情本身就是一个相当复杂的电路。

[20:13] Yeah.
  是的。

[20:13] I mean, this is the first step of all of the hidden data movement costs that that shop.
  我的意思是，这是所有隐藏数据移动成本的第一步。

[20:16] Yeah, yeah.
  是的，是的。

[20:18] Um, and so, like the thing like we're we're just going to like compare like I
  嗯，所以，就像我们要比较一样

[20:23] have to pay this cost, and I've got one

[20:25] mux here, and then in fact I have two

[20:26] more copies of that for each of the

[20:27] three inputs to my multiply accumulate

[20:29] operation. And so, I have this cost

[20:31] which is like like 3 * N * P AND gates

[20:35] over here.

[20:36] Um, compared to this P * Q um,

[20:40] like uh, sort of gates in the actual

[20:42] circuit that I that is doing the thing I

[20:44] care about.

[20:46] And if we plug in actual numbers like

[20:47] this N being eight, um, like I get like

[20:51] 24

[20:53] * P

[20:54] gates over just just in the data

[20:56] movement compared to like if Q is four,

[20:59] like 4 * P gates just in the in in the

[21:01] adder uh, multiply adder. And sorry,

[21:03] where is the three coming from?

[21:05] Three different inputs here.

[21:07] Got it. Okay.

[21:09] So, the case I like really just what I'm

[21:11] hinting at here is that like

[21:13] all of this work which scales like as as

[21:16] the size of the register file, and this

[21:17] is a very small register file, all of

[21:19] this work just moving the data from the

[21:20] register file

[21:21] to the to the to the logic unit um

[21:25] is many many times more expensive than a

[21:27] logic unit.

[21:28] In the most recent Cloud Spectator 7

[21:30] analysis ranked almost 100 different GPU

[21:32] clouds. Crusoe was one of only five that

[21:35] made the gold tier. 7 analysis found

[21:37] that gold tier providers like Crusoe had

[21:39] a total cost of ownership that was 5 to

[21:41] 15% lower than silver tier ones. Even

[21:44] when they had identical GPU pricing.

[21:46] This makes sense because total cost of

[21:47] ownership is downstream of a bunch of

[21:49] different things that don't necessarily

[21:51] show up in the sticker price, but that

[21:52] Crusoe has optimized. Things like

[21:54] [music] how well you detect faults and

[21:56] how quickly you replace failed nodes.

[21:58] For example, Crusoe was one of the first

[22:00] clouds to adopt NV Sentinel, [music]

[22:02] Nvidia's own GPU monitoring and

[22:04] self-healing software for enhanced GPU

[22:06] uptime, utilization, and reliability.

[22:09] This is Crusoe makes use of everything

[22:11] that Nvidia has learned about why chips

[22:13] fail across all their different fleets

[22:14] and deployments so that Crusoe can catch

[22:17] faults earlier in the process. And once

[22:19] they identify a failure, Crusoe can swap

[22:21] in a healthy node in less than [music]

[22:22] 10 minutes. Because they're not running

[22:24] bare metal, Crusoe doesn't have to spend

[22:26] time installing an operating system or

[22:27] configuring drivers. They can just spin

[22:29] up a new VM on an already running and

[22:31] pre-qualified host. If you want to learn

[22:33] more about this or the other reasons

[22:35] that Crusoe made gold tier, go to

[22:37] crusoe.ai/dwarfcash.

[22:41] It may be helpful to just see what a mux

[22:43] looks like, maybe like a two-bit or a

[22:45] four-bit mux. Yeah, great.

[22:47] So, we'll take some inputs. We'll we'll

[22:49] have maybe um

[22:50] like we'll just do a a two-way mux. So,

[22:53] we've got um two different numbers

[22:56] um

[22:58] Uh we've got these two inputs.

[23:00] And then we have a

[23:02] um so the are the inputs

[23:04] um um

[23:05] that are being selected between and then

[23:07] we have a selector.

[23:13] Which says which can either be like

[23:15] I want this one

[23:17] or

[23:19] or it could be I want the other one. So

[23:21] this is a one-hot encoding.

[23:24] Um so so this is what we all start with.

[23:29] And then we The output we want to

[23:31] produce like let let's focus on this

[23:33] case. So so this is the actual input we

[23:35] got.

[23:35] >> Yeah. Um

[23:37] we just want to produce this guy as as

[23:39] the result. Um and so like sort of very

[23:43] laboriously what we do is we and this

[23:46] bit with all of these and so

[23:48] um

[23:50] that produces like ending this bit with

[23:52] this row and likewise we end this bit

[23:54] with this row that produces all zeros.

[23:58] So this was the

[24:00] there's four ANDs here.

[24:03] There's four ANDs here.

[24:06] And then finally we just OR these two

[24:08] together. This gives a one. We OR these

[24:09] two together. This gives a one. We OR

[24:11] these two together gives a zero. We OR

[24:13] these two together and gives a one.

[24:15] And so this is the four ORs.

[24:21] So like this actually ends up looking a

[24:22] little bit like addition in fact like we

[24:24] we did exactly the set of same same set

[24:26] of ANDs here.

[24:28] Um so if we've added all of these things

[24:30] together but then instead of

[24:32] collapsing it by using these full adder

[24:34] circuits we just could get a very simple

[24:36] collapsing with OR gates.

[24:38] And how but that I guess that doesn't

[24:39] look like N times P. Um so so yeah so

[24:41] this was this was with um N equals two

[24:45] um inputs.

[24:46] >> Uh right.

[24:47] Um in the general case we will have

[24:50] um

[24:51] N

[24:52] uh

[24:55] like and so this this is N rows um and

[24:59] then we'll have uh P bits um

[25:02] uh per row. Um so, that gives us the

[25:05] um the N by N times P many AND gates.

[25:08] So, this circuit I've described here

[25:10] almost all of the cost, like uh

[25:13] 7/8 of the cost uh is in the reading and

[25:16] writing the register file. And only a

[25:19] tiny fraction of the cost is in the

[25:20] logic unit itself. So, this is the

[25:22] problem to solve. Um this this

[25:25] essentially was the the state of play um

[25:28] prior to the Volta generation of Nvidia

[25:31] GPUs. This is what what this kind of

[25:33] thing is what was inside the CUDA cores.

[25:36] Um and this and this sort of problem

[25:38] statement is uh what motivated

[25:40] introduction of Tensor cores, which are

[25:41] more generically called systolic arrays.

[25:43] >> Mhm. So, if we think about how we're

[25:45] going to solve this problem, like we

[25:46] we're spending almost all of our circuit

[25:47] area on something that we just really

[25:48] don't care about and is and is like

[25:50] hidden to the software programmer and

[25:51] and the thing that we actually care

[25:52] about is is is not much of the area. Um

[25:55] well, make this one bigger somehow while

[25:56] while keeping this at the same size.

[25:58] That's the goal.

[25:59] So,

[26:00] so the evolution was like uh we had

[26:03] baked this much into hardware um uh in

[26:05] this stage that this single line is a

[26:07] multiply accumulate um and this was this

[26:09] single thing was baked into hardware. Um

[26:11] the idea of a systolic array is to sort

[26:14] of go two levels of loop up and uh and

[26:16] and bake uh this entire loop out here in

[26:19] into hardware. And so, the idea being

[26:20] that if we have a much bigger

[26:22] granularity fixed function piece of

[26:24] logic, um

[26:26] maybe the taxes we pay on the input and

[26:27] output are much smaller.

[26:29] It sounds like you're suggesting that if

[26:30] you have um

[26:32] if you go up one step in the

[26:34] uh

[26:36] in the matrix multiply loop, that

[26:38] there's some you you can tilt the

[26:40] balance more towards compute than

[26:42] communication. That's right. Sort of

[26:44] there's two effects that we're going to

[26:46] take advantage of here. One is just that

[26:47] we can do more stuff before per every

[26:50] trip through a register file.

[26:51] >> Mhm. Um and then the other thing we're

[26:53] going to take advantage of is in fact um

[26:56] in in in some of this loop we can take

[26:58] advantage of for example

[27:00] uh some certain things staying fixed.

[27:04] So uh let's sort of visually we're going

[27:06] to look at this matrix multiplication.

[27:08] So so this portion of the loop

[27:09] corresponds to a matrix vector

[27:12] multiplication in fact. So we'll we'll

[27:13] take a matrix um

[27:16] and multiply it by

[27:18] uh a vector.

[27:21] So how do we do this? We take every

[27:23] column gets multiplied by the vector and

[27:25] then summed. So we're going to sum sort

[27:26] of along columns. Um and so this zero

[27:29] and three gets multiplied by the three

[27:30] and seven and gets summed.

[27:31] And then the one and two gets multiplied

[27:33] by the three and seven and gets summed.

[27:36] So there is a multiply accumulate

[27:38] associated with every single one of

[27:39] these

[27:40] um entries in the matrix. So we'll just

[27:42] draw out these four multiply

[27:43] accumulates.

[27:53] I just want to make sure I understand

[27:54] why there's four multiply accumulates.

[27:56] So

[27:58] if each entry in the

[28:01] uh column that corresponds to the output

[28:03] vector

[28:05] is

[28:06] uh

[28:07] a dot product and in this case it'll be

[28:09] like

[28:11] two multiplications and then the

[28:13] addition of these two multiplications.

[28:15] So like you're accumulating Yeah, so the

[28:17] addition so really there's only one

[28:19] addition in per dot product but like

[28:20] when do we like to start with the

[28:21] initialization

[28:22] >> the the what we're going to aim for is

[28:23] Yeah, so the the what we're going to aim

[28:25] for is to have

[28:27] So we've got a we want to have

[28:29] quadratically more compute. We do. We

[28:30] have We've got sort of

[28:32] x times y

[28:34] as much compute

[28:37] as as we had before.

[28:39] Um

[28:41] Um but we're going to want to somehow

[28:43] aim for having only x times as much um

[28:48] uh, communication.

[28:50] And And this is sort of the the

[28:52] intention so that we get this, uh,

[28:53] advantage, uh, term going as Y.

[28:57] So, we've laid down the multiplications.

[29:00] Bringing in like, we're going to want to

[29:02] bring in a vector of size

[29:05] two. And so, that's sort of already is

[29:07] in line with our columns target. That's

[29:08] fine.

[29:10] However, we need to somehow manage the

[29:11] communication of this matrix which which

[29:13] which does which exceeds our budget of

[29:15] X.

[29:16] And so, uh, the idea is that

[29:20] in the in an AI context, this this

[29:22] matrix is actually going to stay fixed

[29:24] for a long period of time. And so,

[29:26] instead of like bringing it in from the

[29:29] outside, so we've got the some register

[29:30] file sitting over here, we

[29:33] we don't want to have like the amount of

[29:35] stuff coming out of this register file.

[29:36] This is the term that we want to go, uh,

[29:38] sort of as as X in some sense.

[29:41] Um,

[29:43] we don't want to bring this full matrix

[29:45] in from the register file every every

[29:46] cycle because we don't have enough, um,

[29:48] that would cost too much in in terms of

[29:50] wiring from the register file.

[29:52] And so, instead we're going to store,

[29:54] um, our key trick is that this matrix

[29:56] can be stored locally,

[29:58] uh, to the systolic array.

[30:00] And so,

[30:02] um,

[30:03] where

[30:04] we'll store these numbers zero, one,

[30:06] two,

[30:07] and three.

[30:10] In in just like a gate called a register

[30:13] that like

[30:14] physically stores these numbers and

[30:16] we're going to reuse these numbers, uh,

[30:17] over and over again for for a large

[30:19] number of different different vectors.

[30:21] >> And so, the optimization here is that

[30:22] like the nature of matrix multiplication

[30:25] is you can store this like

[30:29] uh, square quadratic thing

[30:32] directly where the logic is happening.

[30:35] Mhm.

[30:36] Um,

[30:37] and which is like higher dimension than

[30:39] the

[30:40] uh, Uh, or has an extra dimension

[30:42] compared to the inputs which you keep

[30:44] swapping in and out. That's right. I

[30:45] mean, this is the nature of what a

[30:46] matrix multiplication is, is that you do

[30:49] uh, a lot of multiplication to get one

[30:53] value out. Like a dot product is a the

[30:55] result of a lot of multiplications, and

[30:56] so that optimization means that you

[30:58] you're just like you can stuff a lot of

[31:00] like multiplication in before you get

[31:02] some value out of it.

[31:03] >> That's right. That's right.

[31:03] >> Yeah. So, like just to complete the

[31:05] picture here of of concretely how that

[31:06] looks, um,

[31:08] uh, I I swapped the between the two

[31:10] here. Um, three and two. Um,

[31:14] so just like

[31:16] this zero and three is going to multiply

[31:17] by the three and seven, and so, um,

[31:21] uh, we're going to form a dot product

[31:22] sort of along columns here.

[31:24] So,

[31:26] somehow we're going to feed a three and

[31:28] a seven in here.

[31:30] These participate in sort of this feeds

[31:33] into this multiplication, and also feeds

[31:35] into this multiplication.

[31:38] Likewise, the three feeds into here, and

[31:40] also into here.

[31:41] And then we're going to sum

[31:43] uh,

[31:46] sort of

[31:47] along here with like starting at the top

[31:49] of a column we feed in zeros, and then

[31:51] coming out the bottom, uh, we get

[31:53] results coming out. So, sort of just to

[31:55] visually see what we've got, um, there

[31:58] is a dot product that is performed along

[31:59] columns in a matrix, and that that sort

[32:01] of maps exactly to what is done

[32:03] spatially in the systolic array here.

[32:06] Um, so this is one dot product summed

[32:08] vertically, and this is a second dot

[32:10] product also summed vertically.

[32:12] Um, and then what is the data that needs

[32:14] to go into and out of the register file?

[32:17] We have

[32:19] X amount of data that's coming out here

[32:21] on the output.

[32:22] Um, and we also have sort of

[32:24] uh,

[32:25] this data

[32:27] coming from the input, X amount of data

[32:28] from the input. Um, and so, with respect

[32:32] to the input and output vectors at

[32:33] least, um, we we sort of met our goal of

[32:36] having only X as as much data going in

[32:39] and out of the

[32:41] um

[32:41] uh the register file.

[32:44] This leaves open the question of like I

[32:47] said that the the weight matrices weight

[32:49] matrix is stored locally in the systolic

[32:51] array. How did it get there in the first

[32:52] place?

[32:53] Um is sort of a like at some point you

[32:55] need to boot your chip and populate this

[32:57] data and so where did that come from?

[32:59] The trick is just we just do it very

[33:01] slowly. So we we are very slowly trickle

[33:03] feed it in into the systolic array. The

[33:06] um sort of the simplest strategy is that

[33:08] we we sort of run this daisy chain that

[33:10] says like feed a number into here and

[33:13] then and then on the next clock cycle

[33:15] it'll move down to the next entry of the

[33:16] systolic array. And so we can do that in

[33:18] every

[33:19] um in in every column in parallel and

[33:22] that gives us um sort of

[33:24] uh this is also going to come from here

[33:26] and this is going to be another factor

[33:27] of approximately X units of of bandwidth

[33:30] coming in.

[33:31] Uh you can can you just would you mind

[33:32] repeating that sentence one more time?

[33:34] So like we're sort of like we know that

[33:37] we're going to be bringing in uh numbers

[33:39] only rarely into the matrix. Yeah. Um

[33:41] and so

[33:43] we just want to come up with any

[33:44] construction at all such that the amount

[33:46] of wiring that actually feeds into sort

[33:48] of crosses this boundary of the systolic

[33:50] array like this boundary right here. We

[33:52] just want to keep that bounded to X and

[33:54] not be not go as X Y. And so a

[33:56] particularly simple strategy is that we

[33:59] um

[34:00] we sort of bring in a a number into the

[34:02] top row of the systolic array. That's

[34:04] what we can do in one clock cycle and

[34:06] then

[34:07] for like uh for Y consecutive clock

[34:11] cycles we're going to be bringing in the

[34:12] top row every time and then sort of

[34:14] shift all of the other rows down by one.

[34:16] And that keeps the the sort of the

[34:18] wiring that needs to come from this

[34:20] expensive register file uh only down to

[34:22] a factor of X rather than X

[34:23] >> I see. Okay. So uh there's two questions

[34:26] in terms of communication. There's like

[34:28] communication time and then there's

[34:31] communication bandwidth. Yes.

[34:33] >> And you're saying, since we're only

[34:35] going to be loading this in once, let's

[34:36] maximize let's minimize bandwidth cuz

[34:39] bandwidth equals diarrhea, Mhm. and

[34:42] let's just like load it in slowly over

[34:44] like smaller lanes cuz we're just going

[34:46] to keep this value in there for a while.

[34:48] >> Exactly. Exactly. Interesting. So, it's

[34:50] interesting to me that when we were

[34:51] talking last time about inference across

[34:54] many chips, the big high-level thing

[34:57] we're trying to optimize for is increase

[34:58] the amount of compute Mhm. per memory

[35:01] bandwidth, that is to say per

[35:02] communication. And here also we're

[35:04] trying to increase the amount of like uh

[35:07] actual multiplies or actual additions

[35:09] relative to transporting um information

[35:13] from registers to uh the logic. So, in

[35:16] both cases you're trying to maximize

[35:17] compute relative to communication. Yeah.

[35:19] Um this this shows up sort of all the

[35:21] way up and down the stack. Um this is

[35:23] sort of close to the bottom, um sort of

[35:25] like into to the gates. Um there's sort

[35:28] of a a version that's made even closer

[35:30] to the gates of just like um even the

[35:33] precision of number format that you

[35:35] choose to use. Um

[35:37] we saw that same effect. Uh there's like

[35:39] a square-cube law or a like squared

[35:41] versus linear term going on both in just

[35:44] purely the precision of this um ALU, but

[35:47] then also in in terms of the size of

[35:48] this the the matrix. Yeah.

[35:51] Interesting. So, um so this unit is sort

[35:54] of the next bigger unit. We had like the

[35:55] multiplication the circuit, and then uh

[35:58] on top of that we have um a like a

[36:00] pretty large array. I drew it as 2 by 2,

[36:02] but in like um for example, older TPUs

[36:05] they would describe it as 128 by 128 uh

[36:08] of of this circuit uh shown here. Um and

[36:12] this circuit ends up being um this is

[36:14] the the most efficient known um

[36:16] mechanism for uh

[36:18] circuit [clears throat] for implementing

[36:19] a matrix multiply.

[36:20] >> I see. So, we we've talked about um sort

[36:23] of

[36:24] it seems obvious that you should try to

[36:25] maximize compute relative to

[36:27] communication. Um,

[36:28] what are non-obvious tradeoffs that

[36:31] um

[36:32] actually you are you know, keep you up

[36:35] at night about what should we do X or

[36:36] should we do Y and it's non-obvious what

[36:38] the answer is. Yeah. So, I mean, I think

[36:41] most of the decisions in chip design are

[36:44] um sizing decisions. Mhm. And so,

[36:48] uh already in what we've drawn so far,

[36:50] um

[36:51] like so

[36:53] AI chips all have this this circuit, you

[36:56] know, they have a a systolic array and

[36:58] then somewhere near it a register file

[37:00] uh providing inputs and outputs.

[37:02] Um

[37:03] the two sort of

[37:05] uh

[37:06] like even within this scope sizing

[37:07] questions that you have are how big

[37:09] should I make my systolic array and how

[37:11] big should I make the

[37:13] um the the register file?

[37:15] The

[37:16] um so and and then the tradeoff for the

[37:18] size of this

[37:19] size of the systolic array, actually

[37:21] these two questions are coupled, is

[37:24] um

[37:25] one way to think of it is to say, I'm

[37:27] going to like have a budget for um

[37:31] uh how much what percentage of my chip

[37:32] area I want to spend on data movement.

[37:34] So, maybe I just say that I want this to

[37:36] be 10% and the systolic array to be 90%.

[37:39] Um and then sort of I can size my

[37:41] register file. Bigger register files are

[37:43] more flexible. I they allow me to run

[37:44] sort of more I can get more application

[37:47] level performance out. Um

[37:49] but

[37:51] uh but then they sort of take away from

[37:52] this area spent on the systolic array.

[37:54] >> Yeah, makes sense. I recently ran an

[37:55] essay contest where I asked people to

[37:57] write about what I consider to be some

[37:59] of the biggest open questions about AI.

[38:01] The submission window closed last week,

[38:02] so I used Cursor to create a couple of

[38:05] different interfaces to help me review

[38:06] the entries. One interface anonymizes

[38:08] submissions and hides unnecessary

[38:10] information. It lets me group responses

[38:11] by question, add notes, and record my

[38:13] scores. The other interface helps me

[38:15] review entrants who also want to be

[38:17] considered for the researcher role that

[38:18] I'm hiring for. The UI puts the

[38:20] applicant's essay right next to the

[38:22] resume and their personal website so

[38:23] that I can see everything at once.

[38:25] Cursor's harness is really good at

[38:27] helping these models see and improve

[38:28] their UIs. I watched it render these

[38:30] interfaces in the built-in browser, take

[38:32] screenshots, click through sections, and

[38:35] keep iterating. At this point, Cursor is

[38:37] where I do most of my work. Whether I'm

[38:38] reading and visualizing a bunch of

[38:40] research papers, or coding up an

[38:41] interface to review applications, or

[38:44] making flash cards for my Blackboard

[38:45] lectures, Cursor just makes it very easy

[38:48] for an AI to look at whatever I'm

[38:49] looking at and help me understand it and

[38:52] work with me on it. So, whatever you're

[38:53] working on, you should do it in Cursor.

[38:55] [music] Go to cursor.com/dwarkesh.

[39:00] Where does the clock cycle of a chip

[39:04] come in? What what determines

[39:06] what that is?

[39:07] Yeah. And what is a clock cycle of a

[39:08] chip? So, I guess at at baseline, it's

[39:11] it's sort of worth observing that chips

[39:12] are incredibly incredibly parallel,

[39:14] right? You've got a hundred billion

[39:15] transistors in a chip.

[39:16] A key thing that you need to do whenever

[39:18] you have like massive parallelism is you

[39:20] need to synchronize between the

[39:21] different parallel units.

[39:23] In software, typically you like you you

[39:26] have these very expensive

[39:27] synchronization methods like a mutex.

[39:29] So, one thread will finish what it's

[39:30] doing, it will in like it will

[39:33] grab a lock somewhere stored in memory

[39:36] and then notify the other thread that

[39:37] it's done.

[39:38] Um

[39:39] on chips, we take a very different

[39:40] approach and say that uh

[39:43] uh

[39:44] every like nanosecond or so, um all

[39:49] all circuitry in the chip will kind of

[39:50] pause for a moment and then and then

[39:51] synchronize every So, it actually

[39:53] synchronizes every single nanosecond or

[39:54] so. And so, that is the clock cycle.

[39:56] Mhm. Um the entire chip, typically all

[40:00] in sort of one fell swoop, um goes in

[40:02] lockstep to the next operation that

[40:04] happens.

[40:05] And so, what this looks like in

[40:06] circuitry is that

[40:08] um

[40:09] you will have

[40:14] it's typically drawn so, the clock is

[40:16] sort of mediated by registers, which are

[40:17] these storage devices that we've drawn

[40:19] elsewhere.

[40:20] Um and the way to think of it is that I

[40:22] have I have some storage uh which is

[40:24] storing like a bit, which might be zero

[40:26] or one.

[40:27] And then I have some sort of cloud of

[40:28] logic, which maybe is like

[40:31] this systolic array or this multiplier

[40:33] or something like that.

[40:35] Um

[40:36] and then I've got some and that's going

[40:38] to produce some output. So, my inputs

[40:40] I've got a bunch of inputs feeding to

[40:41] this cloud of logic. And then eventually

[40:43] later there's going to be some output

[40:45] register that this writes to.

[40:48] There is a global clock signal

[40:52] which drives all of these registers.

[40:55] And it says

[40:57] at a certain instance in time when the

[40:58] clock uh

[40:59] uh strikes um

[41:02] uh whatever value happens to be on this

[41:05] wire at that instant, that's what's

[41:06] going to get stored in there.

[41:08] And so, the

[41:11] the sort of the challenge here is like I

[41:14] I would like to have my clock speed run

[41:16] as fast as possible because if I can run

[41:18] at 2 GHz, I can get twice as many

[41:19] operations done per per per second than

[41:21] if I run at 1 GHz.

[41:23] Um

[41:24] but what that ends up meaning is that uh

[41:28] I'm very sensitive to the delay through

[41:29] this cloud of logic because uh

[41:32] any computation that is going to happen

[41:34] in here needs to sort of finish before

[41:36] the next clock cycle hits.

[41:37] >> Right.

[41:39] So,

[41:40] uh a major point of sort of optimization

[41:43] on on any chip then is is to make is to

[41:45] sort of make uh this delay um

[41:48] delay from here

[41:50] uh

[41:51] uh as short as possible. Interesting.

[41:53] And um

[41:55] is there ever because the constraint

[41:57] here seems to be that if you

[41:59] add

[42:01] too much logic, then you might risk

[42:03] missing the clock cycle. But if you

[42:05] don't add enough, then you're you know,

[42:07] leaving potential compute on the table.

[42:10] Um is there ever a situation where

[42:13] you're like you'd

[42:15] take a probabilistic chance that like a

[42:18] compute competition finishes and

[42:21] or is it just like no either it's going

[42:23] to finish by clock cycle or not? Yeah.

[42:25] In standard chip design you you imagine

[42:28] it such that I mean there is a

[42:29] probability but it's like many many

[42:31] standard deviations like way standard

[42:33] deviations out such that for all

[42:34] intensive purposes it is a reliable

[42:36] part. It will it will always meet the

[42:38] clock.

[42:39] Um

[42:40] there are some weird exceptions to that.

[42:42] There are clock domain crossings where

[42:44] you go from one clock to the other clock

[42:45] and then you actually do have to reason

[42:46] about this probability but

[42:47] >> Interesting.

[42:48] Uh in the main path uh you just like you

[42:51] imagine it such that you'll get there

[42:53] like

[42:54] 25% of the clock cycle in advance um

[42:56] uh so that it's very unlikely that

[42:58] >> And in this in this um

[43:00] the clock

[43:01] uh

[43:03] where the uh clock synchronize I guess

[43:07] where the registers are this is not

[43:09] something you determine as a chip

[43:10] designer. This is sort of just like an

[43:12] artifact of hey I want whatever sequence

[43:15] of logic and then the software you use

[43:18] to convert your Verilog into the thing

[43:20] you send to TSMC that just determines

[43:23] like hey in order to make this work

[43:25] you're going to kind of you got to put a

[43:26] register here here and here to make sure

[43:28] that there's a there's no one step that

[43:31] is like too long such that it makes the

[43:33] whole clock cycle of the entire chip

[43:35] longer than it has to be. Yeah. So this

[43:36] is actually a huge part of the work of

[43:38] of designing a chip actually is is is

[43:40] inserting them. So it is done in a

[43:41] combination of manually and

[43:43] automatically. Um so I mean like just

[43:45] like to show the very sort of dumb uh

[43:48] version of like what you can do here you

[43:50] can take this logic and split it in half

[43:51] and so like say actually instead of just

[43:54] one uh cloud of logic I'm going to um

[43:58] have two smaller clouds of logic um

[44:01] which do the same thing but split them

[44:03] up by a register.

[44:05] Right. Um feeding in like this.

[44:08] Um, and this is like

[44:10] like if you split it like in the middle,

[44:13] you can hit twice the clock frequency.

[44:14] That's great. You get twice the

[44:15] performance um, at the cost of this

[44:18] extra register. And so, at the cost of

[44:20] some more uh, storage. And so, stepping

[44:22] back, why do we need to synchronize the

[44:23] whole chip? Like if you have like if you

[44:25] imagine playing Factorio or something,

[44:27] there's no like global clock cycle. It

[44:28] just should is done when it's done.

[44:30] There's iron on the plate. You can take

[44:31] it if you want. Yeah. So, uh

[44:35] taking that analogy, um, the the the the

[44:37] thing that you need to be mindful of is

[44:39] if I've got two different paths through

[44:40] some logic. So, I have to do a

[44:42] computation like F here. Um, and then

[44:45] computation G here. Um, and then they're

[44:48] going to come and meet for computation H

[44:50] somewhere here.

[44:50] >> Yeah.

[44:51] Um

[44:52] and so

[44:53] uh

[44:56] there's going to be manufacturing

[44:57] variance here. Uh, in some chips F will

[45:00] take a little longer, maybe in some

[45:01] chips G will take a bit a little bit

[45:02] longer. And so, if I've got some signal

[45:04] that's propagating through here, um

[45:07] and the results from F and G have to

[45:09] sort of meet up at H. Um what can the

[45:12] the thing that can go wrong is that F

[45:13] can get there early and it meets like

[45:15] the previous value of G or the next

[45:17] value of G or something like that.

[45:18] >> And H needs to know when to start.

[45:20] Exactly. Like when when has this next

[45:22] iteration of

[45:23] And so, this explains why

[45:26] different chips made at the same process

[45:28] node, the same like TSMC uh,

[45:31] the the technology um can have different

[45:34] clock cycles. Like

[45:36] two chips made at three nanometer might

[45:37] have different clock cycles based on

[45:40] uh, whether they were able to optimize

[45:42] making sure that like there's no one

[45:44] critical path that is so long that um it

[45:48] slows down the whole chip's clock cycle.

[45:50] That's right. This optimization that I

[45:52] that I showed here, this is just the uh,

[45:54] this is sort of um pipeline register

[45:56] insertion it's called.

[45:57] >> Yeah. Um, we've inserted it in the

[45:58] middle of the pipeline or register here.

[46:00] Um, this is a sort of pure trade-off

[46:02] between clock speed and an area. Yeah.

[46:04] Um,

[46:06] this is the easy case.

[46:08] There is a harder case, too, which is

[46:10] um,

[46:12] uh, sort of I've drawn it as a pipeline

[46:13] of logic here, but in in other cases,

[46:15] you may have some, uh,

[46:18] uh, some calculation which actually

[46:20] feeds back in on itself. So, uh, it runs

[46:22] some function F and then writes back to,

[46:25] uh, itself like this.

[46:27] So, for example, this might be this

[46:28] addition, like you've got some number

[46:30] that you're adding to into every clock

[46:31] cycle, um, and so this this could be

[46:34] like a, um, this could be like plus, uh,

[46:37] where adding in some number every clock

[46:39] cycle.

[46:41] Mhm. So, this like this little circuit,

[46:43] uh,

[46:44] um, it it essentially it's just going to

[46:46] sum all of the numbers that get

[46:47] presented on on different clock cycles.

[46:49] And the challenge is, if this plus takes

[46:51] too long,

[46:53] what can I do? If I like split it in if

[46:55] I try and put a pipeline register like

[46:57] right in the middle of it,

[46:59] um, like like here in the middle of it,

[47:02] um,

[47:04] this will end up changing the

[47:06] computation that's done. Instead of

[47:07] forming a running sum of everything that

[47:09] comes here,

[47:11] I will actually have two different run-

[47:12] running sums. I'll end up having a

[47:13] running sum of the even numbers and a

[47:14] running sum of the odd numbers.

[47:17] So, sort of this constraint where I

[47:19] have, um, a loop in my logic which all

[47:22] chips have somewhere, um, this is

[47:24] actually the thing that is the hardest

[47:25] thing to to to address and and sets the

[47:27] clock cycle.

[47:28] I don't understand why it'd be a problem

[47:29] to have that or I'm not sure even what

[47:30] it would mean to have a register there.

[47:33] Like cuz it's is it sort of, uh, atomic

[47:35] operation, right? Yeah, well, so plus is

[47:37] not really atomic. Uh, like I think, um,

[47:39] As you just demonstrated.

[47:40] >> Yeah, yeah, it took a whole lot of work

[47:41] to do a summation. And so, like you can

[47:43] take the early parts of that work and

[47:45] then and then stick a register in the

[47:46] middle and then do take the late parts

[47:47] of that work. Got it. Okay. Yeah. Uh,

[47:49] and I guess it's then up to so TSMC

[47:52] offers a, uh, PDK which sets size, hey,

[47:55] here's the primitives of,

[47:56] um,

[47:58] logic that we can grant you in the in

[48:00] the chip, and it's up to them to

[48:01] determine that no no primitive is bigger

[48:05] than like the clock cycle they're hoping

[48:07] a process node targets. But, other than

[48:10] that, is there like what further

[48:12] optimize Can't you just say like, "Hey,

[48:14] here's all the primitives from TSMC,

[48:16] and

[48:17] if

[48:18] keep adding registers in between the

[48:20] primitives as much as is needed until

[48:21] you get to your desired clock cycle?"

[48:23] Yeah, as a logic designer, like the chip

[48:26] architect set the clock cycle. So, just

[48:28] for for one example,

[48:30] the primitives you get from TSMC are on

[48:33] the order of like AND gates or full

[48:34] adders.

[48:34] >> Mhm. Um

[48:36] they depends a lot on voltage and

[48:39] frequen- and and and which library you

[48:41] choose and so on. But, generally, they

[48:44] you can typically have about like 10 or

[48:46] 20 or 30 of these in in a in a clock

[48:48] cycle sequentially. So, these primitives

[48:50] are very very fast, like 10 picoseconds

[48:52] or something like that. Uh and so, as a

[48:55] logic designer, I mean, like

[48:57] [clears throat] in principle, if you

[48:58] literally just had like um like register

[49:01] and then AND gate um it kind of in a

[49:04] loop like that, you could get an

[49:05] insanely fast clock speed, like more

[49:07] than four or five six gigahertz or

[49:08] something like that.

[49:09] Um

[49:11] but, if you take this this like really

[49:13] sort of like simple circuit, um and you

[49:17] look at the area you're spending here,

[49:18] like um this is maybe like one I mean,

[49:21] this is this is called one gate

[49:22] equivalent um in size, so like unit of

[49:25] one in area, and this thing is like unit

[49:27] of eight in area or something like that.

[49:28] And so, like this is just almo- again,

[49:31] almost all of your cost is being this

[49:33] like synchronization or communication

[49:34] cost compared to the actual logic. And

[49:37] so, uh

[49:38] so, this would be a case where you've

[49:40] gone too far. You've made your clock

[49:41] speed really really fast at the cost of

[49:43] spending almost all of your area on reg-

[49:45] on on on pipeline registers.

[49:47] Interesting. So, what you're hinting at

[49:49] is a dynamic where you can have really

[49:51] fast clock speed, but you're not getting

[49:53] that much work done. Yeah. Yeah. Yeah.

[49:55] Um and so you can have like low latency,

[49:58] but

[49:59] but low bandwidth.

[50:00] Or throughput, rather. Yeah. The it

[50:03] hurts your throughput, in fact, because

[50:04] like uh like the the throughput of your

[50:06] chip you can think of as the product of

[50:09] um how much I can get done per clock

[50:10] cycle, which is based on this area

[50:12] efficiency thing, times how many clocks

[50:14] I get per second.

[50:15] >> This is this is actually so similar to

[50:16] the thing we were discussing last time

[50:17] about like batch size,

[50:20] um where if you have a low batch size,

[50:23] then you can anyone user can receive

[50:25] their

[50:26] uh next token really fast, but the total

[50:29] number of tokens that are processed in

[50:30] say an hour will be kind of lower than

[50:33] it could otherwise be. Yeah, exactly.

[50:34] You get you get less parallelism out if

[50:36] you if you drive your clock speed up

[50:38] really high.

[50:39] Language models are starting to compete

[50:40] against the best human forecasters. I

[50:42] sat down with two senior Jane Streeters,

[50:44] Ron Minsky and Dan Pontecorvo, and

[50:46] asked, "At some point, does AI just do

[50:49] what Jane Street does?" There's a world

[50:50] that we should take seriously where, you

[50:52] know, we're going to build large

[50:54] language models or some other AI systems

[50:56] that are like strictly smarter than all

[50:58] humans on the planet and more capable at

[51:00] all cognitive tasks. Trading in

[51:02] particular feels to me as like a kind of

[51:05] AGI-complete, sort of like NP-complete,

[51:08] because at the end of the day,

[51:10] trading involves figuring out what

[51:13] things are worth, which means making

[51:15] predictions about the future. Jane

[51:16] Street isn't betting against AI. They

[51:18] just signed a $6 million computer chip.

[51:20] But Ron's view is that the edge keeps

[51:22] moving. I have never been more desperate

[51:24] to hire more engineers and more traders

[51:26] than I am today. You know, you have the

[51:28] usual thing of like the other hard parts

[51:29] that we don't yet know how to automate.

[51:31] Well, that ends up being where the

[51:32] competitive edge lies. You can find

[51:34] these open positions and watch the full

[51:36] interview at janestreet.com/thorkast.

[51:40] Okay, so I remember talking to um an

[51:42] FPGA engineer at Jane Street, Clark, um

[51:46] who actually helped me prep for your the

[51:48] the previous interview we did together.

[51:50] And he was explaining why they use FPGAs

[51:52] and, um,

[51:53] uh,

[51:55] I I imagine that for high-frequency

[51:56] trading,

[51:58] the throughput is less important than

[52:00] latency. And so having very specific

[52:02] control over the clock cycle in a

[52:03] deterministic way is the most important

[52:05] thing. I um

[52:08] Maybe it'd be interesting to talk about

[52:09] how you

[52:12] why you can't just achieve that with an

[52:13] ASIC or why a FPGA is the

[52:16] uh

[52:17] Yeah, why you might use an FPGA to

[52:20] have deterministic clock cycles for like

[52:22] high-frequency trading. Yeah. So, I

[52:24] mean, firstly, like, let's consider the

[52:26] sort of the business case for an FPGA

[52:27] versus an ASIC. Um

[52:31] FPGAs and ASICs

[52:33] use largely the same sort of uh

[52:35] conceptual model, which is that I have,

[52:38] um, a series of gates built from ANDs,

[52:40] ORs, XORs, those like very small

[52:42] primitives, um, connected together with

[52:44] a fixed clock cycle, um,

[52:47] uh, and connected together with wires,

[52:48] uh, that are running in a fixed clock

[52:49] cycle. So, anything you can express in

[52:52] an FPGA, you can express in an ASIC,

[52:53] too. And it'll be about an order of

[52:55] magnitude cheaper and and like, uh,

[52:57] better energy efficiency on an ASIC than

[52:59] an FPGA.

[53:01] Uh, the trade-off is that the first FPGA

[53:04] costs you $10,000, whereas the first

[53:07] ASIC you make costs you $30,000 because,

[53:09] uh, because of, uh, it requires an

[53:11] entire tapeout. So, sort of the business

[53:13] use case for an FPGA would be that I

[53:15] want something that has this very

[53:17] deterministic latency and and and and

[53:19] fast run time and high parallelism,

[53:21] uh, but

[53:23] um

[53:25] uh, but, you know,

[53:26] I I I'm going to change it very

[53:28] frequently. I change what I do every

[53:29] month or something like that. Uh, and so

[53:30] that I don't want to pay the tapeout

[53:32] cost every time.

[53:33] Now, how does an FPGA actually implement

[53:36] It's sort of like it emulates the ASIC

[53:38] programming model,

[53:39] but in in in a fixed piece of hardware.

[53:41] And so, how does that work actually?

[53:43] So,

[53:46] what it has at the at the base is um

[53:50] it's got a it's got a um

[53:52] it's got the two components we just

[53:53] talked about. It's got these registers

[53:55] as as uh as storage devices.

[53:58] And then it's got uh these

[54:00] these are called LUTs, look-up tables.

[54:02] Um

[54:04] uh which actually provide all of the

[54:05] gates. Mhm.

[54:07] So, and then and then we're going to see

[54:09] even the sort of the third component, we

[54:11] we then have like

[54:13] sort of a a swarm of these um registers

[54:16] and LUTs.

[54:23] Um and all of these are available and

[54:25] then they're connected by sort of this

[54:27] big set of um sort of

[54:29] uh muxes. So, in front of uh every

[54:32] single one of these we've got uh

[54:35] something like one of these muxes which

[54:37] selects one input from everywhere else.

[54:39] Um

[54:40] sort of selecting from all of these for

[54:42] things. Um

[54:43] you know, we've got a whole bunch of

[54:44] different options feeding into into into

[54:49] into all of these things.

[54:51] So, um

[54:53] so, what what what this allows is

[54:55] essentially a

[54:57] uh when I program my FPGA, I can say

[54:59] that I'm going to take all of these

[55:01] components and I'm going to sort of

[55:02] superimpose on top of this a particular

[55:05] wiring which like goes through

[55:07] this LUT and then feeds into this LUT

[55:10] and then goes to this register and then

[55:12] feeds into this LUT or something like

[55:13] that. Mhm.

[55:15] So, what I've drawn in orange is like

[55:18] how you like FPGA means field

[55:20] programmable gate array. This is the the

[55:22] orange is what has been programmed in

[55:23] the field, whereas the white is all of

[55:25] the wires that must exist in the FPGA in

[55:27] order to actually take to make the

[55:29] device in the first place. What does it

[55:30] mean to be programmed in the field? Like

[55:32] programmed in the field. So, like the

[55:34] device has been deployed in a data

[55:35] center, it's sitting in the field and

[55:36] then you can

[55:37] come and program Not field as in

[55:38] electric field, not field as in like out

[55:40] there in the world field. Okay.

[55:42] Um and so if I see look at the

[55:45] how the

[55:47] the field programming comes out of the

[55:48] first

[55:50] lookup table and goes into the second

[55:52] one.

[55:54] How is it

[55:55] How Yeah, how? Like like where where

[55:57] where are the wires that make that

[55:58] happen? I guess. Yeah. So I I I got a

[56:00] little bit like lazy in drawing all of

[56:02] these Every single device here has a mux

[56:05] sitting in front of it. Um uh

[56:08] uh which can select from all of the like

[56:10] nearby um like uh

[56:14] circuits that are available. Yeah. Um

[56:16] And so uh the actual configuration of

[56:19] the FPGA is like

[56:22] amounts to it it is the mux control. So

[56:24] like we in this mux here we have the

[56:26] data inputs and then we have like the

[56:29] control that selects. And so like

[56:31] there's a little storage device sitting

[56:33] next to every single one of these um

[56:37] uh muxes saying this is where you're

[56:40] going to source your input from. Right.

[56:42] Um and so programming it consists of

[56:45] like configuring every single one of

[56:46] these muxes. So that makes sense. What

[56:48] is happening inside of the lookup table?

[56:49] Yeah. So the purpose of the lookup

[56:52] table, so it's going to also have a

[56:53] little bit of control feeding telling it

[56:55] what to do as well. The purpose of the

[56:57] lookup table is to function

[56:59] to be able to configurably take the role

[57:01] of an AND gate, OR gate, XOR, any of

[57:04] those different gates.

[57:06] So there's many ways you could consider

[57:08] doing that. Um

[57:10] the way it is done in sort of

[57:12] traditional FPGAs is to say um

[57:15] it will support

[57:19] So it will be a a lookup table will be

[57:23] um

[57:24] it will have four bits of input,

[57:26] one bit of output.

[57:30] How many different functions are there

[57:31] from four to one bit? There are 16

[57:33] different functions.

[57:34] Um, and so,

[57:36] uh, you can actually just tabulate this

[57:38] as like 16 different like 16 different

[57:41] numbers. You've got a table of 0111001.

[57:46] 16 entries.

[57:49] And so, what it does is, um,

[57:51] this this table is

[57:53] stored in this blue configuration bit,

[57:55] um,

[57:56] and then, uh,

[58:00] it views these four bits as binary,

[58:02] looks up the relevant row of the table,

[58:03] and emits that bit.

[58:07] So, this is a truth table view of of

[58:09] lookup tables, essentially. Okay, so the

[58:11] lookup table, if you think about it an

[58:13] AND gate, OR gate, NOR gate, XOR gate,

[58:16] these are all like take as input Those

[58:18] are like two input functions.

[58:20] >> Yeah. Uh, so, sometimes we have like

[58:21] more complicated like a three input

[58:23] function would be a three-way XOR

[58:24] >> Right.

[58:25] >> or a four-way XOR. And in this case, how

[58:27] many it just depends how how big it is,

[58:29] but Typical size for LUTs is four input,

[58:32] um, which is sort of just a sweet spot

[58:34] between, um,

[58:36] there's another computer communication

[58:37] trade-off like here. Like, if it if it

[58:39] has too few inputs, then you need to use

[58:41] more LUTs.

[58:41] >> Yeah.

[58:42] >> If it if it has too many But, basically,

[58:43] the lookup table is like a truth table.

[58:45] It's a truth table, yeah. And with a

[58:47] truth table, you can program in any gate

[58:49] you want.

[58:49] >> That's right.

[58:50] And so, instead of a lookup table, just

[58:52] think like a programmable gate. That's

[58:54] right.

[58:55] And so, I mean, one of the things you

[58:56] can do here is you can see why where the

[58:58] rule of thumb that a FPGA is like an

[59:00] order of magnitude more expensive than

[59:02] an ASIC comes from,

[59:03] um, is to count how many gates would be

[59:06] inside this lookup table. Yeah.

[59:09] So, we can view this lookup table

[59:11] essentially as one of these muxes, um,

[59:13] and so, it is a mux with has to select

[59:16] between 16 different values.

[59:18] Uh, and so, it is a mux with,

[59:21] uh, sort of

[59:23] n equals 16, uh, options.

[59:28] P equals one bits.

[59:31] And so,

[59:32] what we saw way earlier is that, um,

[59:36] it

[59:36] this circuit costs like n times p many

[59:38] gates. And so, it's like, um, so it

[59:42] costs like n times p equals 16 and

[59:45] gates.

[59:47] And also 16 ORs. This circuit being the

[59:50] mux. Uh, yeah, exactly. The mux. The mux

[59:53] is the core of the

[59:53] >> The mux that goes into the lookup table.

[59:55] Uh, so the lookup table itself you can

[59:57] think of as being actually a big mux

[59:59] that like selects from all 16 rows down

[01:00:03] to one output. Yeah, okay. I see. That

[01:00:04] that is the lookup table. But the but

[01:00:06] the way you've drawn it here there's

[01:00:07] like a mux and then a lookup table.

[01:00:08] >> It's it's muxes all the way down. So, I

[01:00:09] mean there's a there is a second mux

[01:00:11] that is inside here. This mux is is this

[01:00:13] mux. Got it. Okay. And then the the

[01:00:15] other mux is just saying,

[01:00:18] uh, where it came from in this sort of

[01:00:20] mess of of of gates.

[01:00:21] >> Right. And then the second mux is, okay,

[01:00:23] now you have one value, but that value

[01:00:25] is still,

[01:00:26] um,

[01:00:28] still a four-bit value.

[01:00:29] Yeah, so I've selected four bits from

[01:00:31] the soup.

[01:00:32] >> Um, and then and then I use those four

[01:00:34] bits to select which entry in the lookup

[01:00:36] table

[01:00:36] >> Yeah. uh, I'm going to use. Right. Okay.

[01:00:38] And it's just like so like, suppose in

[01:00:40] the first mux there's like eight nearby

[01:00:44] eight you're you're pulling from eight

[01:00:45] nearby registers as input.

[01:00:47] >> Mhm. Um, [clears throat]

[01:00:48] and so that's like a total of like 32

[01:00:50] bits going in.

[01:00:51] Um, and then out of that

[01:00:54] uh, four bits come out. Those four bits

[01:00:57] go into the second mux which is inside

[01:00:59] the lookup table. So, actually I would

[01:01:00] say in yeah, in this case these

[01:01:02] registers are single bit registers. So,

[01:01:04] if there are eight like eight nearby

[01:01:06] registers and lookup tables, then I have

[01:01:08] eight bits total coming in in in in

[01:01:10] nearby.

[01:01:11] I select from eight down to four

[01:01:13] different values. So, there's actually

[01:01:14] like four different muxes, one

[01:01:16] associated

[01:01:18] with each of these input like

[01:01:21] little mux associated with each of input

[01:01:23] bits. Each of them is selecting one out

[01:01:25] of eight. And where are those eight

[01:01:26] coming from?

[01:01:28] Nearby registers and other lots. And

[01:01:30] each register is one bit? Yes. Yeah.

[01:01:33] And so I guess you AMD or whoever makes

[01:01:35] these um

[01:01:36] uh FPGAs still has to be opinionated

[01:01:39] about

[01:01:40] what uh

[01:01:41] what registers are connected to which

[01:01:42] registers.

[01:01:44] Uh and then you can program in the

[01:01:47] actual gates, but they added wire in the

[01:01:48] connect like the communication topology,

[01:01:51] right? Yeah, so there's there's sort of

[01:01:53] like you get flexibility

[01:01:55] in a local grain thing. There's a sort

[01:01:57] of nearby neighborhood where you can

[01:01:58] select from.

[01:01:59] Um but then more grossly, like more

[01:02:01] coarsely, longer distance con-

[01:02:03] connections they they form an opinion

[01:02:05] on. Yeah.

[01:02:05] >> Right.

[01:02:06] And the reason it's 10x lower is why? So

[01:02:09] if you look at the cost of like building

[01:02:10] this lookup table, it's like

[01:02:13] 32 gates. Yeah.

[01:02:15] And then it can give me the equivalent

[01:02:17] of like what's one an interesting thing

[01:02:19] I can do here. I can do a four-way AND

[01:02:20] gate. And so that's like I I'm using 32

[01:02:23] gates of lookup table to sort of

[01:02:25] implement like a four-way AND means like

[01:02:29] what is a four-way AND? I would do like

[01:02:31] AND AND and then AND of AND.

[01:02:35] So to input like this is a a circuit

[01:02:37] that I could implement in an ASIC

[01:02:38] directly using these three AND gates.

[01:02:40] Um but using a LUT, I can also implement

[01:02:43] it, but it's going to take like these 32

[01:02:45] gates instead of three.

[01:02:46] >> Right.

[01:02:46] And so the overhead is really coming

[01:02:48] from the uh like um

[01:02:50] the fact that the lookup table, the mux

[01:02:52] in the lookup table

[01:02:54] is

[01:02:57] there's a more concise way to describe a

[01:02:59] truth table

[01:03:01] than listing out every single possible

[01:03:03] combination of inputs.

[01:03:04] >> Yes.

[01:03:05] Uh but which is just to like write out

[01:03:07] the gate. Yeah, like to to like place

[01:03:10] down the polysilicon and the and the

[01:03:11] wires and so on.

[01:03:13] >> Interesting.

[01:03:14] One important point he made to me is

[01:03:15] that the reason they prefer FPGAs to

[01:03:17] CPUs is because they get deterministic

[01:03:20] clock cycles. They know when a packet

[01:03:22] will come in and go out.

[01:03:24] Um why is it not a guarantee in CPUs?

[01:03:27] Mhm.

[01:03:28] So, you can actually design a CPU that

[01:03:30] has deterministic latency as well.

[01:03:32] Um

[01:03:33] uh And in fact, like the the processors

[01:03:36] that are inside a lot of AI chips

[01:03:38] actually also have deterministic

[01:03:39] latency, too. Groq has advertised this.

[01:03:41] TPUs have that in the core as well.

[01:03:43] Um

[01:03:46] The challenge is getting sort of

[01:03:47] deterministic latency and high speed at

[01:03:49] the same time. And so, um

[01:03:52] where does the non-determinism in

[01:03:54] latency come from? Um

[01:03:56] Non-deterministic latency um comes from

[01:03:59] specific design choices in a CPU.

[01:04:01] Um

[01:04:02] uh It's actually possible to remove

[01:04:04] those design choices and make a CPU that

[01:04:06] has deterministic latency. Um

[01:04:09] Those are not very attractive in the

[01:04:10] market, and so people don't make those

[01:04:12] CPUs anymore. Um but

[01:04:15] but but but actually in some sense, like

[01:04:17] uh deterministic latency is maybe a sort

[01:04:19] of a simpler designing starting point.

[01:04:22] Um and then and then like some chip

[01:04:24] designers have added uh things in to to

[01:04:26] be non-deterministic. To take a concrete

[01:04:28] example of that, um

[01:04:30] the probably the most important example

[01:04:34] is on a CPU just like the the the uh

[01:04:37] the CPU cache itself.

[01:04:39] So,

[01:04:40] uh

[01:04:43] in a CPU, you have the the CPU This is

[01:04:46] This is the CPU die itself, and then

[01:04:48] there's a memory uh off on the side.

[01:04:50] This is the DDR memory um off on the

[01:04:53] side. And then you have a cache system

[01:04:55] here inside it. This is the cache.

[01:05:01] Uh that sort of remembers recent

[01:05:04] accesses to DDR and and and and stores

[01:05:06] them. And so,

[01:05:08] uh when when I'm running through my CPU

[01:05:10] instructions, uh every time I uh I have

[01:05:12] an instruction that accesses memory, it

[01:05:14] first checks in cache uh was was the

[01:05:16] data stored in cache, and then if not,

[01:05:18] it goes uh fetches out to DDR. Yep.

[01:05:22] This is a huge optimization. The cache

[01:05:24] is like two orders of magnitude faster

[01:05:25] than the DDR. Um if you never

[01:05:28] uh if if if you never like use the

[01:05:30] cache, like all basically all programs

[01:05:32] would run 100 times slower. So, uh the

[01:05:34] presence of a cache is absolutely

[01:05:35] necessary for a CPU to to run at

[01:05:37] reasonable speed. Um

[01:05:39] but

[01:05:40] whether or not you get a cache hit is

[01:05:42] dependent on the sort of ambient

[01:05:43] environment of the CPU. Like what other

[01:05:46] programs are running, what has run

[01:05:47] recently, what is the random number

[01:05:48] generator inside the cache system doing,

[01:05:50] and so So, that is a big source of

[01:05:51] nondeterminism in in the runtime of a

[01:05:53] CPU.

[01:05:54] >> Yeah.

[01:05:55] Um

[01:05:56] so this is sort of the memory system for

[01:05:57] a CPU.

[01:05:59] Um

[01:06:01] The The big thing that you can do

[01:06:02] differently is uh

[01:06:04] instead of having the hardware say, "I'm

[01:06:06] going to like read read memory and then

[01:06:08] decide the hardware decides whether or

[01:06:10] not it comes comes comes from cache or

[01:06:11] not." You can actually bake this in this

[01:06:13] decision into software.

[01:06:15] So, uh a different design philosophy is

[01:06:20] is to

[01:06:21] um So, and you see this in maybe, for

[01:06:23] example, TPUs.

[01:06:25] Um the the TPU instead has I mean, I'll

[01:06:27] draw the same diagram, but I'll call it

[01:06:29] a scratchpad. And so, the the main

[01:06:31] difference is um

[01:06:34] So, this would be like a TPU, and then a

[01:06:37] HBM in this case rather than DDR, but

[01:06:39] it's still an off-chip memory.

[01:06:41] Um

[01:06:43] and instead of like the software saying

[01:06:45] first access like memory and then the

[01:06:47] hardware decides, um you've got some

[01:06:48] instructions that that go here. This is

[01:06:50] like one kind of instruction, and then a

[01:06:52] totally different kind of instruction

[01:06:53] that goes to HBM. Yeah.

[01:06:55] And so, this style is is generically

[01:06:57] known as scratchpad um

[01:07:00] instead of cache. The key distinction

[01:07:02] being that you have like one kind of

[01:07:04] instruction that says read or write

[01:07:05] scratchpad,

[01:07:07] and a totally different instruction that

[01:07:08] says read or write HPM.

[01:07:10] So, is scratchpad being the cache? Yeah,

[01:07:12] this this thing here is the is the

[01:07:14] scratchpad. Um Just to be clear. So,

[01:07:16] stepping way back, people say computers

[01:07:19] have the quote unquote John

[01:07:21] uh uh uh Von Neumann architecture where

[01:07:24] there's this

[01:07:25] uh serial processing of information. And

[01:07:28] maybe just because we've been talking

[01:07:29] about a parallel accelerators, but I

[01:07:31] just don't like the FPGA is super

[01:07:33] parallel.

[01:07:34] >> Yeah. Yeah.

[01:07:36] the

[01:07:37] AI the kinds of AI accelerator TVUs are

[01:07:39] super parallel. Um even CPUs are super

[01:07:42] parallel if you think about all the

[01:07:44] cores they have. And so, is it actually

[01:07:46] like in what sense is modern hardware

[01:07:48] actually the Von Neumann architecture?

[01:07:49] Is that actually a fair way to describe

[01:07:51] modern hardware?

[01:07:52] Um I think it's a fair way to describe

[01:07:53] CPUs. Like just the amount of

[01:07:55] parallelism like on a CPU the amount of

[01:07:57] parallelism you get is about 100 cores

[01:07:59] times maybe like 16-way vector units.

[01:08:01] So, 106 uh about 1,000-way parallelism

[01:08:04] on a CPU.

[01:08:05] Yeah, I I is it One question is what is

[01:08:08] the There is a die that is being used

[01:08:10] for the CPU. Mhm. And if there's fewer

[01:08:13] threads

[01:08:15] just as a matter of like transistor uh

[01:08:19] voltages or like switching on and off is

[01:08:21] it just that there's like literally one

[01:08:22] control flow like a small part of the

[01:08:24] die where like voltages are switching on

[01:08:25] and off or like

[01:08:27] in what How how do you actually occupy

[01:08:29] the

[01:08:30] die area of a CPU if there's only as

[01:08:33] opposed to

[01:08:34] >> If there's so few cores like what are

[01:08:35] you Am I spending a lot of die in there?

[01:08:37] The cores are just much bigger and more

[01:08:38] complicated.

[01:08:40] Um

[01:08:40] so

[01:08:41] uh I mean like so I guess we we should

[01:08:44] compare like a CPU core which takes up

[01:08:46] 1/16 1/100 of the die to like I mean to

[01:08:49] a lot like a lot is just only these 16

[01:08:52] gates. So, like it's clear why there are

[01:08:54] so many more lots in an FPGA than cores

[01:08:56] in a CPU.

[01:08:57] Um

[01:08:58] but then sort of maybe the the like why

[01:09:02] are there more CUDA cores for example

[01:09:04] than than than CPU cores. I think would

[01:09:06] be like why a like what's the difference

[01:09:08] between a CPU and a GPU or something

[01:09:10] like that would would would be a a big

[01:09:11] difference.

[01:09:12] Um

[01:09:13] inside the CPU you have um so one big

[01:09:17] use of so the sort of the top unit uses

[01:09:19] of area in the side of CPU are the

[01:09:21] cache.

[01:09:22] Um

[01:09:23] very little is actually the ALUs. Um

[01:09:26] like mostly it's like these register

[01:09:28] files rather than the logic units. Um

[01:09:31] and then the

[01:09:33] uh

[01:09:33] both of these things have equivalents in

[01:09:35] in a GPU and so that's not a big

[01:09:36] difference. But the thing that does not

[01:09:38] have an equivalent in a GPU is the um

[01:09:41] sort of this branch predictor. And so

[01:09:43] there is a whole big area in the CPU

[01:09:46] which is uh

[01:09:47] sort of uh just

[01:09:49] a whole bunch of predictors that are

[01:09:52] saying um when will my next branch be

[01:09:54] and where the where's the branch target

[01:09:56] for that.

[01:09:57] And so uh stripping a lot of that out uh

[01:10:00] as as well as sort of making these

[01:10:02] register files tighter in in a sense um

[01:10:04] is

[01:10:06] uh driving a lot of where the like GPU

[01:10:08] gains are coming

[01:10:08] >> is the purpose of the branch predictor

[01:10:10] to like execute both branches at once or

[01:10:12] what does it do? So the issue is that

[01:10:14] when I've got a series of instructions

[01:10:15] like um like

[01:10:17] instructions instructions instructions

[01:10:19] instructions I

[01:10:20] uh if I have a branch like here if this

[01:10:22] instruction is branch

[01:10:24] um

[01:10:27] the the actual processing step of

[01:10:29] processing an instruction um

[01:10:31] takes a really long amount of time. It

[01:10:32] takes like maybe five nanoseconds or

[01:10:34] something like that. Um

[01:10:36] so like uh the time to actually notice

[01:10:38] that I've got a branch and then like

[01:10:41] uh

[01:10:42] uh

[01:10:43] evaluate the boolean whe- whether it's

[01:10:45] true and then and then update the

[01:10:48] program counter to the new target and

[01:10:50] then read from the instruction memory

[01:10:51] for that. Um that that could take like

[01:10:53] like actually five nanoseconds to to

[01:10:56] finish. And so in reality this may

[01:10:57] finish somewhere down here.

[01:10:59] Um

[01:11:01] I don't want to like but like I want to

[01:11:04] run a clock speed that is much faster

[01:11:05] than what 5 nanoseconds allows. Like

[01:11:07] like 5 nanoseconds is 200 MHz clock

[01:11:09] speed. I would like to run at 1 or 2 GHz

[01:11:11] or something like that. And so

[01:11:14] um

[01:11:15] I need to run other instructions while

[01:11:17] the branch is is being evaluated.

[01:11:20] And so I

[01:11:22] like I really just want to keep running

[01:11:23] the following instructions that happen

[01:11:25] after me.

[01:11:26] Um but that might have been wrong. Like

[01:11:28] if the branch ended up being taken, then

[01:11:30] I need to know that instead of it

[01:11:31] evaluating these instructions, I

[01:11:32] actually need to like jump to wherever

[01:11:34] the target is and and run run these

[01:11:36] instructions instead.

[01:11:37] And so the purpose of the branch

[01:11:38] predictor is like like genuinely to

[01:11:41] predict based on like before you even

[01:11:43] get to this instruction to be like five

[01:11:45] cycles earlier to predict there was

[01:11:47] going to be a branch that's going to

[01:11:48] happen. So if I think about how the

[01:11:50] brain works versus what you're

[01:11:51] describing here at a high level, the

[01:11:53] differences might be that

[01:11:55] while you can do structured sparsity in

[01:11:59] these accelerators and then save

[01:12:00] yourself some

[01:12:02] um

[01:12:03] area that you would have otherwise had

[01:12:04] to dedicate to these gates, in the brain

[01:12:08] there's unstructured sparsity. You know,

[01:12:10] any neuron can connect to any other

[01:12:12] neuron and not in like ways where they

[01:12:13] have the column lined or whatever.

[01:12:15] >> Yeah. Um

[01:12:17] then there's the fact that memory and

[01:12:18] computer co-located. I guess you could

[01:12:21] say in a way the memory and computer

[01:12:22] co-located on This this is exactly the

[01:12:24] co-location in some sense of the memory

[01:12:26] and computer.

[01:12:26] >> That's right.

[01:12:27] Yeah, so maybe maybe that actually isn't

[01:12:28] a big difference. And the other maybe a

[01:12:31] big

[01:12:31] a big difference is that the clock cycle

[01:12:33] on the brain is much slower

[01:12:35] than on

[01:12:37] um

[01:12:38] on computers.

[01:12:40] And partly that's to preserve energy

[01:12:42] because the faster the clock cycle, the

[01:12:45] bigger the voltage

[01:12:47] uh needs to be in order to

[01:12:49] identify

[01:12:50] uh for the signal to settle and to like

[01:12:52] identify what state a transistor is in.

[01:12:54] >> right. That's right.

[01:12:55] I don't know if you have other

[01:12:56] high-level takes about like how

[01:13:01] any commentary on what you know, what

[01:13:03] the brain might be doing versus how

[01:13:04] these chips work. Yeah, I mean, so so

[01:13:06] let's take the clock speed one first

[01:13:08] actually. Yeah. Um

[01:13:12] the

[01:13:14] clock speed is quite high on on a chip

[01:13:16] because that I mean drives higher

[01:13:18] throughput. Um

[01:13:20] like

[01:13:21] when we compare a like a GPU running

[01:13:24] some some workload, it's running batch

[01:13:26] size 1,000 or something like that.

[01:13:27] Whereas like the brain is not running

[01:13:29] batch size 1,000. There's only one of

[01:13:30] me. And so you could sort of imagine

[01:13:33] saying we'll take a GPU and like instead

[01:13:36] of running at a gigahertz run at a

[01:13:37] megahertz or something like that. And

[01:13:38] and that would start to look maybe a

[01:13:39] little bit more like like um

[01:13:43] sort of equivalent things that you're

[01:13:44] talking about in the brain.

[01:13:45] Um

[01:13:47] There is

[01:13:48] in the way that silicon works, um

[01:13:51] there are

[01:13:52] like that does not give you an 1,000 x

[01:13:54] advantage in energy efficiency.

[01:13:56] So

[01:13:57] um

[01:13:58] what it ends up looking like is you can

[01:14:02] uh like you sort of just end up running

[01:14:05] the circuit once to stabilization and

[01:14:08] then it'll sit idle for a long period of

[01:14:09] time.

[01:14:10] Um it doesn't consume a lot of energy

[01:14:11] while it's sitting idle because uh most

[01:14:14] of the energy is consumed in in sort of

[01:14:16] toggling uh bits from zero to one and

[01:14:18] back. Um

[01:14:20] uh

[01:14:22] So actually let's let's talk about the

[01:14:23] energy consumption uh of of a circuit

[01:14:25] like this.

[01:14:26] The way to think of uh a bit being

[01:14:28] stored is you've actually deposited some

[01:14:30] charge in in a capacitor somewhere

[01:14:32] sitting somewhere in in the chip uh

[01:14:33] implicitly. So it becomes charged when

[01:14:36] it is um bit becomes a one and then it

[01:14:39] becomes discharged when it next goes to

[01:14:40] a zero. Um and that cycle of like

[01:14:42] charging the capacitor and then dumping

[01:14:44] that charge out to ground, that is where

[01:14:46] the energy is consumed. Yeah. This is

[01:14:48] called the dynamic or switching power uh

[01:14:50] This is most of the energy consumption

[01:14:51] of a chip. There is some other energy

[01:14:53] consumption just coming from the fact

[01:14:54] that

[01:14:56] um

[01:14:57] insulators aren't perfect insulators,

[01:14:59] but we'll just discard that. Most of the

[01:15:01] energy consumption actually comes from

[01:15:03] just the charging and discharging of

[01:15:04] like toggling from zero to one and and

[01:15:06] back to zero.

[01:15:07] >> Right. Um so if you run a chip much

[01:15:09] slower and you only clock it once every

[01:15:11] thousand clock cycles or something, you

[01:15:12] will have a thousand times fewer

[01:15:13] transitions. It'll be about a thousand

[01:15:15] times less energy consumption. Um but

[01:15:18] but like but not a substantial advantage

[01:15:21] in energy efficiency. Um okay, so you

[01:15:23] described how a TPU works at a high

[01:15:25] level.

[01:15:26] Um what is the difference

[01:15:29] at a high level between how a GPU and a

[01:15:30] TPU work?

[01:15:32] Yeah, so I mean I think there's sort of

[01:15:33] a

[01:15:34] high-level organization principle that

[01:15:37] is different. Um

[01:15:39] and then there's sort of inside the

[01:15:40] cores what are different. But we'll look

[01:15:42] at sort of

[01:15:43] outside the like at the high level. Um

[01:15:45] so we'll take a GPU

[01:15:47] um

[01:15:48] and a TPU and what does like sort of the

[01:15:50] top-level block structure look like? Um

[01:15:52] if you think of this as the whole chip

[01:15:54] um in each case um

[01:15:57] the organization of of the GPU um

[01:16:00] uh is mostly a bunch of uh almost

[01:16:03] identical units which are these

[01:16:05] um these are the SMs.

[01:16:08] Um and then they've got a an L2 memory

[01:16:11] in the middle and then a bunch more of

[01:16:13] these SMs on the bottom.

[01:16:14] Um

[01:16:16] and so

[01:16:17] uh there's sort of this fairly regular

[01:16:19] grid of cores. Um

[01:16:23] uh and then like if we look at a at a

[01:16:24] TPU in

[01:16:26] in comparison, um you you uh

[01:16:29] you end up with much coarser grained

[01:16:31] units of

[01:16:33] logic and so you you end up with

[01:16:34] something like um

[01:16:36] some large number of

[01:16:38] um maybe like maybe just a few

[01:16:41] matrix units. These are the

[01:16:43] um

[01:16:44] These are the big like systolic arrays.

[01:16:48] And then in the middle you've got some

[01:16:49] vector unit.

[01:16:52] And then you've got your um matrix units

[01:16:54] at the bottom. So um

[01:16:59] Now sort of like matrix units with a

[01:17:02] vector unit in the middle sort of this

[01:17:04] is the whole TPU chip. You can sort of

[01:17:06] think of scaling this thing down into a

[01:17:08] really tiny unit with a smaller matrix

[01:17:11] unit, smaller vector unit. And that is

[01:17:13] sort of what a an SM is. So sort of at a

[01:17:16] very high-level point of view, the

[01:17:19] the GPU has a lot of tiny tiny TPUs um

[01:17:22] sort of tiled across the whole Oh,

[01:17:24] interesting. So like you're suggesting

[01:17:26] the Tensor Core within a streaming SM is

[01:17:29] analogous to an MXU. Yeah, it's very

[01:17:32] very similar. Yeah. I see. And so if you

[01:17:34] had more like

[01:17:36] more lack of structure, having a bunch

[01:17:39] of tiny TPUs makes a lot of sense.

[01:17:41] Whereas if you kind of just have like

[01:17:43] huge matrix multiplications,

[01:17:45] you're like why don't we just

[01:17:47] why don't we avoid the co- the cost of

[01:17:50] having the individual SMs with their own

[01:17:54] registers and work schedulers and things

[01:17:55] like that. Why don't we just like make a

[01:17:57] huge thing and like amortize those costs

[01:18:00] across the whole thing. And I mean I

[01:18:01] think this shows up in how large you can

[01:18:03] grow things. We've we've sort of seen

[01:18:05] this theme like especially with the

[01:18:06] systolic array where larger systolic

[01:18:08] array amortizes the register file costs

[01:18:10] better. Yeah.

[01:18:11] >> Um this sort of design allows you to

[01:18:13] have larger systolic arrays. This

[01:18:15] whereas the sort of GPU design

[01:18:17] constrains you to having small units of

[01:18:19] everything.

[01:18:20] Um

[01:18:21] There's a trade-off, however. Um

[01:18:24] the

[01:18:26] There ends up being um because of this

[01:18:28] sort of coarse-grained separation of

[01:18:29] things, there you need to move a lot of

[01:18:31] data from the vector unit to the matrix

[01:18:33] units. Um and so like you need to move a

[01:18:36] lot of data through through

[01:18:38] a sort of uh

[01:18:40] like two two lines of perimeter here.

[01:18:43] Um whereas if you sort of look at the

[01:18:45] equivalent thing here, you've got vector

[01:18:47] units everywhere, and you need to move

[01:18:49] data through this line, through this

[01:18:50] line, through this line, through this

[01:18:52] line, through this line, through this

[01:18:54] line. So, the amount of data you can

[01:18:56] move between a vector unit and matrix

[01:18:58] unit is actually much higher in in a GPU

[01:19:02] than in a in a TPU because

[01:19:05] because like it's like instead of having

[01:19:07] to like move all the data through these

[01:19:09] just two lines, you're moving all these

[01:19:11] data through like 16 lines or something

[01:19:13] of of of of wiring instead in a GPU.

[01:19:16] Right, but also you you might have to

[01:19:17] move

[01:19:19] across less area.

[01:19:21] Which I mean is also a saving. Like it's

[01:19:22] an energy saving as well. So, so data

[01:19:24] ends up moving like if you can operate

[01:19:26] entirely within a an SM, the data

[01:19:29] movement is much smaller. But then the

[01:19:31] moment you want to operate across SMs,

[01:19:33] like it it becomes sort of more more

[01:19:35] complicated and expensive. So, you don't

[01:19:36] have a comment, but one might expect

[01:19:38] that a thing in Matrox might try to do

[01:19:40] is to get the GPU-like smaller structure

[01:19:44] of um systolic arrays surrounded by um

[01:19:49] SRAM, but also at the same time make it

[01:19:52] so that like the

[01:19:54] things you need in an SM to support the

[01:19:56] CUDA architecture uh

[01:20:00] but take a bunch of bunch of space, you

[01:20:01] might discard.

[01:20:03] Yeah. Um

[01:20:05] We've talked publicly about something

[01:20:06] which we call a splittable systolic

[01:20:08] array, which is sort of in in some sense

[01:20:09] you can think of it as like big systolic

[01:20:11] arrays that can be small systolic arrays

[01:20:12] as well. Cool.

[01:20:14] Um okay, I think this is a good note to

[01:20:15] close on. Ritesh, thank you so much.

[01:20:18] Thanks, Ritesh.
