Full Transcript
https://www.youtube.com/watch?v=eGPLf6RAAic
[00:00] Thank you everybody.
[00:02] Uh I have a lot to cover today so uh it's most probably I will take questions afterwards if you catch me.
[00:10] Um I have this cool I'm uh the Spectumix product manager at Nvidia and today we're going to talk about all the revolution and innovations we're working on in the Gigawatt AI factories.
[00:23] When we talk about an AI factory, you can think about it as a five layers cake.
[00:29] Meaning um you can't just have the top layers.
[00:31] You need to have the foundation.
[00:32] And the basic foundation for an AI factory is an energy.
[00:35] Of course, on top of the energy, you enable components that build out an infrastructure and on that infrastructure you run models and applications.
[00:46] The applications themselves are evolving every every day.
[00:49] On a daily basis you'll see a new a new model comes up improving the the the previous model and we started off with pre-training post-raining
[01:01] pre-training post-raining um agentic AI inference and all of those.
[01:05] um agentic AI inference and all of those requires very significant requirements.
[01:08] requires very significant requirements for from the network.
[01:11] It starts with the bandwidth requirements, latency, jeter.
[01:14] And this is where at Nvidia we decided to decided to put a lot of investment in.
[01:16] to decided to put a lot of investment in innovation and co-designing hardware and.
[01:22] software together to meet the requirements of the AI workload.
[01:29] The AI factory is cons comprised out of.
[01:32] four distinct network.
[01:34] Uh some of my colleagues here talked about it before so I do it a bit briefly.
[01:36] The scaleup network mission is to form a single GPU out of multiple GPU A6.
[01:41] To do that, it has to be very high requirement network.
[01:44] Uh sometimes we even do in network computing within that network to free as.
[01:47] much GPU cycles as we can.
[01:49] Once you form that GPU, you want to connect hundreds, thousands, and our largest customers are connecting a half a million of GPUs in.
[02:03] connecting a half a million of GPUs in in a facility.
[02:05] And to accommodate that you need a scale out scaleout network.
[02:07] you need a scale out scaleout network and scale out network has again rigorous requirements from latency bandwidth.
[02:10] and scale out network has again rigorous requirements from latency bandwidth jitter perspective um that are quite similar to scale up.
[02:13] requirements from latency bandwidth jitter perspective um that are quite similar to scale up.
[02:16] jitter perspective um that are quite similar to scale up.
[02:19] similar to scale up. It will come to a time where your energy will not suffice to be on a single facility and this is where you would like to scale out and connect two different remote facilities.
[02:22] time where your energy will not suffice to be on a single facility and this is where you would like to scale out and connect two different remote facilities.
[02:24] to be on a single facility and this is where you would like to scale out and connect two different remote facilities.
[02:26] where you would like to scale out and connect two different remote facilities to form a single AI factory across different locations.
[02:28] connect two different remote facilities to form a single AI factory across different locations.
[02:30] to form a single AI factory across different locations and newly introduced uh uh network is the context memory storage network which is relevant mostly to inferencing and the use cases of KV cache movements between storage and to the network and between several prefill and decode nodes.
[02:33] different locations and newly introduced uh uh network is the context memory storage network which is relevant mostly to inferencing and the use cases of KV cache movements between storage and to the network and between several prefill and decode nodes.
[02:36] and newly introduced uh uh network is the context memory storage network which is relevant mostly to inferencing and the use cases of KV cache movements between storage and to the network and between several prefill and decode nodes.
[02:39] the context memory storage network which is relevant mostly to inferencing and the use cases of KV cache movements between storage and to the network and between several prefill and decode nodes.
[02:41] is relevant mostly to inferencing and the use cases of KV cache movements between storage and to the network and between several prefill and decode nodes.
[02:44] the use cases of KV cache movements between storage and to the network and between several prefill and decode nodes.
[02:46] between storage and to the network and between several prefill and decode nodes.
[02:48] between several prefill and decode nodes. nodes within the scaleout network.
[02:49] nodes. nodes within the scaleout network.
[02:53] Uh for the rest of the presentation I will focus on the scale out network.
[02:54] Uh for the rest of the presentation I will focus on the scale out network. Um when you look currently on the available Ethernet products today you would see products that were designed for
[02:57] will focus on the scale out network. Um when you look currently on the available Ethernet products today you would see products that were designed for
[02:59] when you look currently on the available Ethernet products today you would see products that were designed for
[03:01] Ethernet products today you would see products that were designed for
[03:03] products that were designed for hyperscal hyperscaler clouds or hyperscal hyperscaler clouds or enterprises and for for that network the enterprises and for for that network the main characteristic was it was a single main characteristic was it was a single server workload data coming into the server workload data coming into the server going out of the server.
[03:16] When we talk about AI workloads, it is more of a distributed computing problem where data moves between servers and not only in and out the servers.
[03:28] That means that some characteristics are very different.
[03:31] The main one is jeter. Jeter is the worst thing that can happen when you run AI workload.
[03:37] It takes only a single GPU to store the complete AI workload.
[03:44] And that is massive when you you you you want to optimize your AI factory to be as much as low as power consumption and provide you the most effective bandwidth for your customers.
[03:57] The way we structured spectrum um it is to be specially optimized for
[04:03] um it is to be specially optimized for the workloads that I mentioned before.
[04:06] the workloads that I mentioned before.
[04:08] You cannot build an AI factory with a single component.
[04:11] It's impossible to create a a component that has low latency, zero jeter, low power consumption.
[04:16] So this is why we separated into an end-to-end integrated solution.
[04:21] Let's start with the switch fabric.
[04:24] The switch fabric whole mission is to unconditionally spread the traffic.
[04:29] A packet get into the switch fabric.
[04:32] It needs to get out of the switch fabric in the best route possible.
[04:37] The way the switch fabric does it is through what we call adaptive routing where each packet is being considered to be routed to the best effective uh uh route on a packet basis.
[04:49] The switches are connected.
[04:52] All of them are having the same information.
[04:55] So each switch is familiar where are the new hotspots in the network or where there are some black holes that you should not go to.
[05:01] The super nick has to
[05:05] should not go to.
[05:05] The super nick has to treat a a um a situation that happens.
[05:10] treat a a um a situation that happens when the switch unconditionally spread the traffic.
[05:11] when the switch unconditionally spread the traffic. It calls out of order.
[05:14] So the super nick consume all the packets in out of order manner and place it directly in the GPU memory not to store the GPU uh cycles.
[05:17] the super nick consume all the packets in out of order manner and place it.
[05:20] in out of order manner and place it directly in the GPU memory not to store the GPU uh cycles.
[05:23] directly in the GPU memory not to store the GPU uh cycles.
[05:23] We don't have buffer copies or something like that.
[05:27] the GPU uh cycles.
[05:27] We don't have buffer copies or something like that.
[05:29] copies or something like that.
[05:29] The other activity of the super nick would be to control the injection rate.
[05:32] activity of the super nick would be to control the injection rate.
[05:35] control the injection rate.
[05:35] Meaning the super nick gets telemetry across the network and it is able to to select how to steer the traffic to the right switch fabric or the right switch in the fabric.
[05:37] super nick gets telemetry across the network and it is able to to select how.
[05:40] network and it is able to to select how to steer the traffic to the right switch fabric or the right switch in the fabric.
[05:42] to steer the traffic to the right switch fabric or the right switch in the fabric.
[05:46] fabric or the right switch in the fabric and if it senses there is going to be congestion it will limit the injection rate.
[05:48] and if it senses there is going to be congestion it will limit the injection rate.
[05:50] congestion it will limit the injection rate.
[05:50] But it is much better to limit the injection rate than to get congestion in the network.
[05:53] rate. But it is much better to limit the injection rate than to get congestion in.
[05:55] injection rate than to get congestion in the network.
[05:55] Congestion is the worst enemy of AI workloads and when you limit the injection rate you effectively maximize the available bandwidth you can.
[05:57] the network. Congestion is the worst enemy of AI workloads and when you limit.
[06:01] enemy of AI workloads and when you limit the injection rate you effectively.
[06:03] the injection rate you effectively maximize the available bandwidth you can.
[06:05] Maximize the available bandwidth you can get for the GPU.
[06:09] Get for the GPU.
[06:11] We are very proud in the list of partners we have for the scale out element of of spectum mix.
[06:17] So we produce the spectrum mix ASIC and we have partners that are are either building their own systems and deploying their own operating system.
[06:26] And we have partners that are taking our boxes and deploy their own operating system.
[06:30] So we are very proud in our friends in in Meta, in our friends in Cisco, in all of our amazing Sonic partners uh around the world and of course for customers who decided to pick our Nvidia solution with our Kumulus Linux operating system.
[06:51] Performance is key indicator for how successful a scaleout network becomes.
[06:56] There is uh I would say the trivial performance metrics in case of training which is called step time and your key goal is to minimize step time.
[07:08] goal is to minimize step time.
[07:11] But there is also another perspective of performance which relates to to the term I mentioned earlier jitter.
[07:16] Your performance of your scaleout network has to be consistent.
[07:22] Once your um line is is straight, it means that all of your GPUs are synchronized.
[07:28] It means that no GPU is left behind.
[07:30] When you get this bump like off-the-shelf Ethernet, the the the effects of it can um increase the AI workload time to to completion and time to first token and all of those terms.
[07:46] Once you deploy the spectum solution, then you have the hardware in place.
[07:52] You invested billions of dollars, but you need to make sure you are future proof in case of a new model coming up.
[07:58] And this is where the co-design between hardware and software of Spectumix come into place.
[08:04] We have quarterly software releases for the software suite of Spectrumix.
[08:06] And this is a new model
[08:09] Spectrumix.
[08:09] And this is a new model coming out of Nvidia.
[08:10] It is called Neotron.
[08:14] And what we figured out that those nickel collectives build up to 85% of how Neimatron trains himself.
[08:21] So we actively decided to optimize these specific nickel collectives making the AI workload uh performance uh increase.
[08:31] So we increased one collective by 2x but some collectives by 14x.
[08:38] And overall the change to the application was dramatic and we do it on a quarterly basis.
[08:42] Every time a new model come up you you get uh best performance of your scaleout network co-designing it hardware and software uh based on the model itself.
[08:53] When we talk about scale out you reach the max capacity on your single location and you want to scale across uh between the different locations.
[09:01] There is an inherent penalty when you go scale across latency.
[09:09] across latency.
[09:09] You cannot avoid it.
[09:11] Like distance will dictate a a couple of milliseconds per meter.
[09:14] Um whenever you go outside of the data center and what we target in spectrum is to minimize the penalty you pay.
[09:17] The traditional of the shelf Ethernet would use debuffers.
[09:22] And debuffer is great because you can put all the traffic on on one node on the exit point and then in the entry point of the next location and all the traffic will reach its destination.
[09:31] The question how is is is how many latency or how much latency you would pay using that uh mechanism and it turns out the debuffer will have twice or triple the latency of the wire itself.
[09:47] So in spectum mix we're not using debuffers.
[09:54] We're actually optimizing the load balancing and congestion control mechanism between the two facilities.
[09:57] It means that sometime we will have to limit the injection rate but the effective bandwidth we get between the two different sites is almost 100% and it's twice um than the
[10:11] almost 100% and it's twice um than the offtheshelf Ethernet.
[10:16] So I believe everybody is in this room.
[10:20] So I believe everybody is in this room is uh uh concerned about the energy.
[10:23] is uh uh concerned about the energy consumption of the factory and the energy from our perspective is the ultimate ceiling of an AI factory.
[10:25] consumption of the factory and the energy from our perspective is the ultimate ceiling of an AI factory.
[10:28] energy from our perspective is the ultimate ceiling of an AI factory.
[10:30] So it's a lower layer of the cake but it's probably the most important one.
[10:33] it's a lower layer of the cake but it's probably the most important one.
[10:34] It dictates how many GPUs you can connect in a single facility.
[10:37] dictates how many GPUs you can connect in a single facility.
[10:39] And for a traditional data center, what was used was mostly copper.
[10:42] traditional data center, what was used was mostly copper.
[10:46] And copper is great. Latency is good, zero power consumption, very reliable.
[10:49] Latency is good, zero power consumption, very reliable. But when you go to the scale out network distance does not allow you to use copper as much as you want to do it.
[10:51] very reliable. But when you go to the scale out network distance does not allow you to use copper as much as you want to do it.
[10:53] scale out network distance does not allow you to use copper as much as you want to do it.
[10:55] allow you to use copper as much as you want to do it. So um the the copper itself um okay.
[10:59] want to do it. So um the the copper itself um okay.
[11:02] the the copper itself um okay.
[11:06] okay.
[11:08] So when you um look on the options today
[11:13] So when you um look on the options today in the market for copper you have many.
[11:16] in the market for copper you have many options you have DR you have ZR you have LRO FRO TRO and if you want to maximize your AI factory for flexibility that's great do it but we at Nvidia care about minimizing the power consumption so even if you consider the most advanced technology today like LPO you still get twice the power consumption than you can get of a solution that was co-designed between the optical engine and the electrical uh circuits which we call co-package optics.
[11:48] Another aspect where you should consider is about reliability and the number of components.
[11:53] So when you do pluggable modules, you have six pluggable modules per GPU in the two-tier factory topology and each model can have up to eight lasers and that's a staggering amount of components and it takes only one component fail to get your GPU blackold or unreachable and then your AI workload
[12:15] or unreachable and then your AI workload will be uh um performance decreased.
[12:19] will be uh um performance decreased and Nvidia and TSMC have set to uh resolve solve that situation and we designed a new process that will integrate the electrical substrate and the optical substrate together minimizing the the distance between the electrical engine and the optical engine which allow us to reduce the power significantly.
[12:45] Now it's not about only shrinking the package.
[12:48] It is high priority to make it a robust manufacturing mass production ready device.
[12:53] We are also innovating with our laser partners and we managed to reduce the number of lasers by 4x compared to be using uh a pluggable data center and this is uh a diagram of how it looks like within.
[13:08] So we match and integrate the electrical and optical surfaces with a new uh TSMC process called coupe.
[13:19] A new uh TSMC process called coupe.
[13:24] This is the industry first 200 gig um service lanes.
[13:29] Um so this is what we position in spectrum 6 Ethernet box.
[13:33] And if you look on ISO power of the of of the whole AI factory, when you use CPO,
[13:37] the ISO power allows you to connect connect triple or 3x time more GPUs compared to a pluggable data center.
[13:49] In terms of reliability and its effect on the AI factory performance, this is uh important to mention.
[13:57] So first of all the package is uh closed meaning there is no human touch everything is liquid cooled.
[14:05] So what we are able to see right now that when we use CPO uh the AI since the MTBF of CPO is much better than pluggable the AI workload runs in a higher performance.
[14:19] AI workload runs in a higher performance up to five times compared to when we use pluggable models.
[14:22] up to five times compared to when we use pluggable models.
[14:25] And this is again when you do all the aggregated calculations of the ROI of building your AI factory that's make a lot of uh difference uh 5x AI up time.
[14:32] AI up time.
[14:36] We do acknowledge that building AI factory is a challenging activity.
[14:38] This is why at NVIDIA we are providing reference designs and reference architecture.
[14:40] Our newest one is called DSX.
[14:44] DSX has uh instructions on how to operate your AI factory, how to cool it, how to power it, how to lay out the facility.
[14:45] And another aspect of it is a small square over there called DSX SIM.
[14:48] In DSX Sim, we are targeting to simulate the AI factory.
[14:53] And one of the key products of DSX SIM is called DSX Air.
[14:56] DSX Air is our digital twin for simulating an AI factory.
[14:57] What it does is
[15:20] is take all of your equipment, GPU, DPU,
[15:24] take all of your equipment, GPU, DPU, nick, switch, management software,
[15:27] nick, switch, management software, orchestration software, storage
[15:29] orchestration software, storage appliances, and simulate it in one
[15:31] appliances, and simulate it in one place, effectively creating a digital
[15:34] place, effectively creating a digital twin.
[15:36] And when you're deploying a billion dollar AI factory, this is a
[15:38] billion dollar AI factory, this is a high stakes.
[15:40] You you will get to the moment where you need to deploy it in
[15:43] moment where you need to deploy it in production into the physical facility.
[15:45] production into the physical facility.
[15:47] And what we see testimonials from our current customers who are using Nvidia
[15:50] air is that until you get the act once you do the PO you'll get few months
[15:57] you do the PO you'll get few months until uh you get the equipment at your
[16:00] until uh you get the equipment at your own facility but you can start planning
[16:03] own facility but you can start planning your digital uh AI factory right now.
[16:06] your digital uh AI factory right now. So eventually it boils down to um
[16:10] eventually it boils down to um validation of of the factory from 6
[16:13] validation of of the factory from 6 months to one week.
[16:16] The software bring up you can start um bringing up your
[16:19] up you can start um bringing up your software even before you purchased.
[16:22] software even before you purchased.
[16:24] You can go today to Nvidia Air and build an AI factory and use all of the components.
[16:26] AI factory and use all of the components and it will boil down to one week and most effectively uh what we saw is actual bring on time, bring up time and deployment.
[16:29] and it will boil down to one week and most effectively uh what we saw is actual bring on time, bring up time and deployment.
[16:32] most effectively uh what we saw is actual bring on time, bring up time and deployment.
[16:34] actual bring on time, bring up time and deployment.
[16:37] deployment. It gets from few weeks to one day.
[16:39] one day. And the reason it gets to one day is because you just copy paste the configuration.
[16:41] day is because you just copy paste the configuration.
[16:43] configuration. Everything was validated up front.
[16:46] up front. And on the right you'll see all of our partners that are participating in DSX Air.
[16:47] all of our partners that are participating in DSX Air.
[16:50] participating in DSX Air. And if you belong to a company that would like to participate in that, just reach out to reach out to me and I will do the right connection.
[16:52] belong to a company that would like to participate in that, just reach out to reach out to me and I will do the right connection.
[16:54] participate in that, just reach out to reach out to me and I will do the right connection.
[16:56] reach out to me and I will do the right connection.
[17:00] Topology is critical for an AI factory.
[17:04] Topology is critical for an AI factory. it's critical for um power consumption and the amount of number of GPUs you can connect in a in a facility but yet people do not fully understand often how important it is.
[17:06] it's critical for um power consumption and the amount of number of GPUs you can connect in a in a facility but yet people do not fully understand often how important it is.
[17:09] um power consumption and the amount of number of GPUs you can connect in a in a facility but yet people do not fully understand often how important it is.
[17:11] number of GPUs you can connect in a in a facility but yet people do not fully understand often how important it is.
[17:15] facility but yet people do not fully understand often how important it is.
[17:19] understand often how important it is. So in the past when people started connected GPUs it was uh a topofrec
[17:21] in the past when people started connected GPUs it was uh a topofrec
[17:25] connected GPUs it was uh a topofrec design and a topofrec design inherently.
[17:27] design and a topofrec design inherently conflicts the way the the GPU nodes are.
[17:30] conflicts the way the the GPU nodes are constructed within the GPUs node there.
[17:32] constructed within the GPUs node there are scale there is a scaleup network.
[17:34] are scale there is a scaleup network scaleup network characteristics allows.
[17:36] scaleup network characteristics allows for GPUs to to communicate with within.
[17:39] for GPUs to to communicate with within each one within the scaleup so it makes.
[17:43] each one within the scaleup so it makes no sense to connect all of the GPUs in.
[17:45] no sense to connect all of the GPUs in the node to top of rex switch you.
[17:47] the node to top of rex switch you consume too much radics of the age you.
[17:49] consume too much radics of the age you inherently create congestion where you.
[17:51] inherently create congestion where you shouldn't because if you go outside the.
[17:53] shouldn't because if you go outside the bandwidth is lower than the scaleup.
[17:55] bandwidth is lower than the scaleup network.
[17:58] network. So this is where Nvidia introduced uh the multi-ray some of you.
[18:01] introduced uh the multi-ray some of you call it rail optimized design where each.
[18:03] call it rail optimized design where each GPU within the node is communicating.
[18:07] GPU within the node is communicating through the scale up and if you want to.
[18:09] through the scale up and if you want to get to nodes on on uh GPUs on other.
[18:13] get to nodes on on uh GPUs on other nodes you go to scale out and this.
[18:14] nodes you go to scale out and this allows you also to increase uh the scale.
[18:17] allows you also to increase uh the scale of the fabric.
[18:19] of the fabric. This is just an example of uh 8K illustration and our latest.
[18:23] of uh 8K illustration and our latest innovation is called Spectumix.
[18:26] innovation is called Spectumix Multiplane and this is where all the magics come true.
[18:31] We are able to show in the ver Rubin generation where each GPU has 1.6 terra of bandwidth with the connectics 9 um to reach to reach half a million uh GPUs in a single facility.
[18:45] Of course, if your powers allows it and space, the way it works is that we use multi-nick in the in the multiport in the nick which allows the switch radics to be larger which in that increased the radics of the two fat topology.
[19:02] Now multiport is not something new.
[19:04] What's innovative in what we did is that we actually took mechanisms from the switch in hardware and put it in the super nick.
[19:13] So now the supernick is part of this global load balancing mechanism and it takes decision on a per packet basis within the collective uh resolution.
[19:23] So AI collective is like the step time.
[19:26] If you take decision within the step time
[19:27] you take decision within the step time your AI uh performance will boost
[19:30] your AI uh performance will boost compared to software mechanism of doing
[19:33] compared to software mechanism of doing load balancing.
[19:36] load balancing.
[19:36] So in terms of what you should uh explore DSX Air available uh for everybody
[19:42] go and log into the website we have uh partners and we want to get more partners
[19:48] some of them are sitting here in this room.
[19:50] Um Sonic is heavily investing in AI infrastructure.
[19:55] So if you see a work stream that you are interested in for example CPO workstream please uh take a part and for anything else that I described in this presentation
[20:07] every every information is located on the Nvidia website feel free to reach out to me and uh yeah I think I'm just on