# Optica Executive Forum @ OFC March 16th 2026:  Panel: Scale Across Session

https://www.youtube.com/watch?v=VpUE_GZA7WE

[00:00] Highlights from the Optica Executive Forum at OFC.
[00:02] Thanks to these sponsors.
[00:20] We're releasing seven sessions from Optica's Executive Forum because the world is watching.
[00:27] Feedback encouraged.
[00:27] Now, enjoy.
[00:32] Thanks, Richard.
[00:33] Uh that's great kickoff to to all the sessions today.
[00:39] Scale up, scale out, scale across.
[00:39] How many times are we going to hear that this week, right?
[00:45] I I think we we got got it there.
[00:47] We're going to get it all through the day and all through the week.
[00:48] That those those are the words of of the week, and I think everyone who is presenting, I give them a lot of credit for submitting slides before Jensen started talking today.
[00:57] Um
[00:58] So, you know, we'll see we'll see how
[01:00] things go with UTC and how much changes.
[01:02] as as he talks, but.
[01:04] so you'll see if you look at the agenda,
[01:07] the the panels that we've set up,
[01:10] the key topic, scale up, scale up out,
[01:12] And so, we're going to kick off with the first one is is scale across.
[01:19] What I think is actually interesting is terms sound very similar, but they each have very different dynamics, right?
[01:23] Scale scale out for optics is much more mature and being deployed.
[01:28] Scale up is this trend you know, copper versus optical.
[01:32] And scale across, I think the point of today's panel, I actually took a look,
[01:36] 207 days it's been since Jensen first used the term scale across.
[01:42] It feels like it's been in the industry forever to me.
[01:45] I, you know, it it's just a part of the vernacular today.
[01:49] But it's been 207 days, and you know, there's still a lot of confusion what it means.
[01:55] I I think when when you talk to people, you get a lot of different perspectives in terms of what scale
[02:02] across actually is.
[02:04] And I think what you're actually going to see, so we have a panel set up.
[02:07] The first three speakers, uh Jeff, Yawai, and Tad, are going they they they all come from the from the web scale web scale perspective.
[02:14] They're all deploying, you know, to me, very simple question.
[02:19] When we started talking about this panel, I said, "Who are the people I ask when I want to understand these things?"
[02:25] Jeff, Yawai, Tad.
[02:28] They're the people.
[02:28] Um so, I think you had a great great perspective of the end user, and I think you're going to find some diversity in in where they're going to go with the topic, right?
[02:38] And I think that kind of speaks to to the scale across topic.
[02:40] Um and then we have Vijay and Rakesh, who are going to speak from the perspective of a, you know, switch router platform, what it means for the IP layer.
[02:51] And And the idea here, I think we want to focus on a couple of things, right?
[02:57] So, what is scale across?
[02:59] What does it mean to different people?
[03:00] How does scale across differ
[03:02] to from the DCI that we're we're familiar with?
[03:06] Um how does it differ in terms of application?
[03:08] How does it differ in terms of bandwidth demand?
[03:12] Um and then what does it mean about how we develop solutions for for the future, right?
[03:18] What are we using today?
[03:20] What does that change as as the network evolves and and scale across becomes more of a uh you know, a standard thing that it is driving bandwidth in the network, and and how do we think about the the development of of new products?
[03:32] So, I think, you know, we've got a great great panel and great perspective.
[03:37] The way we're going to do this, uh so each of the uh speakers is going to come up and present their material.
[03:44] We're going to hold questions until after each of them has presented, and then we'll bring some chairs up, and we'll have a panel discussion.
[03:53] Um so, you know, at that time we'll we'll take Q&A from the audience and uh you know, the the the last section will be uh will will be all discussion.
[04:00] So, with that, I would like
[04:05] to bring up Jeff to uh to kick it off.
[04:09] Uh boy, it's a it's an interesting challenge to go after Richard.
[04:14] Uh he he did a very nice job.
[04:16] And actually, I you know, I like the fact that he spent some time really reflecting on what an interesting time we are in and and the consequence of the things that we are doing here.
[04:25] I mean, that's it's really something that's worth reflecting on.
[04:30] This is a very big impact that we are having.
[04:32] The other thing I liked about Richard is he, you know, reflected on that and then like, "Okay, well, let's dive into the tech into the technology."
[04:41] Cuz really, it's the technology that we're developing that is going to drive all these innovations.
[04:45] So, very excited about this.
[04:48] Uh I'm Jeff Rohn.
[04:50] Um part of the optical uh sorry, I'm part of the architecture team for the backbone uh network at Meta.
[04:57] Uh and so, we're responsible for uh uh building out a lot of Meta's infrastructure.
[05:02] Uh and today, I'm going to talk through kind of these points.
[05:05] Uh start out with
[05:06] an introduction of Meta's infrastructure and global backbone network.
[05:10] Um uh it's probably fairly familiar, but it's worth uh dwelling on.
[05:13] Uh I'm going to talk about how our network is evolving to enable kind of the gigawatt scale that we're getting to.
[05:23] And uh what do we need on the optic side to enable this uh evolution?
[05:30] Uh Meta's well-known. We have a lot of the uh the platforms that uh people use uh very frequently, and these platforms are enabled by some pretty impressive uh global infrastructure.
[05:43] Um and uh this is a map of our currently announced data centers.
[05:46] Uh mostly in North America, but across across the globe.
[05:52] Uh these data centers uh in order to be useful to our end users and to to each other are connected by a a global backbone network uh which our team is responsible for deploying.
[06:04] Um
[06:07] when you think about what that network has looked like over the years.
[06:10] When we first started building out uh data centers, so imagine just a couple of data centers in North America.
[06:16] The main task that we saw for the backbone was how do we connect these data centers to our end users?
[06:20] So, that was straightforward.
[06:22] They want to have access to uh upload uh download images, that sort of thing.
[06:28] So, that's the first use case.
[06:30] But, as we continue to build out our uh global infrastructure, we realized that there was an additional task which was a machine-to-machine traffic.
[06:36] And uh in the past we've always just described this as kind of a machine-to-user and then machine-to-machine.
[06:45] But, as we talk about scale across, I want to kind of distinguish that a little bit from the way things look going forward, which is this task that it had to take care of was really more of a um what I call a data synchronization task.
[06:56] So, uh whatever the workload we had to deploy with our network, uh it was always enabled with local data in
[07:07] order to get its job done.
[07:09] And so, if there was data that different workloads would need, you would want to make sure that there was a copy of that data locally and making sure that all those databases was in sync was the responsibility of our backbone network.
[07:22] So, uh and I call that out because now we're evolving the network to add an additional piece, which is the piece that enables our gigawatt scale clusters.
[07:35] Now, what we need to be able to do is access data that's not local.
[07:39] So, the data is for whatever reason uh situated in a farther away location than than a local cluster can get it to.
[07:51] So, why is this different?
[07:54] This is different because we need to enable large-scale training.
[07:58] We need to enable workloads where the data is remote.
[08:02] So, for with good that Tom had a chance to say, "Well, we don't really know what scale across network is."
[08:05] So I'm going to say this is kind of a broad
[08:08] definition for us of what scale across would be.
[08:12] It includes GPU-to-GPU traffic, but also CPU-to-CPU traffic.
[08:19] Uh, you know, if you look at the way our network has growth grown, I I've said that we start out with a lot of traffic that's between the data center and the user.
[08:27] Uh, and then we have this additional data center machine-to-machine traffic, which is a synchronization task.
[08:34] Uh, and this has grown year-over-year since I've been at Meta and even before then.
[08:39] Uh, the these data center synchronization role is eclipsing the size of the user role, uh, but now we've got this new task, which is enabling gigawatt-scale clusters.
[08:51] And surprisingly, even from a backbone perspective, it is uh, larger in in, you know, in network capacity than our backbone network.
[09:04] So, it's good that I'm going to get a chance to talk with Yawei and and Tad.
[09:10] interesting perspectives.
[09:11] Uh, I for for what we focus on on scale across, it's a lot of the longer distance reaches.
[09:17] Uh, but it all comes back to the same sort of basic problem, which is we need to access power.
[09:21] There's power in various size domains, there's various geographies that allow us to get to it.
[09:28] Uh, traditionally, we tried to build everything within a 3-km radius so that you could connect everything with the cheapest IMDD optics available.
[09:38] Uh, but when that runs out, you start looking out across uh what do we have that's within the range of kind of the longer reach optics, and for us that still remains an IMDD solution.
[09:49] Uh and it's like, "Okay, well, for for something that's up to 10 km, we can still use gray optics.
[09:53] That seems straightforward enough."
[09:57] Uh but I just would like to dwell on saying, "Oh, let's use gray optics."
[10:02] Turns into quite a large civil engineering challenge because gray optics uh is one uh wavelength per fiber or one amount of capacity per fiber.
[10:12] Uh and deploying a large amount of capacity with uh one fiber per is uh requires you pre-build pretty big trenches and uh deploy a lot of fibers.
[10:21] So, uh when the distances get to be even a little bit greater, sort of 10 probably around the 20-km range, it's just not really viable to do uh gray optics anymore.
[10:33] You need to go to uh WDM to achieve the scale that we need to get to.
[10:40] So, at that point we uh start leveraging uh 800ZR+ and DWDM, uh and we get to much more reasonable fiber counts.
[10:49] Uh still large scale, but but at least reasonable.
[10:55] Uh you know, it's clear that we're in a different moment right now.
[10:59] But, at the same time uh the time I've spent at Meta, we've had the same basic question come up over and over and over again.
[11:06] Like, when is it worthwhile to start deploying WDM and enable kind of clusters to work
[11:13] across larger regions instead of having everything done locally?
[11:16] And over and over and over again, the answer has been we just can't make a use case.
[11:23] And so, I want to call out not just that the uh use case is shifting, but also that the technology is really enabling.
[11:29] And what What it about what we have in our toolbox today that is different from what we used to have even just a few years ago.
[11:39] And the key things that work for us are first of all, having standardized interoperable interfaces.
[11:45] This is a really big for us.
[11:49] You know, we we in the 800 gig generation we had plans to do try to bookend and those plans fell apart immediately just because there's too many constraints on how we build our network.
[12:00] So, having things be interoperable is a huge deal.
[12:04] The other one is open line system that allows us to connect those optics seamlessly.
[12:11] We really rely on generational
[12:14] interoperability.
[12:16] When you start talking about a network that spans multiple regions,
[12:18] doing upgrades is impossible to do simultaneously across multiple sites.
[12:25] It's just absolutely impossible.
[12:27] So, having a upgrade scenario where you can upgrade one site and have it operate in the old mode off to different site is absolutely critical.
[12:35] So, we're doing that today upgrading our network to 800 gig from 400 gig interoperating when the old site has not yet been upgraded and then completing the upgrade late later.
[12:48] I will say one of the great benefits in our future upgrade is now we can get to the port counts and wavelength select the switches where we won't even have to rewire recable anything when we do the 800 gig to 1600 gig upgrade.
[13:00] But, the expectation is we'll upgrade we'll operate in that old mode.
[13:06] We leverage full utilization of C&L bands often on day one.
[13:08] We will deploy all the channels on a fiber to fill out C+L.
[13:13] That's just critical
[13:16] for us to get to the scale we need to.
[13:18] We.
[13:21] The other thing that's worth dwelling on a little bit is a lot of these scale across type of applications.
[13:26] the projects uh can come and go and they're very large and it's very hard to know uh you know, you got to make the place the bets on what technology you're going to deploy a long time in advance.
[13:38] And one thing we really rely on is this kind of fungibility between uh solutions.
[13:43] So, having a uh dedicated ZR if it's under 100 km and then go to ZR plus if it's over 100 km, like that's a great idea from a technology point of view, but it's very hard for us to execute on uh just from a uh sourcing and uh manager of the supply.
[13:59] Because these projects like they can come and go and they're very large quantity.
[14:03] So, uh love having the fungibility across applications and that's why we focused on ZR plus for our our standard and I to the point where I I stop I forget to put the plus on.
[14:11] I see I put only ZR on this slide.
[14:13] It's I generally think of it as all ZR plus.
[14:17] Um and then the other thing we're doing is just the space and power requirements for the capacities of of deployments we do,
[14:23] we really need to use optical protection uh anytime there's any chance that you get a fiber cut in the field so that we can reroute using uh optical switching instead of uh uh overbuilding on the IP layer, which is just too much of a scale problem for us at this point.
[14:40] And I will close uh uh with this kind of open question about like where are we going to go?
[14:47] And I think for me there's no question that at some point we need to figure out how to build a network and build network technology that allows us to uh imagine all of uh a continental scale uh compute cluster as our our aspiration over the next you know, 3 to 5 years type of time scale.
[15:02] So.
[15:05] Um okay, and that was the end of my talk and I believe I hand off directly to Yawei I think, right?
[15:10] Okay.
[15:12] Uh it's my honor to be here uh speaking after Jeff and uh uh Richard.
[15:16] Um it makes my talk also
[15:19] difficult, but I will try my best.
[15:21] There's a scale across this nurse.
[15:23] This is a topic of this um forum.
[15:25] I'm a Yahui Yin from Microsoft on behalf
[15:28] of the Azure fiber and AI team in Azure
[15:31] networking at Microsoft.
[15:33] So,
[15:34] yeah, the agenda we'll be talking about
[15:36] is mainly uh scale cross is a need is it a needed
[15:37] by training and inference.
[15:44] So, I will first review the cloud and AI
[15:47] data centers Microsoft
[15:50] has built and has been building uh many
[15:52] from a networking point of view.
[15:54] And then scale out uh scale up, scale
[15:56] cross, many scales, but
[15:58] how we actually even define them and how
[16:01] do we address each of them?
[16:03] And then the
[16:05] some pictures about the the energy,
[16:07] which is the ultimate limit, right?
[16:09] And also other limiting factors that's pre
[16:11] preventing us or not
[16:14] helping as much in
[16:17] getting to the scale we really want.
[16:19] Reliability, energy efficiency, bandwidth, latency, and jitter.
[16:23] And then some of the enabling technologies I'm seeing that can help us get there.
[16:27] Uh the cloud data center, this picture has been shared many times across the past many years.
[16:31] This is the snapshot of the 400G data center cloud data center we are building, we have built, and it's mainstream for the cloud computing.
[16:42] Uh as you can see within the data centers there's many IMDD technology with optical transceivers.
[16:45] DR DR DR4 DR4 plus AOC DAC cables uh between the data centers including the metro region and including the long distance where you need coherent technologies.
[16:56] 400 DRs are currently mainstream.
[16:59] Uh Microsoft is one of the enablers of 400 DR technology together with other partners in the industry and we are adopting in large scale.
[17:10] Uh to first moving forward like when the AI kicks off we start building another type of data centers air data center it's a
[17:20] similar but also different right so as you can see the there's a this so called AI back in the network where there's many used for training for inference for the GPU XPO servers and the chips to exchange data in large scale.
[17:36] And then they are also connected to the the traditional cloud which now we call the front end network in a lot of senses trying to bring the data the user end user data and also the data for training for inference close to the GPU servers and we need to build a many of them as you can imagine.
[17:54] A quick review of the yet another exponential growth if a few years back when we were hit by the pandemic we were showing a chart of the exponential growth pushed by the pandemic but now in the air era what are actually the new charter showing from us roughly within after pandemic is dwarfed the growth chart of even the pandemic so we're seeing a tremendous growth demand on bandwidth both inside this network.
[18:20] and across this network coming after boosted by AI.
[18:25] And then how do we even define scale across right so scale up and scale out is clearly defined scale up is meaning like for a short distance for the XPO to synchronize do the jobs together scale out is meaning like bring that into more of a data center scale and cluster by cluster scale.
[18:46] Then how do we define scale across there's also another workshop I attended yesterday is like how far is too far when we talk about a scale cross right so it's like a between two data centers or between two cities even between continents.
[19:00] It's not very very clearly defined terminology but I would like to maybe we take a step back we just focus on the terminology scale.
[19:06] Do we need to scale why do we need to scale right so is scaling really needed for for training and inference I think the short answer is quite obvious.
[19:12] It's yes.
[19:13] Uh the the simple reason is like Moore's law is not a giving us a bigger and bigger chip, so you can do all the job inside of one chip, right?
[19:20] So,
[19:21] therefore, we need a a distributed system that is connecting multiple tens of thousands or even hundreds of thousands chips together to do the job both for training and for inference.
[19:30] Uh and but this is a referencing this famous paper AI and the Memory Wall.
[19:36] Uh it's showing like the it's it's the growth rate between the compute or the models needed computer resources and the network bandwidth is the growth rate that 10 times is uh order three order magnitude of the difference.
[19:50] So, that is uh saying like the networking that is need first is needed, but networking is not a getting there to really bring the closer gap between what's needed and what's what can be offered.
[20:02] Uh so, yeah, so many many paradigms data model pipeline paradigms are only needed.
[20:09] So, and also in if you think about the inference, you you do you do use the ChatGPT, use other leading AI models uh from uh uh close to your home, close to your work.
[20:20] So, you you naturally want your
[20:22] own data to also be co-located close to
[20:25] you, so you can use the models together
[20:26] with uh
[20:27] with your own data.
[20:29] Uh but the looking at the the limiting
[20:31] factors, this is the
[20:33] uh
[20:34] uh the heat map of the energy
[20:35] distribution inside United States uh
[20:37] referencing from the open infra map
[20:40] that is uh uh built by this gentleman
[20:42] called Ross Garrett. I found it from a
[20:44] just the public internet, right? So,
[20:46] as you can see, I clearly showed the
[20:48] labels I clicked. There's a power,
[20:50] there's a hydro, there's a wind, all the
[20:52] different type of energy. The
[20:53] distribution is not even. We don't
[20:55] expect them to be become even either,
[20:57] right? So, it's uh So, because of that,
[21:00] it's just to simply get power to finish
[21:01] all the job, we wouldn't need to
[21:03] distribute uh the training and inference
[21:05] resources. Traditional cloud data center
[21:08] 50 to 300 MW typical.
[21:11] But for the AI clusters, we're looking
[21:13] at a 1 to 5 GW. So, that's a lot of
[21:16] power, right? So, it's a the whole power
[21:18] grid needs to be upgraded to support
[21:19] that. Uh but also like one point I want
[21:22] to mention here is the regulations also
[21:24] kind of slow to actually get us to the
[21:27] place where we need the power and where
[21:28] power is available.
[21:30] So, it's a it's a it's needed. Uh
[21:33] but then when we try to distribute the
[21:35] things, uh this is what happens, right?
[21:37] So, a few key factors that I really want
[21:39] to mention. Reliability is number one.
[21:41] Uh this is the picture I'm showing here
[21:42] is a over the years of operation at a
[21:46] Microsoft, I developed a hobby of
[21:48] collecting fiber cut pictures. As you
[21:51] can see,
[21:52] first one is the the fiber is being hit
[21:54] by a bullet.
[21:56] The second one
[21:57] is the arrow.
[21:59] So, if they are actually targeting the
[22:01] fiber, this is a really good archer.
[22:05] The third one is mice. So, I don't know
[22:06] what those rodents are going after, but
[22:08] they are chewing up the fiber. But the
[22:10] third fourth one is just a a normal
[22:13] fiber pinch, which happens a lot during
[22:15] construction. The fifth one is the
[22:17] excavator. The last one is the
[22:19] landslide. So, all kinds of issue
[22:20] happens to the fiber.
[22:22] It's not a it's a it's kind of a
[22:25] uh if you try to control fiber not to be
[22:27] cut, it's like a boiling the ocean, not
[22:28] possible. So, when you go outside of the
[22:30] world, all kinds of thing will happen.
[22:31] So, reliability uh is a concern. We we
[22:34] cannot stop this from happening, but
[22:36] what we can build to actually let the
[22:38] traffic let the load does not experience
[22:40] it.
[22:41] General available uh long haul outside
[22:44] plant fiber availability is uh only 17
[22:47] 97.7%, which is far away from a the
[22:50] 29394959s
[22:51] you would want for your service, right?
[22:53] So, uh hardware fit failure rate, this
[22:56] is a tremendous need to be improved. Uh
[22:58] packet uh link budgeting, margin, link
[23:01] flaps, all those things is a the things
[23:04] we need to address if we really want to
[23:05] build a a cross continent uh distributed
[23:09] system for the jobs.
[23:10] Um
[23:12] traditional cloud computing in other
[23:14] senses, depending on your data type, is
[23:16] more tolerant to link failures. You can
[23:17] retransmit it, you can uh do buffers, uh
[23:20] but then uh as Richard also mentioned,
[23:22] this is a
[23:23] great uh high-performance computing
[23:25] large distributed system, but it's uh
[23:27] less tolerant to those kind of failures.
[23:30] Uh synchronous jobs are not uh yeah, not
[23:32] tolerant to failures.
[23:34] Uh another thing is the energy
[23:35] efficiency. Uh I summarized this chart
[23:37] uh with a with also with the help of
[23:39] from uh the leading AI models. As you
[23:41] can see, like I also cross-validated the
[23:42] data, aligns with whatever with I'm
[23:44] seeing. The uh the the further you go by
[23:48] distance, the more
[23:50] energy you will need to consume. Uh
[23:52] that's natural, that's natural by
[23:55] physical laws. But but if you look at
[23:56] the numbers, they are also roughly three
[23:58] orders of magnitude of difference in
[24:00] power efficiency. When we go to long
[24:02] distance, it's easy to go beyond the
[24:04] 1,000 pJ per bit to get the data to
[24:06] another place, right? So, but if you
[24:08] walk down this chart, I wouldn't go to
[24:10] much detail to save some time, but uh
[24:12] yeah, you can see the CPU is uh where we
[24:13] are looking for the most power
[24:15] efficiency domain, and there are
[24:16] different technologies uh
[24:18] silicon photonics micro rings, uh
[24:20] VECSELs, micro LEDs, all the different
[24:22] directions of uh trying to push this to
[24:24] the limit. But the gap is also big,
[24:26] right? So, I want to people pay
[24:27] attention to that. I will mention how we
[24:30] try to address it later.
[24:32] Um another limiting factor, this is a uh
[24:34] there's a uh disclaimer, this is a the
[24:36] bandwidth latency and the jitter, this
[24:38] is a uh
[24:39] I learned from uh the
[24:41] uh publications and uh in action with uh
[24:44] the leading AI models. I didn't actually
[24:46] uh run the training job by myself. Uh
[24:49] so, this data this learnings are not my
[24:51] first I learned by what I I I did a I
[24:53] did my best effort and try to share here
[24:55] with I want to share with you guys is a
[24:57] bandwidth, of course. This is already
[24:58] mentioned many times. We need a lots of
[25:00] lots bandwidth, right? So, latency
[25:03] Latency actually means your
[25:05] your GPU your XPU is waiting for data.
[25:08] And that waiting is not free.
[25:10] You got that burning power while it was
[25:12] waiting, right? So, for inferences like
[25:14] a time to first token is also important
[25:17] uh metric uh that everyone want to get
[25:19] less time for when you are trying to do
[25:22] a query or something.
[25:24] Uh and the jitter um
[25:26] All reduce and all gather operations are
[25:28] essentially synchronous, which means
[25:30] they're blocking. They have to wait
[25:31] until all the other parties finish their
[25:33] job and then move to next step. So, if
[25:35] there's one link flap, there's one uh
[25:37] failures from uh the computing node, uh
[25:40] they need to wait. So, that waiting is
[25:41] again is not free, right? So, So,
[25:43] network jitter is actually the uh
[25:45] consistency of the latency is very
[25:48] important.
[25:51] So, enabling technologies uh one of the
[25:53] directions is we uh go back to the power
[25:56] efficiency the
[25:57] uh the like this three out of magnitude
[25:59] uh higher power effi- uh lower, I would
[26:02] say, lower lower power efficiency but
[26:04] more power consumption is in the long
[26:06] distance, right? So, this seem to be a
[26:08] um to be a like a low-hanging fruit, but
[26:11] uh
[26:11] people already uh doing a lot of jobs
[26:14] over there, but uh have we done enough?
[26:17] Uh we were thinking about this idea of a
[26:19] uh how how about if we can put the
[26:22] coherent DSPs and the uh IM-DD DSP
[26:25] together.
[26:26] So, this is some concept that we call
[26:28] the media converters.
[26:29] Uh in the longer distance, if we are
[26:32] trying to uh address the challenges of
[26:35] bring going to better power efficiency,
[26:38] uh there's a chance there's a such a
[26:40] chance, right? So, in other and another
[26:42] thing I'm thinking is because the
[26:43] coherent DSP if the intra-datacenter
[26:46] short-reach optics uh all going into CPO
[26:49] or near PO type of format, then the
[26:51] traditional IP over WDM coherent DSP
[26:54] we're putting into the router uh IE
[26:57] 400ZR IE 1.6T ZR, where we're going to
[27:00] sit them. So, if we we're going to meet
[27:02] need to take them out and put in another
[27:04] box. In this box, the traditional
[27:06] high-end transponders
[27:09] the pizza box transponders we're
[27:10] currently using is not a power efficient
[27:12] enough. So, one way of doing this is
[27:14] like we're trying to think about if we
[27:15] put this together, then we also save the
[27:18] high-speed high-speed electrical signals
[27:21] not only improve the signal integrity,
[27:23] but also saves power. We remove two
[27:24] cities in this equation, right? So, two
[27:27] cities is cities is another high power
[27:29] consuming components in in the system.
[27:32] So, we are trying to leverage the
[27:33] already published the ERZ IP 2.0
[27:36] standard putting these kind of things
[27:38] into maybe a
[27:40] 400ZR 1.6T ZR of DSPs in together with
[27:44] the retimer DSP, so then we can actually
[27:46] save cities, save power, and then bring
[27:48] into a new format. So, there we're
[27:49] working with some of the suppliers on
[27:51] this direction. And then inside it,
[27:53] right? So, inside it we can see if we
[27:55] actually bring the optical interfaces
[27:58] high-speed up separate high-speed
[27:59] optical interfaces with
[28:01] the low-speed electrical control signals
[28:04] and power supply, then we can save we
[28:06] can achieve this with already defined
[28:08] ERZ IP form factor.
[28:10] So, that's the one direction we're
[28:11] looking at. Uh another enable technology
[28:13] I want to mention is the hollow core
[28:14] fiber as a
[28:16] a lot of you may already notice that
[28:17] Microsoft is
[28:19] building the hollow core fibers. This is
[28:21] a uh a night with a natural benefit we
[28:23] can imagine is like a 33% of lower
[28:25] latency
[28:26] when light is traveling inside the air
[28:28] instead of inside glass. And the also
[28:31] lower loss. A lot of you haven't noticed
[28:32] that it's not bring to the full
[28:34] potential yet, which is a lower loss of
[28:36] hollow core fiber across many bands.
[28:38] This is a way bigger. The low loss
[28:40] window is way bigger than what the
[28:42] traditional glasses can give us. It's a
[28:44] not only C-band L-band, it's also O-band
[28:46] T-band even as you can see here, right?
[28:48] So, uh
[28:50] no fiber non-linearities. This also
[28:51] means we can put in more power,
[28:53] total aggregated power into the fiber,
[28:55] which does not occur cause trouble as we
[28:57] traditionally experience in the
[28:58] single-mode fiber.
[29:01] Another enabling technology I want to
[29:02] mention is the optical circuit
[29:03] switching, which
[29:06] to me is the why we needed this. This is
[29:08] the also due to the dramatic bandwidth
[29:10] demand and electrical packet switching,
[29:12] the ASICs, the the redix is not growing
[29:15] as fast as enough. So, if we cannot can
[29:18] no longer build a factory topology that
[29:20] can connect a hundred thousand of GPUs,
[29:24] then we will have to extend it and
[29:26] reduce it to a non all-to-all type of
[29:28] topology, and then that's where OCS can
[29:30] come into play. The Dragonfly, Siri
[29:32] Taurus,
[29:33] as Google already published a lot in
[29:35] Siri Taurus topology, all these are the
[29:37] things that we can look at to actually
[29:39] extending the connectivities.
[29:42] Um and the for for the how of optical
[29:44] circuit switching, we notice MEMS
[29:46] mirrors, MEMS wave guide, digital liquid
[29:48] crystals, piezoelectric actuators. Those
[29:51] are all the enabling technology we can
[29:52] get there.
[29:54] Uh and another enabling technology is a
[29:55] co-packaged uh co-packaged optics.
[29:58] There's a many things I just want to
[30:00] highlight one, which is a published by
[30:02] Microsoft research team in 2025 in OFC
[30:05] Com. It's a the micro-LED based
[30:06] solution, right? So, which can actually
[30:08] truly bring, if utilized well, can bring
[30:11] the energy efficiency to even
[30:13] sub-picojoule per bit.
[30:16] And then we also need the industry
[30:17] alignments, right? So, this is the MSAs
[30:19] I noticed the right before this RFC,
[30:21] there's the Open CPX MSA, there's XPO
[30:24] MSA, there's a OCI MSA also was
[30:27] mentioned before. These are the MSAs
[30:30] uh do bring
[30:32] uh one step more in alignment of getting
[30:35] the industry to where we want. And also
[30:37] there's existing ones I mentioned here
[30:38] as uh founders there, which we
[30:40] uh largely drove and also benefit from.
[30:43] And, uh, YSAP, we are trying to look at
[30:45] to get a together, right? So, quickly
[30:48] summarize, uh, we do need a distributed
[30:50] systems, no matter it's called a scale
[30:51] up, scale cross, or scale, uh, scale
[30:54] out.
[30:54] Uh, but how do we There's
[30:56] blockers of, uh, getting us there. And
[30:59] these are technologies we notice there's
[31:01] more than this, right? So, but, uh, it's
[31:02] just a few examples I I want to mention
[31:05] that can help us get there. And also the
[31:06] industrial alignment is needed to get us
[31:08] there. Thank you very much.
[31:15] So, I'm Todd Hoffmeister. I'm with a
[31:17] platforms infrastructure team at Google.
[31:20] And, um, I'd like to thank many
[31:22] contributors
[31:23] in our, um, AI infrastructure team,
[31:25] which, uh,
[31:26] the PI, the platforms team is part of.
[31:29] Okay. So, first, um, a heads-up, I'm
[31:32] going to be talking about, um,
[31:35] design concepts, and I don't want
[31:36] pictures taken and and posts being say,
[31:38] "Hey, this is how Google's building
[31:40] their network." Cuz, uh, it's not
[31:41] necessarily the case. And it's
[31:43] certainly, uh, especially when I talk
[31:44] about future things, I don't want it to
[31:46] be, "Oh, this is how Google's going to
[31:47] do it." Um, it's how we may do it.
[31:50] And Google's a, uh, heterogeneous
[31:52] network. We have a lot of different
[31:53] designs to meet different needs of
[31:55] different customers.
[31:57] All right. So, um,
[31:59] Jeff and Huawei have talked a little bit
[32:01] about the different scale up, uh, or
[32:03] scale networks.
[32:05] And, uh, first I want to point out,
[32:06] Google, we have two types
[32:08] of, uh, AI clusters. One is an Nvidia
[32:12] type GPU.
[32:14] And there, the scale up is using, uh,
[32:16] NVLink switches and NVLinks. And today,
[32:19] um, the latest and greatest, you have 72
[32:21] nodes
[32:22] within one of those fabrics. Um,
[32:25] and a node, uh, is is a GPU package.
[32:28] Um, and over time these are getting
[32:31] larger and larger in terms of the number
[32:33] of nodes.
[32:34] Uh, and they're also redefining what a
[32:35] node is by going from package to I.
[32:38] Um on the TPU side in the green box
[32:41] is um
[32:42] uh Google's internally developed uh
[32:44] TensorFlow processor units. And here we
[32:47] use a um
[32:48] like Yawei was showing on one of his
[32:49] later slides um a multi-dimensional
[32:51] torus rather than electronic fabric. And
[32:55] um combination of that torus uh we we
[32:57] have these cube building blocks of 64
[33:00] nodes in a cube
[33:02] um using copper-based interconnect
[33:03] today.
[33:04] Uh then we use an optical circuit switch
[33:06] or multiple optical circuit switches to
[33:08] interconnect um 144 of these cubes. So
[33:11] within one scale-up domain for TPUs, we
[33:14] have 9,000. Um but my point is not to go
[33:18] too much into the details, just showing
[33:19] we have both. But um there are similar
[33:21] trends between these in that with each
[33:24] generation of new GPU or TPU
[33:26] the the processing power and the amount
[33:28] of memory per node is increasing, which
[33:30] is driving up the bandwidth coming in
[33:32] and out of that TPU. And at the same
[33:35] time um
[33:36] uh
[33:37] some of the jobs that are being run on
[33:38] these are requiring clusters with many
[33:40] many more uh TPUs. So the trends are um
[33:44] more
[33:45] in multiple dimensions.
[33:46] So
[33:48] for these large jobs that require tens
[33:50] of thousands of nodes,
[33:52] uh this is what's driving the need for a
[33:54] scale-out network. Um
[33:56] on the left where GPU where you have
[33:58] these building blocks of 72 nodes,
[34:01] you're now aggregating tens of thousands
[34:02] of these. And you have you build out the
[34:04] scale-out fabric and uh
[34:07] you want to have minimized latency, but
[34:10] you want to have very high-capacity
[34:11] bandwidth. So um you want to minimize
[34:14] the over-subscription so that uh the
[34:17] GPUs have full access for any-to-any
[34:20] interconnect.
[34:21] And uh we're seeing the trend again more
[34:23] bandwidth and more nodes. On the TPU
[34:26] side, since each of those pods is
[34:27] already 9,000 nodes,
[34:29] um there's less
[34:31] uh or there was in the past less need of
[34:34] a scale-out network, but now the ask is
[34:36] for tens of thousands of these to be
[34:38] interconnected, which is now requiring,
[34:40] um, a scale-out network to interconnect
[34:42] uh, TPUs.
[34:45] Uh, take a step back. This is just a
[34:47] picture of a, uh, a typical Google, uh,
[34:49] data data center campus.
[34:52] And, uh, the main reason I'm showing
[34:53] this is show we have multiple buildings
[34:56] and with different generations of data
[34:59] centers and buildings, the power
[35:01] capacity of those buildings is evolving.
[35:04] Um, you know, it's easy to show a
[35:06] diagram where, "Okay, I want to build a
[35:07] 100,000 GPU node." And if I'm starting
[35:10] from scratch and building this ideal
[35:12] data center, I can build a single
[35:14] gigawatt building. But the reality is
[35:16] we have existing infrastructure we want
[35:18] to utilize, or, um, there may be other
[35:20] reasons that limit the, uh, power of a
[35:23] given building, which requires
[35:24] distributing these large clusters across
[35:27] multiple buildings.
[35:28] And then the other thing I want to point
[35:29] out is this picture was taken when, um,
[35:31] buildings one through four are up and
[35:33] online, building five is in
[35:35] construction. So,
[35:36] uh, the other thing to keep in mind is
[35:38] as the the campus evolves, you're not
[35:42] day one built at 100%. So, you want a
[35:45] uh, a design that can scale as as the
[35:47] campus scales out and you're adding more
[35:49] buildings and adding more capacity.
[35:53] All right. So, if you have now a single
[35:56] cluster across multiple buildings,
[35:59] you want to, um,
[36:01] it's it's not practical to have a single
[36:03] scale-out fabric that span across
[36:04] multiple buildings. One reason is, uh,
[36:07] you'd add latency, but the other reason
[36:08] is just the cross-sectional bandwidth
[36:10] internal, uh, within that fabric is just
[36:14] not practical. So, instead, this is, um,
[36:17] uh, we need an interconnect, and this is
[36:19] what I'm calling uh, scale across. I
[36:21] think when Tom did his intro, he said
[36:23] scale across means different things. Um,
[36:27] you know, and and I'll I'll I'll be
[36:28] describing in a couple of slides where
[36:30] you're actually crossing uh tens of
[36:32] kilometers or thousands of kilometers,
[36:34] but I wanted to point out um this use
[36:36] case that we have. It's an emerging
[36:37] application, and I wanted to point out
[36:39] the special uh requirements of it. And
[36:42] that is basically scale out uh within
[36:44] the campus or um what I'm also calling
[36:47] uh or what I am going to call moving
[36:49] forward uh point-to-point interbuilding
[36:51] scale out. So, the simplest design is
[36:54] you have a point-to-point connection
[36:56] between two buildings,
[36:57] and you want to um you just basically
[37:00] add add the the optics or the
[37:02] transceivers and the fiber to
[37:04] interconnect directly those scale-out
[37:06] fabrics. And the motivation in doing
[37:08] that directly rather than having a um
[37:10] like a backbone fabric there
[37:12] is you're saving on cost and power of
[37:14] another switching stage and another
[37:16] layer of electrical to optical um uh
[37:20] uh
[37:21] uh conversions.
[37:23] Um but it's also uh you're reducing
[37:25] latency by having this direct connect.
[37:27] And just to get an idea of the order of
[37:29] scale. So, if you have tens of thousands
[37:31] of nodes, and each node is on the order
[37:33] of a terabit capacity, this is now
[37:35] leading to that cross-sectional
[37:36] bandwidth between those fabrics is tens
[37:39] to hundreds of petabits per second.
[37:41] Um and petabits is 1,000 terabits. Uh
[37:44] so, these are tens of thousands of um
[37:47] uh 1.6 T generation uh transceivers,
[37:49] OSFPs.
[37:51] Um Oh, and then as far as distance. So,
[37:54] uh
[37:55] I didn't really show a scale in that
[37:56] picture, but in most cases we have less
[37:59] than 2 km for this interconnect for the
[38:02] distance,
[38:03] but we always have sort of outliers. So,
[38:05] there are cases where we want to go
[38:06] higher. So, we want a solution that can
[38:07] go significantly beyond 2 km.
[38:12] All right. So, now as your uh campus
[38:15] scales and if you want to interconnect
[38:16] more than two buildings,
[38:18] if you follow the simple approach, you
[38:20] We have to build out a full mesh of
[38:22] point-to-point connections.
[38:23] And the downside of this approach uh or
[38:25] on the one side it it's simple and it
[38:27] keeps it low low um latency, low power,
[38:30] minimize the cost. But the downside is
[38:32] you have a limited bandwidth coming out
[38:34] of each of these fabrics. And now you're
[38:36] dividing that bandwidth across multiple
[38:38] links. So, you're no longer or your over
[38:41] subscription is now going to increase as
[38:42] you add more and more buildings.
[38:44] And probably your utilization isn't
[38:46] going to be as good.
[38:48] But now, so one solution is is put you
[38:50] know that electronic spine layer in
[38:52] there, but for the reasons I described
[38:53] before
[38:54] we'd rather not do that.
[38:57] So now, if we use uh another application
[38:59] for optical circuit switch is putting
[39:01] that there.
[39:03] And um this doesn't The optical packet
[39:06] switch is not going to give you per
[39:07] packet switching and statistical
[39:09] multiplexing. So, you're still going to
[39:11] be more and more over subscribed coming
[39:13] out of each fabric as you're adding more
[39:15] and more buildings. Because you're
[39:16] basically setting up the previous
[39:18] connection but statically for a certain
[39:20] amount of time.
[39:22] Um but
[39:23] what where where the OCS adds value is
[39:27] you may have jobs that don't require the
[39:29] full capacity of of what's across all
[39:32] three buildings. So, you may want to you
[39:34] can reconfigure that and and optimize
[39:36] for say connecting between buildings A
[39:38] and B and now you can increase the
[39:40] bandwidth you have there.
[39:41] And then also what I was describing
[39:43] before about that all the buildings
[39:44] aren't miraculously built at the same
[39:46] time.
[39:47] So, um the OCS allows you to add more
[39:50] buildings as they're built into that
[39:52] fabric and allow um the existing
[39:54] buildings to to connect to it without
[39:57] spending a lot of labor and a lot of
[39:59] cost in reconfiguring everything.
[40:04] All right. So, um
[40:06] now uh taking a step back and saying,
[40:08] "Okay, the other types of scale across
[40:10] when you are trying to go between
[40:12] campuses
[40:13] and and Google does this as well. Um
[40:16] you're now going tens of thousands of
[40:17] kilometers.
[40:18] And here, um,
[40:20] typically you would have a uh,
[40:23] DWDM transport or um, what we call a
[40:26] data center interconnect also as as in
[40:28] terms of a a box or
[40:30] Huawei was describing a media converter,
[40:32] which I guess could also be used for
[40:33] this.
[40:34] And, um,
[40:36] this because you have the technology to
[40:38] go farther and do uh, with
[40:41] uh, with DWDM you can increase the
[40:43] capacity per fiber pair because
[40:45] the fibers between campuses are going to
[40:48] be a lot more, um, limited in terms of
[40:50] the the availability of how many you
[40:52] have.
[40:53] So, um,
[40:54] not only you're going to have more
[40:55] expensive equipment to to maximize the
[40:58] util- or the capacity on those fiber
[41:00] pairs,
[41:01] but, um,
[41:02] you're also typically going to uh,
[41:05] effectively have more over subscription
[41:07] cuz you're going to have less capacity
[41:09] on those links.
[41:11] Um, so
[41:13] in addition to being able to
[41:14] interconnect it with, um,
[41:16] DWDM transport, you can also use these
[41:19] pluggable, um, coherent modules, uh, ZR
[41:22] and ZR plus.
[41:24] Uh, or in Jeff talked about the 800 gig
[41:27] that they're using
[41:28] quite a bit and Google uses quite a bit,
[41:30] too. Uh, 400 and 800 and eventually
[41:33] 1.60.
[41:34] Um,
[41:35] and you'll have these gray optics to go
[41:37] from your back end
[41:39] because, um,
[41:41] these back end router or the back end
[41:43] fabric
[41:44] Uh, sorry. I didn't describe a little
[41:46] bit about too. I meant to talk about
[41:47] also front end versus back end because,
[41:50] um,
[41:52] if you don't have a direct interconnect
[41:54] into your scale out network, you can
[41:56] also interconnect via, um, the front end
[41:58] network or the DCN network. But, that's
[42:00] typically going through, um,
[42:03] a CPU host that's associated with your
[42:06] XPU and then going out through the NIC
[42:08] for that host. So, that's typically less
[42:10] bandwidth.
[42:11] Um, so if you're going to have a scale
[42:14] out interconnect or scale across
[42:16] interconnect directly on your scale
[42:17] outs,
[42:18] that's you typically wouldn't have both
[42:21] your front end and your back end, but
[42:22] I'm just showing that how you
[42:24] interconnect those
[42:25] would be the same as far as from a DWDM.
[42:28] Okay.
[42:29] Um
[42:31] I guess my the point I wanted to make
[42:32] though is
[42:34] uh the ZR and ZR+ although they're
[42:36] pluggable,
[42:37] um they're higher powers. You typically
[42:39] can't load them directly onto your scale
[42:41] out um
[42:43] switching fabric ports. The the just the
[42:45] port density is is too high to support
[42:47] those.
[42:48] So, um now if you're looking at this
[42:51] interconnect within the campus, the
[42:52] scale across within the campus,
[42:55] you really want to avoid having those
[42:56] additional layers of either the DWDM
[42:58] transport or that other switching layer
[43:01] that can support ZR or ZR+.
[43:04] Also, ZR and DWDM has a higher gain fact
[43:06] which has higher latency.
[43:08] So, um
[43:10] what we this is this new application or
[43:11] new requirement is we want to maximize
[43:14] uh
[43:15] uh the density
[43:16] um of this uh
[43:19] optical layer that can support going
[43:21] between buildings within a campus.
[43:23] Traditionally, we can do that with IMDD.
[43:25] So, that's intensity modulated direct
[43:27] detect. But as we look forward to um
[43:30] future TPUs and GPUs that have 400 gig
[43:32] and higher uh serial IO,
[43:35] IMDD's most likely not going to work uh
[43:38] over these distances and uh with a loss
[43:40] budget. Um so, this is where coherent
[43:43] light comes in to play. So, it's like a
[43:44] ZR, but a stripped-down version of ZR.
[43:48] And uh the other advantage of uh so,
[43:50] coherent light you can close a longer
[43:51] distance, but then the other advantage
[43:53] is if we couple it with um passive
[43:55] multiplexing,
[43:56] we can now increase how many wavelengths
[43:58] or increase the capacity per fiber pair
[44:01] even within this campus. And that also
[44:02] reduces not only the fibers going
[44:04] between buildings, but also the number
[44:05] of ports required on your OCS.
[44:11] And this is just showing comparison of
[44:13] improvements with Moore's law in DSPs
[44:16] for coherent, we're getting down to the
[44:19] power density of IMDD. And IMDD is kind
[44:22] of leveling out as you're getting higher
[44:24] and higher speeds. It's not coming down
[44:26] in power efficiency.
[44:28] The other required enabling technology
[44:31] of coherent light is
[44:33] low loss
[44:34] modulators with newer technologies. So
[44:37] like thin film lithium niobate or
[44:39] silicon organic. These not only have the
[44:42] higher um
[44:43] bandwidths required for higher symbol
[44:45] rates of 1.6 T and above, but also are
[44:49] have lower drive voltages. So overall
[44:51] you can have a lower power in your
[44:53] module. And this is what's enabling them
[44:55] to have full density of putting a
[44:56] coherent light at full density on the
[44:59] the scale up fabric switches.
[45:03] So
[45:04] in conclusion,
[45:06] the the the trend of having more and
[45:08] more
[45:10] XPUs within clusters
[45:12] and the capacity bandwidth requirements
[45:15] between those is driving up the need
[45:17] to have more and more
[45:20] bandwidth between them, but because we
[45:23] it's more it's succeeding the capacity
[45:25] of a single building, this requires the
[45:27] scale across or scale out inter-building
[45:31] requirements. And as we go to 400 gig
[45:34] and above IO coming out of the XPUs,
[45:37] we really need a coherent solution and
[45:40] in order to put that directly on the
[45:41] fabrics and at scale with a competitive
[45:45] power and price,
[45:46] coherent light is the way to go.
[45:49] Um
[45:50] and again,
[45:51] uh
[45:52] please note the caveat of the bin.
[45:54] Thanks. Uh hi everybody.
[45:56] Uh for those of you don't know know me,
[45:58] my name is Rakesh Chopra and I do
[46:00] hardware architecture work here at
[46:02] Cisco. Uh like many of the speakers
[46:04] before you commented, it's a hard thing
[46:06] to follow everybody who is amazing
[46:08] before. So, I think my main goal here is
[46:11] to reset expectations so VJ comes out
[46:13] looking really well.
[46:16] Okay. So, let's jump into it. I wanted
[46:18] to start today, although I'm a systems
[46:20] vendor, I want to actually take a
[46:22] holistic approach at the problem to
[46:24] understand where we need to optimize.
[46:26] So, let's look at how data centers are
[46:27] built in from my perspective. We have
[46:30] the DCI, or the data center inner
[46:31] connect. And that is a network that is
[46:34] used to connect data centers together
[46:36] and to end networks. Connecting into
[46:39] that is the front end network and is
[46:41] about seven times higher bandwidth than
[46:43] the DCI and that connects CPUs together,
[46:47] CPUs to storage, and to end users. Now,
[46:50] as we've gone from you deploying sort of
[46:52] traditional servers in data centers and
[46:55] we start thinking about data centers
[46:57] being these AI powerhouses, we replace
[47:00] CPUs with GPUs. And we scale those
[47:03] within a rack today and eventually
[47:04] across multiple racks with optics with a
[47:07] scale up network and that has a shocking
[47:09] 504 times higher bandwidth than the the
[47:12] DCI networks.
[47:15] Now, the problem is is that can only
[47:17] connect 100 GPUs with copper roughly or
[47:19] maybe 1,000 GPUs with optics. When you
[47:22] look at how you train
[47:24] AI/ML models today, it's trained on a
[47:27] very large number of GPUs and to solve
[47:29] that problem, we build what's called the
[47:31] scale out network and that is connecting
[47:33] uh racks of of GPUs together. And
[47:37] roughly speaking, a scale out network is
[47:39] about 56 times the bandwidth of a DCI
[47:42] and you can connect on the order of
[47:44] 50,000 to 100,000 GPUs depending on the
[47:46] size of your data center.
[47:48] Now, where do we get into problems?
[47:51] We get into problems with the
[47:52] realization that even today, ChatGPT was
[47:56] trained on about 50,000 to 100,000 GPUs.
[48:00] And in order to unlock new levels of of
[48:02] intelligence in AI, you have to continue
[48:05] to scale the cluster as many people
[48:06] before me have already represented.
[48:09] So, we are not at a problem tomorrow. We
[48:12] are at a problem yesterday. So, what
[48:14] have we been doing about that? Right,
[48:16] Jensen introduces notion of the scale
[48:18] across network. And what that is doing
[48:20] is that is connecting multiple scale out
[48:23] networks together
[48:25] to connect data centers together and
[48:27] that allows you to scale to hundreds of
[48:28] thousands or millions of GPUs. And
[48:31] roughly speaking, that is about 14 times
[48:34] the bandwidth of a traditional data
[48:35] center interconnect network.
[48:38] So,
[48:39] we are here today to talk about
[48:42] terminology and sort of wrap our heads
[48:44] around how we talk about this stuff. So,
[48:47] if I redraw that previous picture in a
[48:49] very simple way, really there's two
[48:51] distinct networks that connect data
[48:53] centers. There's a DCI and the scale
[48:55] across network. And there's a lot
[48:57] similar about these things. They do both
[48:59] connect data centers and typically that
[49:02] is across sort of 10 km plus based on
[49:04] coherent networks. But as Tad mentioned
[49:07] earlier, there's cases where you might
[49:08] actually be doing a DCI or a scale
[49:10] across network for short distances as
[49:13] well.
[49:14] If you're leaving the facility,
[49:16] typically these need to be secured
[49:18] infrastructure. You can do security at
[49:20] different layers of the network stack,
[49:21] but because it's leaving your facility,
[49:23] you usually do want that to be secure.
[49:26] And because of the reality of the cost,
[49:28] power, and the operational challenges of
[49:30] deploying all of this fiber, typically
[49:33] these are over subscribed networks. Now,
[49:35] I think unfortunately, what we find is
[49:38] that legacy terminology really struggles
[49:41] to sort of keep up with today's
[49:42] innovations. And this term data center
[49:44] interconnect or DCI is a legacy term.
[49:48] So, I'd like to introduce the notion
[49:49] that we don't call it DCI anymore. We
[49:52] talk about the first network as
[49:54] traditional DCI and the second network
[49:56] as a scale across. Both those are
[49:58] obviously data center interconnects.
[50:01] Now, if we start thinking about what is
[50:04] actually different about them,
[50:06] okay? A traditional DCI not only
[50:08] connects data centers together, it also
[50:10] connects to end users as well through a
[50:13] wide area network.
[50:15] But it's not actually just connecting
[50:16] data centers, it's connecting front-end
[50:19] networks inside of those data centers,
[50:21] which are built up out of a bunch of
[50:23] CPUs. So, a traditional DC DCI connects
[50:27] CPUs over a front-end network
[50:30] to each other
[50:31] and to the wide area network and end
[50:33] users. A scale across network on the
[50:35] other hand is connecting scale out
[50:37] networks together, and that is built on
[50:40] a bunch of high-bandwidth GPUs
[50:43] interconnected directly today together.
[50:46] Now, if you think about what's running
[50:47] on a CPU, really that's built out of
[50:50] many low-bandwidth, primarily
[50:52] loss-tolerant, asynchronous flows.
[50:55] Whereas GPUs, those are running very
[50:57] high-bandwidth, loss-intolerant,
[51:00] synchronous, and long-lived flows.
[51:02] And if you think about the growth
[51:04] trajectory of these networks, you have a
[51:06] reasonably linear scale bandwidth growth
[51:08] in a traditional DCI, and you have an
[51:11] exponential bandwidth growth with scale
[51:14] across. So, at the end of the day, if
[51:16] you think about what I've just said, I'm
[51:18] not actually sure you could have two
[51:20] more distinct set of things to describe
[51:23] the term DCI.
[51:24] You have distinct endpoints, networks,
[51:27] workloads, and growth.
[51:29] Now, let's talk about what some of the
[51:31] challenges of scale across networks are.
[51:34] So,
[51:35] in the vast majority of use cases today,
[51:38] when you talk about trying to do a
[51:40] lossless network, we use we build it out
[51:42] of what's called a reactive congestion
[51:44] control. Basically, what you do is you
[51:46] launch traffic into a network, the
[51:48] network monitors for congestion, and
[51:50] marks the packets that they've seen
[51:52] congestion. It gets to the end point who
[51:53] says uh
[51:55] congestion's happened. I'm going to
[51:56] signal the source to back off. And
[51:59] what's happening here is that the
[52:00] buffering inside of the network, inside
[52:02] of the switches, is acting like a shock
[52:05] absorber, absorbing that traffic that
[52:07] traffic as the bandwidth converges.
[52:10] So, what's the problem with that when
[52:12] you think about scale across network?
[52:14] Out of all of the innovation that we can
[52:15] do here, we have yet to crack the very
[52:18] pesky problem of the speed of light.
[52:21] And if you run some math, roughly
[52:23] speaking,
[52:24] uh if you look at a 100 km link, you'd
[52:26] have about 100 MB of traffic in flight
[52:29] before flow control can respond to that.
[52:32] Okay? And if you compare that to modern
[52:34] switches today, that's about half the
[52:36] bandwidth of a modern switch
[52:39] for one for one port and one priority.
[52:42] Or said more simply, switch buffering
[52:44] can't work to deal with this sort of
[52:46] bandwidth delay product that exists
[52:47] across long links. And that's why
[52:50] typically people deploy large bandwidth
[52:53] or deep buffered routers in these cases.
[52:56] Now, hollow core fiber is an interesting
[52:58] technology that will improve the latency
[53:00] of by about 30%, but that's not going to
[53:03] fix this problem.
[53:05] So, what's amazing? What's great? The
[53:08] thing about AI workloads is actually
[53:10] very different than what's turning on a
[53:12] CPU. The workloads themselves are
[53:14] predictable, and that allows you to do
[53:16] what's called a proactive congestion
[53:18] control. So, before you ever launch
[53:20] traffic into the network, you figure out
[53:22] if you're going to hit congestion, and
[53:24] you work around it before you cause the
[53:26] problem. So, this is an amazing
[53:28] accomplishment. But what's the problem
[53:31] with that? The thing we know about scale
[53:34] is failures happen at scale. It is
[53:36] guaranteed, and we have to design for
[53:39] the failures. We can't just design for
[53:41] everything's working all the time.
[53:43] And with AI workloads, because we're all
[53:45] working on it sort of a ginormous job in
[53:47] a synchronous fashion, losses are very
[53:50] expensive and you have to roll back to a
[53:52] checkpoint and restart.
[53:54] Okay? When you're deploying huge AI
[53:56] clusters, that is a killer when you look
[53:58] at ROI.
[54:00] So, what do we recommend that you do?
[54:02] You make use of proactive congestion
[54:04] control, but you still use deep buffered
[54:07] routers to absorb that traffic during
[54:09] failure conditions.
[54:12] What's the next thing or the next
[54:14] challenge of scale across? The
[54:16] bandwidth. If you look at DCI and you
[54:18] heard of this from some of my panelists
[54:20] earlier, roughly speaking, again, every
[54:22] data center is a little different, but
[54:24] you have about 1 to 2,000 ports of
[54:26] bandwidth leaving your data center. And
[54:29] typically, many customers deploy large
[54:30] modular chassis to do that. Not
[54:33] everybody does, but quite a few people
[54:34] do.
[54:35] If you look at the number of ports
[54:37] necessary for scale across network,
[54:39] you're talking about 12 to 32,000 ports
[54:43] that need to leave this to leave the
[54:44] data center. So, this is a huge
[54:47] opportunity for coherent optics demand
[54:50] and we should think about those quite
[54:52] uniquely from the data center
[54:53] interconnect of the traditional years.
[54:56] The second thing to realize is when you
[54:58] think about that many ports and you
[54:59] think about power being the fundamental
[55:01] restriction that we have in the AI
[55:03] industry,
[55:04] you have to be able to remove layers of
[55:06] switching which aren't necessary for the
[55:08] task at hand. And you can do that by
[55:10] using fixed boxes
[55:12] scaled out as sort of
[55:14] as is done in traditional data center
[55:16] architectures versus big modular
[55:18] systems.
[55:19] So, at the end of the day, if I
[55:21] summarize what I've just said,
[55:23] traditional DCI and scale across are
[55:26] completely unique networks. There's a
[55:27] few things which are similar, but there
[55:30] is a lot of things which are different.
[55:32] Scale across needs proactive congestion
[55:34] control and deep buffered for failure
[55:36] reaction. and you should build it with
[55:38] disaggregated topologies for
[55:40] scalability. Now, if I bring it back to
[55:42] my final thoughts about what this all
[55:44] means to an optics industry perspective.
[55:47] So, I think when people deploy scale
[55:49] across, they're going to leverage
[55:52] traditional DCI optics to to deploy
[55:56] that, but at significantly higher volume
[55:58] than we've seen in the past. Second, is
[56:01] that reliability, like many people
[56:03] before me have said, is critical. We
[56:06] can't treat this as an afterthought. We
[56:08] must design that into the way we build
[56:11] the optics, the way we qualify the
[56:13] optics, the way we deploy the optics.
[56:15] Uh and third, um based on the volume
[56:18] that scale across is going to drive, it
[56:20] is quite likely in the future that we
[56:22] see optimized technology being deployed
[56:26] uh into these roles in the future.
[56:28] So, with that, I'll say thank you and
[56:29] I'll pass it off to VJ.
[56:36] Rakesh is always a tough act to follow.
[56:39] So,
[56:40] I will never forgive Tom for sticking me
[56:42] at the end of the panel.
[56:46] So, my name is VJ Bhusri Kala.
[56:49] Um I'm at Arista. I cover our customer
[56:53] engineering with the hyperscalers and
[56:54] our cloud titan customers. So, I work
[56:57] closely with Andy on some of the
[56:59] next-generation optics and systems. So,
[57:02] I will cover scale across in this panel.
[57:05] Andy has got a few talks uh later today
[57:07] and the rest of this week, so we'll
[57:09] round it out with scale up, scale across
[57:12] as well.
[57:13] So, the challenge with being the fifth
[57:15] in the panel is you'll hear this for the
[57:17] fifth time, but apparently behavioral
[57:21] psychologists believe that you have to
[57:22] hear a message seven times for it to be
[57:25] firmly tattooed in your brain. So,
[57:27] you've got two more to go.
[57:30] So, if you look at the totality of AI
[57:32] networks, there are four critical
[57:34] networks.
[57:35] Scale up.
[57:37] Scale out from the front end
[57:39] perspective, or as Tad mentioned, this
[57:42] is the DCN one. That's the one that
[57:44] connects through the CPUs to storage and
[57:47] to the users. The scale out back end,
[57:50] which is essentially the back end racks
[57:51] connecting the XPUs to form a cluster.
[57:54] And the new term, the scale across. I
[57:57] think if we remember OFC 2026 by one
[58:01] word or one hyphenated word, it's going
[58:03] to be scale across. So, you'll hear a
[58:05] lot about it. It's an exciting area.
[58:07] It's a new segment that is growing quite
[58:09] rapidly. Uh so, from the picture you can
[58:12] see the role that each of these play.
[58:15] So, scale up is within an enclosure.
[58:17] Uh typically, a rack going into multi uh
[58:20] rack configuration soon. The front end,
[58:22] the role of it is through the CPUs,
[58:25] connect to storage, connect to uh any
[58:27] other applications, and connect out to
[58:29] the users. This is where the traditional
[58:31] DCI uh
[58:33] term was used. And we'll come uh to uh
[58:36] the the distinction between DCI and the
[58:38] scale across part uh in short while. So,
[58:41] uh the scale out and then the scale
[58:43] across part, uh we'll focus on the scale
[58:45] across for the rest of this talk. Uh so,
[58:47] I I realize that the pictures here have
[58:50] a lot of colors. Just want to clarify
[58:53] that these colors do not signify WDM.
[58:56] There's no WDM between the host and the
[58:59] leaf switches. There's a lot of WDM in
[59:02] the scale across side, but as we build
[59:04] these pictures, we realized that these
[59:06] are not intended to be WDM. They're just
[59:09] signifying multiple rails uh in terms of
[59:12] the connection between the hosts and the
[59:13] leaf switches.
[59:17] So, the previous speakers did a great
[59:19] job in terms of motivating scale across.
[59:22] I'm just going to This is just view this
[59:24] as essentially a summary. So, the core
[59:27] point is it's a physical space and power
[59:30] limitation. If you have that limitation,
[59:32] you've got to spread out your compute.
[59:34] And then, if you spread out your
[59:35] compute, you've got to connect that
[59:37] compute. And that spreading out can be
[59:40] across the campus, across the metro, or
[59:43] across the region. So, you have
[59:44] different use cases, 10, 100, and 1,000
[59:48] km use cases. And we'll talk about some
[59:51] of the volumes, etc. But, since the
[59:53] previous speakers um motivated this uh
[59:56] very significantly, I'll leave that part
[59:58] out. So, what are the three key
[01:00:00] requirements?
[01:00:01] Um
[01:00:02] Again, um Rakesh and I, we did not
[01:00:04] compare our slides. So, but it's very
[01:00:06] interesting that the themes are
[01:00:07] essentially very similar. So, the first
[01:00:10] part is it's high capacity over long
[01:00:13] distances. The long distances can be
[01:00:15] anywhere from, as I said, 10 to 1,000
[01:00:17] km.
[01:00:19] Um high over subscription. So, depending
[01:00:22] on the use case, there is over
[01:00:24] subscription because of fiber scarcity,
[01:00:26] etc. And once there is over
[01:00:28] subscription, you have to have uh um
[01:00:31] uh congestion control, QoS, and flow
[01:00:33] steering, etc. So, this is uh
[01:00:36] beginning to look from a system
[01:00:38] perspective, a requirements perspective,
[01:00:41] like uh what people have deployed in the
[01:00:43] traditional van backbone DCI.
[01:00:46] And the moment it leaves a data center,
[01:00:49] security is important. So, these do
[01:00:51] require encryption as well. So, the type
[01:00:54] of solutions that you need for scale
[01:00:56] across are now beginning to look
[01:00:58] different from the switches that you use
[01:01:01] within the data center for the scale-out
[01:01:03] switches.
[01:01:05] So, I'm using the same term that Rakesh
[01:01:08] used in terms of traditional DCI.
[01:01:10] Traditional DCI refers to uh the DCN
[01:01:14] connectivity or the backbone
[01:01:16] connectivity that all of you are
[01:01:17] familiar with. And how is it distinct
[01:01:20] from the AI scale across.
[01:01:23] The first thing is the volumes.
[01:01:26] Um
[01:01:26] I'm using 10x higher volumes. Think
[01:01:29] Rakesh had 14. Arista has always been a
[01:01:32] little more conservative than Cisco's.
[01:01:38] So, with these higher volumes
[01:01:40] naturally there significant cost and
[01:01:43] deployment efficiency considerations.
[01:01:46] And based on where some of these are
[01:01:49] deployed so traditional backbone
[01:01:51] switches or routers, they were not
[01:01:53] liquid cooled or they're not uh liquid
[01:01:56] cooled in the immediate horizon. For
[01:01:57] these, we are looking at uh liquid
[01:01:59] cooling in a shorter horizon earlier
[01:02:03] than the traditional DCI solutions.
[01:02:06] The reliability is going to be
[01:02:08] substantially higher at a system fit
[01:02:11] level compared to the traditional DCI
[01:02:13] because the traditional DCI had
[01:02:16] different resiliency mechanisms uh
[01:02:19] etc. So, this one has a much higher bar
[01:02:22] in terms of reliability that's needed.
[01:02:24] It's essentially a GPU interconnect
[01:02:25] fabric.
[01:02:27] And then the optics range. So, you need
[01:02:30] uh
[01:02:31] as was motivated by the previous uh
[01:02:33] speakers, coherent light, ZR, ZR+, for
[01:02:36] different segments of the reach.
[01:02:38] Coherent light is uh a big new use case.
[01:02:41] Coherent light's been uh essentially in
[01:02:44] discussion for a few years, but this is
[01:02:46] the time when coherent light finds a
[01:02:49] very very compelling application.
[01:02:51] And then with the type of capacities
[01:02:54] that people are looking at, these are
[01:02:56] multi-rail architectures. So, these are
[01:02:58] multi-line systems that are being
[01:03:00] deployed in parallel. So, traditionally
[01:03:03] you had a
[01:03:04] uh you had a transport system that had
[01:03:07] one line system with one amplifier chain
[01:03:09] etc. So, those days and uh those
[01:03:12] architectures are not sufficient for the
[01:03:14] type of capacities people are looking
[01:03:15] at. So, these are uh multi-rail systems
[01:03:19] and uh you can think of this as the
[01:03:21] fiber is the new wavelength at these
[01:03:23] levels of capacities.
[01:03:28] So, what are the solutions
[01:03:30] for the scale across capacities?
[01:03:34] Um couple of days ago we announced or
[01:03:36] maybe 3 days ago we announced an XPO
[01:03:39] MSA. So, let me quickly recap what the
[01:03:42] XPO MSA is and then uh motivate how an
[01:03:46] XPO will work in this context.
[01:03:50] An XPO is eight times the capacity of
[01:03:53] today's OSFP and as you know OSFP is
[01:03:56] currently the workhorse for not just the
[01:03:58] data center optics but the coherent
[01:04:00] pluggables as well, ZR, ZR+, and then
[01:04:04] coherent light.
[01:04:05] Uh in terms of the size, it is 2.7 times
[01:04:09] wider than an OSFP. It is a little
[01:04:12] taller but the key part is you can put
[01:04:15] two of the XPOs in one OU which you'll
[01:04:18] see in the next slide. It gives the
[01:04:20] density and it gives the capacity and it
[01:04:23] is natively liquid cooled up to 400 W.
[01:04:26] So, think of it as eight times OSFP.
[01:04:29] Each OSFP can handle up to 50 W. So, you
[01:04:31] can throw in anything that you want in
[01:04:33] terms of the power dissipation.
[01:04:36] So, you'll see that.
[01:04:37] So, from a data center switch
[01:04:39] perspective, you get a 4x improvement
[01:04:43] with an XPO compared to an OSFP. The
[01:04:45] picture on the left shows 200 terabits
[01:04:48] of switch capacity that takes four rack
[01:04:50] units. On the right, the same capacity
[01:04:53] shrunk down by a factor of four in terms
[01:04:55] of the front panel density.
[01:04:57] So, now
[01:04:58] the most interesting and the most
[01:05:01] exciting part as we worked through the
[01:05:03] XPO the last few months is while it
[01:05:06] started out as a data center density
[01:05:09] optimization and a reliability play,
[01:05:12] we realized that
[01:05:13] it is
[01:05:15] a very uh optimum vehicle for coherent
[01:05:18] pluggables as well. So, it um it this is
[01:05:20] the exciting part. The thermal envelope
[01:05:23] of 400 W,
[01:05:24] thanks to liquid cooling,
[01:05:26] it can enable any optic without any
[01:05:28] compromises in terms of power.
[01:05:31] And I mentioned that reliability is
[01:05:34] super important.
[01:05:35] The combination of liquid cooling
[01:05:37] and the component reduction by going
[01:05:40] from eight OSFPs to one uh XPO, all the
[01:05:44] comments like voltage regulators,
[01:05:45] microcontrollers, etc. are shrunk by
[01:05:48] effectively a factor of four. The
[01:05:50] combination of component reduction,
[01:05:52] liquid cooling where you have about 20
[01:05:55] to 25° lower temperature compared to
[01:05:57] air, and the third part is there are no
[01:06:00] thermal variations with liquid. With
[01:06:02] air, you're just subject to a lot of
[01:06:04] variations as uh the switch or the uh
[01:06:07] XPU um heat up and cool down. Without
[01:06:11] variations, lower temperature, component
[01:06:13] reduction, the combination of those
[01:06:15] three get you an overall 6 to 8x
[01:06:18] reliability improvement. So, think of
[01:06:20] this as an XPO with eight times the
[01:06:22] capacity can have a fit number similar
[01:06:25] to that of an OSFP. So, overall your
[01:06:28] system fit will be substantially
[01:06:30] improved.
[01:06:32] And the third point is
[01:06:34] compared to an OSFP, you have a much
[01:06:36] bigger canvas. So, the way XPO is
[01:06:39] designed is it's what two cards, one at
[01:06:41] the top, one at the bottom. Each one has
[01:06:43] 32 channels. That is four times an OSFP.
[01:06:46] Now, that enables you to do a
[01:06:51] on the DSP side, and that leads to more
[01:06:53] efficiency from a power and a cost
[01:06:55] perspective. So, it's essentially a
[01:06:57] bigger module. Uh think of it as like a
[01:07:00] sled in a standardized liquid-cooled
[01:07:02] format.
[01:07:03] And the fourth one is also super
[01:07:05] interesting, right? Now, with eight
[01:07:07] channels,
[01:07:09] you have 12.8 terabits of coherent
[01:07:11] capacity.
[01:07:13] 12.8 terabits is half of your C band.
[01:07:16] So, effectively have half a band's worth
[01:07:20] of capacity in one module.
[01:07:23] So, you can look at novel architectures
[01:07:25] where you do not need tunable lasers.
[01:07:28] You can do fixed lasers with all the
[01:07:30] associated benefits of cost and
[01:07:32] reliability improvements. And if you do
[01:07:35] fixed lasers,
[01:07:36] you have what can hit the entire C band
[01:07:38] with two skews and the full spectrum of
[01:07:40] C plus L with four skews.
[01:07:43] And it's a path to getting to these half
[01:07:47] band or the full band transponders that
[01:07:49] people have been talking about. And you
[01:07:50] can either mux them inside or you can
[01:07:52] just leave them as individual fixed
[01:07:55] wavelengths if you want to shuffle them
[01:07:56] across.
[01:07:57] And finally,
[01:07:59] this XPO is not only useful for fixed
[01:08:03] form factors where you get the density
[01:08:04] improvement in a modular chassis as
[01:08:07] well. For the next generation line
[01:08:09] cards, you get a density improvement for
[01:08:11] modular chassis as well. So, it's
[01:08:12] applicable broadly,
[01:08:14] fixed and modular, as well as data
[01:08:17] center switches,
[01:08:18] and scale across DCI.
[01:08:22] So, I'm showing one example. This is
[01:08:24] from Ciena. I'm showing this with their
[01:08:27] permission. So, the idea is to show that
[01:08:29] XPO
[01:08:31] the concept works for a 12.8 module. So,
[01:08:35] the one on the left is a coherent light
[01:08:38] XPO and one on the right is ZR ZR+
[01:08:42] module
[01:08:43] from a physical fit perspective, it all
[01:08:46] fits in there. Obviously, these are
[01:08:48] mechanical renders, so it'll take a
[01:08:50] while for any of these products to come
[01:08:51] from, but just wanted to
[01:08:53] show this as
[01:08:55] a proof of concept that
[01:08:59] this is
[01:09:01] a very suitable module for
[01:09:04] 12.8 T coherent.
[01:09:06] So, let me I think that's my last slide.
[01:09:09] So, thank you very much.
[01:09:16] We are a little bit tight on time. We've
[01:09:18] got about 10 minutes for
[01:09:20] for Q&A. If anybody has questions, don't
[01:09:22] be shy. So, maybe
[01:09:25] to to start cuz one of the topics we
[01:09:28] wanted to cover was terminology and
[01:09:30] somehow through no coordination, Cisco
[01:09:33] and Arista both used the term
[01:09:34] traditional DCI. So,
[01:09:37] that might be the news of the show.
[01:09:39] Um
[01:09:41] I I'd like to ask our our web scalers,
[01:09:43] what do you think of that term? What do
[01:09:45] you think about DCI scale across
[01:09:47] terminology? I I think there's a little
[01:09:50] bit of the terminology is broken. Just
[01:09:51] kind of
[01:09:52] thoughts on how do you think about DCI
[01:09:55] and scale across and and how do you
[01:09:58] refer to it internally? This is This is
[01:10:00] the why
[01:10:02] It's why did the bit cross the road
[01:10:04] question? Pretty much. Pretty much. I
[01:10:06] mean, I tried to lay it out for us. I
[01:10:08] you know, we're we're trying to say
[01:10:09] broadly that when we're trying to enable
[01:10:12] compute on a gigawatt scale, that's the
[01:10:14] scale across value add.
[01:10:17] Yeah. Uh yeah, so I would rather focus
[01:10:20] on the term scale. We do need a scale,
[01:10:22] right? So, whether it's a scaling out
[01:10:23] up, out, across. Yes, we need all of
[01:10:27] them.
[01:10:30] So, uh personally, I've used scale
[01:10:32] across as when you need to interconnect
[01:10:34] like VJ said, your scale out across.
[01:10:38] But, uh
[01:10:39] I would when when I had my slides
[01:10:41] approved internally,
[01:10:42] uh our one of the product managers in
[01:10:44] our cloud team said, "Oh, no, scale
[01:10:46] across is geographically diverse because
[01:10:48] that's what um Jensen defined it as."
[01:10:51] So, that's why I had to re-edit and
[01:10:53] inner building scale across, but um
[01:10:55] yeah. Did Did you want me to comment on
[01:10:57] DCI or leave it at scale out cross?
[01:10:59] Well, I mean you've got a bit of a
[01:11:00] unique DCI perspective. So, yeah, go
[01:11:02] ahead.
[01:11:03] Well, um, just just cuz Google, uh, I
[01:11:06] guess DCI, I think Microsoft sort of
[01:11:08] coined when they were first doing these
[01:11:10] really high scale direct, um,
[01:11:12] connections between data centers in the
[01:11:14] metro.
[01:11:15] And, uh, Google, we used it rather than
[01:11:17] at the application. We've always used
[01:11:19] DCI to mean the the a physical box. And
[01:11:23] it was because it was right around where
[01:11:25] you had traditional DWDM transport
[01:11:27] carriers were these giant chassis and
[01:11:29] often they had an OTN layer of
[01:11:31] switching. And and Google, I I think VJ
[01:11:34] is probably who coined
[01:11:35] >> Yeah, you can blame me for that, Tad.
[01:11:37] Um,
[01:11:38] where, uh, oh, no, now we started having
[01:11:40] these pizza boxes, uh, just one or two
[01:11:42] RUs where there was no switching fabric
[01:11:44] and it was just, uh, gray optic
[01:11:46] conversion to DWDM. And we used those,
[01:11:49] uh, all the way from metro as well as
[01:11:51] subsea.
[01:11:54] Yeah, uh, Microsoft does, uh, heavily
[01:11:56] invested and also benefit from, uh, the
[01:11:58] DCI large scale DCI network. Um, we also
[01:12:01] see challenges as I mentioned in the
[01:12:02] slides, right? So, the the bottom line
[01:12:05] is like, uh, we need to get power.
[01:12:07] Uh, otherwise, without that, we cannot
[01:12:10] build anything. So, scale, yeah, scale
[01:12:12] cross the the Along we can address the
[01:12:15] challenges of, uh, getting multiple data
[01:12:18] centers across the city or even across
[01:12:20] the continent working together, then
[01:12:22] it's needed.
[01:12:24] Thank you. I think we had a question
[01:12:26] from the audience. Chris Wistner,
[01:12:28] Avicena.
[01:12:29] Um, I have a sort of a higher level
[01:12:31] question. What's the long-term vision?
[01:12:34] Do you envision that eventually each
[01:12:36] data center will be one scale out
[01:12:38] network and then you scale across
[01:12:40] multiple data centers? Or do we have a
[01:12:42] tiered architecture where you have,
[01:12:44] um, multiple scale out clusters inside a
[01:12:47] data center
[01:12:48] tied together with scale out and then
[01:12:50] scale across to get to another data
[01:12:52] center.
[01:12:55] Uh I I'll I'll go first and uh but I
[01:12:57] think everyone should should chime in on
[01:12:59] cuz I think it's a great question.
[01:13:01] Uh I I think if you think about it from
[01:13:03] an ideal solution
[01:13:06] we would love to have a completely
[01:13:07] uniform interconnect uh completely
[01:13:10] uniform interconnect um
[01:13:13] across all of our compute. That is the
[01:13:15] most generic solution that allows a
[01:13:19] programmer to sort of target the
[01:13:21] hardware of the structure.
[01:13:22] I think the reality though is quite
[01:13:24] different from the ideal, which is at
[01:13:26] the end of the day the thing which is
[01:13:28] limiting what we can build as as many of
[01:13:31] us have said so far is is around the
[01:13:34] reality of power constraints and
[01:13:36] physical constraints. And that drives us
[01:13:39] to optimize what is actually needed
[01:13:42] rather than the sort of nirvana solution
[01:13:44] that is completely uniform. So, my
[01:13:47] suspicion is that we don't go away from
[01:13:49] the notion that there's different levels
[01:13:52] of bandwidth interconnect trading off
[01:13:55] power, bandwidth, latency, and density.
[01:13:58] And if you think about it even before
[01:14:00] the AI revolution, if you think about
[01:14:02] the way general compute works, within a
[01:14:05] chip itself, there's massively more
[01:14:07] bandwidth than there is to the memory,
[01:14:10] than there is to the network. This
[01:14:11] notion of having uh different bandwidth
[01:14:13] points and different optimizations I
[01:14:15] think is nothing new. I'll add on I mean
[01:14:17] the trend is
[01:14:19] more and more power in a single
[01:14:20] building, but I don't think that trend
[01:14:23] is keeping up with the demand. So, I
[01:14:25] think we're always going to have um
[01:14:27] scaling between buildings and campuses
[01:14:30] and between campuses just because uh you
[01:14:33] can only get so much power in in an
[01:14:34] area, I think.
[01:14:36] The way we're showing that. Um
[01:14:38] and uh the other point I wanted to make
[01:14:40] too is like
[01:14:41] uh I I I mentioned in my talk this idea
[01:14:44] of not everything's built and turned up
[01:14:46] at once.
[01:14:47] So, as you get bigger and bigger, it's
[01:14:50] less and less or the time it would take
[01:14:52] to coordinate having everything
[01:14:53] installed and turned up at the same time
[01:14:55] gets larger and larger. So, just from a
[01:14:57] practical point of view, you want to be
[01:14:59] able to design a fabric or or or an
[01:15:02] interconnect solution that can evolve
[01:15:04] and scale over time.
[01:15:07] Yeah, 100% agree. And also, I want to
[01:15:09] add to that is not everything is built
[01:15:11] at the same time and not everything is
[01:15:13] built
[01:15:14] up to the same target of power
[01:15:16] efficiency, density, and reliability,
[01:15:18] right? So, even in the cloud era, we
[01:15:20] have applications where it's more
[01:15:23] latency-sensitive, for example, gaming,
[01:15:25] right? So, and we have other
[01:15:26] applications. But, in the AI era,
[01:15:29] training and inferences are different.
[01:15:31] Training tend to be, at least today, is
[01:15:34] more synchronous and it requires more
[01:15:36] reliability, power efficiency, power
[01:15:39] bandwidth density.
[01:15:40] Inference is, on the other hand, get
[01:15:42] closer to the users, right? So, based on
[01:15:45] the
[01:15:45] application, we actually need to set a
[01:15:48] realistic target. We cannot expect a
[01:15:50] long distance to be the same power
[01:15:52] efficiency as, for example, scale up.
[01:15:54] But, as long as we can set the right
[01:15:55] target and then work on the engineering
[01:15:57] challenges,
[01:15:58] we should be able to get there.
[01:16:00] Yeah,
[01:16:01] maybe I'll just
[01:16:03] as a as a different point of reference,
[01:16:05] it can often it's not as clean in terms
[01:16:08] of the kind of data center
[01:16:09] infrastructure we receive as, you know,
[01:16:11] there's a building in a certain 50
[01:16:13] megawatts or whatever. Comes in all
[01:16:15] different shapes and sizes. And I I just
[01:16:17] maybe call out there was a nice blog
[01:16:18] post by my
[01:16:20] colleagues on our our back-end
[01:16:21] aggregation network, which is exactly to
[01:16:23] fill this role of how do you connect
[01:16:25] together kind of a heterogeneous
[01:16:27] infrastructure ecosystem, which gives a
[01:16:31] little bit of insight into the
[01:16:32] complexity of of what that world looks
[01:16:33] like.
[01:16:35] Tom, there are a lot of questions, so I
[01:16:37] would like to ask the people who ask
[01:16:39] questions to target one of the speakers
[01:16:41] for your question.
[01:16:42] The next question comes from Mark
[01:16:43] Lipkovich.
[01:16:45] Thanks a lot. VJ, a couple of quick
[01:16:47] questions.
[01:16:48] Uh the XPO looks like a great idea. I've
[01:16:50] just wondering if the rhetoric's getting
[01:16:52] a little bit out of hand in terms of
[01:16:53] some of the things that need to be
[01:16:54] developed uh maybe being rushed.
[01:16:57] Hopefully uh it won't be anything even
[01:16:59] reminiscent of LPO, but uh we'll put
[01:17:01] that aside for now. And my my my my
[01:17:04] second question is related related to
[01:17:07] 1.6T ZR ZR+ Uh what percentage if you
[01:17:11] have an opinion uh is going to be indium
[01:17:14] phosphide versus TFLN? Thank you.
[01:17:17] Um
[01:17:18] So, I know you meant the first one as a
[01:17:20] comment, but I'm going to actually
[01:17:22] answer that. And the reason is when we
[01:17:25] look at form factors, these are
[01:17:26] generational form factors. And when
[01:17:29] equipment vendors build systems, it is
[01:17:31] super important to have a universal form
[01:17:34] factor that hits multiple reaches and
[01:17:38] multiple technologies, right? So, all
[01:17:41] the way from say DR FR ZR ZR+ RF
[01:17:45] microwave micro LEDs uh fully real time
[01:17:48] reverse gearboxes linear everything. So,
[01:17:51] we believe that the key part of the XPO
[01:17:53] is that universality. So, some of these
[01:17:56] applications, especially on the coherent
[01:17:58] side, will take a little longer than the
[01:18:00] data center ones to mature, especially
[01:18:03] because the 1.60 ZR ZR+ ecosystem is not
[01:18:07] yet there. Uh but you can't have a form
[01:18:10] factor a new form factor come like every
[01:18:13] year or every 2 years because that is uh
[01:18:16] uh essentially very unproductive and
[01:18:18] fragmenting from uh uh
[01:18:21] a perspective. So, um on the um on the
[01:18:25] ZR ZR+ uh I'll probably defer to others
[01:18:28] in terms so I don't want to crystal ball
[01:18:31] the relative technologies that go into
[01:18:33] that. So, there probably people who
[01:18:35] actually make the ZR ZR+. So, what we do
[01:18:38] is uh we work with them, but we are
[01:18:41] agnostic in terms of the technologies
[01:18:43] that go in within that. So, if
[01:18:46] Jeff or anyone wanted to take a crack at
[01:18:49] so uh
[01:18:51] We will qualify whatever people build
[01:18:53] but uh
[01:18:55] Certainly I'm not the one to talk to
[01:18:57] our technologies other than to say, you
[01:18:59] know, we really do target diversity in
[01:19:01] our supply chain, so so we would look to
[01:19:03] make sure that there's enough diversity
[01:19:06] there that whatever bet turns out to be
[01:19:08] more promising, we'd be engaged in that,
[01:19:11] so.
[01:19:13] Okay, I think we have another question
[01:19:14] out here. And Tom, actually you might
[01:19:16] have an opinion on that, so I'm just a
[01:19:18] moderator here, I
[01:19:20] I took a sworn oath not to opinionate on
[01:19:22] this panel, so.
[01:19:26] Hi, thanks. Reuben Roy from Stifel.
[01:19:29] I think this question is for Jeff.
[01:19:32] Is there a way to think about port
[01:19:34] economics here? We got a lot of detail
[01:19:37] on Delta but for bandwidth, obviously
[01:19:38] power is very important, but if you look
[01:19:41] at
[01:19:42] scale out versus traditional DCI, how
[01:19:44] are you thinking about port economics? I
[01:19:46] know
[01:19:47] they're different
[01:19:48] deltas for optical circuit switches if
[01:19:50] you're using packet switches, etc., but
[01:19:52] any way to think about a framework on
[01:19:54] that? Thanks.
[01:19:57] I'm
[01:19:58] not quite sure I understand exactly what
[01:20:00] the question is in terms of port
[01:20:02] economics. Yeah, just, you know, per
[01:20:04] port cost as you move to, you know, sort
[01:20:07] of these new paradigms versus
[01:20:08] traditional DCI.
[01:20:10] Thinking about the problem that gets
[01:20:11] solved is is
[01:20:13] so far been fairly similar in terms of
[01:20:17] you know, the use case is something that
[01:20:18] can go beyond 100 km, certainly can
[01:20:21] scale,
[01:20:23] and so
[01:20:25] we haven't been able to see a clear
[01:20:26] driver for why the port economics can be
[01:20:30] uh
[01:20:32] for scale
[01:20:33] across compared to our traditional DCI.
[01:20:35] So, I guess at the moment uh
[01:20:37] you know, I guess with the the you know,
[01:20:39] with our traditional DCI, we have uh
[01:20:40] more more of the kind of the break
[01:20:42] points. So, like we use a 600 gig mode
[01:20:44] uh for for longer reach interconnects.
[01:20:46] We would never have to fall back to that
[01:20:48] for for scale across networks. So,
[01:20:50] that's certainly a difference. Um
[01:20:52] There's some optimizations on the line
[01:20:54] system, that sort of thing. But, uh uh
[01:20:58] you know, one of the key drivers is
[01:20:59] going to be volume, I think.
[01:21:03] Thank you. I think we have another
[01:21:04] question over here. Uh Manton from Light
[01:21:06] Acceleration. I have a question for
[01:21:08] Yole. Um for scale across situation, can
[01:21:11] you comment on choosing like electrical
[01:21:13] switch over like or optical OCS or
[01:21:16] electrical switch for scale across
[01:21:19] situation? Yeah, that's a good question.
[01:21:21] Uh OCS in terms of uh scale across, um
[01:21:25] I would say like OCS does enable the
[01:21:28] continuous growth of uh in enlarging the
[01:21:31] switching radix. Uh
[01:21:33] So, then the for the for the context of
[01:21:36] a connecting data centers together, uh
[01:21:39] we are evaluating. We uh do not have a
[01:21:42] clear answer. I don't have clear answer
[01:21:43] of whether OCS will play a role there,
[01:21:45] but uh in general, it's like OCS will
[01:21:49] uh in light of the switching radix of
[01:21:51] the traditional ASICs, uh like you can
[01:21:54] see now we are moving roughly at a 51.2
[01:21:57] terabits. Uh and then next generation
[01:21:59] 100 2.4, and then the the
[01:22:02] this is not uh enough as we are
[01:22:05] connecting more uh GPUs and XPUs
[01:22:08] together. So, but then OCS comes into
[01:22:10] play. We are feel do not have auto
[01:22:12] connection and how to do that. Uh scale
[01:22:14] across, yes, will be So, I can see some
[01:22:16] challenges of uh using OCS there. But,
[01:22:19] yes, if it's continuous extend the
[01:22:22] radix, the possibility of connecting
[01:22:24] multiple ASICs together uh and the
[01:22:28] as pure together, then that's that's can
[01:22:30] be a a direction we look at. Thank you.
[01:22:33] And I think with that we we want to wrap
[01:22:35] up the panel. I
[01:22:36] I think it's clear that scale across has
[01:22:40] a lot of different kinds of
[01:22:40] implementations, right? I think it it's
[01:22:42] not just one one architecture. I think
[01:22:44] generally it is expansion of the scale
[01:22:46] out. I think there's some terminology
[01:22:48] things that as an industry we need to to
[01:22:50] continue to to work through, but
[01:22:52] hopefully this was valuable perspective,
[01:22:55] and I appreciate everybody else on the
[01:22:56] panel. I'd also like to call out Mark
[01:22:58] Filer, who
[01:22:59] who was a valuable contributor and
[01:23:01] helped sort of my partner in crime in
[01:23:02] organizing the panel. So, thank you
[01:23:04] everybody, and
[01:23:06] look forward to the rest of the day.
[01:23:09] Thank you.
[01:23:22] Everybody talking
[01:23:24] about scaling AI,
[01:23:27] but the data center's joking deep
[01:23:30] within.
[01:23:34] Copper running hot,
[01:23:36] yeah, the signal's getting thin.
[01:23:40] So, we flip the switch now,
[01:23:43] optics is in.
[01:23:46] Bandwidth climbing fast,
[01:23:49] racks are running red,
[01:23:52] cloud demand exploding overhead.
[01:23:58] Pluggables fading as the limit's
[01:24:02] closing.
[01:24:05] Co-packaged light is how we win.
[01:24:09] It's photonics,
[01:24:11] baby. It's 2026.
[01:24:15] Riding that light wave,
[01:24:19] doing new tricks.
[01:24:21] From the fiber in the ground to the chip
[01:24:24] in my hand, we make that sunshine jump
[01:24:27] on command. Yeah, Food and Ag baby.
[01:24:33] 2020 seeds