# Energy‑Efficient Optical Interconnects for Multi‑Rack AI Pods

https://www.youtube.com/watch?v=DyLg5j576hg

[00:01] Uh, hi everyone.
[00:02] Uh, thanks a lot for the invitation to speak in this uh, session.
[00:06] Uh, it's really great to be at OCP and talk about photonics.
[00:09] So, great to see photonics uh, uh, getting more and more attention.
[00:13] We have more and more to talk about in future sessions.
[00:15] So, I'm Fotini Karydi now.
[00:17] I am from the systems planning and architecture team uh, within um, uh, Azure hardware systems and infrastructure organization at Microsoft.
[00:26] And today we'll be talking about uh, optical interconnects uh, and how they can be used to enable a, a multi-rack AI pods.
[00:35] So, just to ground uh, the context um, Azure builds across the stack, right?
[00:39] From custom silicon, our own AI accelerators and CPUs all the way um, up to uh, platform integration and uh, full rack-scale systems.
[00:51] And that really matters when we think about optical technologies and introducing those uh, in our platforms because it's not only about peak performance or module performance, but
[01:01] performance or module performance, but it is about how the optics really affect it is about how the optics really affect uh, how you design the tray, uh, the uh, how you design the tray, uh, the serviceability, serviceability, uh, operations at scale, fiber uh, operations at scale, fiber management, etc.
[01:12] management, etc. So, when we discuss this optical integration in AI pods, uh, that's how we should be thinking end-to-end.
[01:18] Uh, device device capability, system architecture, and what can be deployed and sustained at scale.
[01:26] And we have seen a massive deployment of uh, fiber for scale-out architectures and networks in uh, our AI one, for example.
[01:34] New builds and new routes, uh, hundreds of thousands of kilometers of fiber across the US.
[01:41] And this is what really has enabled and has been the foundation uh, of the current distributed AI.
[01:48] And the next step we're seeing is optics really penetrating in the scale-up domain.
[01:52] Uh, thinking about uh, coupled accelerators, um, a tightly coupled domains, spanning trays, racks, and even rows.
[02:02] trays, racks, and even rows.
[02:04] And the scale-up domain, as previously mentioned from also Iponics,
[02:08] really stresses the requirements in terms of bandwidth, density, energy efficiency, and reliability.
[02:16] And this is where optics can play a key enabler role.
[02:21] So, here's the road map of of my talk.
[02:25] We will discuss about the requirements that are driving optics, specifically focused for the scale-up domain.
[02:33] And then we'll also discuss about industry directions, and the industry is really thinking about form factors and technologies to enable optics on the accelerator side.
[02:43] Also, optical innovation, packaging, and integration will be key in order to introduce those technologies in our scale-up networks.
[02:56] And finally, this this stack of optics optical interconnects is a diverse stack, and it
[03:03] Interconnects is a diverse stack, and it consists of many different components.
[03:07] And this is where optics really need to would get where the standards can help really a lot in order to uh align on some specifications and help move forward with the development of those technologies.
[03:21] Just coming out from OFC last week in March in California, there were just three MSAs announced around optical around optics form factors and interfaces.
[03:34] And this is where OCP can play a big role, and that's kind of a call of action of us to work collaboratively together to define also the rest of the stack and more things that need to be specified.
[03:46] So, diving more into like the idea of what a scale-up network looks like, right?
[03:51] This diagram illustrates the uh the basic topology and it's just an example topology.
[03:57] Uh you can think about hierarchic switches in a single tier uh interconnecting a number of accelerators to uh to to form a uniform
[04:05] accelerators to uh to to form a uniform communication domain.
[04:08] communication domain.
[04:08] And optics are used to interconnect uh those accelerators with the switches.
[04:12] those accelerators with the switches.
[04:12] And what they really bring is rich bandwidth and uh connectivity and can extend this domain beyond physical boundaries like uh trays and racks.
[04:24] boundaries like uh trays and racks.
[04:24] Um and the takeaway here is really that as you want to scale and add more and more accelerators to your network, then you will need to introduce more and more tiers of switches.
[04:34] tiers of switches.
[04:34] But that also brings up the cost and the power.
[04:36] And uh so the take the takeaway here is that several topologies uh variations are possible and uh the right one will really depend on your AI workload performance requirements and um the cost and the overall reliability of the system.
[04:52] system.
[04:52] So what are those requirements that the optics optical scale-up domain brings, right?
[04:58] right?
[04:58] Um This is just a graph illustrating two of the main one we care a lot about.
[05:03] the main one we care a lot about.
[05:03] And uh this is energy per bit in the x-axis and
[05:06] this is energy per bit in the x-axis and bandwidth density on the y-axis.
[05:10] And uh what it really says is that um scale-up optics really need to really need to be better than copper.
[05:17] And we will be need think need to think about how we move really from the um uh right uh bottom part of this graph to the left one where uh really uh we get uh power efficiencies uh in the order of below 4 pJ per bit, while we can we can push bandwidth densities close to the accelerator uh really above 2 or 3 uh Tbps per millimeter.
[05:41] Uh these are just two of those uh of the of the of the main requirements, but there are many more, right?
[05:48] And um in this graph and this table, we summarize the um the main ones we care about.
[05:54] Uh reach,
[05:55] uh bandwidth per AI SOC, uh latency, uh bandwidth density, energy efficiency, BER.
[06:03] And when we think about those requirements, we need to think about the
[06:07] requirements, we need to think about the a future accelerator and the the type of a future accelerator and the the type of the interfaces that it needs to have.
[06:11] the interfaces that it needs to have.
[06:12] And uh that means that there are a a lot of uh that means that there are a a lot of different communication that need to happen, right?
[06:17] Um for example, you can think about the scale-up communication as one of the main interfaces.
[06:22] main interfaces.
[06:24] And also new applications like off-package memory could also drive uh requirements.
[06:29] requirements. Now, each of these of those interfaces can come with a different requirements.
[06:32] can come with a different requirements.
[06:33] For example, um scale-up communication dictates the reach, really gives you the requirement around reach.
[06:39] And reach is not something we can compromise on.
[06:41] Uh we need um a solutions that can get you uh for sure be be long 20 or 40 m at least.
[06:47] least. And this is what drives the um the requirement uh on the reach.
[06:51] uh on the reach. Uh also energy efficiency and bandwidth density is driven by scale-up communication.
[06:58] driven by scale-up communication. Off-package memory uh interfaces really gives you the um set the limit and or set the requirement for a um uh latency and a very low BER.
[07:05] um uh latency and a very low BER. So, at
[07:08] uh latency and a very low BER.
[07:10] So, at the end of the day, you can you end up with a combined set of requirements that set the expectation for scale-up optics.
[07:16] Um so, in short, optics need to be better than copper in reach, bandwidth density, picojoule per bit, BER.
[07:22] But at the same time, we need to be honest about the the constraints and the barriers.
[07:26] And this is the cost, the reliability, um anything that introduces um uh service increases service event, connector failures, and creates operational friction.
[07:39] And uh so, an optics uh only as I mentioned before, not only to deliver like uh great performance, that but they need to be uh delivering consistent scalable value in a real deployment.
[07:53] So, the industry um the industry approaches can be viewed along a spectrum.
[07:57] Uh what are those form factors, right, that are being discussed in order to introduce optics on the accelerator side?
[08:04] Uh on the on the left, we have pluggable optics, um optics that we're familiar
[08:09] optics, um optics that we're familiar with, uh easy to service, uh but limiting in bandwidth density and energy efficiency.
[08:15] efficiency.
[08:15] In the middle, we have on-board optics, um or near-package optics, depending on where your optical module is sitting off the board or in on the board uh of your um SOC.
[08:29] And this is where you start getting better um bandwidth densities and energy efficiencies, but also where you start getting into uh higher degree of complexity in terms of integration and packaging, which affects your serviceability.
[08:42] And on the right, um co-package optics is potentially the best um bandwidth density and energy efficiency, but also it's the hardest to oper- operationally, um and um even when we're thinking about uh even more advanced packaging uh on the interposer level.
[09:00] So, moving from the from the left to the right really improves the density and the power, but really raises the bar in terms of packaging, reliability, and operations.
[09:10] operations.
[09:12] And I will double-click a bit on the co-package optics, um.
[09:14] co-package optics, um uh which is um how we're thinking about uh driving towards bandwidth density and energy efficient interconnects.
[09:21] energy efficient interconnects.
[09:23] Uh there are different ways to implement uh co-package optics.
[09:26] This um illustrated uh right uh on on the slide is using uh some high-density uh copper connector on the substrate, and that can is where it can be connected with your CPO module.
[09:38] module.
[09:41] And this you can think about this like 200 G, 400 G this approaches and so what used to what we have you know what we have today.
[09:50] what we have today.
[09:53] And even though this gives you the benefit of the reach it really gives it's the the most high high power approach because it's a service beta based approach.
[10:01] approach because it's a service beta based approach.
[10:03] And also even the industry has made a lot of effort and progress towards enabling those high density connectors there's still some
[10:11] density connectors there's still some manufacturing risk and some packaging manufacturing risk and some packaging complexity involved.
[10:15] So even though you get the benefits of reach in terms of optics still the overall efficiency of the system might not be too compelling.
[10:24] In the middle is what many see as many near term practical first step for co-packaged optics.
[10:31] And this is really based on multi-chip module approaches.
[10:37] This is a more mainstream packaging approach and it really gives you a good trade-off between bandwidth density and power.
[10:44] It's quite well established but it this is where you start getting into serviceability constraints right because now every time you have a failure on your CPO that affects really your whole package.
[10:58] So your blast radius is really key concern.
[11:04] On the far right the CPO based on interposer level integration is the realm where your your optics is sitting
[11:13] realm where your your optics is sitting on the same interposer or your as your SOC.
[11:15] on the same interposer or your as your SOC and then you can imagine thing using some kind of parallel high parallel electrical interconnect like die to die interconnect UCI or other custom type in order really to get rid of the service reduce the power, minimize the electrical reach, and reach the unlock those a very very low energy low power that we're mentioning before.
[11:44] Again, the serviceability and the impact of failure blast radius is really important and needs to be figured out in that real.
[11:54] So, the takeaway is that CPO is not really a single solution.
[11:58] It's a space of a trade-offs.
[12:00] And really the right the right choice depends on how much bandwidth density you need and how much energy efficiency you you need versus how much operational risk and service complexity you really willing to absorb.
[12:16] complexity you really willing to absorb.
[12:17] And here we're focusing on the accelerator side, right?
[12:19] But you may think that in a system you also need to think about the switch side.
[12:25] And those two parts of the of the system can really land into different form factors.
[12:30] So, it doesn't mean that you need to have the form factor in the two sides.
[12:33] However, you do need to be compatible and I guarantee some level of interoperability in both sides, which is very very critical.
[12:42] critical.
[12:44] So, how do we think about this?
[12:46] How we really evaluate those technologies and prioritize and what does it take really to deploy?
[12:49] We have a like a multi-phased approach where in phase one we really try to understand ingredient readiness,
[12:58] prioritize technologies, understand the degradation and packaging complexity, and validate the link performance and the KPIs.
[13:08] The second phase is about thinking about how do you take those ingredients and those optics and try to to design your tray and your platform around.
[13:18] tray and your platform around.
[13:20] And this is where really you are deal with aspects like fiber management,
[13:22] aspects like fiber management, end-to-end link performance, switch interoperability, and others.
[13:27] interoperability, and others. And eventually building small clusters
[13:31] And eventually building small clusters and pilot pilots of proof of concept
[13:34] and pilot pilots of proof of concept
[13:35] proof of concept let's say AI AI pods, then you can really understand telemetry firmware
[13:38] let's say AI AI pods, then you can really understand telemetry firmware control plane and data plane constraints.
[13:42] control plane and data plane constraints. And this is where really one can do risk all deployment challenges at scale.
[13:45] constraints. And this is where really one can do risk all deployment
[13:47] one can do risk all deployment challenges at scale. And that's a multi-year journey, right?
[13:49] challenges at scale. And that's a
[13:51] multi-year journey, right? With with a down selection of technologies.
[13:55] Something really important I want to to emphasize in this talk is that standards are really needed across the whole optical interconnected stack.
[13:58] Something really important I want to to emphasize in this talk is that
[14:00] to emphasize in this talk is that standards are really needed across the
[14:02] standards are really needed across the whole optical interconnected stack.
[14:04] whole optical interconnected stack. From optical interfaces, electrical interfaces, form factors, packaging, lasers,
[14:06] From optical interfaces, electrical interfaces, form factors, packaging,
[14:11] interfaces, form factors, packaging, lasers,
[14:12] lasers, all of these aspects really are very important to make the full stack work.
[14:15] all of these aspects really are very important to make the full stack work.
[14:17] important to make the full stack work. And on the right hand side you can see
[14:19] And on the right hand side you can see some of the mechanisms we have as an industry to try to tackle some of those of those aspects.
[14:32] We have MSAs, OIF, OCP, APC, IEEE, different different forums that could really have different focuses and different strengths.
[14:40] And we should be thinking around how we we are using those forums in order to to specify some of the components of the stack.
[14:48] I want to to just give you to to note like we had a starting point at OIF with publishing an white paper on requirements for scale up.
[14:58] And recently the OCI MSA which stands for optical compute interface MSA that was recently published it's a starting point.
[15:09] But there is so much still to be done.
[15:11] And as I mentioned in my previous slides OCP can really can play a role here.
[15:15] And I know that there's a lot of background effort in order to coordinate and organize this specific this standardization effort.
[15:22] specific this standardization effort across bodies.
[15:24] So I really call for action to work together on this.
[15:26] action to work together on this.
[15:28] I'll close with that slide well
[15:29] well saying around scaling up takes is not really a singular point solution.
[15:33] It's really requires a consistent alignment and that's why we partnered and with our peers and meta Broadcom, AMD, Nvidia and open AI to announce the OCI MSA recently which really is about an open specification for scale up which we believe eventually will get us to the moving from in rack to multi rack and multi multi row implementation and unlock the benefits of optics at scale.
[16:03] unlock the benefits of optics at scale.
[16:06] So with that thank you so much for your attention and I would take some questions.
[16:18] Cliff Grosner OCP.
[16:20] I realize that the
[16:22] I realize that the OCI MSA has just been formed.
[16:25] OCI MSA has just been formed.
[16:27] I'm kind of wondering what the thinking is of the founding members around how OCP can help.
[16:33] I think we have a lot of discussions on that and I'm sure that there are in that stack that I just showed there are a lot of things that make sense to do through an MSA but there are other aspects that for sure we can think about OCP as a forum to specify some of those and we can have those discussions absolutely open.
[16:51] Okay, so if there's an opportunity to look at our work stream at OCP please reach out.
[16:58] Sure and I think this is something on the works already and we can see how we take that forward.
[17:02] Okay.
[17:09] Good morning great presentation of me at B networks.
[17:12] Do you have experience with CPO right now for scale out or like I know meta has been very vocal about their CPO on the scale.
[17:21] The second thing is you talked about serviceability
[17:23] thing is you talked about serviceability and and you know, there's a lot of fun.
[17:26] and you know, there's a lot of fun especially from the LPO people.
[17:30] LPO people talking about failures etc. But then talking about failures etc. But then Meta showed that OCP significantly improved it versus transceivers and anything else.
[17:36] improved it versus transceivers and anything else. So maybe you can elaborate on that.
[17:38] elaborate on that. >> Yeah, definitely.
[17:39] Yeah, definitely. Big credit to Meta for really driving a lot of this work. Um
[17:44] lot of this work. Um Yeah, yes. I think we are So internally in Microsoft, we are
[17:49] are kind of now building our or defining and scoping and building our own proof of concepts to understand reliability.
[17:59] concepts to understand reliability. And you know, optical reliability at scale.
[18:01] But I think the industry points we're getting is like increasing that confidence around reliability.
[18:08] reliability. I think the still the pain point is the laser. And this is something we want to understand better and yeah, work with vendors and the industry to improve because that could be one of the main points we see right now.
[18:18] to improve because that could be one of the main points we see right now. So yeah, I don't know if that answered your question, but we are evaluating and
[18:24] question, but we are evaluating and you know,
[18:25] you know, definitely looking at this direction as
[18:27] definitely looking at this direction as I pointed out in my slides.
[18:32] I'm Ed Olrich from Microchip.
[18:34] The previous speaker mentioned WDM applications within this.
[18:39] And do you think it's important to design now WDM from the base so that as things if we project out a few years in the future,
[18:47] you've got all those lambdas to really increase the massive amount of throughput effectively.
[18:52] Yes, definitely.
[18:54] And the OCI MSA actually did a WDM WDM kind of architecture.
[19:00] And yeah, as we think of multi-generation and gen This is a gen one spec and gen two is is going to be you know, coming.
[19:07] We really need to think about yeah, all the rest to be aligned with that with that architecture.
[19:13] So, yeah, I think it's very important.
[19:15] Okay, let's thank our speaker.
[19:17] Thank you.