# HUAWEI Tau (τ) Scaling Law Livestream

https://www.youtube.com/watch?v=4a-KfIcpUvI
Translation: zh-TW

[00:04] Ladies and gentlemen,
  各位女士及先生，

[00:06] Good morning.
  早安。

[00:08] I'm Xu Zhimo from Huawei.
  我是華為的徐志摩。

[00:17] And it is my honor to attend the ISCAS 2016 and to deliver today's opening speech.
  我很榮幸能參加ISCAS 2016並發表今天的開幕致詞。

[00:27] Over the past 60 years, I have often been asked, "In industry as competitive as mobile and AI, how did you survive and then come back to the top?"
  在過去的60年裡，我經常被問到：「在像行動和AI這樣競爭激烈的行業中，你們是如何生存下來，然後重回巔峰的？」

[00:47] First, I want to thank our customers and the partners for their patience and the support.
  首先，我要感謝我們的客戶和合作夥伴的耐心與支持。

[00:57] During the 6 years, my team and I have explored new passes
  在這6年裡，我和我的團隊探索了新的途徑

[01:04] in semiconductor
  在半導體領域

[01:07] and have found a way for sustainable evolution.
  並找到了一種可持續發展的途徑。

[01:12] Today, based on Huawei's practice, I will share what we have done, what we have sought, and what we have learned.
  今天，基於華為的實踐，我將分享我們所做的、所追求的以及所學到的。

[01:27] For decades, the booming semiconductor industry has propelled the human society into the information era.
  幾十年來，蓬勃發展的半導體產業推動人類社會進入了資訊時代。

[01:39] Behind this miracle is Moore's law, a principle with both technical intuition and economic significance.
  這個奇蹟的背後是摩爾定律，這是一條兼具技術直覺和經濟意義的原則。

[01:52] Moore's law was historically very promising, Delivering performance while staying cost-effective.
  摩爾定律在歷史上非常有前景，在保持成本效益的同時提供了更高的性能。

[02:03] Every year, we got more powerful
  每年，我們都獲得了更強大的

[02:05] devices, better smartphone, personal computers, and the more advanced AI systems.
  設備、更好的智能手機、個人電腦以及更先進的人工智能系統。

[02:18] But, this evolution relied heavily on geometrical scaling, which began to slow down.
  但是，這種演變嚴重依賴於幾何縮放，而幾何縮放開始減慢。

[02:27] Well, FinFET extended the road map for another decade.
  嗯，FinFET 將路線圖延長了十年。

[02:34] Beyond the 7 nm node, we and our peers met serious challenges.
  在 7 納米節點之後，我們和我們的同行遇到了嚴重的挑戰。

[02:45] Geometrical scaling needs the most advanced arrived earlier and more tougher.
  幾何縮放需要最先進的技術更早到來，並且更具挑戰性。

[02:53] So, semiconductor evolution is more than geometrical scaling.
  因此，半導體演進不僅僅是幾何縮放。

[03:00] Geometrical scaling itself has always delivered time domain gain.
  幾何縮放本身一直提供了時間域的增益。

[03:09] Faster transistors, shorter response time, higher chip frequencies.
  更快的電晶體、更短的響應時間、更高的晶片頻率。

[03:19] This means space and the time are two sides of the same coin.
  這意味著空間和時間是同一枚硬幣的兩面。

[03:26] Losing geometrical scaling does not mean losing time scaling.
  失去幾何縮放並不意味著失去時間縮放。

[03:33] So, we proposed a shift from geometrical to the time scaling as the new guiding principle for electronic system evolution.
  因此，我們提出從幾何縮放轉向時間縮放，作為電子系統演進的新指導原則。

[03:46] We saw time scaling can deliver strong benefits across devices, circuits, chips, and uh systems.
  我們看到時間縮放可以在設備、電路、晶片以及嗯，系統方面帶來巨大的好處。

[03:58] Time scaling is the ultimate goal that the system evolution has been pursued.
  時間縮放是系統演進一直追求的最終目標。

[04:06] Continuously raising operating frequency
  持續提高工作頻率

[04:09] for higher performance.
  以獲得更高的性能。

[04:12] At the device level, is it really hard to shift there to time scaling?
  在設備層面上，轉向時間縮放真的困難嗎？

[04:20] Quite the opposite.
  恰恰相反。

[04:24] A device typical operating time is at the picosecond or nanosecond level.
  設備的典型運行時間在皮秒或納秒級別。

[04:32] Even without the geometrical shrinking, we can gain performance through the front end and the back end RC optimization.
  即使沒有幾何收縮，我們也可以通過前端和後端 RC 優化來提高性能。

[04:42] Because tau itself equals RC product.
  因為 tau 本身等於 RC 乘積。

[04:48] Examples such as high-k metal gate, strained silicon, those very good technologies successfully introduced to the faster nodes.
  例如高介電常數金屬閘極、應變矽等非常好的技術已成功引入更快的節點。

[05:00] The all can improve the device performance.
  所有這些都可以提高設備性能。

[05:09] So, from device to circuits and the
  所以，從設備到電路以及

[05:11] chip, evolution can be centered around the tau.
  晶片，演化可以圍繞著陶子展開。

[05:16] From picosecond to nanosecond to second.
  從皮秒到奈秒到秒。

[05:22] In total, 10 full orders of magnitude.
  總共，10個完整的數量級。

[05:28] At the circuit level, signal propagation time is linked to inter- interconnect RC parasitics, pipeline lengths, and the circuit depths.
  在電路層級，訊號傳播時間與互連的電阻電容寄生效應、管線長度以及電路深度相關。

[05:42] At the system level, architecture innovation and the system optimization dominate enabled by device and the process improvement.
  在系統層級，架構創新和系統優化主導，這得益於元件和製程的改進。

[05:54] Using time as the optimization target for the entire electronic system, it is more comprehensive, more consistent, and more seamless.
  將時間作為整個電子系統的優化目標，它更全面、更一致、更無縫。

[06:11] Let me show you what the time scaling
  讓我向您展示時間尺度

[06:14] meant in our products.
  意味著我們的產品。

[06:22] Let's start with a mobile.
  讓我們從手機開始。

[06:26] For a smartphone, one chip performs the entire system.
  對於智慧型手機，一個晶片即可執行整個系統。

[06:32] So, we must work across device, circuits, and the chip levels.
  因此，我們必須跨越裝置、電路和晶片層級進行工作。

[06:39] After 2020, together with our partners, it took a huge effort to bring our mobile chip back to the market.
  2020 年之後，我們與合作夥伴一起，付出了巨大的努力才將我們的行動晶片重新推向市場。

[06:50] Extensive DTCO and ASTCO core optimization works deliver pleasing results for our mobile customers.
  廣泛的 DTCO 和 ASTCO 核心優化工作為我們的行動客戶帶來了令人滿意的結果。

[07:05] But by general expectation, after hearing 1930 Pro launched the last year,
  但總體預期是，在聽到去年推出了 1930 Pro 之後，

[07:17] Our chip may have reached a saturation.
  我們的晶片可能已經達到飽和。

[07:22] It'll be very challenging to maintain the same momentum of evolution.
  要維持同樣的演進動力將會非常困難。

[07:31] But this is also why I'm here to have this speech.
  但這也是我在此發表演講的原因。

[07:38] Have we run out of surprise, or are things about to change.
  我們是否已經用盡了驚喜，或者事情即將改變。

[07:46] Under the tall centric guideline, we found the new path.
  在宏大的中心指導方針下，我們找到了新的道路。

[08:00] This year, we pre uh surprise for the whole industry.
  今年，我們為整個產業帶來了預先的驚喜。

[08:07] In four winter winter, 2026, we will bring the surprise.
  在2026年的四個冬天裡，我們將帶來驚喜。

[08:16] Not saturation,
  不是飽和，

[08:19] not continuation,
  不是延續，

[08:21] but a big leap ahead.
  而是向前邁出了一大步。

[08:27] So, how did we achieve this breakthrough
  那麼，我們是如何在飽和區內取得這項突破的？

[08:30] within the saturation zone?

[08:42] When a single die reaches its limits,
  當單一晶片達到其極限時，

[08:46] multi-die approaches naturally come into
  多晶片方法自然會進入

[08:49] play.
  發揮作用。

[08:51] We have adopted
  我們已經廣泛採用了

[08:53] this technology widely.
  這項技術。

[08:57] But the inter-die interconnection
  但是晶片間的互連

[09:00] deserves a close look
  值得仔細研究

[09:03] look.
  研究。

[09:05] The typical
  典型的

[09:07] interconnected pitch has evolved from
  互連間距已從

[09:09] hundreds of microns for BGA balls
  BGA球體的數百微米

[09:13] to around 100 microns for bumpings,
  發展到凸點周圍的約100微米，

[09:17] and 50 to 25 microns for micro bumps.
  以及微凸點的50至25微米。

[09:22] Also, about 10 microns for hybrid bonding.
  此外，混合鍵合約為10微米。

[09:27] We see this play across HBM 3 to HBM 5, shifted from from a micro bump to a hybrid bonding.
  我們看到這種情況在HBM 3到HBM 5之間發生，從微凸塊轉變為混合鍵合。

[09:40] These technologies, including H HBM, 3D V-Cache, and the others in the industry, improve bandwidth and reduce latency.
  這些技術，包括H HBM、3D V-Cache以及業界的其他技術，提高了頻寬並降低了延遲。

[09:58] Those very good technology or products along with tile scaling.
  那些非常好的技術或產品以及圖塊擴展。

[10:03] But for curing, well, one chip is the entire system.
  但對於固化而言，嗯，一個晶片就是整個系統。

[10:10] They are not enough.
  它們還不夠。

[10:14] We have to take steps forward.
  我們必須向前邁進。

[10:23] What we did for curing?
  我們為治癒做了什麼？

[10:32] Back to the tile scaling process,
  回到瓦片縮放過程，

[10:35] we found the breakthrough.
  我們發現了突破。

[10:38] We named it
  我們稱之為

[10:40] the logical folding.
  邏輯折疊。

[10:43] Let me first give a clear definition.
  首先讓我給出一個清晰的定義。

[10:49] Following the time scaling principle,
  遵循時間縮放原則，

[10:52] logical folding is a new but a universal
  邏輯折疊是一種新的但普遍的

[10:54] design methodology for digital circuits
  數位電路設計方法學

[10:57] and the systems.
  和系統。

[11:00] Successing digital systems across
  成功跨越數位系統

[11:03] vertical stacked active tiers.
  垂直堆疊的主動層。

[11:07] Logical folding optimize the power,
  邏輯折疊優化了功率，

[11:10] performance, density, and the cost
  性能、密度和成本

[11:14] jointly and cons- constantly.
  共同且持續地。

[11:21] With this definition in mind, let me
  記住這個定義，讓我

[11:24] Introduce how it works.
  介紹它是如何工作的。

[11:29] Simply speaking, digital circuits split into combinational for Boolean logic with no state.
  簡單來說，數位電路分為組合邏輯電路，用於布林邏輯且沒有狀態。

[11:42] And the sequential with flip-flops to hold the state.
  以及序列式電路，使用正反器來保持狀態。

[11:47] Primary time score is to reduce the logical depths of circuit paths between flip-flop stages.
  主要的時序目標是減少正反器級之間的電路路徑的邏輯深度。

[11:58] In the back end, placement and routing work must balance the clock tree and the minimize the critical path delay.
  在後端，佈局和佈線工作必須平衡時脈樹並最小化關鍵路徑延遲。

[12:09] Therefore, critical path timing is the governor metric of digital system performance.
  因此，關鍵路徑時序是數位系統效能的決定性指標。

[12:20] For circuit design, logical folding
  對於電路設計，邏輯折疊

[12:25] aggressively compress propagation time between adjacent flip-flops.
  積極地壓縮相鄰觸發器之間的傳播時間。

[12:34] By distributing critical path gates across different plans, we shorten signal wiring and the lower parasitic RC.
  通過將關鍵路徑門分布在不同的平面上，我們可以縮短信號佈線和降低寄生RC。

[12:50] Clock variation drops sharply.
  時鐘變化急劇下降。

[12:53] Reserve margin are largely eliminated.
  預留裕度在很大程度上被消除。

[12:58] The critical path tightens and the chip runs faster.
  關鍵路徑收緊，芯片運行速度更快。

[13:07] Uh To enable effective logical folding with free logic design, we need a very, very aggressive bonding pitch and the fine out ratio ratio or the gear ratio of the metal stack
  呃，為了實現具有自由邏輯設計的有效邏輯折疊，我們需要非常非常積極的鍵合間距和精細輸出比率或金屬堆疊的齒輪比。

[13:26] The metal stack.
  金屬堆疊。

[13:30] Through many trials, we found the the ratio of hybrid bonding pitch to the top metal pitch should be less than three.
  經過多次試驗，我們發現混合鍵合間距與頂部金屬間距的比率應小於三。

[13:42] That will be good very good.
  這將會很好，非常好。

[13:45] In other words, we know that the top metal nowadays is a 720 nm.
  換句話說，我們知道現今的頂部金屬是 720 奈米。

[13:53] That means the hybrid bonding pitch itself should be smaller than 2 micron.
  這意味著混合鍵合間距本身應小於 2 微米。

[14:05] At this very moment, the logical folding miracle occurs.
  就在此刻，邏輯折疊奇蹟發生了。

[14:19] Kirin 2026 is here.
  Kirin 2026 登場了。

[14:27] Going to market later this year, it marks our first-ever successful implementation of logical folding.
  預計今年晚些時候上市，這標誌著我們首次成功實現邏輯折疊。

[14:39] It is built on a brand new free logical design concept expanding from a single layer to a double layer architecture.
  它建立在一個全新的自由邏輯設計概念之上，從單層擴展到雙層架構。

[14:51] Before logical folding, it took 3 years to lift the transistor density from 126 to 155 million transistors per square millimeter.
  在邏輯折疊之前，需要花費 3 年時間才能將晶體管密度從每平方毫米 1.26 億顆提升到 1.55 億顆。

[15:05] That's in fabrication perspective view.
  這是從製造角度來看的。

[15:10] In 2026, logical folding take it all the way to 238 million transistors per square millimeter in one single step.
  到 2026 年，邏輯折疊將在單一步驟中將其提升到每平方毫米 2.38 億顆晶體管。

[15:28] At the same time, the SOC performance core power efficiency improved 41% and the maximum clock frequency increased by nearly 13%.
  與此同時，SOC 效能核心的功耗效率提高了 41%，最高時脈頻率增加了近 13%。

[15:49] To realize logical folding together with partners, we drove a wave of innovation at the device and the process levels.
  為了與合作夥伴一起實現邏輯折疊，我們推動了設備和製程層級的創新浪潮。

[16:02] The hybrid bonding, the pitch is now sub two micron.
  混合鍵合，間距現在小於兩微米。

[16:05] Actually, it's 1.5 micron.
  實際上，它是 1.5 微米。

[16:10] The alignment overlay error under 0.5 micron.
  對齊疊加誤差小於 0.5 微米。

[16:15] And the smarter redundancy, the yield reached the 100%.
  而更智慧的冗餘，良率達到了 100%。

[16:21] For TSV, the critical dimension and the keep out zone scales down to sub 1.5 micron.
  對於 TSV，關鍵尺寸和隔離區縮小到 1.5 微米以下。

[16:30] Pitch is sub six micron, the failure rate below 100 parts per million and the repair rate of which the 99.9%.
  音高低於六微米，故障率低於百萬分之百，修復率為百分之九十九點九。

[16:43] The technical journey continues.
  技術之旅仍在繼續。

[16:47] We are developing low temperature hybrid bonding to optimize thermal budget and moving TSV landing from the top metal down to the metal six or metal five.
  我們正在開發低溫混合鍵合以優化熱預算，並將 TSV 著陸從頂部金屬移至金屬六或金屬五。

[17:02] Bring more like a more than 30% of high level metal routing resources.
  帶來超過 30% 的高級金屬佈線資源。

[17:10] These innovations may be not all under master production this year, but the but we are introduce to the master product from this year and the beyond.
  這些創新今年可能無法全部投入主生產，但我們將從今年及以後引入主產品。

[17:31] Going further along task scaling, we redesigned the several critical circuits.
  進一步擴展任務規模，我們重新設計了幾個關鍵電路。

[17:38] From the high performance computing chips in the data bus, we build a high-speed uh global network of cheap bus using top metal layers on both upper and the lower dies.
  從數據總線中的高性能計算芯片，我們利用上下兩個芯片頂層金屬層構建了一個高速的、廉價總線的全局網絡。

[17:55] Shorten the transmission and the more stable power power deliver ring.
  縮短傳輸並提供更穩定的電源環。

[18:03] Cut the data bus footprint by over 60%.
  將數據總線的佔用空間減少了 60% 以上。

[18:09] Also, for the clock bus, uh innovative architecture enables the post silicon clock clock scheme adjustment.
  此外，對於時鐘總線，創新的架構能夠進行後硅時鐘方案的調整。

[18:21] Contribute over 5% SOC performance improvement at its own for this design.
  為此設計本身貢獻了超過 5% 的 SOC 性能提升。

[18:28] This page, I will use a SRAM.
  這一頁，我將使用 SRAM。

[18:31] Another example to further explain circuit folding.
  再舉一個例子來進一步解釋電路折疊。

[18:37] SRAM performance depends not only on transistors, access speed, per bit energy, and the area efficiency depend strongly on interconnect length, such as bit lines and the word lines, as well as the communication delay between bit cell array and the peripherals.
  SRAM 的效能不僅取決於電晶體、存取速度、每位元能量以及面積效率，還強烈取決於互連線長度，例如位元線和字元線，以及位元單元陣列與周邊設備之間的通訊延遲。

[19:06] As area size grows, interconnector and the communication delay dominate over the transistor intrinsic intrinsic delay.
  隨著面積尺寸的增大，互連線和通訊延遲在電晶體本身的延遲上佔據主導地位。

[19:19] For a 100 1 megabit SRAM today, they account for over 70% of total latency.
  對於今日的 100 1 兆位元 SRAM 而言，它們佔總延遲的 70% 以上。

[19:29] We apply the logical folding to SRAM.
  我們將邏輯折疊應用於 SRAM。

[19:33] First, folding shortens the critical path by reduce the distance between bit cell array, peripheral circuits and the processing cores.
  首先，折疊縮短了關鍵路徑，減少了位單元陣列、外圍電路和處理核心之間的距離。

[19:49] Second, we optimize the RC of each SRAM component.
  其次，我們優化了每個 SRAM 組件的 RC。

[19:56] Overall, SRAM access latency reduced, energy per bit dropped, and the operating frequency those by over 40%.
  總體而言，SRAM 訪問延遲降低，每位能量下降，操作頻率下降了 40% 以上。

[20:09] This is a huge number, especially for SRAM.
  這是一個巨大的數字，特別是對於 SRAM 而言。

[20:14] It's really very hard to achieve achieve this in advanced node.
  在先進節點上要實現這一點真的非常困難。

[20:26] More things.
  更多內容。

[20:28] Logical folding brings benefits for processing cores, especially for clock
  邏輯折疊為處理核心帶來了優勢，特別是對於時鐘

[20:35] tree clock tree performance.

[20:39] For this processing core,

[20:42] by shifting to double layer folding

[20:45] architecture,

[20:47] clock buffer count dropped by more than

[20:51] 50%.

[20:53] We know every buffer, each buffer at

[20:56] least four transistors.

[21:01] And the clock scale down by 25%.

[21:06] That's so friend for the digital system.

[21:09] And the wire length by around 30%.

[21:14] They also very hard to achieve

[21:18] in the advanced processing node.

[21:27] After 6 years,

[21:30] we established a

[21:32] preliminary methodology and the tool

[21:35] trend for logical folding.

[21:38] For curing 2026,

[21:42] we only applied the modest folding on

[21:46] key parts.

[21:49] The pitch of hybrid bonding only reached

[21:51] the 1.5 micron. And the next year, it

[21:55] will reach 1 micron.

[21:59] And the landing level has only taken its

[22:03] first step.

[22:05] Even so,

[22:07] but CPU performance core

[22:10] frequency will reach.

[22:12] This just the start.

[22:22] >> [applause]

[22:23] >> It it took us 6 years to prepare many

[22:27] things

[22:29] like EDA tools, design methodologies.

[22:33] I used to think that it

[22:35] may take us 10 years.

[22:38] But 6 years,

[22:40] we are here.

[22:44] >> [applause]

[22:47] >> In the next 10 years, we will move move

[22:51] from low local critical path folding to

[22:55] full scale and the

[22:57] multi multi layer folding.

[23:00] For full stack optimization

[23:03] from devices

[23:05] to systems.

[23:07] From 2026

[23:10] to

[23:11] 2035

[23:13] as the wide range of

[23:16] R&D exploration goes to into the

[23:19] product.

[23:20] The transistor density will rise.

[23:24] Operating frequency will surge.

[23:28] And we keep delivering cutting-edge

[23:31] mobile chips to the market.

[23:35] Our solution is a feasible and

[23:39] affordable.

[23:41] The performance of new chip can fully

[23:43] compete

[23:45] with that of other parts.

[23:56] This is just the start and it's a tall

[23:58] story so far for our mobile application.

[24:03] One might ask

[24:05] and it's a fair question.

[24:08] Can Tao scaling, which works

[24:11] in the mini world,

[24:13] smartphone world,

[24:15] also working the gigawatt world of AI

[24:20] data centers?

[24:22] AI training and inference are highly

[24:26] parallel

[24:28] unlike mobile.

[24:30] AI is not about a single chip.

[24:34] Formed by hundreds of thousands of

[24:36] chips,

[24:38] AI system is ultra large scale and

[24:41] highly parallel.

[24:44] In one decades,

[24:47] aggregate gator computing power at this

[24:51] scale has climbed a millions of times.

[24:55] Our AI works

[24:57] takes shape as the super node line. Last

[25:01] year,

[25:02] Ascend 910C

[25:05] opened the super node era.

[25:08] This year, Ascend

[25:11] 950 takes the game to another level.

[25:16] Before 20

[25:18] uh 2030, we are using many technologies

[25:22] that the whole industry is using, such

[25:24] as chiplet,

[25:27] 2.5 dimension fanout, 3D stacking by

[25:31] micro bump, and the normal size hybrid

[25:34] bonding. Also, some slightly we will use

[25:36] some logical folding skills.

[25:40] But, around the 2030,

[25:42] we will launch a new Ascend, the full

[25:45] logical folding version with another

[25:48] performance boost.

[25:50] Uh the concept concept behind the

[25:55] Ascend is still tile scaling.

[25:58] Let me explain how we apply the

[26:01] time-centric tile principle to enable

[26:04] powerful AI systems.

[26:11] AI work nodes come in two shapes,

[26:15] large-scale training and the inference.

[26:19] Two things are clear.

[26:23] First,

[26:25] networks keep growing

[26:27] from one chip to dozens, hundreds, even

[26:30] thousands.

[26:32] Second,

[26:34] look at the bill of energy and the bill

[26:37] of materials.

[26:39] Over 80% of energy goes into moving

[26:44] data.

[26:45] And the over 70% of cost into storing

[26:50] it.

[26:52] Uh even more even more these days to

[26:54] storing the data, even more.

[26:59] So, for both training and the inference

[27:03] the win is not just uh in

[27:06] shortening compu- computer time.

[27:11] It is in shorting shortening the time

[27:15] that the data spends moving between

[27:17] chips and the inside a chip.

[27:21] In hardware terms,

[27:23] interconnector of interchip,

[27:27] memory bandwidth for intrachip,

[27:31] and the raw

[27:33] compute for computation.

[27:36] These three kind of times.

[27:40] For training,

[27:44] the design space is relatively wide.

[27:48] What we care about

[27:50] first is the throughput.

[27:54] The art

[27:56] is to overlap those three kind of times.

[28:01] So, communication level becomes the

[28:03] bottleneck.

[28:05] Latency is not the most important thing

[28:08] here.

[28:11] That That's for training system.

[28:16] Inference different.

[28:19] Inference puts more emphasis on faster

[28:21] response.

[28:23] Each job

[28:24] is much smaller,

[28:26] but must answer in real time.

[28:29] Demanding both throughput and the low

[28:32] latency.

[28:33] When we talk to a AI,

[28:36] tokens must uh come back instantly.

[28:41] As soon as possible. Yeah.

[28:43] The sooner the better.

[28:45] So, tokens per second is what we chase.

[28:53] That is why currently we design two kind

[28:57] of systems, each tuned is on top.

[29:01] For the inference supernode on the

[29:04] right,

[29:05] by jointly optimizing the interconnect

[29:09] memory bandwidth and the compute, each

[29:12] generation moves the needle.

[29:17] When we shifted from a single chip to

[29:19] the entire system,

[29:22] com- communication time become

[29:25] critical.

[29:27] Traditional computing is

[29:30] confined inside a iron box.

[29:33] And

[29:34] once outside,

[29:37] interconnect capability drops quickly.

[29:41] What happens when tens of thousands of

[29:43] AI chip work together?

[29:45] We need communication as strong as

[29:48] inside inside

[29:50] a single iron box.

[29:53] To To reduce system time, we introduce a

[29:57] brand brand new

[29:59] bus protocol, Unified Bus.

[30:02] O- UB.

[30:06] In traditional multi-node

[30:09] multi-AI chip architecture, data

[30:12] exchange often requires

[30:15] complex and redundant protocol protocol

[30:19] conversions.

[30:21] Higher late many many protocols still.

[30:24] Yeah, late- legacy.

[30:27] High latency, low reliability, and high

[30:31] cost.

[30:33] UB unifies interconnect across the

[30:37] entire computer system.

[30:41] The same protocol and hardware are used

[30:43] inside and outside of the box.

[30:46] Through

[30:47] fully peer-to-peer architecture, we

[30:50] avoid a closed layer conversion latency.

[30:53] This improved the reliability and the

[30:56] reduced cost.

[30:59] From multiple protocols to one,

[31:03] UB makes large-scale AI system

[31:07] deployment much simpler.

[31:15] One of UB's new

[31:17] new feature is memory semantics.

[31:22] In the past,

[31:24] closed layer data transfer must be

[31:27] encapsulated to the apply the

[31:30] application layer

[31:31] and then go through complex across

[31:34] proto- protocol conversions.

[31:37] With UB, we achieve conversion frame

[31:41] peer-to-peer transmission at the memory

[31:44] semantics layer,

[31:47] significantly reducing memory access

[31:50] latency.

[31:54] This is especially important for large

[31:58] ultra-large scale parallel systems.

[32:02] Through the fully peer-to-peer UB

[32:05] fabric, we have realized a system as one

[32:10] chip

[32:11] with very low latency, a unique total

[32:15] system.

[32:20] Yeah.

[32:21] Another story, very also very important.

[32:25] Optical interconnect is another key

[32:28] technology we developed for AI.

[32:32] Packing more chips into one single rack

[32:38] pushes power density and the reliability

[32:41] past its limits.

[32:44] All the engineering years are

[32:46] now mad.

[32:49] Recently, there are a few bad news from

[32:52] um

[32:53] some very famous chips famous in a

[32:58] company and other famous chips.

[33:01] Bad news, yeah.

[33:03] All these bad news, I think that by this

[33:06] reason.

[33:08] So far

[33:10] So for our tall systems, we choose a

[33:13] distributed

[33:15] computer across racks.

[33:18] And High One is a key enabler.

[33:24] 400 gigabits per second per AI chip

[33:28] is easily

[33:31] achieved by electrical cables.

[33:35] So for older generation chips, there are

[33:38] not so many bad news.

[33:41] Because um at that speed, those

[33:45] electrical cable are really reliable.

[33:54] But scaling to multi-terabits

[33:57] per second,

[33:59] cable becomes very challenging.

[34:05] Third is reach shortens.

[34:08] Cable grow bulky and we cannot even

[34:11] attach panels for installation. Also,

[34:15] the power and the thermal stuff become

[34:18] worse.

[34:20] To address these issues, we developed

[34:23] the a higher density optical engine,

[34:26] High One.

[34:29] A single High One is here. A single High

[34:33] One delivers

[34:35] 8 terabit.

[34:38] Matching the unified bus bandwidth

[34:41] for one AI chip.

[34:45] It is strength the studies reach from

[34:49] 100 cm to 5 cm or 40 in to 2 in.

[34:57] Eliminates bulky cables.

[35:00] Good for AI system that deployment.

[35:04] Also friendly for the power and the

[35:07] thermal stuff.

[35:10] High one also extends reach reach from

[35:14] under a meter to 100 m.

[35:17] Yeah.

[35:19] Making high density in the connector for

[35:21] spread out.

[35:24] Gigawatt data centers a physical

[35:27] reality.

[35:29] Every memory signal, every in the

[35:33] connector signal, and every ample of

[35:37] supplied current

[35:39] must close the logic dies edge to match

[35:43] to reach the computer circuit inside.

[35:50] As the chip grows and more transistors

[35:53] inside,

[35:55] the edge

[35:57] gets more congested.

[36:00] If the chip

[36:02] has a side

[36:04] side length n,

[36:08] computer power scale as n squared.

[36:13] But the memory bandwidth in the

[36:16] connector and the power delivery all

[36:19] carried about

[36:21] 2.5 dimension fan out along the edge

[36:25] only scale as n.

[36:31] So, widening gap between this quadratic

[36:35] and linear curves

[36:38] is the fine out

[36:40] dilemma that the stores

[36:43] 2.5 dimension scaling.

[36:49] System folding breaks this dilemma

[36:54] by moving power delivery, high-speed

[36:57] memory, and optical IO into the vertical

[37:01] direction

[37:02] onto surface surface

[37:06] instead of around the edges. All of

[37:10] these scale quadratically

[37:13] matching the n square pace of compute.

[37:16] We also heard a lot very good

[37:18] technologies like a backside PDN or on

[37:21] face integrated demand SRAM or DRAM

[37:25] memories. Those work also because of

[37:28] this kind

[37:29] reason.

[37:31] Also this kind of trial.

[37:34] Also because of

[37:36] scaling dilemma.

[37:41] So, we think we

[37:43] This is the path for the next

[37:46] next 10 years.

[37:48] And it is already on the way.

[37:53] Along the tall scaling path, we expect

[37:56] to increase the hardware integration by

[37:59] more than 100 times by

[38:02] 2035.

[38:04] Last uh

[38:05] but not least

[38:09] for high-performance computer systems

[38:12] memory and the logical are not separated

[38:15] domains.

[38:17] They are two sides of the same coin.

[38:21] In the 8086 era,

[38:25] the industry decoupled processors and

[38:29] the memories through standardized

[38:33] memory buses,

[38:34] memory IOs.

[38:37] That separation enabled both industry to

[38:42] scale independently.

[38:45] Processor performance advanced rapidly

[38:48] and uh

[38:50] for different application.

[38:53] While memory vendors grew alongside the

[38:57] expanding

[38:59] of the computer market.

[39:04] The AI era is now reversing that trend.

[39:10] Exploding computer density is pushing

[39:13] memory bandwidth, latency, power, and

[39:15] the packaging to their limits.

[39:21] Forcing logic and the memory into

[39:23] ever-tighter integration through

[39:26] technologies such as HBM and the 3D

[39:29] packaging and so on.

[39:32] We call the 3D folding something.

[39:34] As the data movement becomes as critical

[39:38] as computation itself,

[39:40] the balance of inference is more and

[39:43] more shifting toward the memory sectors.

[39:48] We now there we see that.

[39:51] However, for the whole industry,

[39:54] lasting success will belong to those who

[39:57] can fuse logic and the memory.

[40:03] And just as important,

[40:06] build a

[40:07] economical partnership that allows the

[40:11] both memory and the processor both

[40:13] industry to share the benefits

[40:17] of the fishing over the long term.

[40:30] After all this practice, we know many

[40:33] challenges remain.

[40:39] First,

[40:41] folding these new design methodologies

[40:44] and the tool trends,

[40:46] traditional tools are not yet sufficient

[40:49] for full-scale free logical design.

[40:53] We have done preliminary development

[40:56] with

[40:57] useful results.

[40:59] And the details will be released and

[41:02] published

[41:03] over the coming months.

[41:05] You may also find more information

[41:08] during our industry forum tomorrow

[41:11] afternoon.

[41:14] We warmly welcome partners and experts

[41:18] in the in this field to join us for

[41:21] future improvements.

[41:29] Secondarily, thermal management is

[41:31] another major challenge. Chip power,

[41:34] especially thermal design power, keeps

[41:37] growing each year.

[41:39] The thermal pressure spans devices,

[41:42] circuits, chips, and systems also 12

[41:45] orders of magnitude

[41:48] from milliwatt to gigawatt.

[41:53] We can identify the key issues

[41:57] at every level.

[42:02] To address transient current, we

[42:05] developed a high-density integrated

[42:08] cap- cap- capacitors.

[42:11] At the chip interface, we need to

[42:14] control and optimize thermal

[42:17] um resistance and the conductions and so

[42:20] on.

[42:22] Looking ahead,

[42:24] we would like to work with industry

[42:27] peers

[42:28] and the partners

[42:30] on energy efficiency and the thermal

[42:32] challenges along the road map.

[42:40] The road map ahead is challenging for

[42:43] the next 10 years,

[42:45] but direction is clear and resolute.

[42:52] Six year long the task scaling process,

[42:56] practice has delivered excellent results

[43:02] at the circuits level.

[43:04] Transistor density

[43:07] at the fabrication standard has climbed

[43:10] time with a from 150

[43:13] toward

[43:15] 240, even 300 million transistors per

[43:19] square millimeter.

[43:22] And the approaching 400

[43:26] million transistors per millimeter

[43:29] square rapidly.

[43:33] For SOC design,

[43:35] effective transistor density climbed

[43:38] from under 100 to more than

[43:41] 250 million transistors per square

[43:44] millimeter.

[43:51] >> [snorts]

[43:51] >> Sustainable density improvement is now

[43:54] within reach.

[43:56] And the task scaling opens a new design

[43:59] pace space.

[44:01] CPU performance core is to go in beyond

[44:05] the 5 GHz

[44:07] at 5 2031.

[44:17] Logical folding holds circuit level

[44:20] efficiency at ISO power.

[44:23] With deep hardware software core

[44:26] optimization,

[44:28] curing SOC efficiency will more than

[44:31] double in 3 to 5 years on the typical

[44:35] DPU.

[44:38] At the system level,

[44:41] for AI system, we remain equally

[44:44] confident in delivering

[44:47] high-quality, low-latency,

[44:50] ultra-large-scale

[44:52] solutions.

[45:03] Through practice, we have proved

[45:07] that the Tao scaling path is feasible,

[45:11] universal, and sustainable.

[45:16] At different time levels, we can define

[45:19] Tao

[45:20] function as targets to optimize

[45:23] at each level and across the entire

[45:27] system.

[45:28] Here.

[45:34] 6 years

[45:37] 381

[45:39] chips serving different industry sectors

[45:44] and the markets and the customers.

[45:47] Our vision is to bring digital to every

[45:49] person, home, and organization for a

[45:52] fully connected, intelligent world.

[45:56] This remains

[45:59] our war.
