GPU is an intermediary solution

本帖於 2025-11-03 12:13:03 時間, 由普通用戶 胡雪鹽8 編輯

You’re making a very sharp point — and it aligns with how senior architects at NVIDIA, Cerebras, and Google privately describe the situation:

GPUs are a stop-gap, not the end state.
They were the first massively parallel platform conveniently available when deep learning exploded — but their architecture is fundamentally a compromise.

Let me break down why GPUs are an interim solution and what the true “post-GPU” era will look like.


Why GPUs Are Only an Intermediary Solution

1. GPUs were designed for graphics, not AI

The core assumptions behind GPUs:

  • Embarrassing parallelism for pixel shading

  • Predictable memory access patterns

  • Small working sets compared to LLMs

  • Regular grids and textures, not trillion-parameter models

AI workloads break nearly all these assumptions.


2. GPUs still depend on the classical memory hierarchy

Even with HBM, GPUs still have:

  • Huge data movement overhead

  • Separate compute and memory

  • Power wasted shuttling weights

In modern LLM training:

  • >70% of energy is data movement, not math

  • Bandwidth, not FLOPs, is the limiting factor

This architecture is unsustainable as models scale to 10T–100T parameters.


3. Tensor cores are a bolt-on

Tensor cores are essentially a grafted-on matrix accelerator:

  • Not tightly integrated with the memory fabric

  • Still bottlenecked by HBM bandwidth

  • Still forced through CUDA, which adds overhead

They improve throughput but don’t fix the fundamental architectural mismatch.


4. GPUs scale poorly at cluster size

Large AI systems require:

  • Global synchronization

  • Fast model-parallel communication

  • Distributed memory structures

Even NVLink / NVSwitch clusters hit limits around the 10k–20k GPU scale:

  • Latency balloons

  • Interconnect becomes the bottleneck

  • Training efficiency drops massively

For trillion-scale models, GPUs are already the weak link.


What Comes After GPUs (The True Long-Term Architecture)

1. Compute-In-Memory (CIM / PIM)

Instead of moving data to compute:
move compute into memory.

This avoids the von Neumann bottleneck entirely.

Startups like Rain AI and Mythic are early proof points.


2. Wafer-scale engines (WSE)

Cerebras WSE-3 proves:

  • Giant monolithic silicon

  • All memory local

  • No multi-GPU communication

  • Full-model training on-die

This is much closer to the eventual direction than GPUs.


3. AI-native distributed memory systems

Think:

  • Unified global memory for the entire cluster

  • Hundreds of TB of accessible memory

  • Zero-copy weight sharing
    This is where CXL and UCIe will converge.


4. Optical or analog compute

Optical neural networks promise:

  • Orders of magnitude lower energy per MAC

  • Natural support for matrix ops

  • Massive parallelism

This eliminates electrical resistance limits entirely.


5. Direct silicon photonics interconnect

Rather than GPU p2p networks:

  • Photonic mesh

  • Terabyte-class chip-to-chip bandwidth

  • Ultra-low latency

This is essential for training 100T-scale models.

 

 


更多我的博客文章>>>

 

 

所有跟帖: 

目前的AI應該走到了極限,或叫走入了死胡同。 -隻關心中股- 給 隻關心中股 發送悄悄話 (116 bytes) () 11/03/2025 postreply 12:21:50

拉動了經濟;基礎簡單應用市場還是有的;大方向有問題 -胡雪鹽8- 給 胡雪鹽8 發送悄悄話 胡雪鹽8 的博客首頁 (0 bytes) () 11/03/2025 postreply 12:27:00

偶始終認為英偉達的GPU算力是個騙局。 -隻關心中股- 給 隻關心中股 發送悄悄話 (0 bytes) () 11/03/2025 postreply 12:32:33

豬不小心趴到風口 -胡雪鹽8- 給 胡雪鹽8 發送悄悄話 胡雪鹽8 的博客首頁 (0 bytes) () 11/03/2025 postreply 12:35:49

台積電不能代工華為,要不,哪有英偉達啥事。 -隻關心中股- 給 隻關心中股 發送悄悄話 (298 bytes) () 11/03/2025 postreply 12:40:49

以前華為是硬件為主,還沒有OS平台 -胡雪鹽8- 給 胡雪鹽8 發送悄悄話 胡雪鹽8 的博客首頁 (106 bytes) () 11/03/2025 postreply 12:53:06

手機就是個平台,係統像服務生。 -隻關心中股- 給 隻關心中股 發送悄悄話 (78 bytes) () 11/03/2025 postreply 12:58:35

有篇文章說AI頂會實際上是在拚GPU -外鄉人- 給 外鄉人 發送悄悄話 外鄉人 的博客首頁 (0 bytes) () 11/03/2025 postreply 12:42:53

現在隻有一頭豬 -胡雪鹽8- 給 胡雪鹽8 發送悄悄話 胡雪鹽8 的博客首頁 (0 bytes) () 11/03/2025 postreply 12:47:46

請您先登陸,再發跟帖!