An on-chip SRAM AI ASIC is an accelerator where most of the working set (activations, partial sums, sometimes weights) stays inside SRAM physically on the compute die instead of being fetched from off-chip DRAM/HBM.
?
SRAM access: ~0.3–1 ns
HBM access (effective): ~50–100 ns
DDR access: 100+ ns
For token-by-token inference, this difference dominates user-visible latency.
Approximate energy per access:
SRAM: ~0.1–1 pJ/bit
HBM: ~3–5 pJ/bit
DDR: 10+ pJ/bit
LLMs are often memory-energy limited, not compute-limited.
No DRAM scheduling, refresh, or bank conflicts
Enables cycle-accurate pipelines (important for real-time systems)
| Chip class | On-chip SRAM |
|---|---|
| Mobile NPU | 4–32 MB |
| Edge inference ASIC | 32–128 MB |
| Datacenter inference ASIC | 100–300 MB |
| Wafer-scale (Cerebras) | 10s of GB |
?
All on-chip SRAM
Static schedule, no caches
Unmatched token latency
Limited flexibility and capacity
Large SRAM buffers
Matrix-centric workloads
Training + inference hybrid
Wafer-scale SRAM + compute
Avoids off-chip memory entirely
Extreme cost, extreme performance for certain models
?
Ultra-low latency LLM inference
Real-time systems (finance, robotics, telecom)
Edge or power-constrained environments
Predictable workloads with known model shapes
?
?
?
?
?
?
?
?
?