Trainium as a numerical-computing substrate¶
The competitive context¶
Every major silicon roadmap from 2022 onward has converged on the same architecture: a tensor-native core with FP32 accumulate on BF16/FP8/FP4 inputs, FP64 either halved, emulated, or absent. The numbers tell the story clearly:
| Chip | FP64 tensor | BF16 tensor | Ratio |
|---|---|---|---|
| H100 SXM5 | 67 TFLOPS | 989 TFLOPS | 1:14.8 |
| B200 | ~30 TFLOPS | 2,250 TFLOPS | 1:61 |
| B300 Ultra | <1 TFLOPS | ~5,000 TFLOPS | <1:5000 |
| Trn2 (per chip) | none | 79 TFLOPS BF16 | — |
| Trn3 (per chip) | none | 2,520 TFLOPS FP8 | — |
cuBLAS in CUDA 13 now emulates FP64 DGEMM via the Ozaki scheme on FP8 tensor cores (NVIDIA Developer Blog, January 2026). AMD halved FP64 matrix throughput from MI300X to MI355X. Google TPU has never exposed FP64. Intel Gaudi 3, Cerebras WSE-3, SambaNova SN40L — none.
Trainium was designed without an FP64 legacy to protect. It codified the post-FP64 architecture in 2020, roughly five years before the rest of the industry caught up. This is the context in which trnsci's algorithms make sense: not as workarounds for a missing FP64 path, but as the correct algorithms for hardware the industry is converging on.
The HPL-MxP benchmark — which scores the fastest achievable HPC performance using hardware-native precision with certified FP64-accuracy output — delivers 4–24× over HPL-FP64 on every top-10 system. Jack Dongarra's ISC 2026 keynote stated the conclusion plainly: "The most plausible path to effective zettascale is not brute-force FP64, but certified mixed-precision algorithms."
What Trainium actually is¶
A Trainium2 chip contains 8 NeuronCore-v3 cores. Each NeuronCore is a heterogeneous compute engine, not a GPU SM:
TensorEngine — a 128×128 systolic array consuming a stationary matrix and a moving matrix from SBUF, always accumulating into PSUM. Delivers 158 TFLOPS FP8 / 79 TFLOPS BF16 / 20 TFLOPS FP32 per core. Supported dtypes: FP8 (E4M3/E5M2), BF16, FP16, TF32, FP32, INT8.
VectorEngine — elementwise/SIMD for axpy, LayerNorm, pooling, reductions, transcendentals with broadcast/scan semantics. On NC-v3, a new performance mode shares a memory bus with GpSimdE for 2–4× uplift on selected BF16/FP16 ops; NC-v3 also allows VectorE and ScalarE to access PSUM in parallel (a restriction lifted from NC-v2).
ScalarEngine — pointwise non-linear ops (GELU, SIGMOID, EXP, bias+scale).
GpSimdEngine — 8 fully-programmable 512-bit general-purpose vector cores running straight-line C/C++ via Neuron Custom C++ Operators. Each has its own integrated DMA engine on NC-v3. This is the most under-appreciated Trainium feature: it lets the suite offload control flow, rank-revealing logic, adaptive-precision decisions, sparse index computation, and RNG to a Turing-complete engine without leaving the chip or blocking the TensorEngine. trnrand's Philox kernel targets GpSimdE directly.
PSUM — 2 MiB dedicated FP32 accumulation buffer, the exclusive write target of the systolic array, addressable by VectorE/ScalarE/GpSimdE for post-processing. SBUF is 24 MiB on NC-v2, 28 MiB on NC-v3, organized as 128 partitions × (192 or 224) KiB.
Trainium3 (re:Invent 2025, 3nm process): 2.52 PFLOPs FP8 per chip, 144 GB HBM3e at 4.9 TB/s, MXFP8 and MXFP4 microscaled formats per OCP specification, NeuronSwitch-v1 all-to-all fabric for 144-chip UltraServers at 20.7 TB shared HBM3e and 362 MXFP8 PFLOPs. The MXFP formats are the hardware realization of the per-block precision assumptions that mixed-precision HODLR, BLR-LU, and progressive-precision multigrid have been assuming in theory.
Four architectural principles for scientific computing¶
1. PSUM is a free FP32 accumulator¶
PSUM is wider than SBUF (FP32 vs BF16/FP8), exclusively written by the systolic array (deterministic order), and addressable by every other engine. This makes PSUM the natural target for error-free transformations at systolic scale:
- After a BF16 matmul writes C = A⊗B into PSUM at FP32, VectorE can compute the BF16 split C_hi = fl(C) and the residual C_lo = C − C_hi in PSUM — an in-place error-free split — while the next TensorEngine matmul proceeds on the next tile. The Ogita–Rump–Oishi Dot2 construction becomes the default rather than an expensive overlay.
- For iterative refinement (Carson–Higham), PSUM is the high-precision residual buffer: compute r = b − Ax with A and x in BF16, accumulate in FP32 in PSUM, downcast with SR. The inner loop can live entirely on-chip.
- trnblas Phase 2 target (trnblas#22): a
nc_matmul_compensatedkernel delivering FP32-accuracy output from BF16 inputs at roughly 2× the work of naive BF16 matmul, exploiting PSUM as the hidden FP32 accumulator.
2. Engine concurrency is free adaptivity¶
The four engines run independently and overlap. This lets trnsci amortize adaptive logic:
- Adaptive Ozaki splitting. TensorEngine runs the k-th Ozaki split-product; simultaneously VectorE estimates the residual norm to decide whether to stop. The adaptive decision has effectively zero marginal cost because it runs on the idle engine.
- GMRES-IR inner orthogonalization. The Arnoldi Gram–Schmidt sweep runs on VectorE while TensorEngine runs the next Krylov matvec. trnsolver's cg/gmres paths have the scaffolding; exploitable once Phase 2's compensated-dot primitives land.
- Randomized sketching. For Hutch++ trace estimation, the m/3 Hutchinson queries run on VectorE while TensorEngine builds the low-rank projector Q.
3. Stochastic rounding is in the ISA¶
Per the Neuron rounding-modes documentation: from NeuronCore-v2 onward, stochastic rounding is programmable per-instruction in NKI/NISA and globally via NEURON_RT_STOCHASTIC_ROUNDING_EN=1. Connolly, Higham, and Mary (SIAM J. Sci. Comput., 2021) proved SR rounding errors are mean-zero — which replaces Wilkinson's worst-case n·u inner-product error bound with a √n·u probabilistic bound. This is not a nice-to-have; it is a correctness requirement for BF16 Krylov on long sequences. For BF16 (u ≈ 2⁻⁸), any dot product of length n ≥ ~300 should use SR. Trainium makes this free; most other hardware requires CPU fallback or significant overhead to achieve it.
4. Determinism is structural¶
The TensorEngine has a fixed reduction order; the compiler statically schedules all instructions and DMAs. Bitwise reproducibility is a structural property of trnsci kernels given a fixed seed — not a ReproBLAS-style overlay with 4–10× overhead. The only non-determinism is seeded SR, which is reproducible across runs.
Trainium between SM and TPU¶
Architecturally, Trainium sits between NVIDIA's SIMT model and Google's TPU compiler-mediated model:
| Characteristic | NVIDIA SM | Google TPU | Trainium NKI |
|---|---|---|---|
| Programming model | Per-thread SIMT | Per-tensor XLA | Per-tile, per-engine Python |
| Tile size | Warp (32 lanes) + WMMA | Large systolic slice | 128 × (up to 512) fixed |
| Memory management | Shared memory (explicit) | Compiler-managed | SBUF/PSUM explicit; 128 partitions |
| Control flow | Full (GPU branches) | Graph-static | Static in affine_range; GpSimd for irregular |
| Precision control | Via cuBLAS/cuFFT APIs | Via XLA ops | Per-instruction in NKI |
| Stochastic rounding | No (vendor SR is black-box) | No | ISA-level, per-instruction |
| Accumulator | Hidden in WMMA | Compiler-managed | Named PSUM, addressable by all engines |
Same workload expressed in each model:
| Primitive | NVIDIA SM | Google TPU | Trainium NKI |
|---|---|---|---|
| GEMM tile | Warp-level MMA (WMMA / cutlass) | HLO DotGeneral on MXU | nisa.nc_matmul stationary + moving into PSUM |
| Compensated matmul | cuBLAS + separate kernel | Not standard | TensorEngine matmul → PSUM split via VectorE |
| FFT butterfly stage | Per-thread complex multiply-add | XLA fused reduction + permute | TE multiply, VE add, SBUF swap |
| Jacobi rotation | Warp updates two rows | HLO scatter | TE matmul pair + VE reduction to find max |
| Stochastic rounding | N/A or separate kernel | N/A | nisa.activation(..., round_mode="stochastic") |
| Integer bit logic (RNG) | CUDA thread, shared mem | N/A (JAX) | GpSimdE straight-line C |
What fits and what doesn't¶
| Area | Fit | Notes |
|---|---|---|
| Dense LU / QR / Cholesky / SVD | ✅ Clean | Flagship; DF-MP2 validated PySCF to nanohartree |
| Krylov (CG, GMRES, BiCGStab) | ✅ Clean with SR | SR required for BF16 convergence at n ≫ 300 |
| Block-sparse at 128×128 (BSR) | ✅ Clean | Native tile match; CSR is interop only |
| Hierarchical-matrix BLR/HODLR | ✅ Clean | Phase 2+ opportunity; mixed-precision bounds proved |
| FFT at small N | ✅ Clean | DFT-as-GEMM up to 14× at N ≤ 256 |
| FFT at large N | Partial | Error accumulates over log₂(N) stages; compensated butterfly + iterative refinement under development |
| Randomized NLA (RSVD, Hutch++) | ✅ Clean | Only need approximate arithmetic; Phase 3 flagship |
| Monte Carlo / QMC | ✅ Clean | SR-tolerant by construction |
| Tensor contractions / einsum | ✅ Clean | DF-MP2 validated end-to-end |
| Eigensolvers (Jacobi, Householder) | ✅ Clean | Jacobi is a natural NKI kernel (each rotation = 2-row matmul) |
| Mixed-precision multigrid | ✅ Clean | Phase 3 opportunity; progressive-precision framework fits |
| Direct sparse solvers (multifrontal) | Partial | Frontal-matrix updates fit; symbolic analysis on GpSimdE |
| ODE integrators (defect correction) | Partial | Stiff linear/semilinear fits; long nonlinear trajectories do not |
| Long MD trajectories | ❌ Doesn't fit | 10⁸–10¹² steps accumulate roundoff; FP32 is marginal |
| Classical CCSD(T) | ❌ Partial | Near-cancellation in correlated sums needs compensated summation everywhere; gated on trnblas#22 |
| Unstructured CSR SpMV | ❌ Doesn't fit | Irregular access hostile to systolic arrays; use CPU fallback |
Scale-out¶
Trn1: 2 NeuronCore-v2 per chip, 2D torus across 16 chips per instance (800 Gbps EFAv2).
Trn2: 8 NeuronCore-v3 per chip, 1 TB/s chip-to-chip, 2D torus within instance. Trn2 UltraServer: 64 chips, 512 NeuronCore-v3, 6 TB shared HBM, 185 TB/s aggregate, 12.8 Tbps EFAv3. 83.2 PFLOPS FP8 dense per UltraServer.
Trn3: NeuronSwitch-v1 all-to-all fabric, 144 chips per UltraServer, 20.7 TB HBM3e, 362 MXFP8 PFLOPs. Dedicated collective-compute engines run in parallel with core compute — overlapping residual-estimation matvecs with the next Ozaki split.
The scale-out topology is embarrassingly parallel for Monte Carlo, randomized NLA (RSVD with independent Gaussian sketches per chip), and ensemble PDE workflows. The trnsci Phase 4 roadmap targets these workloads: sharded tensor abstractions, collective ops, and dispatch glue that makes multi-chip operation transparent from the user's perspective.