Skip to content

trnsci blog

trntensor v0.12.0: the last NotImplementedError — completing the sharding contract

v0.12.0 closes the last NotImplementedError on the CPU-testable sharding surface. The _execute_sharded function could already name two operand types — output-parallel (sharding dimension maps to an output index) and reduce-parallel (sharding dimension maps to a contracted index) — but a single einsum containing both kinds raised NotImplementedError("Mixed…"). v0.12.0 replaces that raise with a nested dispatch loop. The architectural story is that Trainium's multi-chip topology named this loop structure before the code existed to implement it.

trntensor v0.13.0: precision all the way down

v0.11.0 introduced precision="sr" and wired it to _execute_matmul. Two dispatch paths were never connected: _execute_bmm silently ran at fast rounding regardless of what the caller asked for, and multi_einsum had no precision= parameter at all. v0.13.0 closes both gaps. The fix was small; the reason it matters is architectural.

trntensor v0.11.0: stochastic rounding at the PSUM→SBUF boundary

v0.11.0's headline is one argument: precision="sr" on a matmul dispatch call. The architectural story is that Trainium's PSUM buffer makes stochastic rounding a one-instruction hardware primitive instead of an external library. Anyone using trntensor for iterative BF16 workloads — Krylov solvers, iterative refinement, or long batched contractions — should care about this.

trntensor v0.10.0: the test surface names the interface

v0.9.0 gated reduce-parallel sharding behind NotImplementedError on both CPU and Trainium — nki.collectives wasn't in NKI 0.3.0 Stable and that was that. v0.10.0 lifts the gate on CPU: reduce-parallel sharding now dispatches through a named mock all-reduce, making the full Phase 4 test surface runnable without hardware. The mock is not a shortcut — it is documentation of the exact interface nki.collectives.allreduce will need to satisfy when SDK 2.30+ ships.

trnfft: FFT is a GEMM, and then it isn't

trnfft v0.12–v0.15 shipped three new FFT dispatch paths — DFT-GEMM, Stockham radix-4 with twiddle precomputation, and Stockham radix-8 with a Tensor-engine W₈ kernel — producing 20–37% improvements over the butterfly baseline at medium and large N. The architectural argument running through all three is the same: on Trainium, the bottleneck is not arithmetic but engine utilization and kernel launches. Whether that argument holds at a given N, and at what cost, is where most of the engineering work actually lived.

trnsolver v0.10.0: Schur decomposition and the architecture of two-stage algorithms

v0.10.0 ships schur(A)A = Q @ T @ Q.T, T quasi-upper-triangular — closing the last open item in trnsolver's factorization table. The implementation uses Householder-QR throughout, and the reason that works on Trainium is the same reason the earlier Jacobi path for eigh eventually had to grow a second stage: two-stage algorithms present uniform-shape kernel calls, and uniform shapes keep the NEFF cache warm.

trntensor v0.8.0–v0.9.0: dispatch owns routing and placement

v0.8.0 closed a gap where every binary step in a 3+ operand chain bypassed the dispatch layer entirely — calling torch.einsum directly, invisible to the FLOP threshold and the NKI kernel, silently skipping up to 134 GFLOP of Tensor Engine work per chain. v0.9.0 added the first multi-chip abstraction: ShardedTensor and output-parallel einsum dispatch, the foundation for DF-MP2 with basis sets that don't fit in single-chip HBM. Both changes are the same insight applied at different scales: the dispatch layer is where decisions about what code runs and where data lives belong.

trnblas Phase 3: from 215× slower to 3.6× faster in one kernel boundary move

The Phase 2 profiler post closed with an unexpected conclusion: the fused MP2 energy kernel hits its Amdahl ceiling, the remaining gap to the 3× target lives entirely in the step that surrounds it, and the only lever left is the kernel boundary. Phase 3 moved the boundary. The result is a 3.6× end-to-end speedup over the torch baseline at the small bench shape — the first energy path that actually beats chunk-GEMM.

trntensor v0.3.0–v0.7.0: dispatch granularity is the architecture

Phase 1 shipped fused NKI primitives for DF-MP2 and the 4-index AO→MO transform — single-program kernels that outperform the equivalent multi-dispatch CUDA plan sequences in HBM traffic. That architectural argument held. What didn't hold was the assumption that per-contraction dispatch was an acceptable cost in a workload that runs 4,096 of them. Phase 2 through v0.7.0 fixed that, and the fix is the same idea one level up: the dispatch boundary should be at the loop, not the iteration.

trnrand: the integer-multiply gap pointed to a better algorithm

The previous trnrand post closed with: "the silicon just needs one more op to let the library say it out loud." aws-neuron-sdk#1308 is still open. trnrand 0.4.0 ships hardware-validated uniform RNG on trn1 anyway — not by fixing Philox, but by using Threefry4x32-20, the PRNG Salmon et al. designed in the same SC'11 paper for hardware without fast integer multiply. The library said it out loud without waiting for the op.