trntensor¶

April 29, 2026
in Deep dive, trntensor
6 min read

trntensor v0.16.0: extending the precision contract to multi_einsum

v0.15.0 gave einsum() a target_forward_error= argument — the caller specifies a relative error bound, the library picks the cheapest mode that satisfies it. v0.16.0 extends the same contract to multi_einsum(). A precision API that works for single contractions but silently doesn't apply to batched calls is a leaky abstraction, and this is the patch that closes it.

April 27, 2026
in Deep dive, trntensor
7 min read

trntensor v0.15.0: the caller specifies the error; the library picks the mode

The precision scaffolding built across v0.11.0–v0.14.0 gave trntensor four accumulation modes. v0.15.0 gives the caller a single number. The question "which mode?" is now the library's problem, not the caller's.

April 24, 2026
in Deep dive, trntensor
8 min read

trntensor v0.14.0: four modes, four mocks — completing the precision contract

precision="dd" previously raised NotImplementedError everywhere — including CPU. v0.14.0 lifts the CPU gate with two named mock functions, completing a four-mode precision scaffolding that is now fully testable without hardware. CUDA programmers evaluating Trainium for iterative or mixed-precision workloads now have a CPU-testable spec for all four accumulation strategies.

April 22, 2026
in Deep dive, trntensor
8 min read

trntensor v0.12.0: the last NotImplementedError — completing the sharding contract

v0.12.0 closes the last NotImplementedError on the CPU-testable sharding surface. The _execute_sharded function could already name two operand types — output-parallel (sharding dimension maps to an output index) and reduce-parallel (sharding dimension maps to a contracted index) — but a single einsum containing both kinds raised NotImplementedError("Mixed…"). v0.12.0 replaces that raise with a nested dispatch loop. The architectural story is that Trainium's multi-chip topology named this loop structure before the code existed to implement it.

April 22, 2026
in Deep dive, trntensor
6 min read

trntensor v0.13.0: precision all the way down

v0.11.0 introduced precision="sr" and wired it to _execute_matmul. Two dispatch paths were never connected: _execute_bmm silently ran at fast rounding regardless of what the caller asked for, and multi_einsum had no precision= parameter at all. v0.13.0 closes both gaps. The fix was small; the reason it matters is architectural.

April 21, 2026
in Deep dive, trntensor
8 min read

trntensor v0.11.0: stochastic rounding at the PSUM→SBUF boundary

v0.11.0's headline is one argument: precision="sr" on a matmul dispatch call. The architectural story is that Trainium's PSUM buffer makes stochastic rounding a one-instruction hardware primitive instead of an external library. Anyone using trntensor for iterative BF16 workloads — Krylov solvers, iterative refinement, or long batched contractions — should care about this.

April 21, 2026
in Deep dive, trntensor
7 min read

trntensor v0.10.0: the test surface names the interface

v0.9.0 gated reduce-parallel sharding behind NotImplementedError on both CPU and Trainium — nki.collectives wasn't in NKI 0.3.0 Stable and that was that. v0.10.0 lifts the gate on CPU: reduce-parallel sharding now dispatches through a named mock all-reduce, making the full Phase 4 test surface runnable without hardware. The mock is not a shortcut — it is documentation of the exact interface nki.collectives.allreduce will need to satisfy when SDK 2.30+ ships.

April 18, 2026
in Deep dive, trntensor
7 min read

trntensor v0.8.0–v0.9.0: dispatch owns routing and placement

v0.8.0 closed a gap where every binary step in a 3+ operand chain bypassed the dispatch layer entirely — calling torch.einsum directly, invisible to the FLOP threshold and the NKI kernel, silently skipping up to 134 GFLOP of Tensor Engine work per chain. v0.9.0 added the first multi-chip abstraction: ShardedTensor and output-parallel einsum dispatch, the foundation for DF-MP2 with basis sets that don't fit in single-chip HBM. Both changes are the same insight applied at different scales: the dispatch layer is where decisions about what code runs and where data lives belong.

April 17, 2026
in Deep dive, trntensor
7 min read

trntensor v0.3.0–v0.7.0: dispatch granularity is the architecture

Phase 1 shipped fused NKI primitives for DF-MP2 and the 4-index AO→MO transform — single-program kernels that outperform the equivalent multi-dispatch CUDA plan sequences in HBM traffic. That architectural argument held. What didn't hold was the assumption that per-contraction dispatch was an acceptable cost in a workload that runs 4,096 of them. Phase 2 through v0.7.0 fixed that, and the fix is the same idea one level up: the dispatch boundary should be at the loop, not the iteration.

April 15, 2026
in Deep dive, trntensor
9 min read

trntensor: when the kernel boundary is the API

trntensor Phase 1 landed. The 2-index and batched nc_matmul NKI kernels validate on trn1. ContractionPlan.backend now reports "nki" when shapes qualify, "pytorch" otherwise — plan-time transparency about where work will actually land. And two fused multi-step primitives — a DF-MP2 correlation-energy kernel and a 4-index AO→MO integral transform — run contract → elementwise → reduce and contract → SBUF-resident → contract as single NKI programs.

The architectural point isn't "einsum on Trainium." It's that the kernel boundary is the design surface: what cuTENSOR hides behind a Plan object, NKI asks you to lay out in source. More work, but also where a tensor library can become a cuTENSOR superset rather than a port.