Skip to content

trnsci blog

Three tracks: monthly suite digests, bylined technical deep-dives from sub-project maintainers when non-trivial work ships, and occasional thinking pieces about Trainium's place in the accelerator landscape.

trnfft: what trn1 and trn2 tell us about the Ozaki frontier

The v0.18 and v0.19 posts claimed hardware precision of O(sqrt(N)·u_bf16²) ≈ 1.6e-5 and O(sqrt(N)·u_bf16⁴) ≈ 2e-9 for the Ozaki modes. The trn1 hardware measurement says those numbers are wrong — both modes deliver ~1.7e-3, equivalent to single-pass BF16. trn2 was then tested with the same characterization. The result is identical. Both generations. The conclusion is still not "Ozaki is a dead end" — but the generational gap theory needs revision.

trntensor v0.16.0: extending the precision contract to multi_einsum

v0.15.0 gave einsum() a target_forward_error= argument — the caller specifies a relative error bound, the library picks the cheapest mode that satisfies it. v0.16.0 extends the same contract to multi_einsum(). A precision API that works for single contractions but silently doesn't apply to batched calls is a leaky abstraction, and this is the patch that closes it.

trnfft: the residual must stay FP32

v0.19 ships precision="ozaki_hq" — six BF16 matmuls that together reach O(sqrt(N)·u_bf16⁴) ≈ 2e-9 relative error, near-FP64 accuracy on the Tensor Engine. The implementation is a 40-line extension of the v0.18 Ozaki scheme. There is one non-obvious constraint that the algorithm turns entirely on. Getting it wrong gives you 1-level accuracy out of a 2-level design, silently, with no error.

trnsparse: what Trainium thinks a sparse matrix is

Block-sparse attention on a systolic array requires rethinking the data structure before touching the kernel. trnsparse v0.6.0 ships forward and backward NKI attention kernels, K-tiling for head_dim > 128, and — after a week fighting NKI 0.3.0's changed API — a simulator CI gate that actually tests the kernels rather than silently substituting PyTorch.

trntensor v0.14.0: four modes, four mocks — completing the precision contract

precision="dd" previously raised NotImplementedError everywhere — including CPU. v0.14.0 lifts the CPU gate with two named mock functions, completing a four-mode precision scaffolding that is now fully testable without hardware. CUDA programmers evaluating Trainium for iterative or mixed-precision workloads now have a CPU-testable spec for all four accumulation strategies.

trnfft: the missing dtype and the 80× cliff

The first working version of trnfft's NKI butterfly kernel passed every correctness test. It was also 80× slower than the PyTorch fallback for batched STFT — a regression so large the benchmark was assumed to be broken. It wasn't. The kernel was calling NKI once per batch row in a Python loop, paying full XLA graph compilation overhead for each row.

That discovery, and the fix, is what Phase 1 is mostly about.

trnfft: the FP32 accumulator you didn't know you had

trnfft v0.17 ships two new precision modes — "bf16" and "bf16_refined" — for the DFT-GEMM fast path. The headline numbers: 1.4–1.5× faster than FP32 at N=64–256 on trn1, with near-FP32 accuracy after one correction step. The mechanism is an architectural property of Trainium that was already present in every kernel, just never exploited.