trnsci blog¶

Three tracks: monthly suite digests, bylined technical deep-dives from sub-project maintainers when non-trivial work ships, and occasional thinking pieces about Trainium's place in the accelerator landscape.

April 30, 2026
in Deep dive, trnfft
4 min read

trnfft: the NeuronCore is the unit of parallelism

v0.20 ships multi-NeuronCore batch-split FFT. B transforms dispatched across C NeuronCores gives B/C per core — no shared state, no communication, linear scaling up to B = C. The architecture made this obvious. Getting the dispatch infrastructure right was less so.

April 30, 2026
in Deep dive, trnfft
4 min read

trnfft: what trn1 and trn2 tell us about the Ozaki frontier

The v0.18 and v0.19 posts claimed hardware precision of O(sqrt(N)·u_bf16²) ≈ 1.6e-5 and O(sqrt(N)·u_bf16⁴) ≈ 2e-9 for the Ozaki modes. The trn1 hardware measurement says those numbers are wrong — both modes deliver ~1.7e-3, equivalent to single-pass BF16. trn2 was then tested with the same characterization. The result is identical. Both generations. The conclusion is still not "Ozaki is a dead end" — but the generational gap theory needs revision.

April 29, 2026
in Deep dive, trntensor
6 min read

trntensor v0.16.0: extending the precision contract to multi_einsum

v0.15.0 gave einsum() a target_forward_error= argument — the caller specifies a relative error bound, the library picks the cheapest mode that satisfies it. v0.16.0 extends the same contract to multi_einsum(). A precision API that works for single contractions but silently doesn't apply to batched calls is a leaky abstraction, and this is the patch that closes it.

April 28, 2026
in Deep dive, trnfft
6 min read

trnfft: the residual must stay FP32

v0.19 ships precision="ozaki_hq" — six BF16 matmuls that together reach O(sqrt(N)·u_bf16⁴) ≈ 2e-9 relative error, near-FP64 accuracy on the Tensor Engine. The implementation is a 40-line extension of the v0.18 Ozaki scheme. There is one non-obvious constraint that the algorithm turns entirely on. Getting it wrong gives you 1-level accuracy out of a 2-level design, silently, with no error.

April 27, 2026
in Deep dive, trnsparse
7 min read

trnsparse: what Trainium thinks a sparse matrix is

Block-sparse attention on a systolic array requires rethinking the data structure before touching the kernel. trnsparse v0.6.0 ships forward and backward NKI attention kernels, K-tiling for head_dim > 128, and — after a week fighting NKI 0.3.0's changed API — a simulator CI gate that actually tests the kernels rather than silently substituting PyTorch.

April 27, 2026
in Deep dive, trntensor
7 min read

trntensor v0.15.0: the caller specifies the error; the library picks the mode

The precision scaffolding built across v0.11.0–v0.14.0 gave trntensor four accumulation modes. v0.15.0 gives the caller a single number. The question "which mode?" is now the library's problem, not the caller's.

April 25, 2026
in Deep dive, trnfft
5 min read

trnfft: eight hardware runs, one wrong argument, and the Ozaki scheme

v0.18 ships precision="ozaki" — three BF16 matmuls that together deliver ~1e-5 relative error where a single BF16 matmul gives ~1e-3. The implementation took an afternoon. Getting the benchmark to record a single timing took eight hardware runs and produced a "What didn't work" section that was, frankly, humbling.

April 24, 2026
in Deep dive, trntensor
8 min read

trntensor v0.14.0: four modes, four mocks — completing the precision contract

precision="dd" previously raised NotImplementedError everywhere — including CPU. v0.14.0 lifts the CPU gate with two named mock functions, completing a four-mode precision scaffolding that is now fully testable without hardware. CUDA programmers evaluating Trainium for iterative or mixed-precision workloads now have a CPU-testable spec for all four accumulation strategies.

April 23, 2026
in Deep dive, trnfft
5 min read

trnfft: the missing dtype and the 80× cliff

The first working version of trnfft's NKI butterfly kernel passed every correctness test. It was also 80× slower than the PyTorch fallback for batched STFT — a regression so large the benchmark was assumed to be broken. It wasn't. The kernel was calling NKI once per batch row in a Python loop, paying full XLA graph compilation overhead for each row.

That discovery, and the fix, is what Phase 1 is mostly about.

April 22, 2026
in Deep dive, trnfft
6 min read

trnfft: the FP32 accumulator you didn't know you had

trnfft v0.17 ships two new precision modes — "bf16" and "bf16_refined" — for the DFT-GEMM fast path. The headline numbers: 1.4–1.5× faster than FP32 at N=64–256 on trn1, with near-FP32 accuracy after one correction step. The mechanism is an architectural property of Trainium that was already present in every kernel, just never exploited.