Skip to content

trnsci blog

trnsparse: the attention mask is a BSRMatrix

trnsparse v0.4.2 added a block-sparse attention primitive without writing a new kernel. The mechanism is straightforward in hindsight — BSRMatrix already stores a matrix as 128×128 blocks plus a block-level sparsity pattern, and the post-softmax attention weight matrix with a local-window mask is exactly that structure — but the implication took a moment to land: trnsparse, built as a quantum-chemistry cuSPARSE replacement, already spoke transformer.

trnblas: four hypotheses, one profiler trace, and why 1.48× is the correct answer

The previous trnblas post shipped with a 1.48× speedup on the fused MP2 energy reduction and an open admission: the kernel underperforms the 3× RFC target, four hypotheses exist for why, and none had been tested against hardware. One Neuron Profiler 2.0 trace later, the answer is in — Vector Engine at 96.45% active, Tensor Engine at 0.000002%, HBM reads matching the analytical prediction to the byte. The 1.48× ceiling is an exact Amdahl prediction, and the kernel is near-optimal on its own step.

trnrand: RNG is a four-engine workload, if the silicon lets you say so

trnrand 0.3.0 shipped this week with the Philox 4×32-10 counter-based PRNG and the Box-Muller transform targeted at two non-Tensor-Engine resources on Trainium: GpSimd for the integer multiply-XOR rounds, and the Vector Engine for the cos/sin/log/sqrt pairs that turn uniforms into normals. The kernels compile, dispatch, and run the correct Python algorithm end to end. They do not currently produce correct numerical output, for a specific and reproducible reason that traces back to an NKI platform property — not to the kernel design. This is a retrospective about what the four-engine framing does for RNG, what shipped in 0.3.0, and the one integer-primitive gap that stands between the current state and hardware-validated Philox.

trntensor: when the kernel boundary is the API

trntensor Phase 1 landed. The 2-index and batched nc_matmul NKI kernels validate on trn1. ContractionPlan.backend now reports "nki" when shapes qualify, "pytorch" otherwise — plan-time transparency about where work will actually land. And two fused multi-step primitives — a DF-MP2 correlation-energy kernel and a 4-index AO→MO integral transform — run contract → elementwise → reduce and contract → SBUF-resident → contract as single NKI programs.

The architectural point isn't "einsum on Trainium." It's that the kernel boundary is the design surface: what cuTENSOR hides behind a Plan object, NKI asks you to lay out in source. More work, but also where a tensor library can become a cuTENSOR superset rather than a port.

The dev loop just got a lot shorter

Until this week, working on an NKI kernel inside any trnsci library looked roughly like this: edit the kernel, push to a branch, wait for the GitHub Actions runner to start a trn1.2xlarge instance over SSM, wait for user-data to finish, wait for the NEFF compile, finally run pytest -m neuron, read the result, stop the instance. Eight to twelve minutes per iteration in the best case. Longer when anything went sideways.

AWS's Neuron SDK 2.29 shipped a month ago with NKI 0.3.0 Stable. The change that matters most isn't in the SDK itself — it's in the fact that the full stack now installs cleanly on ubuntu-latest GitHub runners via AWS's pip index, and the CPU simulator exposed by nki.simulate(kernel)(numpy_args) runs the same kernels device-free. The dev loop dropped from minutes to seconds, and the cost-of-iteration barrier for contributing to a trnsci library just collapsed.

trnblas: fusing DF-MP2 energy reduction into one NKI kernel

trnblas v0.4.0 shipped hardware-validated NKI kernels for GEMM, SYRK, and a fused MP2 energy reduction on trn1. End-to-end density-fitted MP2 matches PySCF to 10 µHa (1×10⁻⁵ Ha) on H₂O, CH₄, and NH₃ at cc-pVDZ. The interesting story isn't the GEMM. It's the fused energy kernel — a single NKI pass that holds the contraction, the orbital-denominator division, and the scalar sum-reduction SBUF-resident, and how the choice to build it looks nothing like a cuBLAS port.

trnfft: FFT on hardware that doesn't want to be an FFT engine

Between v0.7 and v0.12, trnfft's NKI story moved from one per-row butterfly dispatch into a batched butterfly plus a fused DFT-as-GEMM fast path, with opt-in Kahan-compensated precision. All of it is hardware-validated on trn1.2xlarge. What landed on silicon looks very little like cuFFT: no complex dtype, no thread-per-butterfly, no bit-reversal in the fast path. What Trainium's architecture — four programmable engines, a fixed 128-partition × 512-moving tile, explicit SBUF/PSUM memory — suggested was a different decomposition, and this post is the retrospective on what that turned out to be.

trnsolver: Jacobi for Trainium — when the hardware inverts the algorithm choice

Symmetric eigh on Trainium wants Jacobi, not Householder-QR — even though Householder has the better asymptotic FLOP count. The inversion doesn't sit in the arithmetic; it sits in the 128-partition Tensor Engine tile and NKI 0.3.0's per-traced-graph compile model. Phase 1 is the simulator-validated correctness gate; hardware-perf numbers land in Phase 3, once the Tensor Engine reformulation and multi-core sharding are in.

trnsparse: the tile is the unit, not the nonzero

trnsparse shipped its first hardware-validated NKI SpMM kernel in v0.2.0, and the benchmark table was publicly worse than scipy across every configuration. That's not a failure — it's the evidence that led to the reframe v0.3.0 ships on: Trainium's sparse primitive isn't CSR, it's the 128×128 Tensor Engine tile. The CUDA sparse playbook is the wrong starting point for this hardware.

Hello trnsci

The trnsci scientific computing suite for AWS Trainium is public. Six libraries covering the CUDA cu* equivalents the Neuron SDK ships without, a coordinating meta-package, full docs, seven PyPI packages, a conda-forge submission in review, and a five-phase roadmap from current alpha to generation-tuned stable. This is the first post of a blog series that will tell the project's story as it unfolds.