Skip to content

trnblas

trnblas Phase 3: from 215× slower to 3.6× faster in one kernel boundary move

The Phase 2 profiler post closed with an unexpected conclusion: the fused MP2 energy kernel hits its Amdahl ceiling, the remaining gap to the 3× target lives entirely in the step that surrounds it, and the only lever left is the kernel boundary. Phase 3 moved the boundary. The result is a 3.6× end-to-end speedup over the torch baseline at the small bench shape — the first energy path that actually beats chunk-GEMM.

trnblas: four hypotheses, one profiler trace, and why 1.48× is the correct answer

The previous trnblas post shipped with a 1.48× speedup on the fused MP2 energy reduction and an open admission: the kernel underperforms the 3× RFC target, four hypotheses exist for why, and none had been tested against hardware. One Neuron Profiler 2.0 trace later, the answer is in — Vector Engine at 96.45% active, Tensor Engine at 0.000002%, HBM reads matching the analytical prediction to the byte. The 1.48× ceiling is an exact Amdahl prediction, and the kernel is near-optimal on its own step.

trnblas: fusing DF-MP2 energy reduction into one NKI kernel

trnblas v0.4.0 shipped hardware-validated NKI kernels for GEMM, SYRK, and a fused MP2 energy reduction on trn1. End-to-end density-fitted MP2 matches PySCF to 10 µHa (1×10⁻⁵ Ha) on H₂O, CH₄, and NH₃ at cc-pVDZ. The interesting story isn't the GEMM. It's the fused energy kernel — a single NKI pass that holds the contraction, the orbital-denominator division, and the scalar sum-reduction SBUF-resident, and how the choice to build it looks nothing like a cuBLAS port.