trnrand¶

April 16, 2026
in Deep dive, trnrand
8 min read

trnrand: the integer-multiply gap pointed to a better algorithm

The previous trnrand post closed with: "the silicon just needs one more op to let the library say it out loud." aws-neuron-sdk#1308 is still open. trnrand 0.4.0 ships hardware-validated uniform RNG on trn1 anyway — not by fixing Philox, but by using Threefry4x32-20, the PRNG Salmon et al. designed in the same SC'11 paper for hardware without fast integer multiply. The library said it out loud without waiting for the op.

April 15, 2026
in Deep dive, trnrand
9 min read

trnrand: RNG is a four-engine workload, if the silicon lets you say so

trnrand 0.3.0 shipped this week with the Philox 4×32-10 counter-based PRNG and the Box-Muller transform targeted at two non-Tensor-Engine resources on Trainium: GpSimd for the integer multiply-XOR rounds, and the Vector Engine for the cos/sin/log/sqrt pairs that turn uniforms into normals. The kernels compile, dispatch, and run the correct Python algorithm end to end. They do not currently produce correct numerical output, for a specific and reproducible reason that traces back to an NKI platform property — not to the kernel design. This is a retrospective about what the four-engine framing does for RNG, what shipped in 0.3.0, and the one integer-primitive gap that stands between the current state and hardware-validated Philox.