Skip to content

trnrand

trnrand: the integer-multiply gap pointed to a better algorithm

The previous trnrand post closed with: "the silicon just needs one more op to let the library say it out loud." aws-neuron-sdk#1308 is still open. trnrand 0.4.0 ships hardware-validated uniform RNG on trn1 anyway — not by fixing Philox, but by using Threefry4x32-20, the PRNG Salmon et al. designed in the same SC'11 paper for hardware without fast integer multiply. The library said it out loud without waiting for the op.

trnrand: RNG is a four-engine workload, if the silicon lets you say so

trnrand 0.3.0 shipped this week with the Philox 4×32-10 counter-based PRNG and the Box-Muller transform targeted at two non-Tensor-Engine resources on Trainium: GpSimd for the integer multiply-XOR rounds, and the Vector Engine for the cos/sin/log/sqrt pairs that turn uniforms into normals. The kernels compile, dispatch, and run the correct Python algorithm end to end. They do not currently produce correct numerical output, for a specific and reproducible reason that traces back to an NKI platform property — not to the kernel design. This is a retrospective about what the four-engine framing does for RNG, what shipped in 0.3.0, and the one integer-primitive gap that stands between the current state and hardware-validated Philox.