Architecture¶
Layout¶
trnrand/
├── trnrand/
│ ├── __init__.py # Re-exports all RNG operations
│ ├── generator.py # Generator class, seeding, state management
│ ├── distributions.py # uniform, normal, exponential, bernoulli, etc.
│ ├── quasi.py # sobol, halton, latin_hypercube
│ └── nki/
│ ├── __init__.py # Backend dispatch (set_backend / HAS_NKI)
│ └── dispatch.py # Philox kernel scaffold for on-device RNG
├── tests/
├── examples/mc_integration.py
└── benchmarks/
Use cases across the suite¶
| Use case | trnrand function | Consumer |
|---|---|---|
| Noise injection (speech training) | normal() |
trnfft |
| Stochastic trace estimation | normal(), sobol() |
trnsolver |
| Weight initialization | truncated_normal() |
trnfft/nn.py |
| Monte Carlo integration | sobol(), halton() |
trnblas (DF-MP2) |
| Hyperparameter sweeps | sobol() |
Ablation studies |
| Data augmentation | uniform(), bernoulli() |
General |
| Bayesian inference priors | gamma(), beta() |
General scientific |
| Event-count / queuing simulation | poisson() |
General scientific |
| Variance / goodness-of-fit tests | chi_squared() |
General scientific |
NKI strategy¶
The Philox 4×32 counter-based RNG maps cleanly to Trainium:
- GpSimd engine runs the integer multiply-XOR rounds (the Tensor Engine is wasted on this).
- Parallel generation: each tile gets a disjoint counter range, no cross-tile coordination required.
- Deterministic:
(counter, key) → output— no state to synchronize across cores.
Philox is preferred over Mersenne Twister precisely because it's stateless and trivially parallelizable. It's the same engine used by cuRAND and JAX.
Box-Muller for normal()¶
The on-device normal path is a Box-Muller transform layered on the Philox uniform stream:
- Pairs of uniforms
(u1, u2)→ standard-normal pairs(z1, z2)viar = √(-2 ln u1),θ = 2π u2,z1 = r cos θ,z2 = r sin θ. - Runs on the Vector Engine, which has hardware
cos/sin/log/sqrt. - Box-Muller is preferred over Marsaglia polar here: Marsaglia avoids the trig calls but uses rejection sampling, which serializes branch-divergent lanes and kills SIMD throughput. Box-Muller has constant work per pair.
Known gaps¶
- NKI Philox and Box-Muller kernels await validation. Both landed in
v0.1.0 with CPU conformance oracles (the three canonical Salmon et al.
SC'11 test vectors pass in
tests/test_nki_philox.py::TestPhiloxReference::test_spec_vectors). Tracked on the v0.3.0 milestone (#1 Philox, #2 Box-Muller, #18 Phase 1).
Phase 1 iteration is no longer hardware-gated. Neuron SDK 2.29
(April 2026) brought NKI 0.3.0 Stable with a CPU simulator
(nki.simulate_kernel); the test-simulator CI job exercises kernel
correctness on ubuntu-latest without a trn1 instance. Use
pytest tests/ -v -m simulator as the inner loop for NKI kernel
development — seconds per iteration, no AWS round-trip. Hardware runs
via scripts/run_neuron_tests.sh are reserved for final real-device
validation, not inner-loop iteration.
- Halton degrades above ~20 dimensions — known algorithmic limitation.
Sobol is preferred for d > 10.
- Quasi-random sequences are host-only. NKI scrambling for
Sobol/Halton is scoped for v0.3.0 (#11, #12).
- FP32 throughout. BF16 / FP16 output paths and multi-NeuronCore
sharding are v0.4+ candidates (see roadmap).