Skip to content

Architecture

Layout

trnrand/
├── trnrand/
│   ├── __init__.py        # Re-exports all RNG operations
│   ├── generator.py       # Generator class, seeding, state management
│   ├── distributions.py   # uniform, normal, exponential, bernoulli, etc.
│   ├── quasi.py           # sobol, halton, latin_hypercube
│   └── nki/
│       ├── __init__.py    # Backend dispatch (set_backend / HAS_NKI)
│       └── dispatch.py    # Philox kernel scaffold for on-device RNG
├── tests/
├── examples/mc_integration.py
└── benchmarks/

Use cases across the suite

Use case trnrand function Consumer
Noise injection (speech training) normal() trnfft
Stochastic trace estimation normal(), sobol() trnsolver
Weight initialization truncated_normal() trnfft/nn.py
Monte Carlo integration sobol(), halton() trnblas (DF-MP2)
Hyperparameter sweeps sobol() Ablation studies
Data augmentation uniform(), bernoulli() General
Bayesian inference priors gamma(), beta() General scientific
Event-count / queuing simulation poisson() General scientific
Variance / goodness-of-fit tests chi_squared() General scientific

NKI strategy

The Philox 4×32 counter-based RNG maps cleanly to Trainium:

  • GpSimd engine runs the integer multiply-XOR rounds (the Tensor Engine is wasted on this).
  • Parallel generation: each tile gets a disjoint counter range, no cross-tile coordination required.
  • Deterministic: (counter, key) → output — no state to synchronize across cores.

Philox is preferred over Mersenne Twister precisely because it's stateless and trivially parallelizable. It's the same engine used by cuRAND and JAX.

Box-Muller for normal()

The on-device normal path is a Box-Muller transform layered on the Philox uniform stream:

  • Pairs of uniforms (u1, u2) → standard-normal pairs (z1, z2) via r = √(-2 ln u1), θ = 2π u2, z1 = r cos θ, z2 = r sin θ.
  • Runs on the Vector Engine, which has hardware cos/sin/log/sqrt.
  • Box-Muller is preferred over Marsaglia polar here: Marsaglia avoids the trig calls but uses rejection sampling, which serializes branch-divergent lanes and kills SIMD throughput. Box-Muller has constant work per pair.

Known gaps

  • NKI Philox and Box-Muller kernels await validation. Both landed in v0.1.0 with CPU conformance oracles (the three canonical Salmon et al. SC'11 test vectors pass in tests/test_nki_philox.py::TestPhiloxReference::test_spec_vectors). Tracked on the v0.3.0 milestone (#1 Philox, #2 Box-Muller, #18 Phase 1).

Phase 1 iteration is no longer hardware-gated. Neuron SDK 2.29 (April 2026) brought NKI 0.3.0 Stable with a CPU simulator (nki.simulate_kernel); the test-simulator CI job exercises kernel correctness on ubuntu-latest without a trn1 instance. Use pytest tests/ -v -m simulator as the inner loop for NKI kernel development — seconds per iteration, no AWS round-trip. Hardware runs via scripts/run_neuron_tests.sh are reserved for final real-device validation, not inner-loop iteration. - Halton degrades above ~20 dimensions — known algorithmic limitation. Sobol is preferred for d > 10. - Quasi-random sequences are host-only. NKI scrambling for Sobol/Halton is scoped for v0.3.0 (#11, #12). - FP32 throughout. BF16 / FP16 output paths and multi-NeuronCore sharding are v0.4+ candidates (see roadmap).