Roadmap¶
Forward-looking plan for trnrand. Tracked via GitHub milestones — this page is a browsable summary.
v0.1.0 — Foundation (shipped 2026-04-12)¶
Generator class with manual_seed / get_state / set_state, module-level
default generator, standard distributions (uniform, normal, exponential,
bernoulli, randint, randperm, truncated_normal), quasi-random sequences
(sobol, halton, latin_hypercube), Philox 4×32 NKI stub, Box-Muller NKI
stub, MC/QMC hypersphere-volume example.
v0.1.1 — Docs and housekeeping (shipped 2026-04-13)¶
Docs badge and mkdocs.yml site_url pointing to trnsci.dev/trnrand/,
repo homepageUrl updated, ROADMAP.md skeleton.
v0.2.0 — CPU distributions (shipped 2026-04-13)¶
Gamma, chi-squared, beta, Poisson CPU implementations; all generator-aware for reproducibility. CPU path works today; NKI acceleration deferred.
v0.3.0 — NKI simulator + Philox byte-tile arithmetic (shipped 2026-04-15)¶
NKI 0.3.0 simulator dispatch via TRNRAND_USE_SIMULATOR=1 — contributors
can iterate kernel correctness on any Linux x86_64 host without hardware
dispatch cost. Philox 4×32 32-bit multiply reworked to 8-bit byte
decomposition (16 sub-products, byte-wise carry chain), validated bit-exact
against Python unbounded-int ground truth. nki-simulator CI job on
ubuntu-latest.
v0.4.0 — Threefry4×32-20 NKI kernel (shipped 2026-04-18)¶
Integer-multiply-free PRNG implemented entirely in GpSimd byte-tile arithmetic,
sidestepping aws-neuron-sdk#1308
at the algorithm level. Fused threefry_normal_kernel routes GpSimd output
directly into Vector Engine transcendentals with tiles SBUF-resident — no HBM
round-trip between RNG and Box-Muller transform. Three categories of
trn1-compiler-rejected constructs eliminated from the kernel call trees. trn1
hardware validation: 4/5 TestThreefryNKI pass; Threefry normal kernel blocked
by NCC_IBIR605 (trn1-only). Self-contained trn2 Terraform root; TMPDIR fix.
v0.4.1 — trn2 hardware validation (pending — pre-staged in feat/trn2-validated)¶
Both test_normal_kernel_distribution and test_normal_kernel_matches_box_muller_cpu
pass on trn2.3xlarge (XPASS). NCC_IBIR605 confirmed trn1-only; does not affect
trn2+. xfail marks removed. Closes
#2.
v0.5.0 — QMC on-device + NKI distributions (planned)¶
Extend the stateless GpSimd infra to quasi-random sequences and the distribution kernels that currently use CPU fallbacks.
QMC on-device (GpSimd):
Distribution breadth (Vector Engine):
- #13 — Beta distribution (gamma-ratio method). CPU path shipped in v0.2.0; NKI pending.
- #14 — Gamma distribution (Marsaglia-Tsang + boost). CPU path shipped in v0.2.0; NKI pending.
- #15 — Poisson distribution (Knuth / Atkinson rejection). CPU path shipped in v0.2.0; NKI pending.
- #16 — Chi-squared distribution (sum of squared normals). CPU path shipped in v0.2.0; NKI pending.
- #17 — Truncated normal on the Vector Engine (currently host-side rejection).
v0.6.0 — Phase 3: single-chip streaming performance (planned)¶
Batched-tile RNG streaming, NEFF compile-cache reuse, per-kernel tuning so the NKI path is meaningfully faster than the PyTorch fallback.
Design: SBUF-resident streaming Generator RFC — pre-compiled streaming kernel that keeps Generator state SBUF-resident across multiple distribution calls, pipelining GpSimd / Vector / Scalar engines concurrently. This is qualitatively different from cuRAND's per-call dispatch model, not a perf tweak.
- #19 — Phase 3 tracker:
trnrand.normal_into(buf)streaming API, Sobol/Halton perf parity, published tokens/sec + GB/s benchmarks.
v0.7.0 — Phase 4: multi-chip counter partitioning (planned)¶
Philox's counter-based design makes cross-chip sharding trivial — each NeuronCore gets a disjoint counter subrange, outputs are bit-exact vs single-chip.
Design: Counter-partitioned multi-chip RNG RFC — the bit-exactness thesis: a 1-chip run and a 32-chip run with the same seed produce the same combined stream, byte-for-byte. GPU RNG libraries typically can't guarantee this; for MCMC / replication studies / cluster-reshape debugging, that's a qualitatively different property.
- #20 — Phase 4 tracker:
Generatoracceptspartition_rank/partition_size; near-linear strong scaling ontrn1.32xlarge.
v0.8.0 — Phase 5: trn2 wider-PSUM fast path (planned)¶
Exploit trn2's larger partition count without maintaining two separately tuned codebases; runtime capability detection picks the right kernel.
- #21 — Phase 5 tracker: trn2-specific Philox kernel + runtime hardware detection in dispatch.
Suite phase mapping¶
trnrand's roadmap aligns with the trnsci suite-wide phase plan:
| Suite Phase | trnrand Milestone | Tracker |
|---|---|---|
| Phase 1 — correctness on hardware | v0.4.0 / v0.4.1 | #18 |
| Phase 2 — precision | (N/A — trnrand is precision-neutral) | — |
| Phase 3 — single-chip perf | v0.6.0 | #19 |
| Phase 4 — multi-chip | v0.7.0 | #20 |
| Phase 5 — generation-specific | v0.8.0 | #21 |