Roadmap¶
Forward-looking plan for trnrand. Tracked via GitHub milestones — this page is a browsable summary.
v0.1.1 — post-transfer hygiene (in progress)¶
Housekeeping after the scttfrdmn/trnrand → trnsci/trnrand transfer. No
new features.
- #9 — Point Docs badge and
mkdocs.ymlsite_urlattrnsci.dev. - #10 — Update GitHub repo
homepageUrltotrnsci.dev/trnrand/.
v0.2.0 — CPU distributions + NKI hardware validation (CPU distributions shipped; hardware validation gated on trn1)¶
CPU implementations of the v0.3.0 distributions ship here so users can pick them up today without waiting for NKI hardware validation. The Philox and Box-Muller NKI scaffolds from v0.1.0 still need proving on real Trainium silicon — tracked in the same milestone.
Shipped (v0.2.0):
- Gamma, chi-squared, beta, poisson CPU implementations (see Distributions API).
Hardware-gated (open on this milestone):
- #1 — Validate NKI Philox 4×32-10 kernel on trn1/trn2.
- #2 — On-device Box-Muller transform (uniform → normal) on the Vector Engine.
- #3 — Benchmarks vs cuRAND on trn1.2xlarge.
v0.3.0 — QMC on-device + distribution breadth (planned)¶
Extend the stateless-Philox infra to quasi-random sequences on GpSimd; add the distributions that close the gap vs cuRAND / NumPy.
QMC on-device (GpSimd):
Distribution breadth (Vector Engine):
- #13 — Beta distribution (gamma-ratio method). CPU path shipped in v0.2.0; NKI pending.
- #14 — Gamma distribution (Marsaglia-Tsang + boost). CPU path shipped in v0.2.0; NKI pending.
- #15 — Poisson distribution (Knuth / Atkinson rejection). CPU path shipped in v0.2.0; NKI pending.
- #16 — Chi-squared distribution (sum of squared normals). CPU path shipped in v0.2.0; NKI pending.
- #17 — Truncated normal on the Vector Engine (currently host-side rejection).
v0.4.0 — Phase 3: single-chip performance (planned)¶
Batched-tile RNG streaming, NEFF compile-cache reuse, per-kernel tuning so the NKI path is meaningfully faster than the PyTorch fallback.
Design: SBUF-resident streaming Generator RFC — pre-compiled streaming kernel that keeps Generator state SBUF-resident across multiple distribution calls, pipelining GpSimd / Vector / Scalar engines concurrently. This is qualitatively different from cuRAND's per-call dispatch model, not a perf tweak.
- #19 — Phase 3 tracker:
trnrand.normal_into(buf)streaming API, Sobol/Halton perf parity, published tokens/sec + GB/s benchmarks.
v0.5.0 — Phase 4: multi-chip counter partitioning (planned)¶
Philox's counter-based design makes cross-chip sharding trivial — each NeuronCore gets a disjoint counter subrange, outputs are bit-exact vs single-chip.
Design: Counter-partitioned multi-chip RNG RFC — the bit-exactness thesis: a 1-chip run and a 32-chip run with the same seed produce the same combined stream, byte-for-byte. GPU RNG libraries typically can't guarantee this; for MCMC / replication studies / cluster-reshape debugging, that's a qualitatively different property.
- #20 — Phase 4 tracker:
Generatoracceptspartition_rank/partition_size; near-linear strong scaling ontrn1.32xlarge.
v0.6.0 — Phase 5: trn2 wider-PSUM fast path (planned)¶
Exploit trn2's larger partition count without maintaining two separately tuned codebases; runtime capability detection picks the right kernel.
- #21 — Phase 5 tracker: trn2-specific Philox kernel + runtime hardware detection in dispatch.
Suite phase mapping¶
trnrand's roadmap aligns with the trnsci suite-wide phase plan:
| Suite Phase | trnrand Milestone | Tracker |
|---|---|---|
| Phase 1 — correctness on hardware | v0.3.0 | #18 |
| Phase 2 — precision | (N/A — trnrand is precision-neutral) | — |
| Phase 3 — single-chip perf | v0.4.0 | #19 |
| Phase 4 — multi-chip | v0.5.0 | #20 |
| Phase 5 — generation-specific | v0.6.0 | #21 |