Skip to content

Roadmap

Forward-looking plan for trnrand. Tracked via GitHub milestones — this page is a browsable summary.

v0.1.1 — post-transfer hygiene (in progress)

Housekeeping after the scttfrdmn/trnrand → trnsci/trnrand transfer. No new features.

  • #9 — Point Docs badge and mkdocs.yml site_url at trnsci.dev.
  • #10 — Update GitHub repo homepageUrl to trnsci.dev/trnrand/.

v0.2.0 — CPU distributions + NKI hardware validation (CPU distributions shipped; hardware validation gated on trn1)

CPU implementations of the v0.3.0 distributions ship here so users can pick them up today without waiting for NKI hardware validation. The Philox and Box-Muller NKI scaffolds from v0.1.0 still need proving on real Trainium silicon — tracked in the same milestone.

Shipped (v0.2.0):

Hardware-gated (open on this milestone):

  • #1 — Validate NKI Philox 4×32-10 kernel on trn1/trn2.
  • #2 — On-device Box-Muller transform (uniform → normal) on the Vector Engine.
  • #3 — Benchmarks vs cuRAND on trn1.2xlarge.

v0.3.0 — QMC on-device + distribution breadth (planned)

Extend the stateless-Philox infra to quasi-random sequences on GpSimd; add the distributions that close the gap vs cuRAND / NumPy.

QMC on-device (GpSimd):

  • #11 — NKI Sobol scrambling kernel on GpSimd.
  • #12 — NKI Halton on-device generation (low-dim only).

Distribution breadth (Vector Engine):

  • #13 — Beta distribution (gamma-ratio method). CPU path shipped in v0.2.0; NKI pending.
  • #14 — Gamma distribution (Marsaglia-Tsang + boost). CPU path shipped in v0.2.0; NKI pending.
  • #15 — Poisson distribution (Knuth / Atkinson rejection). CPU path shipped in v0.2.0; NKI pending.
  • #16 — Chi-squared distribution (sum of squared normals). CPU path shipped in v0.2.0; NKI pending.
  • #17 — Truncated normal on the Vector Engine (currently host-side rejection).

v0.4.0 — Phase 3: single-chip performance (planned)

Batched-tile RNG streaming, NEFF compile-cache reuse, per-kernel tuning so the NKI path is meaningfully faster than the PyTorch fallback.

Design: SBUF-resident streaming Generator RFC — pre-compiled streaming kernel that keeps Generator state SBUF-resident across multiple distribution calls, pipelining GpSimd / Vector / Scalar engines concurrently. This is qualitatively different from cuRAND's per-call dispatch model, not a perf tweak.

  • #19 — Phase 3 tracker: trnrand.normal_into(buf) streaming API, Sobol/Halton perf parity, published tokens/sec + GB/s benchmarks.

v0.5.0 — Phase 4: multi-chip counter partitioning (planned)

Philox's counter-based design makes cross-chip sharding trivial — each NeuronCore gets a disjoint counter subrange, outputs are bit-exact vs single-chip.

Design: Counter-partitioned multi-chip RNG RFC — the bit-exactness thesis: a 1-chip run and a 32-chip run with the same seed produce the same combined stream, byte-for-byte. GPU RNG libraries typically can't guarantee this; for MCMC / replication studies / cluster-reshape debugging, that's a qualitatively different property.

  • #20 — Phase 4 tracker: Generator accepts partition_rank / partition_size; near-linear strong scaling on trn1.32xlarge.

v0.6.0 — Phase 5: trn2 wider-PSUM fast path (planned)

Exploit trn2's larger partition count without maintaining two separately tuned codebases; runtime capability detection picks the right kernel.

  • #21 — Phase 5 tracker: trn2-specific Philox kernel + runtime hardware detection in dispatch.

Suite phase mapping

trnrand's roadmap aligns with the trnsci suite-wide phase plan:

Suite Phase trnrand Milestone Tracker
Phase 1 — correctness on hardware v0.3.0 #18
Phase 2 — precision (N/A — trnrand is precision-neutral)
Phase 3 — single-chip perf v0.4.0 #19
Phase 4 — multi-chip v0.5.0 #20
Phase 5 — generation-specific v0.6.0 #21