Roadmap¶
trnsci develops in five phases. The ordering is intentional — later phases only pay off when earlier ones are solid. Not every sub-project moves through every phase at the same pace; what matters is that a given sub-project has a credible Phase 1 before anyone starts caring about its Phase 3.
The detailed version of this roadmap, with tracking labels and cross-project dependencies, lives in ROADMAP.md on the umbrella repo.
Phase 1 — Single-chip correctness¶
Every public API runs through a real NKI kernel on a trn1 or trn2 NeuronCore, matching the PyTorch reference within float tolerance. Phase 1 replaces dispatch stubs with real kernels and validates them against published test vectors and cross-library integration demos.
Data-parallel multi-chip (replica per chip, embarrassingly parallel) falls out of Phase 1 automatically — no new code beyond a worker loop.
Phase 2 — Precision and numerical validation¶
The Carson–Higham three-precision iterative-refinement framework, Ozaki-scheme FP64 emulation on BF16/FP8 tensor units, and stochastic-rounding-enabled Krylov solvers. Concretely:
- trnblas#22 — double-double FP64 GEMM (keystone). A
nc_matmul_compensatedkernel exploiting PSUM as a free FP32 accumulator to deliver FP32-accuracy output from BF16 inputs at ~2× the work of naive BF16 matmul. This unblocks trnsolver iterative refinement, trntensorprecision="dd", and the Phase 2 direction for every library that does linear algebra. - Compensated reduction chains (Kahan/Neumaier) in trnsolver cg/gmres inner products and trntensor contractions.
- SR-enabled BF16 Krylov — stochastic rounding is already in the NKI ISA; wiring it as the default for long reductions (n ≫ √(1/u) ≈ 300 for BF16) turns convergence from aspirational to structural.
target_forward_errorAPI contract — unify trnsolver'siterative_refinement=True, trntensor'sprecision=, and trnfft'sprecision=into a cross-packagetarget_forward_errorkwarg. The user states "solve to ε"; trnsci selects factorization precision, Ozaki split count, and IR iterations automatically, returning an achieved-error bound.
Validation targets: DF-MP2 to nanohartree against PySCF at aug-cc-pVDZ (currently passing at cc-pVDZ), and HPL-MxP single-node on Trn2.
Phase 3 — Single-chip performance + randomized NLA flagship¶
Performance: NKI path becomes meaningfully faster than the PyTorch fallback and competitive per-dollar against vintage-matched NVIDIA instances (A10G vs Trn1, H100 vs Trn2, GB200 vs Trn3). Work per kernel: tile-shape sweeps, operand-stationarity choices, multi-engine scheduling, operation fusion, NEFF compile-cache reuse.
Randomized NLA flagship: ship rsvd, hutch_plus_plus, and sRR (sketched Rayleigh–Ritz, Nakatsukasa–Tropp 2024) as named entry points — natural home in trnsolver or a new trnrandla sibling, using existing trnblas, trnrand, and trnsolver.qr primitives. The flagship benchmark: randomized SVD of a 10⁶-dimensional kernel matrix on Trn2 UltraServer at BF16, verified to FP32 accuracy via iterative refinement, with bitwise-reproducible output given a seed. This is the highest-visibility Phase 3 differentiator — no GPU library can match ISA-level SR + documented-deterministic systolic reduction + Ozaki adaptivity in one package.
Additional Phase 3 targets: HPL-MxP multi-node submission on Trn2 UltraServer, BLR/HODLR direct solvers (trnsparse + trnblas frontal kernels), iterative FFT refinement research (trnfft Phase 2 research thread).
Phase 4 — Model-parallel multi-chip¶
Workloads whose tensors exceed a single chip's HBM — large-basis DF-MP2, N > 2²⁴ FFTs, sparse systems with >1B nonzeros — sharded across NeuronCores within a chip and chips within an instance. Introduces ShardedTensor abstractions, collective operations (blocked on nki.collectives, expected in Neuron SDK 2.30+), and dispatch glue transparent to the user. HPL-MxP UltraServer submission.
Cross-instance (EFA-interconnected) multi-chip is a Phase 4 follow-up.
Phase 5 — Generation-specific optimization + research¶
Trn1 (NeuronCore-v2) and Trn2/Trn3 (NeuronCore-v3) generation-specific fast paths: Trn3 MXFP8/MXFP4 microscaling as a native codepath for block-structured problems (directly applicable to mixed-precision HODLR and progressive-precision multigrid). Common Phase 3 paths stay the default; generation-specific paths are opt-in.
Research contributions back to the community: iterative FFT refinement theory, compensated systolic matmul error bounds (formal analysis for BF16+FP32-accum with ISA-level SR on a documented-deterministic array), SR-as-preconditioner analysis (Dexter et al. 2024 showed SR regularizes tall-skinny matrices — can this be formalized as a zero-memory preconditioner?), mixed-precision CCSD(T) cancellation analysis.
New domains: mixed-precision multigrid (trnpde or trnsolver submodule, Phase 3+), scientific optimization — LBFGS/ADMM/trust-region Newton (trnsolver submodule, Phase 3+), FEAST eigensolver via contour integration (Phase 3, maps naturally onto Trn3 UltraServer's 144 chips across quadrature points).
Where each library is today¶
As of April 2026.
| Package | Version | Phase | Status |
|---|---|---|---|
| trnfft | v0.15.0 | Phase 1 ✓ / Phase 2 active | Hardware-validated on trn1.2xlarge (70/70 cases). DFT-as-GEMM fast path (up to 14× at N ≤ 256), Stockham radix-4/8 with Tensor-engine W₈, compensated butterfly in precision="kahan" mode. Iterative FFT refinement open as research. |
| trnblas | v0.5.4 | Phase 1 ✓ | GEMM, SYRK, batched GEMM, fused DF-MP2 energy reduction hardware-validated. DF-MP2 matches PySCF to 10 µHa at cc-pVDZ on H₂O/CH₄/NH₃. FP32 precision sufficient to nanohartree at target molecules (trnblas#20 closed). Phase 2 next: double-double FP64 GEMM (trnblas#22). |
| trnsolver | v0.9.0 | Phase 1 hardware-pending | Full API shipped: eigh, eigh_generalized, Cholesky/LU/QR/SVD/pinv, cg/gmres with iterative refinement (κ ≲ 10⁷, Carson–Higham regime), Schur decomposition. NKI Jacobi kernel simulator-validated; hardware validation pending. |
| trnsparse | v0.4.3 | Phase 1 partial | BSRMatrix (128×128), bsr_spmm, screened_spmm, block-sparse attention, cg_bsr, chebyshev_bsr, richardson_bsr shipped. Hardware-validated via densify-then-GEMM path. Native BSR NKI kernel hardware validation pending. |
| trntensor | v0.3.0 | Phase 1 hardware-pending | Einsum with greedy path planner, CP/Tucker/TT decompositions, homogeneous-contraction batching, ShardedTensor for output-parallel multi-chip, to_xla/from_xla residency. Hardware validation pending. |
| trnrand | v0.4.1 | Phase 1 hardware-pending | Philox 4×32 and Threefry4×32-20 NKI kernels (Threefry hardware-validated for uniforms); Box-Muller NKI kernel; full distribution suite (normal, exponential, Bernoulli, etc.); Sobol/Halton/LHS QMC. Philox blocked on aws-neuron-sdk#1308. |
How to follow along¶
- Suite-wide coordination and cross-project dependencies: issues on
trnsci/trnsci - Per-library roadmap tracking: labels
phase-1-correctness/phase-2-precision/phase-3-perf/phase-4-multichip/phase-5-generationon each sub-project's issue tracker - NKI-native migration tracking: trnsci/trnsci#35 — removing torch-xla, targeting Neuron SDK 2.30
- Releases: each library versions independently. Current PyPI versions at pypi.org/project/trnsci/
- Blog: trnsci.dev/blog/ — retrospectives on shipped work, honest about what worked and what didn't