CUDA → trnsci Rosetta stone¶
A reference map between NVIDIA CUDA numerical libraries and their trnsci equivalents. If you're porting a CUDA codebase to Trainium, start here.
Library mapping¶
| CUDA | trnsci |
Scope | Notes |
|---|---|---|---|
| cuFFT | trnfft | FFT, complex tensors, STFT | Trainium has no complex dtype — split real/imag |
| cuBLAS | trnblas | BLAS Levels 1–3, batched GEMM | FP32-only; FP64 via double-double is future work |
| cuRAND | trnrand | Philox PRNG, Sobol/Halton QMC | Counter-based, stateless |
| cuSOLVER | trnsolver | Factorizations, eigendecomposition | Jacobi for eigh (tile-friendly) |
| cuSPARSE | trnsparse | Sparse formats, SpMV/SpMM | Gather-matmul-scatter over DMA + Tensor Engine |
| cuTENSOR | trntensor | Einstein summation with planning | Dispatch to matmul / bmm / einsum / NKI |
Per-library detail¶
cuFFT → trnfft¶
cuFFT exposes cufftPlan* + cufftExec* for 1D/2D/3D complex and real transforms, with plan caching for repeated shapes. trnfft mirrors this:
import trnfft
x = torch.randn(1024)
X = trnfft.fft(x) # like cufftExecC2C, forward
y = trnfft.ifft(X).real # like cufftExecC2C, inverse
R = trnfft.rfft(x) # like cufftExecR2C
S = trnfft.stft(x, n_fft=256, hop_length=128)
Semantic differences.
- Trainium has no
complex64dtype.trnfft.ComplexTensorstores real and imaginary parts as paired real tensors. Arithmetic is decomposed: complex multiply is 4 real multiplies + 2 adds. - Non-power-of-two sizes use Bluestein's chirp-z transform, padded to a power of two. FP32 accuracy degrades above ~N=500 through the 3-FFT Bluestein chain. Use FP64 on CPU (
x.double()) for higher precision; Trainium itself is FP32-only. - Plan caching is keyed by
(size, inverse). Plans are cheap to create but not free — reuse them when calling in a loop.
cuBLAS → trnblas¶
cuBLAS is a two-tier API: the classic BLAS (cublasSgemm etc.) and cublasLt for batched / strided / mixed-precision paths. trnblas offers the classic surface plus batched GEMM:
C = trnblas.gemm(1.0, A, B) # like cublasSgemm
Cb = trnblas.batched_gemm(1.0, A_batch, B_batch) # like cublasSgemmBatched
X = trnblas.trsm(1.0, L, B, uplo="lower", trans=True) # like cublasStrsm
Semantic differences.
- FP32-only. Trainium's Tensor Engine does not support FP64 natively. Chemistry workloads that need FP64 accuracy have to use double-double arithmetic (two FP32 values), which is documented but not yet implemented in trnblas.
- Level-1 and Level-2 (dot, axpy, gemv) are provided for API completeness but don't get NKI kernels. The Tensor Engine would be wasted on vector-only ops. Level-3 is where NKI acceleration lives.
- Tile shapes are fixed: 128 (partition) × 512 (moving). Matrix dimensions are padded implicitly.
cuRAND → trnrand¶
cuRAND provides two families: pseudo-random (Philox, XORWOW, MRG32k3a) and quasi-random (Sobol, scrambled Sobol). trnrand mirrors this:
g = trnrand.manual_seed(42)
x = trnrand.normal(1024, 1024, generator=g) # like curandGenerateNormal
q = trnrand.sobol(d=8, n=4096) # like curandGenerateQuasiSobol
lhs = trnrand.latin_hypercube(d=4, n=1024) # extra — not in cuRAND
Semantic differences.
- The Philox generator is the primary PRNG — counter-based and stateless, which makes it easy to place per-tile counters in parallel on the GpSimd engine.
- Box-Muller is used for the normal distribution. A future NKI Box-Muller kernel would run on the Vector Engine (cos, sin, log, sqrt).
- Halton loses quality above ~20 dimensions. Use Sobol for
d > 10.
cuSOLVER → trnsolver¶
cuSOLVER has a dense API (cusolverDnSpotrf, cusolverDnSsyevd, cusolverDnSgesvdj) and a sparse API. trnsolver currently covers the dense surface plus iterative Krylov methods:
L = trnsolver.cholesky(A) # like cusolverDnSpotrf
w, V = trnsolver.eigh(A) # like cusolverDnSsyevd
w, V = trnsolver.eigh_generalized(F, S)
x, info = trnsolver.cg(A, b) # iterative
Semantic differences.
eighuses Jacobi rotations, not Householder tridiagonalization + QR. Jacobi is O(n³) per sweep with O(n) sweeps — cubic overall, with a larger constant than QR — but each rotation is a fixed-size matmul on the Tensor Engine, which maps cleanly to NKI tiles. Forn < ~500Jacobi is competitive; above that, QR would win on a GPU. On Trainium, tile-friendliness matters more than asymptotic constant.inv_sqrt_spdcurrently uses eigendecomposition. Newton-Schulz (X_{k+1} = 0.5 X_k (3I − A X_k²)) is all GEMMs and maps better to Trainium; it's on the roadmap.
cuSPARSE → trnsparse¶
cuSPARSE handles sparse formats (CSR, CSC, COO, BSR) and sparse-BLAS kernels. trnsparse currently offers CSR / COO + SpMV / SpMM plus domain-specific screening:
A = trnsparse.CSRMatrix.from_dense(dense)
y = trnsparse.spmv(A, x) # like cusparseSpMV
Y = trnsparse.spmm(A, X) # like cusparseSpMM
Q = trnsparse.schwarz_bounds(shell_pair_integrals)
mask = trnsparse.screen_quartets(Q, threshold=1e-10)
Semantic differences.
- SpMM uses a gather-matmul-scatter pattern: DMA gathers non-zero columns into a dense SBUF tile, Tensor Engine multiplies against the RHS, DMA scatters back. Efficiency depends on the nnz distribution per row.
- Row-variable sparsity patterns need bucketing. Uniform nnz maps cleanly to fixed tiles; highly variable nnz is currently penalized.
cuTENSOR → trntensor¶
cuTENSOR is NVIDIA's general tensor-contraction library — it handles arbitrary einsum expressions by selecting contraction paths and dispatching to optimized kernels. trntensor offers the same shape:
C = trntensor.einsum("ijk,klm->ijlm", A, B) # like cutensorContract
plan = trntensor.plan_contraction("ijk,klm->ijlm", A, B)
flops = trntensor.estimate_flops("ijk,klm->ijlm", A, B)
factors = trntensor.cp_decompose(X, rank=8)
core, facs = trntensor.tucker_decompose(X, ranks=(4, 4, 4))
Semantic differences.
- The planner picks
matmul,bmm,torch.einsum, ornki(future) based on the subscript pattern. 2D-over-single-index →matmul; batched →bmm; complex multi-index →einsum. - Optimal contraction ordering (like
opt_einsum) is on the roadmap. Currently the planner handles one contraction at a time. - CP and Tucker decompositions use alternating least squares on top of
trnblas-style GEMM primitives. They can be fed back througheinsumat evaluation time, so a Tucker-compressed tensor can be contracted without materializing.
Reverse direction¶
The mapping goes both ways. A few idioms from trnsci are arguably cleaner than their CUDA counterparts — see examples/reverse_port_note.md for patterns a CUDA programmer might borrow back.