Skip to content

CUDA → trnsci Rosetta stone

A reference map between NVIDIA CUDA numerical libraries and their trnsci equivalents. If you're porting a CUDA codebase to Trainium, start here.

Library mapping

CUDA trnsci Scope Notes
cuFFT trnfft FFT, complex tensors, STFT Trainium has no complex dtype — split real/imag
cuBLAS trnblas BLAS Levels 1–3, batched GEMM FP32-only; FP64 via double-double is future work
cuRAND trnrand Philox PRNG, Sobol/Halton QMC Counter-based, stateless
cuSOLVER trnsolver Factorizations, eigendecomposition Jacobi for eigh (tile-friendly)
cuSPARSE trnsparse Sparse formats, SpMV/SpMM Gather-matmul-scatter over DMA + Tensor Engine
cuTENSOR trntensor Einstein summation with planning Dispatch to matmul / bmm / einsum / NKI

Per-library detail

cuFFT → trnfft

cuFFT exposes cufftPlan* + cufftExec* for 1D/2D/3D complex and real transforms, with plan caching for repeated shapes. trnfft mirrors this:

import trnfft
x = torch.randn(1024)
X = trnfft.fft(x)          # like cufftExecC2C, forward
y = trnfft.ifft(X).real    # like cufftExecC2C, inverse
R = trnfft.rfft(x)         # like cufftExecR2C
S = trnfft.stft(x, n_fft=256, hop_length=128)

Semantic differences.

  • Trainium has no complex64 dtype. trnfft.ComplexTensor stores real and imaginary parts as paired real tensors. Arithmetic is decomposed: complex multiply is 4 real multiplies + 2 adds.
  • Non-power-of-two sizes use Bluestein's chirp-z transform, padded to a power of two. FP32 accuracy degrades above ~N=500 through the 3-FFT Bluestein chain. Use FP64 on CPU (x.double()) for higher precision; Trainium itself is FP32-only.
  • Plan caching is keyed by (size, inverse). Plans are cheap to create but not free — reuse them when calling in a loop.

cuBLAS → trnblas

cuBLAS is a two-tier API: the classic BLAS (cublasSgemm etc.) and cublasLt for batched / strided / mixed-precision paths. trnblas offers the classic surface plus batched GEMM:

C = trnblas.gemm(1.0, A, B)                    # like cublasSgemm
Cb = trnblas.batched_gemm(1.0, A_batch, B_batch)  # like cublasSgemmBatched
X = trnblas.trsm(1.0, L, B, uplo="lower", trans=True)   # like cublasStrsm

Semantic differences.

  • FP32-only. Trainium's Tensor Engine does not support FP64 natively. Chemistry workloads that need FP64 accuracy have to use double-double arithmetic (two FP32 values), which is documented but not yet implemented in trnblas.
  • Level-1 and Level-2 (dot, axpy, gemv) are provided for API completeness but don't get NKI kernels. The Tensor Engine would be wasted on vector-only ops. Level-3 is where NKI acceleration lives.
  • Tile shapes are fixed: 128 (partition) × 512 (moving). Matrix dimensions are padded implicitly.

cuRAND → trnrand

cuRAND provides two families: pseudo-random (Philox, XORWOW, MRG32k3a) and quasi-random (Sobol, scrambled Sobol). trnrand mirrors this:

g = trnrand.manual_seed(42)
x = trnrand.normal(1024, 1024, generator=g)      # like curandGenerateNormal
q = trnrand.sobol(d=8, n=4096)                    # like curandGenerateQuasiSobol
lhs = trnrand.latin_hypercube(d=4, n=1024)        # extra — not in cuRAND

Semantic differences.

  • The Philox generator is the primary PRNG — counter-based and stateless, which makes it easy to place per-tile counters in parallel on the GpSimd engine.
  • Box-Muller is used for the normal distribution. A future NKI Box-Muller kernel would run on the Vector Engine (cos, sin, log, sqrt).
  • Halton loses quality above ~20 dimensions. Use Sobol for d > 10.

cuSOLVER → trnsolver

cuSOLVER has a dense API (cusolverDnSpotrf, cusolverDnSsyevd, cusolverDnSgesvdj) and a sparse API. trnsolver currently covers the dense surface plus iterative Krylov methods:

L = trnsolver.cholesky(A)                        # like cusolverDnSpotrf
w, V = trnsolver.eigh(A)                          # like cusolverDnSsyevd
w, V = trnsolver.eigh_generalized(F, S)
x, info = trnsolver.cg(A, b)                      # iterative

Semantic differences.

  • eigh uses Jacobi rotations, not Householder tridiagonalization + QR. Jacobi is O(n³) per sweep with O(n) sweeps — cubic overall, with a larger constant than QR — but each rotation is a fixed-size matmul on the Tensor Engine, which maps cleanly to NKI tiles. For n < ~500 Jacobi is competitive; above that, QR would win on a GPU. On Trainium, tile-friendliness matters more than asymptotic constant.
  • inv_sqrt_spd currently uses eigendecomposition. Newton-Schulz (X_{k+1} = 0.5 X_k (3I − A X_k²)) is all GEMMs and maps better to Trainium; it's on the roadmap.

cuSPARSE → trnsparse

cuSPARSE handles sparse formats (CSR, CSC, COO, BSR) and sparse-BLAS kernels. trnsparse currently offers CSR / COO + SpMV / SpMM plus domain-specific screening:

A = trnsparse.CSRMatrix.from_dense(dense)
y = trnsparse.spmv(A, x)                          # like cusparseSpMV
Y = trnsparse.spmm(A, X)                          # like cusparseSpMM
Q = trnsparse.schwarz_bounds(shell_pair_integrals)
mask = trnsparse.screen_quartets(Q, threshold=1e-10)

Semantic differences.

  • SpMM uses a gather-matmul-scatter pattern: DMA gathers non-zero columns into a dense SBUF tile, Tensor Engine multiplies against the RHS, DMA scatters back. Efficiency depends on the nnz distribution per row.
  • Row-variable sparsity patterns need bucketing. Uniform nnz maps cleanly to fixed tiles; highly variable nnz is currently penalized.

cuTENSOR → trntensor

cuTENSOR is NVIDIA's general tensor-contraction library — it handles arbitrary einsum expressions by selecting contraction paths and dispatching to optimized kernels. trntensor offers the same shape:

C = trntensor.einsum("ijk,klm->ijlm", A, B)       # like cutensorContract
plan = trntensor.plan_contraction("ijk,klm->ijlm", A, B)
flops = trntensor.estimate_flops("ijk,klm->ijlm", A, B)
factors = trntensor.cp_decompose(X, rank=8)
core, facs = trntensor.tucker_decompose(X, ranks=(4, 4, 4))

Semantic differences.

  • The planner picks matmul, bmm, torch.einsum, or nki (future) based on the subscript pattern. 2D-over-single-index → matmul; batched → bmm; complex multi-index → einsum.
  • Optimal contraction ordering (like opt_einsum) is on the roadmap. Currently the planner handles one contraction at a time.
  • CP and Tucker decompositions use alternating least squares on top of trnblas-style GEMM primitives. They can be fed back through einsum at evaluation time, so a Tucker-compressed tensor can be contracted without materializing.

Reverse direction

The mapping goes both ways. A few idioms from trnsci are arguably cleaner than their CUDA counterparts — see examples/reverse_port_note.md for patterns a CUDA programmer might borrow back.