Benchmarks¶
Performance numbers for trnsolver across three axes:
- CPU baselines — torch.linalg, scipy.linalg (LAPACK), and the trnsolver PyTorch path. Run on every CI build.
- GPU baselines —
torch.linalgon CUDA (cuSOLVER / cuBLAS) on a vintage-matched Nvidia instance. - Trainium NKI — the NKI path of trnsolver on trn1 / trn2. Pending hardware validation.
Vintage matching¶
Comparing a 2022 Trainium chip to a 2024 H100 is not informative — both the arithmetic tier and the price gap distort the result. The benchmark table pairs each Trainium generation with the Nvidia GPU of closest architectural generation and approximate price tier:
| Trainium chip | Launch | Nvidia peer | EC2 instance | On-demand cost |
|---|---|---|---|---|
| trn1 (NeuronCore v2) | Oct 2022 | A10G (Ampere, 2021) | g5.xlarge |
~$1.01/hr |
| trn2 (NeuronCore v3) | Dec 2024 | H100 (Hopper, 2022) | p5.4xlarge |
~$12/hr |
The default GPU baseline is g5.xlarge (A10G) because it's the vintage peer of trn1.2xlarge and within the same price tier. H100 baselines are an opt-in follow-up on p5.4xlarge (1× H100) — roughly 10× the cost of the A10G box, but still far cheaper than p5.48xlarge (8× H100, ~$98/hr).
Reproducing locally (CPU)¶
pytest benchmarks/bench_solver.py -v -m "not neuron and not cuda" --benchmark-only
Reproducing on GPU (AWS)¶
cd infra/terraform
AWS_PROFILE=aws terraform apply -var=enable_gpu_ci=true \
-var=vpc_id=vpc-... -var=subnet_id=subnet-...
# Wait for user-data to finish (~5 min), then stop; scripts will wake it.
AWS_PROFILE=aws ./scripts/run_cuda_tests.sh g5
Reproducing on Trainium¶
AWS_PROFILE=aws ./scripts/run_neuron_tests.sh trn1
CPU results (laptop reference — n=256)¶
Run on darwin/Apple-silicon CPU, Python 3.14, torch 2.x, scipy 1.17.1, float32 except where noted.
| Op | trnsolver | torch.linalg | scipy.linalg |
|---|---|---|---|
eigh (symmetric) |
2,240 µs | 2,251 µs | 4,059 µs |
cholesky |
69 µs | 68 µs | 100 µs |
solve_spd |
86 µs | 143 µs (torch.solve) | 21 µs (cho_solve) |
qr |
863 µs | — | — |
inv_sqrt_spd (eig-based) |
2,320 µs | — | — |
inv_sqrt_spd_ns (Newton-Schulz) |
4,782 µs | — | — |
cg (no precond) |
161 µs | — | — |
gmres |
2,549 µs | — | — |
Numbers are for illustration — on the benchmark-target Linux CI hardware scipy.linalg is generally faster than torch.linalg for small dense factorizations, and the trnsolver CPU path should sit between them (thin layer over torch.linalg). Reference values live in results.json artifacts from CI runs.
GPU results — A10G / g5.xlarge (cuSOLVER via torch.linalg)¶
Run on AWS g5.xlarge (1× A10G, 24 GB, Ampere). Numbers include an explicit torch.cuda.synchronize() so the timer captures kernel execution, not async launch. Mean of 5+ warm rounds, µs:
| Op | n=64 | n=128 | n=256 | n=512 |
|---|---|---|---|---|
cholesky |
95 | 110 | 166 | 277 |
qr |
283 | 615 | 1,028 | 2,248 |
solve_spd |
195 | 293 | 560 | 1,306 |
eigh |
944 | 2,095 | 5,463 | 15,919 |
inv_sqrt_spd (eig-based) |
1,108 | 2,298 | 6,191 | 17,709 |
inv_sqrt_spd_ns (Newton-Schulz) |
2,979 | 3,005 | 2,973 | 3,492 |
The headline result: on GPU, inv_sqrt_spd_ns beats the eigendecomposition-based inv_sqrt_spd by 2.1× at n=256 and 5.1× at n=512. Same story — eigh dominates on CPU because LAPACK is heavily optimized for it, but when you're on an accelerator the all-GEMM shape of Newton-Schulz wins. This is the evidence that the NS path will pay off on Trainium once the trnblas NKI GEMM backend lands.
Reproduce: AWS_PROFILE=aws ./scripts/run_cuda_tests.sh g5.
Trainium results (trn1 / trn2)¶
Pending — classical Jacobi was architecturally mismatched to NKI (see #9 post-mortem). Phase 1 redesign is iterating against the new NKI 0.3.0 CPU simulator (Neuron SDK 2.29, April 2026) with full-sweep Jacobi / Householder-QR / block-Jacobi candidates; hardware numbers will land once the redesigned kernel validates in simulation.
Notes¶
inv_sqrt_spd_nsis Newton-Schulz — all GEMM. On CPU it loses to the eigendecomposition path because eigh is heavily optimized in LAPACK/MKL. On GPU it wins at n ≥ 256 because the workload shape fits the Tensor Core pipeline. Trainium's Tensor Engine should show a similar win oncetrnblasGEMM validates on hardware.- cuSOLVER is invoked indirectly via
torch.linalgon a CUDA tensor. We don't link cuSOLVER directly.