Benchmarks¶
Performance results for trnsparse.spmm across four backends on the same
numeric inputs:
- scipy —
scipy.sparse.csr_matrix @ B(compiled C reference) - torch.sparse —
torch.sparse_csr_tensor @ B - trnsparse (pytorch) — v0.1.3 vectorized fallback that lowers to
torch.sparse - trnsparse (nki) — v0.2.0 NKI dispatch: materialize CSR dense, run
tiled GEMM on the Tensor Engine via
@nki.jit, return dense
Hardware¶
- Instance:
trn1.2xlarge(single NeuronCore v2) - AMI: Deep Learning AMI Neuron PyTorch 2.9 · Ubuntu 24.04
- Neuron SDK: 2.24 ·
neuronxcc2.24
Results (SpMM, mean time in μs)¶
Columns are size M=K, density, N (RHS width). Lower is better.
| Size | Density | N | scipy | torch.sparse | trnsparse pytorch | trnsparse nki |
|---|---|---|---|---|---|---|
| 256 | 0.001 | 32 | 8.8 | 15.4 | 48.3 | 1397 |
| 256 | 0.001 | 128 | 7.6 | 29.1 | 44.7 | 1365 |
| 256 | 0.01 | 32 | 8.8 | 15.6 | 28.9 | 1248 |
| 256 | 0.01 | 128 | 20.4 | 27.5 | 41.7 | 1370 |
| 256 | 0.1 | 32 | 40.0 | 18.5 | 31.7 | 1278 |
| 256 | 0.1 | 128 | 137 | 34.2 | 48.6 | 1428 |
| 1024 | 0.001 | 32 | 14.4 | 28.9 | 43.4 | 1732 |
| 1024 | 0.001 | 128 | 39.8 | 27.2 | 47.0 | 2067 |
| 1024 | 0.01 | 32 | 72.4 | 31.8 | 48.1 | 1847 |
| 1024 | 0.01 | 128 | 257 | 46.5 | 72.5 | 2212 |
| 1024 | 0.1 | 32 | 609 | 75.0 | 95.5 | 2151 |
| 1024 | 0.1 | 128 | 2475 | 248 | 274 | 2479 |
What to read into this¶
v0.2.0 ships NKI for correctness, not speed. At every data point the NKI path is slower than both CPU backends. Two reasons, both structural:
- No sparsity exploitation yet. v0.2.0's NKI kernel materializes the CSR
into a dense
(M, K)tile before the matmul. At density 0.001 on a1024 × 1024matrix, this means 1000× more work than scipy does. Real sparse speedups come from row-bucketing + gather-matmul-scatter, which lands in v0.3.0 (#15). - Kernel-launch overhead dominates. The NKI times are roughly constant at ~1.3-2.5 ms across densities — the NEFF dispatch + HBM round-trip amortizes over large workloads, not these small ones. trn2 (v0.2.x Phase 5, #17) and NEFF-cache reuse (part of Phase 3) close this gap for steady-state workloads.
What v0.2.0 does deliver:
set_backend("nki")now actually runs SpMM on the Tensor Engine.- Forward + backward (via
torch.autograd.Function) both validated on hardware;torch.autograd.gradcheckpasses atatol=1e-4. - The full Neuron toolchain — compile, NEFF cache, XLA dispatch, PyTorch integration — is exercised end-to-end, which is what Phase 1 is for.
Reproducing¶
# CPU only (scipy + torch.sparse + trnsparse pytorch)
pytest benchmarks/bench_spmm.py --benchmark-only
# On trn1, add NKI column:
AWS_PROFILE=aws ./scripts/run_benchmarks.sh trn1
Results above were generated at commit
cee2e34 on
trn1.2xlarge.
Results (BSR SpMM, v0.3.0)¶
BSRMatrix stores a matrix as stacked (block_size, block_size) dense
blocks plus a block-level CSR pattern. At block_size=128 each nonzero
block is exactly a Tensor-Engine tile — no gather overhead.
Sizes are (m_blocks × 128, n_blocks × 128). block_density is the
fraction of (m_blocks, n_blocks) blocks actually stored. Lower is
better; dense is torch.matmul(A_dense, B) on the trn1.2xlarge CPU
as a ceiling.
| m_blocks | n_blocks | block_density | N | bsr PyTorch (ms) | bsr NKI (ms) | dense CPU (ms) |
|---|---|---|---|---|---|---|
| 4 | 4 | 0.10 | 128 | 0.11 | 1.85 | 0.12 |
| 4 | 4 | 0.25 | 128 | 0.18 | 1.87 | 0.12 |
| 4 | 4 | 0.50 | 128 | 0.43 | 2.05 | 0.13 |
| 4 | 8 | 0.50 | 256 | 1.01 | 3.00 | 0.49 |
| 8 | 4 | 0.50 | 256 | 1.05 | 3.11 | 0.47 |
Reading the BSR table¶
Three observations, all consistent with the architectural framing:
bsr PyTorchscales linearly with block density (0.11 → 0.43 ms as density goes 0.10 → 0.50, ~4× more work done). It's the straightforward "iterate stored blocks, doA_block @ B_slice" reference. This is what the NKI path has to beat.bsr NKIis roughly constant at 1.8-3.1 ms across configurations. Kernel dispatch + compilation + HBM round-trips dominate at these small sizes, not the matmul work itself. This mirrors the v0.2.0 CSR finding — Trainium rewards amortizing the kernel launch.- Dense CPU is the ceiling to beat — scipy / torch.sparse / BSR PyTorch are all within 2-4× of it on the trn1.2xlarge's Xeon. NKI needs workloads where the matmul work is large enough to swamp dispatch, or where the kernel launch can be amortized across many operations.
Where NKI wins — shipped v0.4.x¶
Not in raw "one-shot SpMM at N=128 vs CPU." The BSR table above measures kernel dispatch overhead. The architectural wins come from fusing multiple host passes into one dispatch:
screened_spmm(v0.4.0, #19) ✅ shipped: fuses Schwarz bound + threshold + mask + matmul into one NKI dispatch. On Fock-build-sized inputs: ~30–50% faster than the unfused 4-step path (schwarz_bounds → screen_quartets → from_dense → spmm) because it eliminates 3 HBM round-trips and the host-side CSR build.- Block-sparse attention (v0.4.2, #21)
✅ shipped:
bsr_spmmhandlesattn_weights @ Vfor local-window / dilated / global-token masks. At seq_len=8192, window=2: 4.9% block density means 95% ofnc_matmulcalls are skipped. Current example still materializes the full score matrix (follow-up: #25). - #22 — on-chip iterative solvers: CG/power-iteration with A SBUF-resident across all iterations (replaces #20). v0.3.2 ships the Python-loop plumbing; fused kernel gated on NKI scalar-carry capability.
See docs/architecture.md for why BSR is the
Trainium-native format.
BSR results from commit
411a3a7.