Skip to content

Benchmarks

Performance results for trntensor — einsum dispatch, contraction planning, and CP / Tucker decompositions — comparing the PyTorch fallback and the NKI Trainium path on the same machine.

Reproducing

# Local CPU
pytest benchmarks/bench_einsum.py --benchmark-only

# On-hardware (via SSM orchestration)
AWS_PROFILE=aws ./scripts/run_benchmarks.sh trn1

# Force the PyTorch path on hardware (apples-to-apples baseline)
TRNTENSOR_FORCE_BACKEND=pytorch pytest benchmarks/ --benchmark-only

Results (trn1.2xlarge, neuronxcc 2.24)

Both columns ran on the same trn1.2xlarge instance — CPU (Intel Xeon 8375C) vs NKI (1 NeuronCore).

Op Shape FLOPs PyTorch (trn1 CPU) NKI (trn1) NKI / CPU
einsum ap,bp->ab 48×128 × 48×128 295 K 19.6 µs 1047 µs 53.4×
einsum mi,mnP->inP (4-index) 32×8, 32×32×64 524 K 35.4 µs 35.1 µs 0.99×
einsum ij,jk->ik 512³ 134 M 481 µs 1452 µs 3.0×
tucker_decompose 16³ ranks (4,4,4) 875 µs 859 µs 0.98×
einsum bij,bjk->bik 16×256³ 268 M 953 µs 2162 µs 2.3×
einsum ij,jk->ik 1024³ 1.07 G 3402 µs 4022 µs 1.2×
cp_decompose 16³ rank 8 (20 iters) 21.3 ms 21.9 ms 1.03×
einsum ij,jk->ik 2048³ 8.6 G 27.4 ms 16.9 ms 0.62×
einsum bij,bjk->bik 32×1024³ 34.4 G 126.3 ms 190.8 ms 1.5×

NKI wins: 2048×2048 matmul (1.6× faster than CPU). All other sizes still favor CPU on this hardware.

DF-MP2 energy — fused vs Python-loop

Fused trntensor.mp2_energy (single NKI program) vs the reference loop from examples/df_mp2_einsum.py (25–256 einsum calls), both on trn1.2xlarge:

Workload Python loop Fused NKI Winner
(nocc, nvir, naux) = (5, 19, 72) 1.5 ms 15.6 ms Loop 10×
(nocc, nvir, naux) = (16, 128, 128) 25.5 ms 41.3 ms Loop 1.6×

The fused kernel is architecturally what we want — one program, no intermediate HBM materialization of the four-index T tensor, PSUM/SBUF-resident across contract → elementwise → reduce. But NKI's per-call dispatch + compile overhead (~15–40 ms floor) still eats the win at current sizes. The gap closes with scale (10× → 1.6× as workload grows), consistent with the overhead being a fixed cost the compute amortizes. See #33 and #34.

Size-based dispatch threshold

Because per-call NKI dispatch currently carries ~1 ms of XLA launch overhead, nki_matmul and nki_batched_matmul short-circuit to the PyTorch path when the contraction is below TRNTENSOR_MIN_NKI_FLOPS (default 2 GFLOPs, calibrated at ≈ half the smallest NKI-winning size). The plan.backend field reflects this: it reports "nki" only when the dispatch will actually invoke a kernel, and "pytorch" otherwise.

Overrides:

  • TRNTENSOR_MIN_NKI_FLOPS=0 — always attempt NKI (useful for kernel validation).
  • TRNTENSOR_FORCE_BACKEND=pytorch — always use the PyTorch path (useful for CPU baselines).
  • trntensor.set_backend("pytorch") — same, via API.

Interpretation

The kernels themselves are correct (validated by 9 hardware tests in tests/test_nki_kernels.py) and the 2048×2048 case confirms NKI can beat CPU at large enough workloads. The gap at smaller sizes is dominated by per-call overhead — not by the Tensor Engine itself — and is tracked as follow-up work in #33.

Recommendation for users:

  • Call trntensor normally. The dispatch layer makes the right choice for typical sizes.
  • For GEMM-dominant workloads with single contraction sizes ≥ 2 GFLOPs, NKI will be invoked automatically.
  • Until #33 lands, tight loops of small contractions (DF-MP2 pair-energy style) see no NKI benefit and are served by the PyTorch path.