trnblas: four hypotheses, one profiler trace, and why 1.48× is the correct answer¶
The previous trnblas post shipped with a 1.48× speedup on the fused MP2 energy reduction and an open admission: the kernel underperforms the 3× RFC target, four hypotheses exist for why, and none had been tested against hardware. One Neuron Profiler 2.0 trace later, the answer is in — Vector Engine at 96.45% active, Tensor Engine at 0.000002%, HBM reads matching the analytical prediction to the byte. The 1.48× ceiling is an exact Amdahl prediction, and the kernel is near-optimal on its own step.
The problem¶
The fused energy kernel shipped at 1.48× vs. the torch reduction path, with four open hypotheses for the remaining gap to the 3× design target:
| Hypothesis | Tested in | Result |
|---|---|---|
Denominator hoisting out of (i, j) loop |
#32 | 1.48× → 1.50×, within noise |
| Cross-pair SBUF batching (K pairs, one HBM store) | #36 | 1.48× → 1.49×, within noise |
| Per-pair HBM-store fence overhead | folded into #36 | no change |
| Dispatch overhead above NEFF cache | direct measurement | NEFF cache already amortizes |
All four targeted the energy kernel's own step. None moved the number by more than 0.02×. The profiler investigation (#33) was queued to look at what the hardware actually does during a call.
What the architecture suggests¶
Trainium's four programmable engines — Tensor, Vector, Scalar, GpSimd — are distinct physical execution units. The profiler reports each independently. That separation is what makes Neuron Profiler output architecturally informative rather than just numerically informative: a 96.45% VE / 0.000002% TE split is a hardware-level readout, not a heuristic estimate.
The MP2 energy expression is:
E_partial = T * (2T − Tᵀ) / Δ, summed over virtual orbital pairs (a, b)
Every operation in this chain — multiply, subtract, reciprocal, free-dim sum — is element-wise arithmetic. There is no matrix product. Element-wise ops on Trainium run on the Vector Engine. Systolic matrix multiplication runs on the Tensor Engine. A kernel whose expression is entirely element-wise should show near-zero Tensor Engine utilization; 0.000002% is the confirmation.
This is also why the four prior hypotheses were always targeting the wrong thing. The energy kernel's own step takes 0.21 s of the fused path's 5.43 s total. The GEMM that precedes it — T_flat = B_chunk @ B_flat.T, running on the Tensor Engine — takes ~5.2 s. No amount of Vector Engine tuning in the energy kernel can move a number that is 4% of the total.
The architecture also points directly at the Phase 3 solution. PSUM is the Tensor Engine's 32-bit accumulation buffer, on-chip. After a GEMM tile completes, its result sits in PSUM before being stored to HBM. A kernel that fuses GEMM and energy reduction can read that PSUM output directly with Vector Engine instructions — T_flat never touches HBM. The current two-kernel split (GEMM → HBM → energy kernel) materializes 6.58 GB of T_flat to HBM per energy step and then reads it back; Phase 3 eliminates that round-trip.
flowchart LR
subgraph Current["Current (two dispatches)"]
direction TB
G["TE: B_chunk @ B_flat.T\n~5.2 s"]
TF[/"HBM: T_flat\n16.6 GB at medium shape"/]
EK["VE: T·(2T−Tᵀ)/Δ\n~0.21 s"]
G -->|"6.58 GB write"| TF
TF -->|"6.58 GB read"| EK
end
subgraph Phase3["Phase 3 (one NEFF)"]
direction TB
G3["TE: GEMM tile"]
PS[/"PSUM (on-chip)"/]
EK3["VE: ·(2−Tᵀ)/Δ/sum"]
G3 -->|"no HBM"| PS
PS --> EK3
end
The approach¶
Neuron Profiler 2.0, available in Neuron SDK 2.29, replaced the old inspect/show-session workflow with two commands:
neuron-profile capture -n <model.neff> -s profile.ntff
neuron-profile view -n <model.neff> -s profile.ntff --output-format summary-text
capture executes the NEFF and records a hardware trace. --io-from=neff (the default) allocates IO tensors from the NEFF's own declared shapes — no input .npy files needed. view --output-format summary-text returns per-engine utilization and instruction counts to stdout, with no InfluxDB, no browser, and no local viewer required.
NEFF isolation is clean given NKI's compile-once cache model: clear /var/tmp/neuron-compile-cache, run a Python warmup that calls only nki_mp2_energy, and the single resulting model.neff is unambiguously the energy kernel. NEFF size was 17 MB; captured trace was 191 MB.
Implementation¶
# mp2_warmup.py — compiles _mp2_energy_kernel to an isolated NEFF
import sys
sys.path.insert(0, '/home/ubuntu/trnblas')
import torch, trnblas
trnblas.set_backend('nki')
ic, nocc, nvir = 64, 64, 448 # medium bench shape, single-chunk all-pairs form
T_flat = torch.zeros(ic * nvir, nocc * nvir)
eps_occ_chunk = torch.full((ic,), -0.5)
eps_occ_full = torch.full((nocc,), -0.5)
eps_vir = torch.full((nvir,), 0.5)
result = trnblas.nki.nki_mp2_energy(T_flat, eps_occ_chunk, eps_occ_full, eps_vir)
# Isolate, capture, extract
rm -rf /var/tmp/neuron-compile-cache/*
python mp2_warmup.py
NEFF=$(find /var/tmp/neuron-compile-cache -name model.neff | head -1)
sudo -u ubuntu HOME=/home/ubuntu neuron-profile capture -n "$NEFF" -s profile.ntff
sudo -u ubuntu HOME=/home/ubuntu neuron-profile view -n "$NEFF" -s profile.ntff \
--output-format summary-json
The full orchestration — SSM session management, polling, base64 encoding — is in scripts/run_neuron_profile.sh. The profiled shape (ic=nocc=64, nvir=448) matches the medium bench's all-pairs form; engine utilization ratios are proportional across chunk sizes since the kernel structure is identical regardless of i_block.
What didn't work¶
Neuron Profiler API breaking change without a migration note. The April-14 first attempt used neuron-profile inspect → show-session — the workflow in the 2.28 documentation. In 2.29, show-session rejects the NTFF format that inspect itself writes (error: rejected: unsupported NTFF version 130), and view --disable-ui --ingest-only requires InfluxDB, which the 2.29 DLAMI doesn't pre-install. Neither the 2.29 release notes nor the DLAMI's bundled documentation marks this as a breaking change. Discovery required a probe SSM command (neuron-profile --help, neuron-profile capture --help, neuron-profile view --help | grep output-format) to find the new API surface. Concrete ask for the Neuron team: the show-session rejection message should name the replacement. "Try neuron-profile capture + view --output-format summary-text instead" in the error text would have saved a failed attempt and a day's gap.
SSM heredocs. When bash reads a script from stdin (the SSM pattern printf '%s' $B64 | base64 -d | bash), heredocs inside that script also read from stdin and consume the remaining script body as their content, breaking silently at the first << 'EOF' boundary. Workaround: double base64 — the Python warmup is a second base64-encoded string inside the bash body, decoded on the instance with printf '%s' $PY_B64 | base64 -d > /tmp/mp2_warmup.py. No heredocs in the transmitted script. This pattern is documented in the script header for future reference.
$HOME not set for neuron-profile. SSM sessions run as root without $HOME. Both capture and view read $HOME for a config directory and fail with $HOME is not defined when it's absent. The fix — sudo -u ubuntu HOME=/home/ubuntu neuron-profile ... — is two words, but the error message offers no hint that $HOME is the missing piece. Worth knowing before the next profile run.
Four falsified hypotheses, all at the wrong level. In retrospect the Amdahl math makes this obvious: moving the energy kernel's own step from 0.21 s to zero would change the 5.43 s fused total to 5.22 s, a 1.04× improvement on top of the current 1.48×. All four hypotheses were targeting those 0.21 s. The GEMM at 5.2 s was a floor none of them touched. The hypothesis list needed a profiler data point before the next round of kernel tuning; skipping that step cost four PR iterations on the wrong bottleneck.
Numbers¶
Per-engine profile, trn1.2xlarge, Neuron runtime 2.31.24, compiler 2.24.5133:
| Engine | Active % | Wall time | Instructions |
|---|---|---|---|
| Vector | 96.45% | 206 ms | 403,039 |
| DMA | 26.42% | 57 ms | 18,935 transfers |
| Scalar | 3.67% | 8 ms | 16,407 |
| GpSimd | 0.010% | 22 µs | 42,928 |
| Tensor | 0.000002% | 0.48 µs | 21 |
Total kernel wall time: 214 ms. The 21 Tensor Engine instructions are XLA graph setup overhead, not kernel body.
HBM bandwidth:
| Metric | Measured | Analytical prediction |
|---|---|---|
| HBM reads | 6.58 GB | 6.579 GB ✓ |
| HBM writes | 1.75 MB | — |
| Effective read bandwidth | 30.8 GB/s | — |
Analytical prediction: IC × NOCC × NSTRIP × 2 × P_TILE × NVIR × 4 bytes = 64 × 64 × 4 × 2 × 112 × 448 × 4 = 6,578,757,632 bytes. The fusion is working as designed — intermediates are SBUF-resident. The previous 33 GB napkin estimate was for the unfused torch path, where T, Tᵀ, Δ, and the product each round-trip through HBM separately.
Amdahl decomposition, medium bench (trn1.2xlarge, warm cache):
| Step | Torch path | Fused path |
|---|---|---|
| GEMM (T_flat = B_chunk @ B_flat.T) | 5.2 s | 5.2 s |
| Energy reduction | 2.83 s | 0.21 s |
| Energy step total | 8.03 s | 5.43 s |
Energy kernel achieves ~13× on the reduction step alone (2.83 s → 0.21 s). With f = 0.35 (fraction of the torch path that is reduction) and s = 13.5:
speedup = 1 / ((1 − f) + f/s) = 1 / (0.65 + 0.026) ≈ 1.48×
Exact match to the measured result.
What's next¶
- Phase 3 RFC — fused GEMM+energy. One NEFF: Tensor Engine computes
B_i @ B_j.T, PSUM output flows directly to Vector Engine instructions for theT*(2T−Tᵀ)/Δchain, scalar partial writes to HBM once per pair. Eliminates the 6.58 GB HBM round-trip. The profiler data now firmly motivates the design. - #26 — GEMM tile-shape autotuner. Measured-best tile per shape replaces the current fixed
(128, 128, 512). Independent of Phase 3. - Perfetto artifact. The
.pftraceis at/home/ubuntu/profiles/run-1776296734/on the CI instance. Opening it inui.perfetto.devgives instruction-level timeline — relevant if the question of cross-pair pipeline overlap becomes load-bearing in Phase 3 design.
Phase tracker: trnsci ROADMAP. The #33 findings doc has the raw JSON profile data and the full analytical derivation.
Takeaway¶
The _mp2_energy_kernel is near-optimal as written: Vector Engine at 96.45% active, HBM reads matching the analytical prediction to the byte, ~13× speedup on the step it owns. The 1.48× overall speedup is not a failure of the kernel — it is an exact Amdahl prediction given that the GEMM accounts for 96% of the fused path's wall time. Four successive tuning hypotheses, all targeting the kernel's own 0.21 s step, could not have moved the 5.43 s total meaningfully; one profiler run established this in an afternoon. The path to 3× was already the right design — Phase 3's fused GEMM+energy kernel — and the profiler data now makes the motivation concrete rather than architectural speculation.