Iterative solvers over BSR¶
trnsparse v0.3.2 adds cg_bsr and power_iteration_bsr — Conjugate
Gradient and power iteration on block-sparse row matrices. The API is
stable; the architectural story below explains why there's a v0.4.0
follow-up.
Why this matters¶
Large SPD linear systems and dominant-eigenpair problems show up across scientific computing:
- Quantum chemistry: Hamiltonian eigenvalue problems (HF, DFT, CI), response equations (CPSCF).
- PDE discretizations: stiffness-matrix solves for finite element methods, graph Laplacian systems.
- Graph learning: spectral embeddings, PageRank-like iterations.
The matrix A in each case is typically block-sparse (Fock matrices
after Schwarz screening; FEM stiffness tied to mesh connectivity;
graph adjacency). BSR is the Trainium-native representation for those
matrices (see architecture.md).
v0.3.2 — plumbing¶
import trnsparse
A = trnsparse.BSRMatrix.from_dense(fock_matrix, block_size=128)
b = compute_rhs()
x, iters, rel = trnsparse.cg_bsr(A, b, tol=1e-6, max_iter=1000)
# Jacobi-preconditioned variant:
M = trnsparse.jacobi_preconditioner_bsr(A)
x, iters, rel = trnsparse.cg_bsr(A, b, tol=1e-6, M=M)
lam, v, iters = trnsparse.power_iteration_bsr(A, max_iter=500)
Under the hood, each CG iteration calls bsr_spmm(A, x.unsqueeze(1))
once. On the NKI backend that's one kernel dispatch + one HBM
round-trip per iteration. On CPU it's torch.sparse-backed and
roughly on par with scipy.sparse.linalg.cg (benchmarked: 369 μs vs
310 μs at 128×128 SPD, 1.19× slower).
v0.4.0 — fused kernel with SBUF-resident A¶
The architectural claim from
#22: Trainium's 32 GB
SBUF per NeuronCore fits a 5000×5000 BSR Hamiltonian on-chip. CG
doesn't need to round-trip A to HBM at all — only x, r, and p.
The shape of the v0.4.0 kernel:
@nki.jit
def _cg_spd_kernel(A_blocks, A_cols, A_row_ptrs, b, max_iter):
# Load A blocks once into SBUF at the top.
A_sbuf = nl.load(A_blocks)
# State: x, r, p in SBUF registers.
x = nl.zeros(...)
r = nl.copy(b)
p = nl.copy(r)
rr = nl.reduce(r * r)
for k in nl.affine_range(max_iter):
Ap = _bsr_matvec_sbuf(A_sbuf, A_cols, A_row_ptrs, p) # all SBUF
pAp = nl.reduce(p * Ap)
alpha = rr / pAp
x = x + alpha * p
r = r - alpha * Ap
rr_new = nl.reduce(r * r)
beta = rr_new / rr
p = r + beta * p
rr = rr_new
return x, residual_norm_history
Fixed max_iter, no early exit — returning the full residual history
lets the host pick the convergence point post-hoc (standard
NKI-inside-loop constraint: no dynamic control flow).
Expected regime where this wins: when
max_iter × dispatch_overhead > fused_kernel_cost + hbm_load_A_once.
For a 4000×4000 BSR Hamiltonian with ~100 iterations to convergence,
that's roughly an order of magnitude of wall-time reduction.
What needs to land first:
- Simulator-iterated kernel skeleton (now tractable thanks to the
nki-simulatorCI gate — see the NKI 0.3.0 migration in v0.3.1). - SBUF sizing model for
A— some workloads overflow 32 GB; then fall back to the v0.3.2 plumbing. - Dispatcher logic in
cg_bsrthat picks the fused kernel whenmax_iter * nexceeds a threshold.
Tracked in a dedicated sub-issue on trnsci/trnsparse.
Why CG and power iteration, not GMRES / Lanczos / Davidson¶
v0.3.2 covers the two most common algorithm families:
- CG for SPD systems — the workhorse for Hamiltonian solves, stiffness systems, and many PDE discretizations.
- Power iteration for dominant eigenpairs — the starting point for spectral methods, PageRank-like fixed points, and the iteration core inside Lanczos / Arnoldi / Davidson.
GMRES (general non-symmetric systems), Lanczos / Arnoldi (full spectrum), and Davidson (interior eigenvalues) are follow-ups when users ask for them. The v0.4.0 fused kernel's structure (load A once, iterate on-chip) generalizes trivially to those variants — it's the same architectural pattern.