The dev loop just got a lot shorter
Until this week, working on an NKI kernel inside any trnsci library looked roughly like this: edit the kernel, push to a branch, wait for the GitHub Actions runner to start a trn1.2xlarge instance over SSM, wait for user-data to finish, wait for the NEFF compile, finally run pytest -m neuron, read the result, stop the instance. Eight to twelve minutes per iteration in the best case. Longer when anything went sideways.
AWS's Neuron SDK 2.29 shipped a month ago with NKI 0.3.0 Stable. The change that matters most isn't in the SDK itself — it's in the fact that the full stack now installs cleanly on ubuntu-latest GitHub runners via AWS's pip index, and the CPU simulator exposed by nki.simulate(kernel)(numpy_args) runs the same kernels device-free. The dev loop dropped from minutes to seconds, and the cost-of-iteration barrier for contributing to a trnsci library just collapsed.