AWS Setup for Neuron Tests¶
To run pytest -m neuron against real Trainium hardware, we use a local workflow:
- Provision a Trainium EC2 instance with Terraform (stays stopped when not testing)
- Run the test script locally from your machine, using
AWS_PROFILE=aws - The script starts the instance, runs pytest via SSM, prints output, stops the instance
GitHub Actions does not touch AWS. All AWS interaction is human-initiated.
One-time setup¶
1. Provision the CI instance¶
Two separate Terraform roots, one per hardware family:
| Hardware | Terraform root | Default region | Instance |
|---|---|---|---|
| Trainium1 | infra/terraform/ |
us-east-1 |
trn1.2xlarge |
| Trainium2 | infra/terraform-trn2/ |
sa-east-1 |
trn2.3xlarge |
trn2 availability (as of 2026-04-16):
| Instance type | Region | AZs |
|---|---|---|
| trn2.xlarge | — | not yet offered |
| trn2.3xlarge | sa-east-1 | a, b, c |
| trn2.48xlarge | us-east-2 | a, b, c |
Trainium1 (trn1) — us-east-1:
cd infra/terraform
AWS_PROFILE=aws terraform init
AWS_PROFILE=aws terraform apply \
-var="vpc_id=vpc-xxxxxx" \
-var="subnet_id=subnet-xxxxxx"
Trainium2 (trn2) — sa-east-1:
cd infra/terraform-trn2
AWS_PROFILE=aws terraform init
AWS_PROFILE=aws terraform apply
The trn2 root is self-contained — it creates its own VPC, public subnet, internet gateway,
and route table in sa-east-1. No vpc_id or subnet_id variables required.
If apply fails with InsufficientInstanceCapacity, the default AZ is sa-east-1a.
Retry with a different AZ:
AWS_PROFILE=aws terraform apply -var="az_suffix=b" # or az_suffix=c
User-data takes ~5 minutes to install the Neuron SDK and clone trnblas.
Stop the instance once ready:
# trn1
cd infra/terraform
AWS_PROFILE=aws aws ec2 stop-instances \
--instance-ids $(AWS_PROFILE=aws terraform output -raw instance_id) \
--region us-east-1
# trn2
cd infra/terraform-trn2
AWS_PROFILE=aws aws ec2 stop-instances \
--instance-ids $(AWS_PROFILE=aws terraform output -raw instance_id) \
--region $(AWS_PROFILE=aws terraform output -raw aws_region)
Running neuron tests¶
# trn1 (us-east-1, default)
AWS_PROFILE=aws ./scripts/run_neuron_tests.sh
# trn2 (sa-east-1, auto-detected from instance type)
AWS_PROFILE=aws ./scripts/run_neuron_tests.sh trn2
# Override region explicitly if needed
AWS_REGION=us-east-2 AWS_PROFILE=aws ./scripts/run_neuron_tests.sh trn2
The script resolves the default region from the instance type argument:
trn2* → sa-east-1, everything else → us-east-1. AWS_REGION overrides.
The script will:
- Look up the tagged instance (
Name=trnblas-ci-trn1by default,Name=trnblas-ci-trn2for trn2) - Start it if stopped; wait for SSM agent
- Send the pytest command over SSM
- Print stdout/stderr
- Stop the instance in a trap (even if pytest fails or you Ctrl-C)
It exits non-zero if any test fails.
Running the DF-MP2 bench¶
Same instance, same SSM mechanism, runs examples/df_mp2.py --bench to
capture per-step timing across small / medium / large synthetic shapes:
AWS_PROFILE=aws ./scripts/run_df_mp2_bench.sh # all 3 shapes, torch energy
AWS_PROFILE=aws ./scripts/run_df_mp2_bench.sh --shape medium # one shape
AWS_PROFILE=aws ./scripts/run_df_mp2_bench.sh --compare-all # 3-way: torch vs fused-gemm vs batched-pair
Each shape runs cold then warm in the same Python process, so NEFF cache
effects are visible in the reported numbers. --compare-all runs all three
energy paths in one SSM session — the full Phase 3 table.
Running PySCF precision tests¶
PySCF is not in the trn1 user-data by default. run_pyscf_tests.sh
installs trnblas[pyscf] in the Neuron venv before running, so there
is no permanent instance change needed:
AWS_PROFILE=aws ./scripts/run_pyscf_tests.sh # fast: h2o/ch4/nh3 at sto-3g / cc-pVDZ
AWS_PROFILE=aws ./scripts/run_pyscf_tests.sh --slow # + glycine/cc-pVDZ, h2o_trimer, h2o/cc-pVTZ
These are the FP32 precision envelope tests (#20). The --slow set
populates the TBD rows in docs/architecture.md and determines whether
double-double (#22) is needed.
GPU (A10G) companion instance¶
For cuBLAS head-to-head benchmarks on the same DF-MP2 workload, a
vintage-matched single-A10G instance lives in infra/terraform-cuda/:
cd infra/terraform-cuda
AWS_PROFILE=aws terraform init
AWS_PROFILE=aws terraform apply \
-var="vpc_id=vpc-xxxxxx" -var="subnet_id=subnet-xxxxxx"
A10G (GA102 Ampere, Apr 2021) is the closest single-GPU AWS match for Trainium1 (Oct 2022). Runs via:
AWS_PROFILE=aws ./scripts/run_cuda_bench.sh --shape medium
Uses the AWS Deep Learning AMI (PyTorch + CUDA 13). Runner passes
--device cuda to the bench, so inputs go straight to GPU HBM and
the kernel path is cuBLAS via torch.matmul.
Cost¶
Stopped = EBS only (~$10/mo for 100 GB gp3). Running:
| Type | Hourly | Typical run (10 min) |
|---|---|---|
| trn1.2xlarge | $1.34 | $0.22 |
| trn2.3xlarge | $10.00 | $1.67 |
| inf2.xlarge | $0.76 | $0.13 |
| g5.xlarge (A10G) | $1.006 | $0.17 |
Troubleshooting¶
"No instance found with Name=trnblas-ci-trn1"
— Run terraform apply first, or check that the tag matches.
SSM InvalidInstanceId error
— Instance hasn't finished booting/registering. Wait 1-2 minutes and retry.
User-data didn't finish (neuronxcc not found)
— SSH in via SSM session and re-run manually:
aws ssm start-session --target $INSTANCE_ID
cd /home/ubuntu/trnblas && pip install -e '.[neuron,dev]'
InsufficientInstanceCapacity when starting the instance
— AWS may temporarily be out of Trainium in that AZ. Wait and retry, or re-provision in a
different AZ. For trn2, the terraform root accepts -var="az_suffix=b" or =c to move
the subnet to a different AZ without destroying and recreating everything.