AWS Setup for Neuron Tests¶
To run pytest -m neuron against real Trainium hardware, we use a local workflow:
- Provision a Trainium EC2 instance with Terraform (stays stopped when not testing)
- Run the test script locally from your machine, using
AWS_PROFILE=aws - The script starts the instance, runs pytest via SSM, prints output, stops the instance
GitHub Actions does not touch AWS. All AWS interaction is human-initiated.
One-time setup¶
1. Provision the CI instance¶
Pick a VPC + subnet in a region with trn1/trn2/inf2 capacity. trn1.2xlarge is cheapest for basic validation.
cd infra/terraform
AWS_PROFILE=aws terraform init
AWS_PROFILE=aws terraform apply \
-var="vpc_id=vpc-xxxxxx" \
-var="subnet_id=subnet-xxxxxx" \
-var="instance_type=trn1.2xlarge"
Capture instance_id from the outputs. User-data takes ~5 minutes to install the Neuron SDK and clone trnsolver.
Stop the instance once ready:
AWS_PROFILE=aws aws ec2 stop-instances --instance-ids $(terraform output -raw instance_id)
Running neuron tests¶
AWS_PROFILE=aws ./scripts/run_neuron_tests.sh
# or for trn2 / inf2:
AWS_PROFILE=aws ./scripts/run_neuron_tests.sh trn2
The script will:
- Look up the tagged instance (
Name=trnsolver-ci-trn1by default) - Start it if stopped; wait for SSM agent
- Send the pytest command over SSM
- Print stdout/stderr
- Stop the instance in a trap (even if pytest fails or you Ctrl-C)
It exits non-zero if any test fails.
GPU baseline instance (cuSOLVER comparison)¶
For vintage-matched cuSOLVER baselines (see Benchmarks), provision the GPU CI instance with:
cd infra/terraform
AWS_PROFILE=aws terraform apply -var=enable_gpu_ci=true \
-var="vpc_id=vpc-xxxxxx" -var="subnet_id=subnet-xxxxxx"
Defaults to g5.xlarge (A10G Ampere, the trn1 vintage peer). Run the benchmarks with:
AWS_PROFILE=aws ./scripts/run_cuda_tests.sh g5
Same SSM-based pattern as the Neuron runner: starts the instance, runs pytest benchmarks/bench_cuda.py -m cuda --benchmark-only, prints output, stops the instance via trap. Costs ~$0.17 per 10-minute run on g5.xlarge.
To opt into an H100 comparison for trn2:
AWS_PROFILE=aws terraform apply -var=enable_gpu_ci=true \
-var=gpu_instance_type=p5.4xlarge -var=gpu_instance_tag=trnsolver-ci-p5 \
-var="vpc_id=..." -var="subnet_id=..."
AWS_PROFILE=aws ./scripts/run_cuda_tests.sh p5
That's ~$12/hr (single H100) — still pricey, use sparingly.
Cost¶
Stopped = EBS only (~$10/mo for 100 GB gp3). Running:
| Type | Hourly | Typical run (10 min) |
|---|---|---|
| trn1.2xlarge | $1.34 | $0.22 |
| trn2.8xlarge | $10.00 | $1.67 |
| inf2.xlarge | $0.76 | $0.13 |
| g5.xlarge (A10G, CUDA) | $1.01 | $0.17 |
| p5.4xlarge (H100×1, CUDA) | $12.00 | $2.00 |
Troubleshooting¶
"No instance found with Name=trnsolver-ci-trn1"
— Run terraform apply first, or check that the tag matches.
SSM InvalidInstanceId error
— Instance hasn't finished booting/registering. Wait 1-2 minutes and retry.
User-data didn't finish (neuronxcc not found)
— SSH in via SSM session and re-run manually:
aws ssm start-session --target $INSTANCE_ID
cd /home/ubuntu/trnsolver && pip install -e '.[neuron,dev]'
InsufficientInstanceCapacity when starting the instance
— AWS may temporarily be out of Trainium in that AZ. Wait and retry, or re-provision in a different AZ.