AWS Setup for Neuron Tests¶

To run pytest -m neuron against real Trainium hardware, we use a local workflow:

Provision a Trainium EC2 instance with Terraform (stays stopped when not testing)
Run the test script locally from your machine, using AWS_PROFILE=aws
The script starts the instance, runs pytest via SSM, prints output, stops the instance

GitHub Actions does not touch AWS. All AWS interaction is human-initiated.

One-time setup¶

1. Provision the CI instance¶

Pick a VPC + subnet in a region with trn1/trn2/inf2 capacity. trn1.2xlarge is cheapest for basic validation.

cd infra/terraform

AWS_PROFILE=aws terraform init
AWS_PROFILE=aws terraform apply \
  -var="vpc_id=vpc-xxxxxx" \
  -var="subnet_id=subnet-xxxxxx" \
  -var="instance_type=trn1.2xlarge"

Capture instance_id from the outputs. User-data takes ~5 minutes to install the Neuron SDK and clone trnsolver.

Stop the instance once ready:

AWS_PROFILE=aws aws ec2 stop-instances --instance-ids $(terraform output -raw instance_id)

Running neuron tests¶

AWS_PROFILE=aws ./scripts/run_neuron_tests.sh
# or for trn2 / inf2:
AWS_PROFILE=aws ./scripts/run_neuron_tests.sh trn2

The script will:

Look up the tagged instance (Name=trnsolver-ci-trn1 by default)
Start it if stopped; wait for SSM agent
Send the pytest command over SSM
Print stdout/stderr
Stop the instance in a trap (even if pytest fails or you Ctrl-C)

It exits non-zero if any test fails.

GPU baseline instance (cuSOLVER comparison)¶

For vintage-matched cuSOLVER baselines (see Benchmarks), provision the GPU CI instance with:

cd infra/terraform
AWS_PROFILE=aws terraform apply -var=enable_gpu_ci=true \
  -var="vpc_id=vpc-xxxxxx" -var="subnet_id=subnet-xxxxxx"

Defaults to g5.xlarge (A10G Ampere, the trn1 vintage peer). Run the benchmarks with:

AWS_PROFILE=aws ./scripts/run_cuda_tests.sh g5

Same SSM-based pattern as the Neuron runner: starts the instance, runs pytest benchmarks/bench_cuda.py -m cuda --benchmark-only, prints output, stops the instance via trap. Costs ~$0.17 per 10-minute run on g5.xlarge.

To opt into an H100 comparison for trn2:

AWS_PROFILE=aws terraform apply -var=enable_gpu_ci=true \
  -var=gpu_instance_type=p5.4xlarge -var=gpu_instance_tag=trnsolver-ci-p5 \
  -var="vpc_id=..." -var="subnet_id=..."
AWS_PROFILE=aws ./scripts/run_cuda_tests.sh p5

That's ~$12/hr (single H100) — still pricey, use sparingly.

Cost¶

Stopped = EBS only (~$10/mo for 100 GB gp3). Running:

Type	Hourly	Typical run (10 min)
trn1.2xlarge	$1.34	$0.22
trn2.8xlarge	$10.00	$1.67
inf2.xlarge	$0.76	$0.13
g5.xlarge (A10G, CUDA)	$1.01	$0.17
p5.4xlarge (H100×1, CUDA)	$12.00	$2.00

Troubleshooting¶

"No instance found with Name=trnsolver-ci-trn1" — Run terraform apply first, or check that the tag matches.

SSM InvalidInstanceId error — Instance hasn't finished booting/registering. Wait 1-2 minutes and retry.

User-data didn't finish (neuronxcc not found) — SSH in via SSM session and re-run manually:

aws ssm start-session --target $INSTANCE_ID
cd /home/ubuntu/trnsolver && pip install -e '.[neuron,dev]'

InsufficientInstanceCapacity when starting the instance — AWS may temporarily be out of Trainium in that AZ. Wait and retry, or re-provision in a different AZ.