AWS Setup for Neuron Tests¶
To run pytest -m neuron against real Trainium hardware, we use a local workflow:
- Provision a Trainium EC2 instance with Terraform (stays stopped when not testing)
- Run the test script locally from your machine, using
AWS_PROFILE=aws - The script starts the instance, runs pytest via SSM, prints output, stops the instance
GitHub Actions does not touch AWS. All AWS interaction is human-initiated.
One-time setup¶
1. Provision the CI instance¶
Pick a VPC + subnet in a region with trn1/trn2/inf2 capacity. trn1.2xlarge is cheapest for basic validation.
cd infra/terraform
AWS_PROFILE=aws terraform init
AWS_PROFILE=aws terraform apply \
-var="vpc_id=vpc-xxxxxx" \
-var="subnet_id=subnet-xxxxxx" \
-var="instance_type=trn1.2xlarge"
Capture instance_id from the outputs. User-data takes ~5 minutes to install the Neuron SDK and clone trntensor.
Stop the instance once ready:
AWS_PROFILE=aws aws ec2 stop-instances --instance-ids $(terraform output -raw instance_id)
Running neuron tests¶
AWS_PROFILE=aws ./scripts/run_neuron_tests.sh
# or for trn2 / inf2:
AWS_PROFILE=aws ./scripts/run_neuron_tests.sh trn2
The script will:
- Look up the tagged instance (
Name=trntensor-ci-trn1by default) - Start it if stopped; wait for SSM agent
- Send the pytest command over SSM
- Print stdout/stderr
- Stop the instance in a trap (even if pytest fails or you Ctrl-C)
It exits non-zero if any test fails.
Cost¶
Stopped = EBS only (~$10/mo for 100 GB gp3). Running:
| Type | Hourly | Typical run (10 min) |
|---|---|---|
| trn1.2xlarge | $1.34 | $0.22 |
| trn2.8xlarge | $10.00 | $1.67 |
| inf2.xlarge | $0.76 | $0.13 |
Troubleshooting¶
"No instance found with Name=trntensor-ci-trn1"
— Run terraform apply first, or check that the tag matches.
SSM InvalidInstanceId error
— Instance hasn't finished booting/registering. Wait 1-2 minutes and retry.
User-data didn't finish (neuronxcc not found)
— SSH in via SSM session and re-run manually:
aws ssm start-session --target $INSTANCE_ID
cd /home/ubuntu/trntensor && pip install -e '.[neuron,dev]'
InsufficientInstanceCapacity when starting the instance
— AWS may temporarily be out of Trainium in that AZ. Wait and retry, or re-provision in a different AZ.