AWS Setup for Neuron Tests¶
To run pytest -m neuron against real Trainium hardware, we use a local workflow:
- Provision a Trainium EC2 instance with Terraform (stays stopped when not testing)
- Run the test script locally from your machine, using
AWS_PROFILE=aws - The script starts the instance, runs pytest via SSM, prints output, stops the instance
GitHub Actions does not touch AWS. All AWS interaction is human-initiated.
One-time setup¶
1. Provision the CI instance¶
Two separate Terraform roots, one per hardware family:
| Hardware | Terraform root | Default region | Instance |
|---|---|---|---|
| Trainium1 | infra/terraform/ |
us-east-1 |
trn1.2xlarge |
| Trainium2 | infra/terraform-trn2/ |
sa-east-1 |
trn2.3xlarge |
trn2 availability (as of 2026-04-16):
| Instance type | Region | AZs |
|---|---|---|
| trn2.xlarge | — | not yet offered |
| trn2.3xlarge | sa-east-1 | a, b, c |
| trn2.48xlarge | us-east-2 | a, b, c |
Trainium1 (trn1) — us-east-1:
cd infra/terraform
AWS_PROFILE=aws terraform init
AWS_PROFILE=aws terraform apply \
-var="vpc_id=vpc-xxxxxx" \
-var="subnet_id=subnet-xxxxxx"
Trainium2 (trn2) — sa-east-1:
cd infra/terraform-trn2
AWS_PROFILE=aws terraform init
AWS_PROFILE=aws terraform apply
The trn2 root is self-contained — it creates its own VPC, public subnet, internet gateway,
and route table in sa-east-1. No vpc_id or subnet_id variables required.
If apply fails with InsufficientInstanceCapacity, the default AZ is sa-east-1a.
Retry with a different AZ:
AWS_PROFILE=aws terraform apply -var="az_suffix=b" # or az_suffix=c
User-data takes ~5 minutes to install the Neuron SDK and clone trnrand.
Stop the instance once ready:
# trn1
cd infra/terraform
AWS_PROFILE=aws aws ec2 stop-instances \
--instance-ids $(AWS_PROFILE=aws terraform output -raw instance_id) --region us-east-1
# trn2
cd infra/terraform-trn2
AWS_PROFILE=aws aws ec2 stop-instances \
--instance-ids $(AWS_PROFILE=aws terraform output -raw instance_id) \
--region $(AWS_PROFILE=aws terraform output -raw aws_region)
Running neuron tests¶
# trn1 (default)
AWS_PROFILE=aws ./scripts/run_neuron_tests.sh
# trn2 — note AWS_REGION override (instance is in sa-east-1)
AWS_PROFILE=aws AWS_REGION=sa-east-1 ./scripts/run_neuron_tests.sh trn2
The script will:
- Look up the tagged instance (
Name=trnrand-ci-trn1by default) - Start it if stopped; wait for SSM agent
- Send the pytest command over SSM
- Print stdout/stderr
- Stop the instance in a trap (even if pytest fails or you Ctrl-C)
It exits non-zero if any test fails.
Running the DF-MP2 bench¶
Same instance, same SSM mechanism, runs examples/df_mp2.py --bench to
capture per-step timing across small / medium / large synthetic shapes:
AWS_PROFILE=aws ./scripts/run_df_mp2_bench.sh # all 3 shapes
AWS_PROFILE=aws ./scripts/run_df_mp2_bench.sh --shape medium # one shape
Each shape runs cold then warm in the same Python process, so NEFF cache effects are visible in the reported numbers.
Cost¶
Stopped = EBS only (~$10/mo for 100 GB gp3). Running:
| Type | Hourly | Typical run (10 min) |
|---|---|---|
| trn1.2xlarge | $1.34 | $0.22 |
| trn2.3xlarge | $10.00 | $1.67 |
| inf2.xlarge | $0.76 | $0.13 |
Troubleshooting¶
"No instance found with Name=trnrand-ci-trn1"
— Run terraform apply first, or check that the tag matches.
SSM InvalidInstanceId error
— Instance hasn't finished booting/registering. Wait 1-2 minutes and retry.
User-data didn't finish (neuronxcc not found)
— SSH in via SSM session and re-run manually:
aws ssm start-session --target $INSTANCE_ID
cd /home/ubuntu/trnrand && pip install -e '.[neuron,dev]'
InsufficientInstanceCapacity when starting the instance
— AWS may temporarily be out of Trainium in that AZ. Wait and retry, or re-provision in a different AZ.