Bizarre InternalTorchDynamoError with locally and formerly working code

Hi, has anyone seen this error in the last 3 months or so running a compiled pytorch model with multi-GPU?

[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/variables/builder.py", line 529, in _wrap
[rank0]: if has_triton():
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 37, in has_triton
[rank0]: return is_device_compatible_with_triton() and has_triton_package()
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 33, in is_device_compatible_with_triton
[rank0]: if device_interface.is_available() and extra_check(device_interface):
[rank0]: File "/usr/lib/python3/dist-packages/torch/utils/_triton.py", line 23, in cuda_extra_check
[rank0]: return device_interface.Worker.get_device_properties().major >= 7
[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/device_interface.py", line 191, in get_device_properties
[rank0]: return caching_worker_device_properties["cuda"][device]
[rank0]: torch._dynamo.exc.InternalTorchDynamoError: IndexError: list index out of range

[rank0]: from user code:
[rank0]: File "/usr/lib/python3/dist-packages/torch/_dynamo/external_utils.py", line 40, in inner
[rank0]: return fn(*args, **kwargs)

[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

The same code was running fine last October-November on the same type of instance and still runs fine last week on my local machine. I have tried changing the version of PyTorch (2.4.1, 2.5.1, 2.6.0) and triton (3.2.0, 3.1.0, 3.0.0, and the last 2.x) but none of that helped. torch.cuda.is_available() and torch.cuda.device_count() look fine and simple test code (e.g. MHA w/ mask on DDP Ā· GitHub) also works fine. What other libraries/drivers might contribute to this?

The error is probably very environment specific but the exact steps to reproduce on a Lambda 8x A100 (40 GB SXM4) instance is as follows:

pip3 install --upgrade requests
pip3 install wandb
pip3 install schedulefree

git clone https://github.com/EIFY/mup-vit.git
cd mup-vit
NUMEXPR_MAX_THREADS=116 torchrun main.py --fake-data --batch-size 4096 --log-steps 100

[Bug]: Cannot Load any model. IndexError with CUDA, multiple GPUs Ā· Issue #4069 Ā· vllm-project/vllm Ā· GitHub is the closest issue I can find but the author ā€œSolved it with a fresh install with a new docker containerā€ without nailing down the causeā€¦

You are calling main_worker here with:

args.gpu: None, args: Namespace(data='imagenet', workers=4, prefetch_factor=1, hidden_dim=384, input_resolution=224, patch_size=16, num_layers=12, num_heads=6, posemb='sincos2d', mlp_head=False, representation_size=None, pool_type='gap', register=0, epochs=90, log_steps=100, log_epoch=[], start_step=0, batch_size=4096, accum_freq=1, schedule_free=False, warmup=10000, lr=0.001, beta1=0.9, beta2=0.999, polynomial_weighting_power=0.0, decoupled_weight_decay=True, weight_decay=0.0001, grad_clip_norm=1.0, torchvision_inception_crop=False, lower_scale=0.05, upper_scale=1.0, mixup_alpha=0.2, randaug=True, randaug_magnitude=10, print_freq=100, resume='', evaluate=False, world_size=1, rank=-1, dist_url='env://', dist_backend='nccl', seed=None, gpu=None, multiprocessing_distributed=False, fake_data=True, logs='./logs/', name='2025_02_07-22_08_43-lr_0.001-b_4096', distributed=False, ngpus_per_node=8, checkpoint_path='./logs/2025_02_07-22_08_43-lr_0.001-b_4096/checkpoints')

thus setting the args.gpu to None causing the issue and Iā€™m unsure how this code worked before.

I just checked and locally on a single-GPU machine it still runs fine, either with torchrun or just plain python. So I am not sure what made it intolerant on a 8 GPU instanceā€¦

(All the additional arguments are to shrink the memory footprint and see some logs sooner)

$ torchrun main.py --fake-data --batch-size 1024 --accum-freq 8 --dist-backend gloo --log-steps 1
=> Fake data is used!
Compiling model...
Test: [ 0/49]	Time  9.935 ( 9.935)	Loss 6.9078e+00 (6.9078e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.02 (  0.00)
 *   Acc@1 0.001 Acc@5 0.005
Test: [ 0/49]	Time 11.767 (11.767)	Loss 6.9078e+00 (6.9078e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.00 (  0.00)
 *   Acc@1 0.001 Acc@5 0.005
Test: [ 0/49]	Time  7.029 ( 7.029)	Loss 6.9078e+00 (6.9078e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.01 (  0.01)

To launch torchrun on multiple devices you would use torchrun --nproc_per_node==8 ... which will then correspond to the --local-rank argument inside your script as described here.
In your approach you are launching your script with torchrun only and are not using the --local-rank at all, so again unsure how this should have ever worked.
Alternatively, you can also use a multiprocessing approach inside your script which will spawn the processes there as described in this tutorial.

Thanks for the explanation. The last time I ran it on a multi-GPU instance I think I have set --multiprocessing-distributed. Since without it it still runs locally I thought it should work. I will try it as soon as I grab an instance.

1 Like

Update: adding --multiprocessing-distributed back indeed fixed it. The error message could be less mysterious but ultimately the false alarm is on me. Thanks for the help!