I’m getting this error trying to launch a script via AWS EMR. When I run the script on an EC2 instance (p3.2xlarge which has 1 GPU) I do not have this issue. The weird thing is that you can see in the PyTorch log at the beginning that the GPU is being seen and used:
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
I’m not sure what is causing this error. Has anyone come across this and solved it?
/home/hadoop/deep_behavior_embedding/.env/lib/python3.8/site-packages/lightning/pytorch/loops/utilities.py:70: PossibleUserWarning: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
rank_zero_warn(
[rank: 0] Global seed set to 8637
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Traceback (most recent call last):
File "/home/hadoop/deep_behavior_embedding/finetune_claims_frequency.py", line 139, in <module>
trainer.fit(model, datamodule=dm)
File "/home/hadoop/deep_behavior_embedding/.env/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 531, in fit
call._call_and_handle_interrupt(
File "/home/hadoop/deep_behavior_embedding/.env/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 41, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/hadoop/deep_behavior_embedding/.env/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 91, in launch
return function(*args, **kwargs)
File "/home/hadoop/deep_behavior_embedding/.env/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 570, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/hadoop/deep_behavior_embedding/.env/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 933, in _run
self.strategy.setup_environment()
File "/home/hadoop/deep_behavior_embedding/.env/lib/python3.8/site-packages/lightning/pytorch/strategies/ddp.py", line 143, in setup_environment
self.setup_distributed()
File "/home/hadoop/deep_behavior_embedding/.env/lib/python3.8/site-packages/lightning/pytorch/strategies/ddp.py", line 192, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "/home/hadoop/deep_behavior_embedding/.env/lib/python3.8/site-packages/lightning/fabric/utilities/distributed.py", line 246, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "/home/hadoop/deep_behavior_embedding/.env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 761, in init_process_group
default_pg = _new_process_group_helper(
File "/home/hadoop/deep_behavior_embedding/.env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 897, in _new_process_group_helper
pg = ProcessGroupNCCL(prefix_store, group_rank, group_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Command exiting with ret '1'