Hello,
I’m trying to train a YOLOv7 model on Databricks using a GPU, but I’m encountering an error when running the following training command:
!python train.py --batch-size 2 --epochs 8 --data data.yaml --weights 'yolov7_training.pt' --workers 2
The error message is:
File "/Workspace/Users/Yolov7/train.py", line 624, in <module>
train(hyp, opt, device, tb_writer)
File "/Workspace/Yolov7/train.py", line 368, in train
pred = model(imgs) # forward
File "/databricks/python/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/databricks/python/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 993, in _run_ddp_forward
inputs, kwargs = _to_kwargs(
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 94, in _to_kwargs
_recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 86, in _recursive_to
res = to_map(inputs)
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 77, in to_map
return list(zip(*map(to_map, obj)))
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 55, in to_map
if obj.device == torch.device("cuda", target_gpu):
RuntimeError: Device index must not be negative
I want to avoid using Distributed Data Parallel (DDP) because I’ve encountered issues with it before. When I print the device
variable right before the error, it returns "cuda:0"
.
Here are the details about the environment I’m using:
- Databricks Runtime: 13.3 LTS ML (includes Apache Spark 3.4.1, GPU, Scala 2.12)
Any insights into what might be causing this issue and how I can resolve it would be greatly appreciated.
Thank you!