RuntimeError: Device index must not be negative

Hello,

I’m trying to train a YOLOv7 model on Databricks using a GPU, but I’m encountering an error when running the following training command:

!python train.py --batch-size 2 --epochs 8 --data data.yaml --weights 'yolov7_training.pt' --workers 2

The error message is:

File "/Workspace/Users/Yolov7/train.py", line 624, in <module>
    train(hyp, opt, device, tb_writer)
  File "/Workspace/Yolov7/train.py", line 368, in train
    pred = model(imgs)  # forward
  File "/databricks/python/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/databricks/python/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 993, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 94, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 86, in _recursive_to
    res = to_map(inputs)
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 77, in to_map
    return list(zip(*map(to_map, obj)))
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 55, in to_map
    if obj.device == torch.device("cuda", target_gpu):
RuntimeError: Device index must not be negative

I want to avoid using Distributed Data Parallel (DDP) because I’ve encountered issues with it before. When I print the device variable right before the error, it returns "cuda:0".

Here are the details about the environment I’m using:

  • Databricks Runtime: 13.3 LTS ML (includes Apache Spark 3.4.1, GPU, Scala 2.12)

Any insights into what might be causing this issue and how I can resolve it would be greatly appreciated.

Thank you!

1 Like

Hi mohamed! If you share the part of the train.py that is giving this error, it will be easier to help you

Thank you, Eduardo, for your reply.

I’m using the code directly from the official YOLOv7 repository with the following initialization:

dist.init_process_group(backend='nccl')
I’m doing this to bypass the DDP protocol.

Here’s the relevant part of my code:

    opt.total_batch_size = opt.batch_size
    device = select_device(opt.device, batch_size=opt.batch_size)
    
    # Hyperparameters
    with open(opt.hyp) as f:
        hyp = yaml.load(f, Loader=yaml.SafeLoader)  # load hyps

    # Train
    logger.info(opt)
    if not opt.evolve:
        tb_writer = None  # init loggers
        if opt.global_rank in [-1, 0]:
            prefix = colorstr('tensorboard: ')
            logger.info(f"{prefix}Start with 'tensorboard --logdir {opt.project}', view at http://localhost:6006/")
            tb_writer = SummaryWriter(opt.save_dir)  # Tensorboard
        train(hyp, opt, device, tb_writer)

For the device argument, I’m specifying it as follows:

parser.add_argument('--device', default='0', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')