RuntimeError: Device index must not be negative

mohamed_laabydy · August 16, 2024, 4:21pm

Hello,

I’m trying to train a YOLOv7 model on Databricks using a GPU, but I’m encountering an error when running the following training command:

!python train.py --batch-size 2 --epochs 8 --data data.yaml --weights 'yolov7_training.pt' --workers 2

The error message is:

File "/Workspace/Users/Yolov7/train.py", line 624, in <module>
    train(hyp, opt, device, tb_writer)
  File "/Workspace/Yolov7/train.py", line 368, in train
    pred = model(imgs)  # forward
  File "/databricks/python/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/databricks/python/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 993, in _run_ddp_forward
    inputs, kwargs = _to_kwargs(
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 94, in _to_kwargs
    _recursive_to(inputs, device_id, use_side_stream_for_tensor_copies)
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 86, in _recursive_to
    res = to_map(inputs)
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 77, in to_map
    return list(zip(*map(to_map, obj)))
  File "/databricks/python/lib/python3.10/site-packages/torch/distributed/utils.py", line 55, in to_map
    if obj.device == torch.device("cuda", target_gpu):
RuntimeError: Device index must not be negative

I want to avoid using Distributed Data Parallel (DDP) because I’ve encountered issues with it before. When I print the device variable right before the error, it returns "cuda:0".

Here are the details about the environment I’m using:

Databricks Runtime: 13.3 LTS ML (includes Apache Spark 3.4.1, GPU, Scala 2.12)

Any insights into what might be causing this issue and how I can resolve it would be greatly appreciated.

Thank you!

Eduardo_Lawson · August 16, 2024, 5:20pm

Hi mohamed! If you share the part of the train.py that is giving this error, it will be easier to help you

mohamed_laabydy · August 16, 2024, 5:40pm

Thank you, Eduardo, for your reply.

I’m using the code directly from the official YOLOv7 repository with the following initialization:

dist.init_process_group(backend='nccl')
I’m doing this to bypass the DDP protocol.

Here’s the relevant part of my code:

    opt.total_batch_size = opt.batch_size
    device = select_device(opt.device, batch_size=opt.batch_size)
    
    # Hyperparameters
    with open(opt.hyp) as f:
        hyp = yaml.load(f, Loader=yaml.SafeLoader)  # load hyps

    # Train
    logger.info(opt)
    if not opt.evolve:
        tb_writer = None  # init loggers
        if opt.global_rank in [-1, 0]:
            prefix = colorstr('tensorboard: ')
            logger.info(f"{prefix}Start with 'tensorboard --logdir {opt.project}', view at http://localhost:6006/")
            tb_writer = SummaryWriter(opt.save_dir)  # Tensorboard
        train(hyp, opt, device, tb_writer)

For the device argument, I’m specifying it as follows:

parser.add_argument('--device', default='0', help='cuda device, i.e. 0 or 0,1,2,3 or cpu')