Default collate_fn sending data to cuda:0

Hyung_Jin_Chung · March 6, 2020, 2:09pm

Hi, I used to have a single gpu, but since now I have two,
I tried to run my code in cuda:1, rather than cuda:0 which I normally use.

However, I ran into the error of

  File "/Hard_3rd/harry/TOF_hj_0306/train/model_trainers/trainer_CU_MixRes_scale.py", line 297, in _train_epoch
    for step, data in data_loader:
  File "/home/user/anaconda3/envs/TOF/lib/python3.7/site-packages/tqdm/std.py", line 1107, in __iter__
    for obj in iterable:
  File "/home/user/anaconda3/envs/TOF/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/user/anaconda3/envs/TOF/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/home/user/anaconda3/envs/TOF/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/user/anaconda3/envs/TOF/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 68, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/home/user/anaconda3/envs/TOF/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 68, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/home/user/anaconda3/envs/TOF/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 42, in default_collate
    out = batch[0].new(storage)
RuntimeError: Attempted to set the storage of a tensor on device "cuda:1" to a storage on different device "cuda:0".  This is no longer allowed; the devices must match.

I guess the issues come from default collate_fn trying to send data to cuda:0, when it is already on cuda:1. How can I stop this from happening? Is there a way I can still use default collate_fn while running my code properly?

mrshenli · March 6, 2020, 3:48pm

cc @vincentqb for dataloader questions

Hyung_Jin_Chung · March 6, 2020, 6:50pm

@vincentqb
Can I get some help here?

mrshenli · March 6, 2020, 10:10pm

Have you tried setting CUDA_VISIBLE_DEVICES env var before launching the process? It would be more clear if you share some minimum code snippet

vincentqb · March 6, 2020, 10:13pm

As you mentioned, you can specify a custom collate_fn. Have you tried doing so? Can you provide a minimal code snippet that we could experiment to reproduce?

Hyung_Jin_Chung · March 7, 2020, 1:01pm

I didn’t realize I could do this, setting CUDA_VISIBLE_DEVICES to a single gpu. Thank you very much for your help!!