Hi am using this implementation of DeepLab for PyTorch and would like to perform hyperparameter optimization by training multiple models at the same time.
This implementation uses
nn.DataParallel to train a model on multiple GPUs.
I start one training process after executing
export CUDA_VISIBLE_DEVICES=0,1. When I want to start the other training process, I get two different errors depending on the GPU ids in
If I type
export CUDA_VISIBLE_DEVICES=0,1,2,3 and train the second model on GPU 2 and 3, I receive:
RuntimeError: module must have its parameters and buffers on device cuda:2 (device_ids) but found one of them on device: cuda:0
On the other hand if I execute
export CUDA_VISIBLE_DEVICES=2,3 I receive
AssertionError: Invalid device id.
How can I train two models at the same time while another process already loaded the input data in a GPU.
Do I have to specify on which device the model should run using
Yes, I would recommend to use a single script with the
DataLoader, create multiple model, push each one to the desired device via
to('cuda:id') and just pass the data to each model.
Since the training is done on different devices, it should be executed in parallel.
Your approach of running multiple scripts with
CUDA_VISIBLE_DEVICES would make it unnecessary complicated to share the data between these processes.
Thank you. As soon as I didn’t use
CUDA_VISIBLE_DEVICES and specified ‘.to(device)’ I can train multiple models, each on multiple GPUs, at the same time.
It looks like
_check_balance(device_ids) somehow gets the device id 0 even though
nn.DataParallel(model, device_ids=[2,3]) has GPU IDs 2 and 3. That was the problem why using
CUDA_VISIBLE_DEVICES=2,3 was resulted in an
AssertionError: Invalid device id .
Hello, Have you solved this problem? I am new at this field. I would like to inference two models on two gpus with same input. Which method do you recommend? cause I saw Distributed Data Parallel is only available on inferencing only one model on multiple gpus.