Run multiple independent models on same input data

Hi am using this implementation of DeepLab for PyTorch and would like to perform hyperparameter optimization by training multiple models at the same time.
This implementation uses nn.DataParallel to train a model on multiple GPUs.

I start one training process after executing export CUDA_VISIBLE_DEVICES=0,1. When I want to start the other training process, I get two different errors depending on the GPU ids in CUDA_VISIBLE_DEVICES.

If I type export CUDA_VISIBLE_DEVICES=0,1,2,3 and train the second model on GPU 2 and 3, I receive:
RuntimeError: module must have its parameters and buffers on device cuda:2 (device_ids[0]) but found one of them on device: cuda:0

On the other hand if I execute export CUDA_VISIBLE_DEVICES=2,3 I receive AssertionError: Invalid device id.

How can I train two models at the same time while another process already loaded the input data in a GPU.
Do I have to specify on which device the model should run using .to_device(cuda:x)?

Yes, I would recommend to use a single script with the DataLoader, create multiple model, push each one to the desired device via to('cuda:id') and just pass the data to each model.
Since the training is done on different devices, it should be executed in parallel.

Your approach of running multiple scripts with CUDA_VISIBLE_DEVICES would make it unnecessary complicated to share the data between these processes.

2 Likes

Thank you. As soon as I didn’t use CUDA_VISIBLE_DEVICES and specified ‘.to(device)’ I can train multiple models, each on multiple GPUs, at the same time.

It looks like _check_balance(device_ids) somehow gets the device id 0 even though nn.DataParallel(model, device_ids=[2,3]) has GPU IDs 2 and 3. That was the problem why using CUDA_VISIBLE_DEVICES=2,3 was resulted in an AssertionError: Invalid device id .

Hello, Have you solved this problem? I am new at this field. I would like to inference two models on two gpus with same input. Which method do you recommend? cause I saw Distributed Data Parallel is only available on inferencing only one model on multiple gpus.

1 Like